Superpoint Transformer for 3D Scene Instance Segmentation

Jiahao Sun, Chunmei Qing, Junpeng Tan, Xiangmin Xu

Introduction

3D scene understanding regards as a fundamental ingredient for many applications, including augmented/virtual reality (Park et al. 2020), autonomous driving (Zhou et al. 2020), and robotics navigation (Xie et al. 2021). Generally, instance segmentation is a challenging task in 3D scene understanding, which aims to not only detect instances on sparse point clouds but also give a clear mask for each instance.

Existing state-of-the-art methods can be divided into proposal-based (Yang et al. 2019; Liu et al. 2020) and grouping-based (Jiang et al. 2020; Chen et al. 2021; Liang et al. 2021; Vu et al. 2022). Proposal-based methods consider 3D instance segmentation as a top-down pipeline. They firstly generate region proposals (i.e. bounding box), as shown in Fig. 1(b), and then predict instance masks in the proposed region. These methods are encouraged by the big success of Mask-RCNN (He et al. 2017) on 2D instance segmentation fields. However, these methods struggle on point clouds due to domain gaps. In 3D fields, bounding box has more degree of freedom (DoF) increasing the difficulty of fitting. Moreover, points usually only exist on parts of object surface, which causes object geometric centers to be not detectable. Besides, low-quality region proposals affect box-based bipartite matching (Yang et al. 2019) and further degrade model performance.

On the contrary, grouping-based methods adopt a bottom-up pipeline. They learn point-wise semantic labels and instance center offsets. Then they use the offsetted points and semantic predictions to aggregate into instances, as shown in Fig. 1(c). Over the past two years, grouping-based methods have achieved great improvements in 3D instance segmentation task (Liang et al. 2021; Vu et al. 2022). However, there also are several shortcomings: (1) grouping-based methods depend on their semantic segmentation results, which might lead to wrong predictions. Propagating these wrong predictions to subsequent processing suppresses the performance of network. (2) These methods need an intermediate aggregation step increasing training and inference time. The aggregation step is independent of network training and lack of supervision, which needs an additional refinement module.

With the discussion above, we naturally think about a hyper framework that can avoid drawbacks and take benefits from two types of methods simultaneously. In this paper, we proposed a novel end-to-end two-stage 3D instance segmentation method based on Superpoint Transformer, named as SPFormer. SPFormer groups bottom-up potential features from point clouds into superpoints and proposes instances by query vectors as a top-down pipeline.

In the bottom-up grouping stage, a sparse 3D U-net is utilized to extract bottom-up point-wise features. A simple superpoint pooling layer is presented to group potential point-wise features into superpoints. Superpoints (Landrieu and Simonovsky 2018) can leverage the geometric regularities to represent homogeneous neighboring points. In contrast to previous method (Liang et al. 2021), our superpoint features are potential, which avoid supervising the features through non-straightforward semantic and central distance labels. We consider superpoints as a potential mid-level representation of 3D scenes and directly use instance labels to train the whole network. In the top-down proposal stage, a novel query decoder with transformers is proposed. We utilize learnable query vectors to propose instance prediction from potential superpoint features as a top-down pipeline. The learnable query vector can capture instance information through superpoint cross-attention mechanism. Fig. 1(d) illustrates this process that the redder the part of the chair is, the more attention of query vector pays. With the query vectors carrying instance information and superpoint features, query decoder directly generates instance class, score, and mask predictions. Finally, through bipartite matching based on superpoint masks, SPFormer can implement end-to-end training without time-consuming aggregation step. Besides, SPFormer is free of post-processing like non-maximum suppression (NMS), which further accelerates the speed of network.

SPFormer achieves state-of-the-art on both ScanNetv2 and S3DIS benchmarks. Especially, SPFormer exceeds compared state-of-the-art methods by qualitative and quantitative measures, and inference speed, simultaneously. SPFormer with a novel pipeline can be served as a general framework for 3D instance segmentation. In summary, our contributions are listed as follows:

We propose a novel end-to-end two-stage method named SPFormer that represents 3D sence with potential superpopint features without relying on the results of object detection or semantic segmentation.

We design a query decoder with transformers where learnable query vectors can capture instance information by superpoint cross-attention. With query vectors, query deocoder can directly generate instance predictions.

Through bipartite matching based on superpoint masks, SPFormer can implement the network training without time-consuming intermediate aggregation step and be free of complex post-processing during inference.

Related Work

Proposal-based methods take a top-down pipeline for instance segmentation. Previous methods (Yi et al. 2019; Hou, Dai, and Nießner 2019; Narita et al. 2019) focus on fusing 2D image features with point cloud features into a volumetric grid and generate region proposals from the grid. 3D-BoNet (Yang et al. 2019) uses PointNet++ (Qi et al. 2017a, b) extracting features from point clouds and treats 3D bounding box generation task as an optimal assignment problem. GICN (Liu et al. 2020) predicts Gaussian heatmap to select instance center candidates and produces instance masks within the proposed bounding boxes. 3D-MPA (Engelmann et al. 2020) samples predicted centroids and cluster points near the centroids to form final instance masks. Most proposal-based methods are based on 3D bounding boxes. However, low-quality bounding boxes predictions will affect the performance of the instance segmentation model.

Grouping-based Methods.

Grouping-based methods regard 3D instance segmentation as a bottom-up pipeline. MTML (Lahoud et al. 2019) utilizes a multi-task strategy to learn feature embedding. PointGroup (Jiang et al. 2020) aggregates points from original and center-shifted point clouds and designs ScoreNet for evaluating the quality of aggregation. PE(Zhang and Wonka 2021) introduces a novel probabilistic embedding space. Dyco3D(He, Shen, and van den Hengel 2021) introduces dynamic convolution kernels. HAIS (Chen et al. 2021) extends PointGroup with a hierarchical aggregation and filters noisy points within instance prediction. SSTNet (Liang et al. 2021) constructs a semantic superpoint tree and gains instance prediction by splitting non-similar nodes. SoftGroup (Vu et al. 2022) uses a lower threshold for clustering to address the wrong semantic hard prediction and refines instances with a tiny 3D U-net. Although, grouping-based methods may have a top-down refinement module, they still inevitably rely on intermediate aggregation step.

D Instance Segmentation with Transformer.

Recently, transformer (Vaswani et al. 2017) is introduced in image classification (Dosovitskiy et al. 2020; Touvron et al. 2021; Liu et al. 2021), object detection (Carion et al. 2020; Dai et al. 2021) and segmentation (Cheng, Schwing, and Kirillov 2021; Cheng et al. 2022a; Guo et al. 2021). There are also some instance segmentation methods (Fang et al. 2021; Cheng et al. 2022b) inspired by transformer. Mask2Former (Cheng et al. 2022a) successfully applies transformer to build a universal network for 2D image semantic, instance, and panoptic segmentation.

Inspired by the success of transformer for 2D segmentation tasks, we are motivated to introduce transformer for 3D instance segmentation. However, transformer cannot be naively applied on the output of sparse convolution backbone, because it will introduce highly computational overhead because of the complexity of attention mechanism. In this paper, we will design a novel query decoder for 3D instance segmentation and employ superpoints to build a bridge between the backbone and query decoder.

Method

The architecture of the proposed SPFormer is illustrated in Fig. 2. Firstly, a sparse 3D U-net is utilized to extract bottom-up point-wise features. A simple superpoint pooling layer is presented to group potential point-wise features into superpoints. Secondly, a novel query decoder with transformers is proposed, where learnable query vectors can capture instance information by superpoint cross-attention. Finally, through bipartite matching based on superpoint masks, SPFormer can implement end-to-end training without time-consuming aggregation step.

Superpoint pooling layer.

2 Query Decoder

Considering the disorder and quantity uncertainty of superpoint, transformer structure is introduced to handle variable length input. The potential feature of superpoints and the learnable query vectors are used as the input of the transformer decoder. The detailed architecture of our modified transformer decoder layer is depicted in Fig. 3. Inspired by (Cheng et al. 2022a), query vectors are initialized randomly before training, and the instance information of each point cloud can only be obtained through superpoint cross-attention, therefore, our transformer decoder layer exchanges the order of self-attention layer and cross-attention layer compared with the standard one(Vaswani et al. 2017). In addition, because the input is the potential features of superpoints, we empirically remove position embedding.

Shared Prediction Head.

Iterative Prediction.

3 Bipartite Matching and Loss Function

With a fixed number of proposals, we formulate ground truth label assignment as an optimal assignment problem. Formally, we introduce a pairwise matching cost $\mathcal{C}_{ik}$ to evaluate the similarity of the $i$ -th proposal and the $k$ -th ground truth. $\mathcal{C}_{ik}$ is determined by classification probability and superpoint mask matching cost $\mathcal{C}^{mask}_{ik}$ , as defined in Eq. (3).

where $p_{i,c_{k}}$ indicates the probability for the category $c_{k}$ of $i$ -th proposal and $\lambda_{cls}$ , $\lambda_{mask}$ are corresponding coefficients of each term. In our experiments, we set $\lambda_{cls}=0.5$ , $\lambda_{mask}=1$ . Superpoint mask matching cost $\mathcal{C}^{mask}_{ik}$ consists of binary cross-entropy (BCE) and dice loss with Laplace smoothing (Milletari, Navab, and Ahmadi 2016), as

where $m_{i}$ and $m^{gt}_{k}$ are the superpoint mask of proposal and ground truth respectively. We assign a hard instance label to each superpoint depending on whether more than half of the points within the superpoint belong to the instance. With the matching cost $\mathcal{C}_{ik}$ , we use Hungarian algorithm (Kuhn 1955) to find the optimal matching between proposals and ground truth.

After assignment, we treat the proposals that are not assigned to ground truth as ”no instance” class and compute the classification cross-entropy loss $\mathcal{L}_{cls}$ for every proposal. Then we compute the superpoint mask loss which consists of binary cross-entropy loss $\mathcal{L}_{bce}$ and dice loss $\mathcal{L}_{dice}$ for each proposal ground truth pair. In addition, we add the following L2 loss $\mathcal{L}_{s}$ for the score branch:

where $\{s_{k}\}^{N_{gt}}_{k=1}$ is the set of score predictions that are assigned to $N_{gt}$ ground truth. $\mathds{1}_{\{iou_{k}\}}$ indicates whether the IoU between proposal mask prediction and assigned ground truth is higher than 50%. We only use high-quality proposals for supervision (Huang et al. 2019). Finally, to build an end-to-end training, we adopt multi-task loss $\mathcal{L}$ , as

where $\beta_{cls}$ , $\beta_{s}$ , $\beta_{mask}$ are corresponding coefficients of each term. Empirically, we set $\beta_{cls}=\beta_{s}=0.5$ , $\beta_{mask}=1$ .

4 Inference

During inference, given an input point cloud, SPFormer directly predicts $K$ instances with classification $\{p_{i}\}$ , IoU-aware score $\{s_{i}\}$ and corresponding superpoint masks. We additionally obtain a mask score $\{ms_{i}\in\}^{K}$ by averaging superpoints probability higher than 0.5 in each superpoint mask. The final score for sorting $\widetilde{s_{i}}=\sqrt{p_{i}\cdot s_{i}\cdot ms_{i}}$ . SPFormer is free of non-maximum suppression in post-processing, which ensures its fast inference speed.

Experiments

Experiments are conducted on ScanNetv2 (Dai et al. 2017) and S3DIS (Armeni et al. 2016) datasets. ScanNetv2 has a total of 1613 indoor scenes, of which 1201 are used for training, 312 for validation, and 100 for testing. It contains 18 categories of object instances. We submit the final prediction of our method to its hidden test set and the ablation studies are conducted on its validation set. S3DIS has 6 large-scale areas with 272 scenes in total. It has 13 categories for instance segmentation task. We follow two common settings for evaluation: testing on Area 5 and 6-fold cross-validation.

Evaluation Metrics.

Task-mean average precision (mAP) is utilized as the common evaluation metric for instance segmentation, which averages the scores with IoU thresholds set from 50% to 95%, with a step size of 5%. Specifically, $\text{AP}_{50}$ and $\text{AP}_{25}$ denote the scores with IoU thresholds of 50% and 25%, respectively. We report mAP, $\text{AP}_{50}$ and $\text{AP}_{25}$ on ScanNetv2 dataset and we addtionally report mean precision (mPrec), and mean recall (mRec) on S3DIS dataset.

1 Benchmark Results

SPFormer is compared with existing state-of-the-art methods on the hidden test set, as shown in Table 1. SPFormer accomplishes the highest mAP score of 54.9%, outperforming the previous best result by 4.3%. For the specific 18 categories, our model achieves the highest AP scores on 8 of them. Especially, SPFormer surpasses the previous best AP score by more than 10% in the counter category, where past methods are always hard to achieve a satisfactory score.

We also evaluate SPFormer on ScanNetv2 validation set, as shown in Table 2. SPFormer outperforms all state-of-the-art methods by a large margin. Compared to the second-best results, our method improves 6.9%, 6.3%, 4.0% in terms of mAP, $\text{AP}_{50}$ and $\text{AP}_{25}$ , respectively.

S3DIS.

We evaluate SPFormer on S3DIS using Area 5 and 6-fold cross-validation, respectively. As shown in Table 3, SPFormer achieves the-state-of-art results in terms of $AP_{50}$ . Following the protocols used in previous methods, we additionally report mPrec and mRec. Our method also achieves competitive results in mPrec/mRec metrics. The results on S3DIS confirm the generalization ability of SPFormer.

Runtime Analysis.

We test the runtime per scene of different methods on ScanNetv2 validation set, as shown in Table 4. For a fair comparison, the SSC and SC layers in all the above methods are implemented by spconv v2.1 (Contributors 2022). We report in detail the running time of the components of each method (the last part of each model contains their own post-processing). Since our SPFormer and (Liang et al. 2021) is based on superpoints, here we add superpoints extraction (s.p. extraction) runtime to test the inference speed from raw input point clouds. However, superpoints can pre-compute in training stage, which can significantly reduce the model training time. Even with superpoints extraction, SPFormer is still the fastest method compared to the existing ones.

2 Ablation Study

Table 5 shows the performance results when different components are omitted. Considering naively feeding the output of backbone into query decoder, we find that there is a huge drop in performance. Query vectors can not attend to several hundred thousand points due to the softmax process in cross-attention. We employ superpoints to build a bridge between backbone and query decoder, which significantly improves our method performance. Then we discuss the bipartite matching target. We compare matching by boxes with matching by masks. The detail of the implementation of matching by boxes is in the supplementary material. We find the performance of matching by mask exceeds box one by 6.4% on mAP. 3D boxes have more DoF than 2D ones and object geometric centers are usually not detectable, which inevitably makes matching more difficult. Finally, we confirm IoU-aware score branch brings benefits to our method. It takes +1.3/1.5/0.4 improvements on mAP/ $\text{AP}_{50}$ / $\text{AP}_{25}$ respectively. The score branch mitigates the misalignment of proposal quality ranking.

The Architecture of Transformer.

The ablation analysis of the architecture of transformer is illustrated in Table 6. Considering the original transformer decoder layer (Vaswani et al. 2017) without position encoding as baseline, iteratively predicting on each transformer layer by the shared prediction head can bring +1.5/1.8/1.8 improvement on mAP/ $\text{AP}_{50}$ / $\text{AP}_{25}$ respectively. Moreover, if we add superpoint attention masks, our method will further improve +3.8/2.4/1.3 performance. Superpoint attention masks allow SPFormer to only attend to the foreground from the former layer predictions. Due to the uncertainty of the number of superpoints in each scene, we only discuss whether use position encoding on query vectors. We add position encoding where query vectors are fed into every decoder layer. We observe that the position encoding can safely remove, probably due to the irregularity and diversity of the point clouds. At last, we swap the order of self-attention and cross-attention, for query vectors can gather context information immediately once they are fed into decoder layer, which makes the process more sensible and brings a little improvement.

Number of Queries and Layers.

Table 7 presents the selection of the number of query vectors and transformer decoder layers. The results show that too less or too many layers will cause a reduction in performance. Interestingly, we observe some performance improvement when using 400 query vectors compared to 200/100 ones and performance only saturates when the number rises to 800. It may be due to the fact that the number of instances in a 3D scene is usually more than the number of instances in the common 2D dataset.

The Selection of Mask Loss.

Table 8 illustrates the performance of the components of mask loss. We observe that only using binary cross-entropy loss or focal (Lin et al. 2017) loss will cause much lower performance. Dice loss is indispensable in mask loss. Based on dice loss, adding bce loss or focal loss will improve the total performance. The combination of dice loss and bce loss achieves the best results.

3 Visualizations

The visualization of 3D instance segmentation is shown in Fig. 4. Compared to the existing state-of-the-art method, SPFormer correctly segments each instance and produces finer segmentation results.

Cross-Attention Mechanism.

Fig. 5 visualizes the cross-attention mechanism. For an input point cloud, query vectors attend to the superpoints and highlight the region of interest. Here we propagate the attention weights of superpoints to their own points for visualization. Then query vectors carry the attention information and form the final mask prediction in prediction head.

Conclusion

In this paper, we propose a novel end-to-end two-stage framework (SPFormer) for 3D instance segmentation. It can be considered as a combination of proposal-based method and grouping-based method. SPFormer with a novel hybrid pipeline groups bottom-up potential features from point clouds into superpoints and proposes instances by query vectors as a top-down pipeline. SPFormer achieves state-of-the-art on both ScanNetv2 and S3DIS benchmarks, and retains fast inference speed.

Acknowledgments

This paper is partially supported by the following grants: National Natural Science Foundation of China (61972163, U1801262), Natural Science Foundation of Guangdong Province (2022A1515011555), National Key R&D Program of China (2022YFB4500600), Guangdong Provincial Key Laboratory of Human Digital Twin (2022B1212010004) and Pazhou Lab, Guangzhou, 510330, China.