SoftGroup for 3D Instance Segmentation on Point Clouds

Thang Vu, Kookhoi Kim, Tung M. Luu, Xuan Thanh Nguyen, Chang D. Yoo

Introduction

Scene understanding on 3D data has received increasing attention for the rapid development of 3D sensors and availability of large-scale 3D datasets. Instance segmentation on point clouds is a 3D perception task, serving as the foundation for a wide range of applications such as autonomous driving, virtual reality, and robot navigation. Instance segmentation processes the point clouds to output a category and an instance mask for each detected object.

State-of-the-art methods consider 3D instance segmentation as a bottom-up pipeline. They learn the point-wise semantic labels and center offset vectors and then group points of the same labels with small geometric distances into instances. These grouping algorithms are performed on the hard semantic prediction, where a point is associated with a single class. In many cases, objects are locally ambiguous, the output semantic predictions show different categories for different parts, and thus using hard semantic predictions for instance grouping leads to two problems: (1) low overlap between predicted instance and the ground-truth and (2) extra false-positive instances from wrong semantic regions. Figure 1 shows a visualization example. Here, in the semantic prediction results, some parts of cabinet is wrongly predicted as other furniture. When hard semantic predictions are used to perform grouping, the semantic prediction error is propagated to instance prediction. As a result, the predicted cabinet instance has low overlap with the ground truth, and the other furniture instance is a false positive.

This paper proposes SoftGroup to address these problems by considering soft semantic scores to perform grouping instead of hard one-hot semantic predictions. The intuition of SoftGroup is illustrated in Figure 2. Our finding is that the object parts with wrong semantic predictions still have reasonable scores for the true semantic class. SoftGroup relies on a score threshold to determine which category the object belongs instead of the argument max values. Grouping on the soft semantic scores produces for accurate instance on true semantic class. The instance with wrong semantic prediction will be suppressed by learning to categorize it as background. To this end, we treat an instance proposal as either a positive or negative sample depending on the maximum Intersection over Union (IoU) with the ground truth, then construct a top-down refinement stage to refine the positive sample and suppress the negative one. As shown in Figure 1, SoftGroup is able to produce accurate instance masks from imperfect semantic prediction.

SoftGroup is conceptually simple and easy to implement. Experiments on the ScanNet v2 and S3DIS benchmark datasets show the efficacy of our method. Notably, SoftGroup outperforms the previous state-of-the-art method by a significant margin of +6.2% on the ScanNet hidden test set and +6.8% on S3DIS Area 5 in terms of AP50. SoftGroup is fast, requiring 345ms to process a ScanNet scene. In summary, our contribution is threefold.

We propose SoftGroup that performs grouping on soft semantic scores to address the problem of the hard semantic predictions that propagates the errors to instance predictions.

We propose a top-down refinement stage to correct, refine the positive samples and suppress false positives introduced by wrong semantic predictions.

We report extensive experiments on multiple datasets with different evaluation metrics, showing significant improvements over existing state-of-the-art methods.

Related work

Point cloud representation is a common data format for 3D scene understanding since it is simple while preserving original geometric information. To process point clouds, early methods extract hand-crafted features based on statistical properties of points. Recent deep learning methods learn to extract features from points. Pointwise methods, such as PointNet , directly process points through shared Multi-Layer Perceptron (MLP) and then aggregate regional and global features from symmetric function, such as max-pooling. Voxel-based methods transform the unordered point sets into ordered sparse volumetric grids and then perform 3D sparse convolutions on the grids, showing the effectiveness in performance and speed.

Proposal-based Instance Segmentation.

Proposal-based methods consider a top-down strategy that generates region proposals and then segments the object within each proposal. Existing proposal-based methods for 3D point clouds are highly influenced by the success of Mask-R CNN for 2D images. To handle data irregularity of point clouds, Li et al. propose GSPN, which takes an analysis-by-synthesis strategy to generate high-objectness 3D proposals, which are refined by a region-based PointNet. Hou et al. present 3DSIS that combines multi-view RGB input with 3D geometry to predict bounding boxes and instance masks. Yang et al. propose 3D-BoNet which directly outputs a set of bounding boxes without anchor generation and non-maximum suppression, then segments the object by a pointwise binary classifier. Liu et al. present GICN to approximate the instance center of each object as a Gaussian distribution, which is sampled to get object candidates then produce corresponding bounding boxes and instance masks.

Grouping-based Instance Segmentation.

Grouping-based methods rely on a bottom-up pipeline that produces per-point predictions (such as semantic maps, and geometric shifts, or latent features) then groups points into instances. Wang et al. propose SGPN to construct a feature similarity matrix for all points and then group points of similar features into instances. Pham et al. present JSIS3D that incorporates the semantic and instance labels by a multi-value conditional random field model and jointly optimizes the labels to obtain object instances. Lahoud et al. propose MTML to learn feature and directional embedding, then perform mean-shift clustering on the feature embedding to generate object segments which are scored according to their direction feature consistency. Han et al. introduce OccuSeg that performs graph-based clustering guided by object occupancy signal for more accurate segmentation outputs. Zhang et al. consider a probabilistic approach that represents each point as a tri-variate normal distribution followed by a clustering step to obtain object instances. Jiang et al. propose PointGroup to segment objects on original and offset-shifted point sets, relying on a simple yet effective algorithm that groups nearby points of the same label and expands the group progressively. Chen et al. extend PointGroup and propose HAIS that further absorbs surrounding fragments of instances and then refines the instances based on intra-instance prediction. Liang et al. SSTNet to construct a tree network from pre-computed superpoints then traverse the tree and split nodes to get object instances.

The common proposal-based and grouping-based methods have their advantages and drawbacks. Proposal-based methods process each object proposal independently that is not interfered with by other instances. Grouping-based methods process the whole scene without proposal generation, enabling fast inference. However, proposal-based methods have difficulties in generating high-quality proposals since the point only exists on the object surface. Grouping-based methods highly depend on semantic segmentation such that the errors in semantic predictions are propagated to instance predictions. The proposed method leverages the advantages and address the limitations of both approaches. Our method is constructed as a two-stage pipeline, where the bottom-up stage generates high-quality object proposals by grouping on soft semantic scores, and then the top-down stage process each proposal to refine positive samples and suppress negative ones.

Method

The overall architecture of SoftGroup is depicted in Figure 3, which is divided into two stages. In the bottom-up grouping stage, the point-wise prediction network (Sec. 3.1) takes point clouds the input and produces point-wise semantic labels and offset vectors. The soft grouping module (Sec. 3.2) processes these outputs to produce preliminary instance proposals. In the top-down refinement stage, based on the proposals, the corresponding features from the backbone are extracted and used to predict classes, instance masks, and mask scores as the final results.

The input of the point-wise prediction network is a set of $N$ points, each of which is represented by its coordinate and color. The point set is voxelized to convert unordered points to ordered volumetric grids, which are fed into a U-Net style backbone to obtain point features. The Submanifold Sparse Convolution is adopted to implement the U-Net for 3D point clouds. From the point features, two branches are constructed to output the point-wise semantic scores and offset vectors.

Offset Branch.

2 Soft Grouping

We note that existing proposal-based methods commonly consider bounding boxes as object proposals then perform segmentation within each proposal. Intuitively, the bounding box with high overlap with the instance should have the center close to the object center. However, generating high-quality bounding box proposals in 3D point clouds is challenging since the point only exists on object surfaces. Instead, SoftGroup relies on point-level proposals which are more accurate and naturally inherit the scattered property of point clouds.

Since the quality of instance proposals from grouping highly depend on the quality of semantic segmentation, we quantitatively analyze the impact of $\tau$ on the recall and precision of semantic predictions. The recall and precision for class $j$ is defined as follows.

Figure 4 shows the recall and precision (averaged over classes) with the varying score thresholds $\tau$ compared with those of hard semantic prediction. With hard semantic prediction, the recall is 79.1%, indicating more than 20% amount of points over classes are not covered by the predictions. When using the score threshold, the recall increases as the score threshold decreases. However, the small score threshold also leads to low precision. We propose a top-down refinement stage mitigate the low precision problems. The precision can be interpreted as the relation between foreground and background points of object instances. We set the threshold to 0.2 with precision near 50%, leading to the ratio between foreground and background points for ensuring stage is balanced.

3 Top-Down Refinement

The top-down refinement stage classifies and refines the instance proposals from the bottom-up grouping stage. A feature extractor layer processes each proposal to extract its corresponding backbone features. The extracted features are fed into a tiny U-Net network (a U-Net style network with a small number of layers) before predicting classification scores, instance masks, and mask scores at the ensuing branches.

We note that existing grouping-based methods typically derive the object category from semantic predictions. However, instances may come from objects with noisy semantic predictions. The proposed method directly uses the output of the classification branch as the instance class. The classification branch aggregates all point features of the instance and classifies the instance with a single label, leading to more reliable predictions.

Segmentation Branch.

As shown in Section 3.2, the instance proposals contain both foreground and background points, we construct a segmentation branch to predict an instance mask within each proposal. The segmentation branch is a point-wise MLP of two layers that output an instance mask $\boldsymbol{m}_{k}$ for each instance $k$ .

Mask Scoring Branch.

Learning Targets.

4 Multi-task Learning

The whole network can be trained in an end-to-end manner using a multi-task loss.

where $L_{\text{semantic}}$ and $L_{\text{offset}}$ are the semantic and offset losses defined at subsection Section 3.1 while $L_{\text{class}}$ , $L_{\text{mask}}$ and $L_{\text{mask\_score}}$ are the classification, segmentation and mask score losses defined at Section 3.3.

Experiments

The experiments are conducted on standard benchmarked ScanNet v2 and S3DIS dataset. The ScanNet dataset contains 1613 scans which is divided into training, validation, and testing sets of 1201, 312, 100 scans, respectively. Instance segmentation is evaluated on 18 object classes. Following existing methods, the benchmarked results are reported on the hidden test split. The ablation study is conducted on the validation set.

The S3DIS dataset contains 3D scans of 6 areas with 271 scenes in total. The dataset consists of 13 classes for instance segmentation evaluation. Following existing methods, two settings are used to evaluate the instance segmentation results: testing on Area 5 and 6-fold cross-validation.

Evaluation Metrics.

The evaluation metric is the standard average precision. Here, AP50 and AP25 denote the scores with IoU thresholds of 50% and 25%, respectively. Likewise, AP denotes the averaged scores with IoU threshold from 50% to 95% with a step size of 5%. Additionally, the S3DIS is also evaluated using mean coverage (mCov), mean weighed coverage (mWCov), mean precision (mPrec), and mean recall (mRec).

Implementation Details.

The implementation details follow those of existing methods . The model is implemented using PyTorch deep learning framework and trained on 120k iterations with Adam optimizer . The batch size is set to 4. The learning rate is initialized to 0.001 and scheduled by a cosine annealing . The voxel size and grouping bandwidth $b$ are set to 0.02m and 0.04m, respectively. The score threshold for soft grouping $\tau$ is set to 0.2. At training time, the scenes are randomly cropped at a maximum number of points of 250k. At inference, the whole scene is fed into the network without cropping. For the S3DIS with high point density, scenes are randomly downsampled at a ratio of 1/4 before cropping. At inference, the scene is divided into four parts before feeding into the model, and then the outputs from the four parts are merged to get the final results.

We note that the source code and trained models for existing high-performing methods are publicly available on ScanNet v2 only. In this work, the source code and trained models on both ScanNet v2 and S3DIS will be released to support result reproducibility.

2 Benchmarking Results

Table 1 shows the results of SoftGroup and recent state-of-the-art methods on the hidden test set of ScanNet v2 benchmark. We submit our model and report the results from the server. The proposed SoftGroup achieves the highest average AP50 of 76.1%, surpassing the previous strongest methods a significant margin of 6.2%. Regarding class-wise scores, our method achieves the best performance in 12 out of 18 classes.

S3DIS.

Table 2 summaries the results on Area 5 and 6-fold cross-validation of S3DIS dataset. On both Area 5 and cross-validation evaluations, the proposed SoftGroup achieves higher overall performance compared to existing method. Notably, on Area 5 evaluation, SoftGroup achieves AP/AP50 of 51.6/66.1(%), which is 8.9/6.8(%) improvement compared to the second-best. The state-of-the-art performance on both ScanNet v2 and S3DIS datasets shows the generalization advantage of our method.

Segmentation and Detection Results.

We further report the instance segmentation and object detection results on ScanNet v2 validation set. To obtain object detection results, we follow the approach in to extract a tight axis-aligned bounding box from the predicted point mask. Table 3 reports the instance segmentation and object detection results. Our method achieves significant improvement compared to the second-best by 3.2, 3.3, 6.3, and 7.3(%) of AP50, AP25, box AP50, and box AP25, respectively.

Runtime Analysis.

Table 4 report the runtime per scan of different methods on ScanNet v2 validation set. For a fair comparison, the reported runtime is measured on the same Titan X GPU model. The inference time of our method is 345ms per scan, which is extra 6ms over the fastest model. Regarding our component-time, the point-wise prediction network, soft grouping algorithm, and top-down refinement latencies are 152ms, 132ms, and 70ms, respectively. The results show that our method achieves high accuracy while remaining computationally efficient.

3 Qualitative Analysis

Figure 5 shows the visualization examples from ScanNet v2 dataset. Without SoftGroup, the semantic prediction errors are propagated to instance segmentation predictions (highlighted by dashed boxes). In contrast, SoftGroup effectively corrects the semantic prediction errors and thus generates more accurate instance masks.

4 Ablation Study

We provide experimental results of SoftGroup when different components are omitted. The considered baseline is a model with hard grouping and the confidence scores of output instances are ranked by a ScoreNet branch . Table 5 shows the ablation results. The baseline achieves 39.5/61.1/75.5(%) in terms of AP/AP50/AP25. Significant improvement is obtained by either applying soft grouping or top-down refinement. Combining these two components achieves the best overall performance AP/AP50/AP25 of 46.0/67.6/78.9(%), which is significantly higher than the baseline by 6.5/6.5/3.4(%).

Score Threshold for Soft Grouping.

Table 6 shows the experimental results with varying score thresholds for soft grouping. The baseline is with $\tau$ being “None”, indicating the threshold is deactivated and the hard predicted label is used for grouping. The baseline achieves AP/AP50/AP25 of 44.3/65.4/78.1(%). When $\tau$ is too high or too low the performance is even worse than the baseline. The best performance is obtained at $\tau$ of 0.2, which confirms our analysis at the Section 3.2, where the number of positive and negative samples are balanced.

Top-Down Refinement.

We further provide the ablation results on the top-down refinement, on Table 7. With only the classification branch, our method achieves AP/AP50/AP25 of 41.1/64.6/79.7(%). When mask branch and mask scoring branch are in turn applied, the performance tends to improve on the higher IoU threshold regions. Combining all branches yields the performance AP/AP50/AP25 of 46.0/67.6/78.9(%).

Instance Category from Classification Branch.

Table 8 reports the results of different schemes to obtain object categories. The results show that deriving the object category from semantic prediction yields the AP/AP50/AP25 of 45.0/65.6/76.2(%). The proposed method directly uses the output of the classification branch as the instance class. The classification branch aggregates all point features of the instance and classifies the instance with a single label, leading to more reliable prediction. The results show that directly using classification output as object category improves the AP/AP50/AP25 to 46.0/67.6/78.9(%).

Conclusion

We have presented SoftGroup, a simple yet effective method for instance segmentation on 3D point clouds. SoftGroup performs grouping on soft semantic scores to address the problem stemming from hard grouping on locally ambiguous objects. The instance proposals obtained from the grouping stage are assigned to either positive or negative samples. Then a top-down refinement stage is constructed to refine the positives and suppress the negatives. Extensive experiments on different datasets show that our method outperforms the existing state-of-the-art method by a significant margin of +6.2% on the hidden ScanNet v2 test set and +6.8% on S3DIS Area 5 in terms of AP50. SoftGroup is also fast, requiring 345ms to process a ScanNet scene.