DyCo3D: Robust Instance Segmentation of 3D Point Clouds through Dynamic Convolution

Tong He, Chunhua Shen, Anton van den Hengel

Introduction

Instance segmentation is significantly more challenging than semantic segmentation because it requires identifying every individual instance of a class of objects, and the visible extent of each. The information recovered has proven invaluable for scene understanding, however. With the increasing applications of 3D sensors (such as LiDAR and laser scanners), point clouds have become an important modality in scene understanding. Although significant advances have been made in instance segmentation in the image domain , instance segmentation of 3D point clouds has proven far more challenging. This is partly due to the inherent irregularity and sparsity of the data, but also due to the diversity of the scene. By way of example, Mask R-CNN , which has shown great success when applied to 2D images, performs poorly when applied in 3D .

Many previous top-performing approaches for point cloud instance segmentation adopt a bottom-up strategy, involving heuristic grouping algorithms or complex post-processing steps. 3D-MPA , for example, extracts proposals from the predicted instance centroids. Instances are then generated by aggregating proposal-wise embeddings. PointGroup generates instances proposals by gradually merging neighbouring points that share the same category label. Both original and centroid-shifted points are explored with a manually specified search radius. A separate model (labelled ScoreNet) is used to estimate the objectness of the proposals. Both methods achieved promising performance on ScanNetV2 and S3DIS benchmarks. However, these bottom-up methods often suffer from several drawbacks: (1) the performance is sensitive to values of the pre-defined hyper-parameters, which require manual tuning. In PointGroup , modifying the clustering radius from 3cm to 2cm causes mAP to drop by more than 6%, illustrating the method’s limited robustness and generalization ability. These results are presented in Fig. 1. (2) they incorporate either complex post-processing steps or training pipelines, rendering them unsuitable for real-time applications such as robotics and driverless cars. For example, 3D-MPA needs an extra 10-layer graph network and a clustering post-processing step to yield its final instance segmentation masks. (3) they are heavily reliant on the quality of the proposals, which limits their robustness and can lead to joint/fragmented instances in practice.

In this paper, we propose a novel pipeline tailored to 3D point cloud instance segmentation using dynamic convolution, that we label DyCo3D. Our approach addresses the task with only a few convolution layers, for which the filters are generated on the fly, conditioned on the category and position of the instance to be decoded. To empower the filters to distinguish different instances, we propose to encode category-specific context by deploying a light-weight sub-network to explore homogenous points that have close votes for instance centroids and share the semantic labels. Instance masks can be decoded in parallel by convolving the generated class-specific filters with the position embedded features. Compared with bottom-up approaches that are sensitive to the values of numerous hyper-parameters, our approach demonstrates superiority on both effectiveness and efficiency. Qualitative results illustrating this fact are presented in Fig. 1.

Besides, as has been proved in the 2D image domain, a large receptive field and rich context information are critical to the success of instance segmentation . To address the problem in the 3D point cloud, we propose to introduce a small transformer to capture a long-range dependency and build high-level interactions among different regions.

Our main contributions are summarised as follows.

A novel method for 3D point cloud instance segmentation based on dynamic convolution that outperforms previous methods in both efficiency and effectiveness.

A proposal-free instance segmentation approach that is more robust than bottom-up strategies.

A light-weight transformer that enlarges the receptive field and captures non-local dependencies.

Comprehensive experiments demonstrating that the proposed method achieves state-of-the-art results, with improved robustness, and an inference speed superior to that of its comparators.

Related Work

Deep Learning for 3D Point Cloud. In contrast to the image domain, wherein the data representation is relatively consistent (see \egVGGNet and ResNet ), methods for 3D point cloud representation are still developing. The most prevalent existing approaches can be roughly categorised as: point-based , voxel-based , and multiview-based . PointNet is one of the pioneering point-based approaches. It exploits a shareable multi-layer perceptron (MLP) network to extract per-point representation. PointNet++ extends this approach by introducing a nested hierarchical PointNet to extract local context information. Although simple, PointNet and PointNet++ are still widely used in the tasks of semantic segmentation , instance segmentation , and 3D detection . Multi-view solutions often involve view projection to utilize well-explored 2D techniques. In , for instance, view-pooling is used to combine information from different views of a 3D shape and thereby to construct a compact shape descriptor. 3DSIS , in contrast, projects features extracted from 2D views into 3D space. Voxel-based methods first transfer 3D points into rasterized voxels and apply convolution operations for feature extraction. Traditional 3D convolution methods are often constrained by inefficient computation and limited GPU memory. In addition, computational and representational resources are wasted on void space. DyCo3D, in contrast, uses sparse volumetric convolution to efficiently process this inherently sparse data. Focusing computation on the data, rather than the space it occupies, makes DyCo3D faster, more robust, and better able to extract local patterns.

Instance Segmentation of 3D Point Cloud. As in the 2D image domain, 3D instance segmentation approaches can be broadly divided into two groups: top-down and bottom-up. Top-down methods often use a detect-then-segment approach, which first detects 3D bounding boxes of the instances and then predicts foreground points. 3D-BoNet , for instance, first detects unoriented 3D bounding boxes from a single global representation by utilizing a Hungarian matching algorithm. Then per-point features are explored within each bounding box to mask out the background. Instead of regressing bounding boxes for instance proposals, GSPN generates instance shapes and applies analysis-by-synthesis. Bottom-up methods, in contrast, group sub-components into instances. Methods applying this approach have dominated the leaderboard of the ScanNet dataset http://kaldir.vc.in.tum.de/scannet_benchmark/. The grouping techniques vary from simple clustering to complex graph-based algorithms based on learned embeddings. ASIS , for example, learns point-level embeddings, regularized by a discriminative loss function , which encourages points belonging to the same instance to be mapped to similar locations in a metric space while separating points belonging to different instances. A mean-shift algorithm is then applied to generate instance masks. Many subsequent works use the same general pipeline. PointGroup generates instance clusters from two sets of points: original and centroid shifted points. A network the authors label ScoreNet is used to evaluate the candidates. OccuSeg , in contrast, uses multi-task learning to generate feature embeddings, but also explicit occupancy embeddings that enable metric instance scale calculations to be made.

Dynamic Convolution. The existing works most closely related to DyCo3D are and . Dynamic convolution was first proposed to enhance filter representation by encoding sample-specific and position-specific knowledge. CondInst successfully applies it in the 2D image domain for instance segmentation. However, our experiments demonstrate that it performs poorly when applied directly to 3D point clouds for the following reasons: 1) it introduces a large amount of computation, resulting in optimization difficulties. 2) the performance is constrained by the limited receptive field and representation capability due to the sparse convolution. In this paper, we improve the dynamic convolution tailored for 3D point cloud instance segmentation and demonstrate its effectiveness and robustness on multiple benchmarks. Details are introduced in the next section.

Methods

2 Backbone Network

Although our method is not restricted to any specific choice of backbones, we select sparse convolution for its efficiency and competitive performance. Following , we construct a U-Net, which consists of an encoder and a decoder that have symmetrical structures. However, sparse convolution is often constrained by a limited receptive field and representation capability, due to the small number of convolution layers and channels. To this end, we propose a light-weight transformer to enhance long-range interactions on top of the encoder. The transformer is identical to the implementation of , except for the position embedding layer, where the position-sensitive information is encoded as the mean of the pairwise direction vector. More details can be found in the supplementary materials.

Semantic Segmentation. We apply traditional cross entropy loss $\mathcal{L}_{\text{seg}}$ for semantic segmentation. Pointwise prediction of the category label can be easily obtained, indicated as $\{l_{\text{seg}}^{i}\}_{\text{i=1}}^{N}$ .

Centroid Offset. The variability in the distribution of points across surfaces makes aggregating contextual information complex. To address the problem, we follow VoteNet , by shifting points towards the corresponding centroids of instances. Point-wise prediction $o_{\text{off}}^{i}$ is supervised by the following loss function:

where $p^{i}$ is the coordinates of the $i$ -th point, $o_{\text{off}}^{i}$ is the $i$ -th item of $\textbf{O}_{\text{off}}$ , and $ctr_{\text{gt}}^{i}$ is the geometric centroid of the corresponding instance. $\mathds{1}(p^{i})$ is an indicator function, representing whether $p^{i}$ is a valid point for centroid prediction. $N_{v}$ is the total number of the valid points. For example, the categories of ‘floor’ and ‘wall’ are ignored for instance segmentation on ScanNetV2 , making them free from offset predictions.

3 Dynamic Weight Generator

The combination of the shallow network architecture and sparse convolution would typically cause a limited receptive field, and impair the method’s ability to exploit large-scale context. To generate discriminative filters for distinguishing different instances we propose to group homogenous points that have close votes for the geometric centroids and share the category predictions. Then instance-aware filters are dynamically generated by applying a small sub-network for large context aggregation, as shown in Fig. 3. Provided both predicted semantic labels and centroids offsets, we are ready for grouping homogenous points by using a similar strategy to that in . Details of the algorithm can be found in the supplementary materials. However, different from that directly treats the clusters as individual instance proposals, our method explores the spatial distribution of these points and integrates large context to generate filters for instances decoding. Due to the removal of the reliance on the quality of the instance proposals, the performance of our method is robust to the pre-defined hyper-parameters, as approved in the following experiments. Qualitative results are presented in Fig. 1. Moreover, compared with CondInst , where filters are generated for every valid pixel, DyCo3D generates much less number of instance candidates (less than 60), and each filter is responsible for one instance in a specific class, reducing the difficulties for optimization and the heavy requirements for hardware resources.

For cluster $\mathcal{C}^{z}$ , we first voxelize it with a grid size of $g$ , which is set to 14 in all our experiments. The features of each grid is calculated as the average of the point feature $\textbf{F}_{b}$ within the grid, where $\textbf{F}_{b}$ is the output of the backbone. To aggregate context for cluster $\mathcal{C}^{z}$ , a light-weighted sub-network $G_{w}(\cdot)$ is maintained. It contains two sparse convolutional layers with a kernel size of 3, a global pooling layer, and an MLP layer. The output is all convolutional parameters flattened in a compact vector, $\mathcal{W}_{\mathcal{C}}^{z}$ . Each $\mathcal{W}_{\mathcal{C}}^{z}$ is responsible for one specific instance. The size of $\mathcal{W}_{\mathcal{C}}^{z}$ is decided by the feature dimension and the number of the subsequent convolution layers (see Eq. (3)).

4 Instance Decoder

Given a specific category, position representation is critical to separate different instances. To encode position sensitive knowledge, we directly append position embeddings in the feature space. For $z$ -th instance with geometric centroid of $\mathcal{C}_{\text{ctr}}^{z}$ , position embedding for the $i$ -th point $f_{\text{pos}}^{i}$ is calculated as:

Formally, the instance decoder is formulated as:

where $Z$ is the total number of the clusters, $l_{\text{seg}}^{j}$ is the semantic prediction of the $j$ -th point, $l_{\mathcal{C}}^{z}$ is the semantic label of the $z$ -th cluster, and $L_{\text{BCE}}$ is the binary cross entropy loss function. $\mathds{1}$ is an indicator function, showing the loss is only computed on the points that have identical semantic labels with group $\mathcal{C}^{z}$ , and $N_{z}$ is a normalization item which is calculated as: $\sum_{j=1}^{N}\mathds{1}_{l_{\text{seg}}^{j}=l_{\mathcal{C}}^{z}}$ . In addition to the point-wise supervision, we also utilize the dice loss $\mathcal{L}_{\text{dice}}$ , which is designed for addressing the imbalance between the foreground and background points.

5 Training details

The loss function of DyCo3D can be formulated as:

where $\mathcal{L}_{\text{seg}}$ is for semantic segmentation, $\mathcal{L}_{\text{ctr}}$ is for instance centroids supervision, and $\mathcal{L}_{\text{mask}}$ and $\mathcal{L}_{\text{dice}}$ are two loss items for instance segmentation. All loss weights are set to 1.0.

During the inference time, we perform NMS on the instance binary masks $\{m_{\mathcal{C}}^{z}\}_{z=1}^{Z}$ , which are scored by the mean value of the semantic scores among the foreground points. The IoU threshold is the same as , with a value of 0.3. Cluster $\mathcal{C}^{z}$ is ignored if it contains points less than 50. The voxel size is set to 0.02m and 0.05m for ScanNetV2 and S3DIS , respectively. The hyper-parameter $r$ of the searching radius is set to 0.03m, which is the same with for a fair comparison. We implement multi-GPU training with a batch size of 16 and 4 GPUs. For the first 6k iterations, we only train the semantic segmentation $\mathcal{L}_{\text{seg}}$ and centroid prediction $\mathcal{L}_{\text{ctr}}$ , as dynamic filters depend on the results of both tasks. For the next 24k iterations, we compute all the loss items. During the training, the initial learning rate is set to 0.01 with an Adam optimizer. We apply the same data augmentation strategy with , including random cropping, flipping, and rotating.

Experiments

To validate the effectiveness of our proposed method, we conduct both qualitative and quantitative experiments on datasets that are widely studied: ScanNetV2 and Stanford 3D Indoor Semantic Dataset (S3DIS) . In this section, we show that our method demonstrates superiority in both effectiveness and efficiency.

S3DIS contains 13 categories that commonly exist in indoor scenes. The point cloud data is collected in 6 large-scale areas, covering more than 6000 $m^{2}$ with more than 215 million points. Following the protocols of previous methods , we evaluate the performance on Area-5 and train the model on the other sets. ScanNet is another large-scale benchmark for indoor scene analysis, which consists of 1613 scans with 40 categories in total. The dataset is split into 1201, 312, and 100 for training, evaluating, and testing, respectively. Like previous methods, we estimate the performance of instance segmentation on 18 common categories. Also, we follow the strategy in 3D-MAP and report the performance of 3D detection, where the results are obtained by fitting an axis-aligned bounding box around the instance segmentation.

2 Evaluation Metrics

For ScanNetV2, we report the metric of mean average precision (mAP), which is widely used in the 2D image domain. AP@50 and AP@25 are also provided, having an IoU threshold set to 0.5 and 0.25, respectively. For S3DIS, we apply the metrics that are used in : mConv, mWConv, mPrec, and mRec. mConv is defined as the mean instance-wise IoU. mWConv denotes the weighted version of mConv, where the weights are determined by the sizes of instances. mPrec and mRec denote the mean precision and recall, respectively.

3 Ablation Studies

In this section, we analyze the effect of each component in our proposed DyCo3D. Performance is reported in terms of mAP, AP@50, and AP@25. All experiments are conducted with the same setting and training schedule, and are evaluated on ScanNetV2 validation set.

Baseline. We build a strong baseline by generating filters for each foreground point without introducing any clustering operation. Due to the large size of $N$ , we randomly select 150 points for instance decoding. As presented in Table 1, our method achieves 24.8, 43.8, and 56.4 in terms of mAP, AP@50, and AP@25, respectively. We also implement CondInst , which has demonstrated its success in the 2D image domain. As presented in the second row in Table 1 the mAP has boosted by 2.2%, with the help of instance-related position embeddings.

Ablation on the Grouping Homogenous Points. Due to the limited receptive field introduced by the sparse convolution, it is significant to incorporate rich context for distinguishing different instances. To this end, we propose to integrate homogenous points that are defined in Sec. 3.3. Thanks to the grouping operation, the model surpasses the baseline by a large margin in terms of all metrics. Besides, the grouping operation reduces the number of instance candidates (less than 60), lowering the optimization difficulties and the heavy requirements for the hardware facilities.

Ablation on the Category-Aware Decoding. As filters are generated by exploring the points that have identical semantic predictions, only certain category context is encoded. We propose to convolve each filter on these category-specific points and mask out irrelevant points. As presented in Table 1, adding category masks improves the mAP from 31.8% to 34.1%.

Ablation on the Transformer. As limited receptive field and representation ability introduced by the sparse convolution, we propose to add a light-weighted transformer upon the bottleneck layer to capture the long-range dependencies and enhance interactions among different points, while maintaining efficiency. As presented in Table 1, the transformer brings about 1.3% improvements in terms of mAP.

Ablation on the Clustering Radius. The clustering radius is pre-defined in the grouping step. PointGroup , which treats clusters as the instance proposals, makes the performance highly dependent on the quality of the clustering results. We test the performance with a different radius, as shown in Fig. 4. Grouping with a small $r$ may generate over-segmented results, while a large $r$ increases the risk of merging two adjacent objects. As a result, changing the radius $r$ from 3cm to 2cm drops mAP by 6.3%, and 23.9% by changing $r$ from 3cm to 1cm. The volatility makes it necessary to be carefully tuned, demonstrating limited generalization capability to various scenes. Our method, on the other hand, eliminates the dependence on the proposals, showing strong robustness to the radius $r$ . More qualitative results can be found in Fig. 5.

Analysis on Efficiency. Different from previous point-based approaches that require to split each scene as 1m $\times$ 1m blocks and apply a complex block merging algorithm , our method takes the whole scene as input. In addition, we also compare our DyCo3D with PointGroup, which has shown its efficiency on large-scale scenes. We report the inference time that is averaged on the whole validation set. With the only post-processing step NMS, our method runs at 0.28s per scan on a 1080TI GPU, while the PointGroup runs at 0.39s with the same facility.

4 Comparison with State-of-the-art Methods

Object Detection. Following , we report the performance of 3D object detection on the validation set of ScanNetV2, which is obtained by fitting axis-aligned bounding boxes containing the instances. As shown in Table 2, our method surpasses PointGroup by 3.1% and 3.0% in terms of AP@25 and AP@50, respectively, demonstrating the compactness of our generated instance masks.

Instance Segmentation on S3DIS. We report the performance of instance segmentation on the S3DIS benchmark, as shown in Table 3. Our method achieves the highest performance with all the evaluation metrics. The results in terms of mPrec and mRec are 2.6% and 2.1% higher than PointGroup . Our method also reaches 60.9% under the metric of AP@50, which is 3.1% higher than PointGroup . We compute all these metrics with the evaluation code provided by . Qualitative results are illustrated in Fig. 6.

Instance Segmentation on ScanNetV2. We report the results of instance segmentation on the validation and testing sets of ScanNetV2, as presented in Table 4 and Table 5, respectively. We report both AP@50 and mAP on the validation set. We implement DyCo3D with both small and large backbones, denoted as Ours-S and Ours-L, respectively. Two models share the same network structure but with a different number of channels for convolution. We first compare Ours-S and PointGroup, which are implemented with the same backbone. Our method surpasses it by 0.7% and 0.6% in terms of AP@50 and mAP, respectively. We also make a fair comparison with 3D-MPA , our large model surpasses it by 2.3% and 4.9% in terms of AP@50 and mAP, respectively. We also report the performance of DyCo3D on the test set, as shown in Table 5. Highest AP@50 is achieved.

Conclusion

Achieving robustness to the inevitable variation in the data has been one of the ongoing challenges in 3D point cloud segmentation. We have shown here that dynamic convolution offers a mechanism by which to have the segmentation method actively respond to the characteristics of the data at test time, and that this does in-fact improve robustness. It also allows devising an approach that avoids many other pitfalls associated with bottom-up methods. The particular dynamic-convolution-based method that we have proposed, DyCo3D, not only achieves state-of-the-art results, it offers improved efficiency over existing methods.