Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks

Zhihao Liang, Zhihao Li, Songcen Xu, Mingkui Tan, Kui Jia

Introduction

The task of 3D instance segmentation is fundamental in many applications concerned with 3D scene understanding. Given an observed scene of point cloud reconstructed from depth cameras via multi-view fusion techniques , the task is to both assign semantic labels of pre-defined object categories to individual scene points, and differentiate those belonging to different object instances. Learning to achieve 3D instance segmentation is challenging at least in the following aspects: (1) observed scene points are usually sparse and irregular, which poses difficulties for learning point-wise classification based on shape features of local (and possibly global) contexts around individual points; (2) the unknown number of object instances in a scene introduces additional uncertainties to the problem of learning point-instance associations that is already combinatorial; (3) even though point-wise classification and point-instance associations can be conducted, learning consistencies among spatially adjacent points are not guaranteed, which may cause fragmented segmentations, especially around object boundaries (cf. Fig. 1 for an illustration).

State-of-the-art methods , e.g., those ranking top on the ScanNet benchmark , tackle (some of) the above challenges with the following general pipeline. They first train networks to learn point-wise features that are discriminative at both the semantic and instance levels, followed by a separate step of point clustering that groups together those believed to be on same instances, using the learned point-wise features. While promising, they have the following shortcomings. Firstly, the second step of point clustering is independent of network training, whose results are thus not guaranteed by guiding towards the ground-truth groupings of object instances. Secondly, while superpoints have been commonly used for semantic segmentation of 3D points , when coming to instance segmentation, these state-of-the-art methods, except OccuSeg , choose to conduct both feature learning and grouping in a point-wise manner, which takes away their chance to leverage the geometric regularities established at the mid-level shape representation of superpoints.

To overcome these shortcomings, we are motivated to develop an end-to-end solution for proposing object instances from an observed scene of points. Considering that a superpoint represents a geometrically homogeneous neighborhood, we choose to work with superpoints pre-computed from the scene points, and the problem of instance segmentation boils down as learning a network that groups superpoints on same object instances. In this work, we design such a solution called Semantic Superpoint Tree Network (SSTNet), as illustrated in Fig. 2. Similar to existing methods, SSTNet starts with a backbone that learns point-wise semantic and instance-level features; differently from them, SSTNet immediately aggregates these features as superpoint-wise ones efficiently via point-wise pooling. Key in SSTNet is an intermediate, semantic superpoint tree (SST), with the superpoints as its tree leaves. SST is constructed based on the pooled semantic (and instance-level) features of superpoints, and will be traversed and split by the subsequent SSTNet module of binary classification; starting from the root, a proposal of object instance is formed as the superpoints of a tree branch when non-splitting decision is made at the intermediate tree node that spans the branch (cf. Fig. 3 for an illustration). Our tree construction is highly efficient by choosing ways of feature inheritance from leaves to the root and pair-wise similarity metric, which support fast algorithms such as nearest-neighbor chain . We note that erroneous assignments of superpoints to instances may occur when constructing and traversing the tree. To compensate, we design a subsequent refinement module termed CliqueNet, which converts each proposed branch as a graph clique and learns to prune some of the branch nodes. A ScoreNet is finally used to evaluate the generated proposals, which gives instance segmentation results of our SSTNet.

Thorough experiments on the benchmark datasets of ScanNet and S3DIS show the efficacy of our proposed method. Notably, SSTNet outperforms all existing methods on the two benchmarks, and at the time of submission, it ranks top on the ScanNet (V2) leaderboard, with 2% higher of mAP than the second best method. We finally summarize our technical contributions as follows.

We propose an end-to-end solution of Semantic Superpoint Tree Network (SSTNet) to directly propose and evaluate object instances from observed 3D scenes. By working with superpoints, our method enjoys the benefit of geometric regularity that supports consistent and sharp segmentations, especially at object boundaries.

We choose a strategy of divisive grouping in SSTNet, which first builds the tree, followed by tree traversal for object proposal via node splitting. By constructing the tree with appropriate node merging and feature inheritance, our strategy is an order of magnitude faster than the alternative, agglomerative grouping, thus enabling efficient training and inference of SSTNet.

Considering that erroneous assignments of superpoints to instances may occur when constructing and traversing the tree, we design a refinement module in SSTNet, termed CliqueNet, which converts each proposed branch as a graph clique and learns to prune some of the branch nodes. Experiments show its efficacy.

Related Works

In this section, we briefly review the literature of 3D segmentation, focusing on those relevant to elements of our proposed method.

3D Semantic Segmentation Establishing geometric regularities is essential to realize semantic segmentation for an irregular point cloud. Recent methods used projection, voxelization or local aggregation to perform brief regularization, while the subsequent semantic learning task is still challenging. Instead, superpoint-based methods aggregated the geometrically homogeneous points as superpoints to establish a certain degree of geometric regularities. Furthermore, superpoints become the mid-level shape representation to boild down the problem of instance segmentation as grouping the superpoints that belong to the same instance.

3D Instance Segmentation Considering bottom-up methods, which cluster results based on semantic segmentation. heuristically clustered instance masks based on discriminative instance-level features. Intuitively, PointGroup utilized the adjacency of instance-wise coordinates. The above clustering results relied on the boundary conditions due to the lack of explicit boundary supervision. To address this issue, SSTNet combines the bottom-up clustering strategy with top-down traversal to realize end-to-end learning proposal generation.

Image Segmentation for Object Proposals To overcome the complexity caused by sliding windows, segmentation-based methods treat 2D detection as image segmentation, where the candidates are hypothesized from hierarchical image segmentation using an agglomeration manner. Furthermore, SSTNet involves a greedy agglomeration strategy and employs a learning splitting classifier to get rid of dependence on the times of agglomeration and generate precise mask results.

Overview

Individual Modules of the Proposed Network

where $\texttt{CE}(\cdot,\cdot)$ denotes the cross-entropy loss, and the remaining terms in (4.1) define a dice loss that alleviates the imbalance among the $K$ categories . Let $\bm{c}_{\bm{p}}^{*}$ denote the geometric center of the object instance to which any $\bm{p}\in\mathcal{I}$ belongs. We use the following loss to train the MLP for offset prediction

where $\mathds{I}(\bm{p})\in\{0,1\}$ is an indicator function telling whether the point $\bm{p}$ belongs to any object instance, and $N^{\prime}=\sum_{i=1}^{N}\mathds{I}(\bm{p}_{i})$ counts the number of such points. We give specifics of the two MLPs in the supplementary material.

2 Construction of Semantic Superpoint Tree

where the weights $w_{i}$ and $w_{j}$ are proportional to the respective sizes of $\mathcal{P}_{i}$ and $\mathcal{P}_{j}$ , i.e., $w_{i}=|\mathcal{P}_{i}|/(|\mathcal{P}_{i}|+|\mathcal{P}_{j}|)$ and $w_{j}=|\mathcal{P}_{j}|/(|\mathcal{P}_{i}|+|\mathcal{P}_{j}|)$ . Offset prediction of $t$ is computed similarly as $\bm{o}_{t}=w_{i}\bm{o}_{i}+w_{j}\bm{o}_{j}$ . We then compute the augmented $\bm{a}_{t}^{{\dagger}}$ from the obtained $\bm{a}_{t}$ and $\bm{o}_{t}$ . Note that we also compute the feature $\bm{f}_{t}=w_{i}\bm{f}_{i}+w_{j}\bm{f}_{j}$ for the node $t$ , which will be used in the subsequent module of proposal generation via tree traversal and splitting. Given the augmented $\bm{a}_{t}^{{\dagger}}$ for any $t\in\mathcal{T}$ and the pair-wise similarity metric based on Euclidean distance, the tree can be constructed hierarchically, as illustrated in Fig. 3, whose depth ranges between $\log_{2}^{M}$ and $M-1$ . For clarity, we write the $M$ leaf nodes as $\{t_{\mathcal{P}_{i}}\in\mathcal{T}\}_{i=1}^{M}$ and any root or intermediate one as $t\in\mathcal{T}$ .

Our use of the augmented semantic score $\bm{a}^{{\dagger}}=[\bm{a};\bm{c}_{\mathcal{P}}]$ to represent each $t_{\mathcal{P}}$ (and $t$ ) is based on the argument that for any pair of $\mathcal{P}_{i}$ and $\mathcal{P}_{j}$ on a same instance, both their semantic scores and instance centers are expected to be consistent. Empirical results in Section 5.1 show that it gives better performance than alternative choices do, which verifies the hypothesis. We thus term the constructed $\mathcal{T}$ as semantic superpoint tree.

Remarks Given the $M$ superpoints, the hierarchical tree construction described above has a complexity of $\mathcal{O}(M^{3})$ . Due to the linear feature inheritance (3) and the use of Euclidean distance as the similarity metric, construction of $\mathcal{T}$ can be made highly efficient by employing the fast algorithm of nearest-neighbor chain , which results in a same $\mathcal{T}$ at a complexity of $\mathcal{O}(M^{2})$ . On a machine running at 13 Hz, it takes $\sim$ 75 milliseconds per construction (e.g., scenes of ScanNet ), thus supporting online SST construction per iteration of network training.

3 Proposal Generation via Tree Traversal and Splitting

In this work, we implement the classifier $\phi$ as an MLP, whose details are given in the supplementary material. To train $\phi$ , we define the instance-level, ground-truth labels for nodes of the tree as follows. Assume that a training scene $\mathcal{I}$ contains $J$ object instances, which may belong to some of the $K$ categories. For any superpoint $\mathcal{P}$ (i.e, a leaf node $t_{\mathcal{P}}$ ), we assign its instance-level, soft label $\bm{q}_{\mathcal{P}}^{*}\in^{J}$ according to what proportions its contained points belong to (some of) the $J$ instances. The soft label $\bm{q}_{t}^{*}\in^{J}$ for any intermediate or root $t$ is again hierarchically inherited, via weighted averaging, from those of superpoints, similar to the inheritance of features. Given that $s_{1}$ and $s_{2}$ are the two child nodes of $t$ in $\mathcal{T}$ , we use the following loss symmetric to them to train $\phi$

where $\texttt{BCE}(\cdot,\cdot)$ denotes a binary cross-entropy loss, and $\bm{q}_{s_{1}}^{*\top}\bm{q}_{s_{2}}^{*}\in$ indicates, in a soft manner, whether the two child nodes belong to a same instance.

Remarks In the proposed SSTNet, we choose to first build the tree, as described in Section 4.2, and then learn to traverse and split tree nodes to generate instance proposals; in other words, we choose a strategy of divisive grouping, instead of an agglomerative one commonly used in hierarchical image segmentation . Our motivation for such a design is mostly computational: by using nearest-neighbor chain , our tree construction has a complexity of $\mathcal{O}(M^{2})$ , and the tree traversal to propose all the branches of object instances has a complexity of $\mathcal{O}(M)$ , giving rise to an overall complexity of $\mathcal{O}(M^{2}+M)$ ; in contrast, learning to generate proposals in an agglomerative manner has an order-of-magnitude higher complexity of $\mathcal{O}(M^{3})$ .

4 CliqueNet for Refinement of Proposals

We note that in the forward pass of SSTNet, once a superpoint $\mathcal{P}$ truly on an object instance is constructed into a wrong branch $\mathcal{B}_{t}$ of SST $\mathcal{T}$ , e.g., $\mathcal{B}_{t}$ corresponding to the background or a different instance, the mistake cannot be corrected. Nevertheless, when any branch $\mathcal{B}_{t}$ is proposed as an object instance, we have the chance to improve its score evaluation (cf. Section 4.5) by pruning its contained superpoints that may belong to other instances or the background.

where $\bar{\bm{D}}_{\mathcal{C}}$ is the diagonal degree matrix of $\bar{\bm{A}}_{\mathcal{C}}$ , and $\bm{W}_{\psi}^{1}$ denotes weight matrix of the first layer of $\psi$ . In this work, we use a three-layer CliqueNet whose specifics are given in the supplementary material.

CliqueNet outputs scores $\psi(\bm{F}_{\mathcal{C}}^{{\dagger}},\bm{A}_{\mathcal{C}})\in(0,1)^{M_{t}+1}$ defined respectively for the $M_{t}+1$ nodes in $\mathcal{C}$ . To train $\psi$ , we impose supervision on each node pair of $t$ and $\mathcal{P}_{i}$ , $i\in\{1,\dots,M_{t}\}$ , giving rise to

where the instance-level, soft labels $\bm{q}_{t}^{*}\in^{J}$ and $\bm{q}_{\mathcal{P}}^{*}\in^{J}$ are defined in Section 4.3.

5 Proposal Evaluation

where $|\mathcal{R}|$ is the number of proposals generated by our SSTNet (cf. Algorithm 1).

6 Training and Inference

We write our overall objective for training SSTNet as

Note that SSTNet is trained in a greedy, module-wise manner, which means that the individual loss terms applied to their respective modules are sequentially invoked into the overall loss (8). Although the tree $\mathcal{T}$ needs to be constructed in every forward pass of SSTNet, it is highly efficient as indicated by the complexity and practical running time given in preceding sections. The complexity of tree traversal for instance proposals is linear w.r.t. the number of superpoints; furthermore, once a proposal is formed at an intermediate tree node, it is not necessary to traverse all the descendant nodes. The inference is simply a same procedure as a forward pass of SSTNet training. Given the non-overlapping nature of our proposed object instances, post-processing steps such as non-maximum suppression are not necessary.

Experiments

Datasets We conduct experiments using the benchmark datasets of ScanNet (V2) and S3DIS. ScanNet has 1201 training, 312 validation, and 100 test scenes that contain object instances of 18 categories. Surface normals are also provided for each scene. We do analysis and ablation studies on its validation set, and submit our results to the hidden test set. S3DIS contains 6 large-scale indoor scenes with 13 object classes, we evaluate our model in the following aspects: (1) Area-5 is treated as the testing, while residuals are used for training, and (2) 6-fold cross validation that each area is treated as the testing once.

Implementation Details For each input scene, we concatenate the RGB values and point coordinates as the point-wise inputs of SSTNet. The network is trained using AdamW optimizer , with an initial learning rate of 1e-3 and weight decay of 1e-4; learning rates follow a polynomial learning rate policy. We set the batch size as 4. We pre-process scenes of S3DIS dataset by sub-sampling its points at a rate of 1/4. We employ a graph-based segmentation method to generate superpoints for ScanNet scenes. For S3DIS, each scene is represented by colored point cloud and we employ SPP + SPG to generate its superpoints. Module and layer specifics of SSTNet are given in the supplementary material.

Evaluation Metrics Following the official ScanNet (V2) evaluation protocol, we report mean Average Precisions (mAPs) at different thresholds of IoU as the evaluation metric to compare different methods. The mAP@25 and mAP@50 denote the average precision scores with IoU thresholds respectively set to 25% and 50%, and the mAP averages the scores with IoU thresholds set from 50% to 95%, with a step size of 5%.

We first conduct ablation studies to evaluate the efficacy of individual components proposed in SSTNet. These studies are conducted on the ScanNet (V2) dataset .

Analysis on Features for SST Construction The quality of SST depends on what features are used when contructing the tree. In this work, for a superpoint $\mathcal{P}$ , we choose semantic score $\bm{a}$ and predicted instance center $\bm{c}_{\mathcal{P}}$ over the triple $\{\bm{f},\bm{a},\bm{c}_{\mathcal{P}}\}$ , where $\bm{c}_{\mathcal{P}}$ is computed from the offset prediction $\bm{o}$ (cf. Section 4.2), and form the augmented $\bm{a}^{{\dagger}}=[\bm{a};\bm{c}_{\mathcal{P}}]$ for SST construction. Results in Table 3 verify our argument that for any pair of superpoints on a same instance, their semantic scores and instance centers are expected to be consistent, while their superpoint-wise features are not necessarily to be similar.

Efficacy of Proposal Generation via Tree Traversal and Split Learning To verify the efficacy of our main proposal generation scheme via SST, we compare with two alternatives. The first alternative conducts the same traversal of SST but replaces the node-splitting classifier $\phi$ with a simple thresholding scheme, which we term as SST-Thresholding; to determine whether an intermediate node $t$ is to be split into its child nodes $s_{1}$ and $s_{2}$ , it thresholds the Euclidean distance $\|\bm{a}_{s_{1}}^{{\dagger}}-\bm{a}_{s_{2}}^{{\dagger}}\|_{2}$ where we optimally tune the thresholds for its best performance We also try thresholding of the Euclidean distance $\|\bm{f}_{s_{1}}^{{\dagger}}-\bm{f}_{s_{2}}^{{\dagger}}\|_{2}$ , where $\bm{f}_{s_{1}}^{{\dagger}}=[\bm{f}_{s_{1}};\bm{a}_{s_{1}}^{{\dagger}}]$ and $\bm{f}_{s_{2}}^{{\dagger}}$ is computed similarly. It empirically gives even worse performance. . For the second alternative, instead of relying on SST construction, given the $M$ superpoints with its augmented semantic scores $\{\bm{a}_{i}^{{\dagger}}\}_{i=1}^{M}$ , we first build a K-nearest-neighbor graph based on pair-wise Euclidean distances, and then train a classifier to decide whether some graph edges should be disconnected; the resulting, disconnected graph cliques are proposed as object instances; we term this alternative as Superpoint Graph, which can be interpreted as a flattened version of learning to propose object proposals. Table 4 shows that SST-thresholding performs the best at the low-precision metric of mAP@25, suggesting our construction of SST is indeed useful for generation of object proposals. On the averaged metric of mAP, SSTNet greatly outperforms the two alternatives.

Efficacy of the CliqueNet Refinement Ablation study on the efficacy of CliqueNet is presented in Table 5, which shows that pruning superpoints from proposed tree branches is effective at high-precision regimes of mAP metrics.

2 Results on the ScanNet Benchmark

We train SSTNet on the training set of ScanNet (V2) and submit our model onto the testing sever. Table 1 shows that on the leaderboard of ScanNet (V2) test set, SSTNet outperforms all existing methods. Results at the metrics of AP@25 and AP@50 are reported in Table 2.

3 Results on S3DIS

Following the protocols used in previous methods, we employ the Area-5 and 6-fold cross validation, and use the mAP/AP@50/mean precision (mPrec)/mean recall(mRec) with IoU threshold 0.5 to evaluate SSTNet on the S3DIS dataset. One may refer to for precise definitions of these metrics. Table 6 shows that SSTNet outperforms all exist methods, confirming the generalizable advantage of our proposed method.

Acknowledgement This work was partially supported by the Guangdong R $\&$ D key project of China (No.: 2019B010155001), the National Natural Science Foundation of China (No.: 61771201), and the Program for Guangdong Introducing Innovative and Entrepreneurial Teams (No.: 2017ZT07X183).