Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew Markham, Niki Trigoni

Introduction

Enabling machines to understand 3D scenes is a fundamental necessity for autonomous driving, augmented reality and robotics. Core problems on 3D geometric data such as point clouds include semantic segmentation, object detection and instance segmentation. Of these problems, instance segmentation has only started to be tackled in the literature. The primary obstacle is that point clouds are inherently unordered, unstructured and non-uniform. Widely used convolutional neural networks require the 3D point clouds to be voxelized, incurring high computational and memory costs.

The first neural algorithm to directly tackle 3D instance segmentation is SGPN , which learns to group per-point features through a similarity matrix. Similarly, ASIS , JSIS3D , MASC , 3D-BEVIS and apply the same per-point feature grouping pipeline to segment 3D instances. Mo et al. formulate the instance segmentation as a per-point feature classification problem in PartNet . However, the learnt segments of these proposal-free methods do not have high objectness as they do not explicitly detect the object boundaries. In addition, they inevitably require a post-processing step such as mean-shift clustering to obtain the final instance labels, which is computationally heavy. Another pipeline is the proposal-based 3D-SIS and GSPN , which usually rely on two-stage training and the expensive non-maximum suppression to prune dense object proposals.

In this paper, we present an elegant, efficient and novel framework for 3D instance segmentation, where objects are loosely but uniquely detected through a single-forward stage using efficient MLPs, and then each instance is precisely segmented through a simple point-level binary classifier. To this end, we introduce a new bounding box prediction module together with a series of carefully designed loss functions to directly learn object boundaries. Our framework is significantly different from the existing proposal-based and proposal-free approaches, since we are able to efficiently segment all instances with high objectness, but without relying on expensive and dense object proposals. Our code and data are available at https://github.com/Yang7879/3D-BoNet.

As shown in Figure 1, our framework, called 3D-BoNet, is a single-stage, anchor-free and end-to-end trainable neural architecture. It first uses an existing backbone network to extract a local feature vector for each point and a global feature vector for the whole input point cloud. The backbone is followed by two branches: 1) instance-level bounding box prediction, and 2) point-level mask prediction for instance segmentation.

The bounding box prediction branch is the core of our framework. This branch aims to predict a unique, unoriented and rectangular bounding box for each instance in a single forward stage, without relying on predefined spatial anchors or a region proposal network . As shown in Figure 2, we believe that roughly drawing a 3D bounding box for an instance is relatively achievable, because the input point clouds explicitly include 3D geometry information, while it is extremely beneficial before tackling point-level instance segmentation since reasonable bounding boxes can guarantee high objectness for learnt segments. However, to learn instance boxes involves critical issues: 1) the number of total instances is variable, i.e., from 1 to many, 2) there is no fixed order for all instances. These issues pose great challenges for correctly optimizing the network, because there is no information to directly link predicted boxes with ground truth labels to supervise the network. However, we show how to elegantly solve these issues. This box prediction branch simply takes the global feature vector as input and directly outputs a large and fixed number of bounding boxes together with confidence scores. These scores are used to indicate whether the box contains a valid instance or not. To supervise the network, we design a novel bounding box association layer followed by a multi-criteria loss function. Given a set of ground-truth instances, we need to determine which of the predicted boxes best fit them. We formulate this association process as an optimal assignment problem with an existing solver. After the boxes have been optimally associated, our multi-criteria loss function not only minimizes the Euclidean distance of paired boxes, but also maximizes the coverage of valid points inside of predicted boxes.

The predicted boxes together with point and global features are then fed into the subsequent point mask prediction branch, in order to predict a point-level binary mask for each instance. The purpose of this branch is to classify whether each point inside of a bounding box belongs to the valid instance or the background. Assuming the estimated instance box is reasonably good, it is very likely to obtain an accurate point mask, because this branch is simply to reject points that are not part of the detected instance. A random guess may bring about $50\%$ corrections.

Overall, our framework distinguishes from all existing 3D instance segmentation approaches in three folds. 1) Compared with the proposal-free pipeline, our method segments instance with high objectness by explicitly learning 3D object boundaries. 2) Compared with the widely-used proposal-based approaches, our framework does not require expensive and dense proposals. 3) Our framework is remarkably efficient, since the instance-level masks are learnt in a single-forward pass without requiring any post-processing steps. Our key contributions are:

We propose a new framework for instance segmentation on 3D point clouds. The framework is single-stage, anchor-free and end-to-end trainable, without requiring any post-processing steps.

We design a novel bounding box association layer followed by a multi-criteria loss function to supervise the box prediction branch.

We demonstrate significant improvement over baselines and provide intuition behind our design choices through extensive ablation studies.

3D-BoNet

The bounding box prediction branch simply takes the global feature vector $\boldsymbol{F}_{g}$ as input, and directly regresses a predefined and fixed set of bounding boxes, denoted as $\boldsymbol{B}$ , and the corresponding box scores, denoted as $\boldsymbol{B}_{s}$ . We use ground truth bounding box information to supervise this branch. During training, the predicted bounding boxes $\boldsymbol{B}$ and the ground truth boxes are fed into a box association layer. This layer aims to automatically associate a unique and most similar predicted bounding box to each ground truth box. The output of the association layer is a list of association index $\boldsymbol{A}$ . The indices reorganize the predicted boxes, such that each ground truth box is paired with a unique predicted box for subsequent loss calculation. The predicted bounding box scores are also reordered accordingly before calculating loss. The reordered predicted bounding boxes are then fed into the multi-criteria loss function. Basically, this loss function aims to not only minimize the Euclidean distance between each ground truth box and the associated predicted box, but also maximize the coverage of valid points inside of each predicted box. Note that, both the bounding box association layer and multi-criteria loss function are only designed for network training. They are discarded during testing. Eventually, this branch is able to predict a correct bounding box together with a box score for each instance directly.

In order to predict point-level binary mask for each instance, every predicted box together with previous local and global features, i.e., $\boldsymbol{F}_{l}$ and $\boldsymbol{F}_{g}$ , are further fed into the point mask prediction branch. This network branch is shared by all instances of different categories, and therefore extremely light and compact. Such class-agnostic approach inherently allows general segmentation across unseen categories.

2 Bounding Box Prediction

Bounding Box Encoding: In existing object detection networks, a bounding box is usually represented by the center location and the length of three dimensions , or the corresponding residuals together with orientations. Instead, we parameterize the rectangular bounding box by only two min-max vertices for simplicity:

Neural Layers: As shown in Figure 4, the global feature vector $\boldsymbol{F}_{g}$ is fed through two fully connected layers with Leaky ReLU as the non-linear activation function. Then it is followed by another two parallel fully connected layers. One layer outputs a $6H$ dimensional vector, which is then reshaped as an $H\times 2\times 3$ tensor. $H$ is a predefined and fixed number of bounding boxes that the whole network are expected to predict in maximum. The other layer outputs an $H$ dimensional vector followed by $sigmoid$ function to represent the bounding box scores. The higher the score, the more likely that the predicted box contains an instance, thus the box being more valid.

Optimal Association Formulation: To associate a unique predicted bounding box from $\boldsymbol{B}$ for each ground truth box of $\boldsymbol{\bar{B}}$ , we formulate this association process as an optimal assignment problem. Formally, let $\boldsymbol{A}$ be a boolean association matrix where $\boldsymbol{A}_{i,j}$ =1 iff the $i^{th}$ predicted box is assigned to the $j^{th}$ ground truth box. $\boldsymbol{A}$ is also called association index in this paper. Let $\boldsymbol{C}$ be the association cost matrix where $\boldsymbol{C}_{i,j}$ represents the cost that the $i^{th}$ predicted box is assigned to the $j^{th}$ ground truth box. Basically, the cost $\boldsymbol{C}_{i,j}$ represents the similarity between two boxes; the less the cost, the more similar the two boxes. Therefore, the bounding box association problem is to find the optimal assignment matrix $\boldsymbol{A}$ with the minimal cost overall:

To solve the above optimal association problem, the existing Hungarian algorithm [20; 21] is applied.

Association Matrix Calculation: To evaluate the similarity between the $i^{th}$ predicted box and the $j^{th}$ ground truth box, a simple and intuitive criterion is the Euclidean distance between two pairs of min-max vertices. However, it is not optimal. Basically, we want the predicted box to include as many valid points as possible. As illustrated in Figure 5, the input point cloud is usually sparse and distributed non-uniformly in 3D space. Regarding the same ground truth box #0 (blue), the candidate box #2 (red) is believed to be much better than the candidate #1 (black), because the box #2 has more valid points overlapped with #0. Therefore, the coverage of valid points should be included to calculate the cost matrix $\boldsymbol{C}$ . In this paper, we consider the following three criteria:

(1) Euclidean Distance between Vertices. Formally, the cost between the $i^{th}$ predicted box $\boldsymbol{B}_{i}$ and the $j^{th}$ ground truth box $\boldsymbol{\bar{B}}_{j}$ is calculated as follows:

where $q^{n}_{i}$ and $\bar{q}^{n}_{j}$ are the $n^{th}$ values of $\boldsymbol{q}_{i}$ and $\boldsymbol{\bar{q}}_{j}$ .

(3) Cross-Entropy Score. In addition, we also consider the cross-entropy score between $\boldsymbol{q}_{i}$ and $\boldsymbol{\bar{q}}_{j}$ . Being different from sIoU cost which prefers tighter boxes, this score represents how confident a predicted bounding box is able to include valid points as many as possible. It prefers larger and more inclusive boxes, and is formally defined as:

Overall, the criterion (1) guarantees the geometric boundaries for learnt boxes and criteria (2)(3) maximize the coverage of valid points and overcome the non-uniformity as illustrated in Figure 5. The final association cost between the $i^{th}$ predicted box and the $j^{th}$ ground truth box is defined as:

Loss Functions After the bounding box association layer, both the predicted boxes $\boldsymbol{B}$ and scores $\boldsymbol{B}_{s}$ are reordered using the association index $\boldsymbol{A}$ , such that the first predicted $T$ boxes and scores are well paired with the $T$ ground truth boxes.

Multi-criteria Loss for Box Prediction: The previous association layer finds the most similar predicted box for each ground truth box according to the minimal cost including: 1) vertex Euclidean distance, 2) sIoU cost on points, and 3) cross-entropy score. Therefore, the loss function for bounding box prediction is naturally designed to consistently minimize those cost. It is formally defined as follows:

where $\boldsymbol{C}^{ed}_{t,t}$ , $\boldsymbol{C}^{sIoU}_{t,t}$ and $\boldsymbol{C}^{ces}_{t,t}$ are the cost of $t^{th}$ paired boxes. Note that, we only minimize the cost of $T$ paired boxes; the remaining $H-T$ predicted boxes are ignored because there is no corresponding ground truth for them. Therefore, this box prediction sub-branch is agnostic to the predefined value of $H$ . Here raises an issue. Since the $H-T$ negative predictions are not penalized, it might be possible that the network predicts multiple similar boxes for a single instance. Fortunately, the loss function for the parallel box score prediction is able to alleviate this problem.

Loss for Box Score Prediction: The predicted box scores aim to indicate the validity of the corresponding predicted boxes. After being reordered by the association index $\boldsymbol{A}$ , the ground truth scores for the first $T$ scores are all ‘1’, and ‘0’ for the remaining invalid $H-T$ scores. We use cross-entropy loss for this binary classification task:

where $\boldsymbol{B}^{t}_{s}$ is the $t^{th}$ predicted score after being associated. Basically, this loss function rewards the correctly predicted bounding boxes, while implicitly penalizing the cases where multiple similar boxes are regressed for a single instance.

3 Point Mask Prediction

Given the predicted bounding boxes $\boldsymbol{B}$ , the learnt point features $\boldsymbol{F}_{l}$ and global features $\boldsymbol{F}_{g}$ , the point mask prediction branch processes each bounding box individually with shared neural layers.

Neural Layers: As shown in Figure 6, both the point and global features are compressed to be $256$ dimensional vectors through fully connected layers, before being concatenated and further compressed to be $128$ dimensional mixed point features $\widetilde{\boldsymbol{F}}_{l}$ . For the $i^{th}$ predicted bounding box $\boldsymbol{B}_{i}$ , the estimated vertices and score are fused with features $\widetilde{\boldsymbol{F}}_{l}$ through concatenation, producing box-aware features $\widehat{\boldsymbol{F}}_{l}$ . These features are then fed through shared layers, predicting a point-level binary mask, denoted as $\boldsymbol{M}_{i}$ . We use $sigmoid$ as the last activation function. This simple box fusing approach is extremely computationally efficient, compared with the commonly used RoIAlign in prior art [58; 15; 13] which involves the expensive point feature sampling and alignment.

4 End-to-End Implementation

We use Adam solver with its default hyper-parameters for optimization. Initial learning rate is set to $5e^{-4}$ and then divided by 2 every $20$ epochs. The whole network is trained on a Titan X GPU from scratch. We use the same settings for all experiments, which guarantees the reproducibility of our framework.

Experiments

We first evaluate our approach on ScanNet(v2) 3D semantic instance segmentation benchmark . Similar to SGPN , we divide the raw input point clouds into $1m\times 1m$ blocks for training, while using all points for testing followed by the BlockMerging algorithm to assemble blocks into complete 3D scenes. In our experiment, we observe that the performance of the vanilla PointNet++ based semantic prediction sub-branch is limited and unable to provide satisfactory semantics. Thanks to the flexibility of our framework, we therefore easily train a parallel SCN network to estimate more accurate per-point semantic labels for the predicted instances of our 3D-BoNet. The average precision (AP) with an IoU threshold 0.5 is used as the evaluation metric.

We compare with the leading approaches on 18 object categories in Table 1. Particularly, the SGPN , 3D-BEVIS , MASC and are point feature clustering based approaches; the R-PointNet learns to generate dense object proposals followed by point-level segmentation; 3D-SIS is a proposal-based approach using both point clouds and color images as input. PanopticFusion learns to segment instances on multiple 2D images by Mask-RCNN and then uses the SLAM system to reproject back to 3D space. Our approach surpasses them all using point clouds only. Remarkably, our framework performs relatively satisfactory on all categories without preferring specific classes, demonstrating the superiority of our framework.

2 Evaluation on S3DIS Dataset

We further evaluate the semantic instance segmentation of our framework on S3DIS , which consists of 3D complete scans from 271 rooms belonging to 6 large areas. Our data preprocessing and experimental settings strictly follow PointNet , SGPN , ASIS , and JSIS3D . In our experiments, $H$ is set as $24$ and we follow the 6-fold evaluation [1; 51].

We compare with ASIS , the state of art on S3DIS, and the PartNet baseline . For fair comparison, we carefully train the PartNet baseline with the same PointNet++ backbone and other settings as used in our framework. For evaluation, the classical metrics mean precision (mPrec) and mean recall (mRec) with IoU threshold 0.5 are reported. Note that, we use the same BlockMerging algorithm to merge the instances from different blocks for both our approach and the PartNet baseline. The final scores are averaged across the total 13 categories. Table 2 presents the mPrec/mRec scores and Figure 7 shows qualitative results. Our method surpasses PartNet baseline by large margins, and also outperforms ASIS , but not significantly, mainly because our semantic prediction branch (vanilla PointNet++ based) is inferior to ASIS which tightly fuses semantic and instance features for mutual optimization. We leave the feature fusion as our future exploration.

3 Ablation Study

To evaluate the effectiveness of each component of our framework, we conduct $6$ groups of ablation experiments on the largest Area 5 of S3DIS dataset.

(1) Remove Box Score Prediction Sub-branch. Basically, the box score serves as an indicator and regularizer for valid bounding box prediction. After removing it, we train the network with:

Initially, the multi-criteria loss function is a simple unweighted combination of the Euclidean distance, the soft IoU cost, and the cross-entropy score. However, this may not be optimal, because the density of input point clouds is usually inconsistent and tends to prefer different criterion. We conduct the below $3$ groups of experiments on ablated bounding box loss function.

(5) Do Not Supervise Box Prediction. The predicted boxes are still associated according to the three criteria, but we remove the box supervision signal. The framework is trained with:

(6) Remove Focal Loss for Point Mask Prediction. In the point mask prediction branch, the focal loss is replaced by the standard cross-entropy loss for comparison.

Analysis. Table 3 shows the scores for ablation experiments. (1) The box score sub-branch indeed benefits the overall instance segmentation performance, as it tends to penalize duplicated box predictions. (2) Compared with Euclidean distance and cross-entropy score, the sIoU cost tends to be better for box association and supervision, thanks to our differentiable Algorithm 1. As the three individual criteria prefer different types of point structures, a simple combination of three criteria may not always be optimal on a specific dataset. (3) Without the supervision for box prediction, the performance drops significantly, primarily because the network is unable to infer satisfactory instance 3D boundaries and the quality of predicted point masks deteriorates accordingly. (4) Compared with focal loss, the standard cross entropy loss is less effective for point mask prediction due to the imbalance of instance and background point numbers.

4 Computation Analysis

(1) For point feature clustering based approaches including SGPN , ASIS , JSIS3D , 3D-BEVIS , MASC , and , the computation complexity of the post clustering algorithm such as Mean Shift tends towards $\mathcal{O}(TN^{2})$ , where $T$ is the number of instances and $N$ is the number of input points. (2) For dense proposal-based methods including GSPN , 3D-SIS and PanopticFusion , region proposal network and non-maximum suppression are usually required to generate and prune the dense proposals, which is computationally expensive . (3) Both PartNet baseline and our 3D-BoNet have similar efficient computation complexity $\mathcal{O}(N)$ . Empirically, our 3D-BoNet takes around $20$ ms GPU time to process $4k$ points, while most approaches in (1)(2) need more than 200ms GPU/CPU time to process the same number of points.

Related Work

To extract features from 3D point clouds, traditional approaches usually craft features manually [5; 42]. Recent learning based approaches mainly include voxel-based [42; 46; 41; 23; 40; 11; 4] and point-based schemes [37; 19; 14; 16; 45].

Semantic Segmentation PointNet shows leading results on classification and semantic segmentation, but it does not capture context features. To address it, a number of approaches [38; 57; 43; 31; 55; 49; 26; 17] have been proposed recently. Another pipeline is convolutional kernel based approaches [55; 27; 47]. Basically, most of these approaches can be used as our backbone network, and parallelly trained with our 3D-BoNet to learn per-point semantics.

Object Detection The common way to detect objects in 3D point clouds is to project points onto 2D images to regress bounding boxes [25; 48; 3; 56; 59; 53]. Detection performance is further improved by fusing RGB images in [3; 54; 36; 52]. Point clouds can be also divided into voxels for object detection [9; 24; 60]. However, most of these approaches rely on predefined anchors and the two-stage region proposal network . It is inefficient to extend them on 3D point clouds. Without relying on anchors, the recent PointRCNN learns to detect via foreground point segmentation, and the VoteNet detects objects via point feature grouping, sampling and voting. By contrast, our box prediction branch is completely different from them all. Our framework directly regresses 3D object bounding boxes from the compact global features through a single forward pass.

Instance Segmentation SGPN is the first neural algorithm to segment instances on 3D point clouds by grouping the point-level embeddings. ASIS , JSIS3D , MASC , 3D-BEVIS and use the same strategy to group point-level features for instance segmentation. Mo et al. introduce a segmentation algorithm in PartNet by classifying point features. However, the learnt segments of these proposal-free methods do not have high objectness as it does not explicitly detect object boundaries. By drawing on the successful 2D RPN and RoI , GSPN and 3D-SIS are proposal-based methods for 3D instance segmentation. However, they usually rely on two-stage training and a post-processing step for dense proposal pruning. By contrast, our framework directly predicts a point-level mask for each instance within an explicitly detected object boundary, without requiring any post-processing steps.

Conclusion

Our framework is simple, effective and efficient for instance segmentation on 3D point clouds. However, it also has some limitations which lead to the future work. (1) Instead of using unweighted combination of three criteria, it is better to design a module to automatically learn the weights, so to adapt to different types of input point clouds. (2) Instead of training a separate branch for semantic prediction, more advanced feature fusion modules can be introduced to mutually improve both semantic and instance segmentation. (3) Our framework follows the MLP design and is therefore agnostic to the number and order of input points. It is desirable to directly train and test on large-scale input point clouds instead of the divided small blocks, by drawing on the recent work .

References

Appendix A Experiments on ScanNet Benchmark

ScanNet(v2) consists of 1613 complete 3D scenes acquired from real-world indoor spaces. The official split has 1201 training scenes, 312 validation scenes and 100 hidden testing scenes. The original large point clouds are divided into $1m\times 1m$ blocks with $0.5m$ overlapped between neighbouring blocks. This data proprocessing step is the same as being used by PointNet for the S3DIS dataset. We sample $4096$ points from each block for training, but use all points of a block for testing. Each point is represented by a 9D vector (normalized xyz in the block, rgb, normalized xyz in the room). $H$ is set as $20$ in our experiments. We train our 3D-BoNet to predict object bounding boxes and point-level masks, and parallelly train an officially released ResNet-based SCN network to predict point-level semantic labels.

Figure 8 shows qualitative results of our 3D-BoNet for instance segmentation on ScanNet validation split. It can be seen that our approach tends to predict complete object instances, instead of inferring tiny and but invalid fragments. This demonstrates that our framework indeed guarantees high objectness for segmented instances. The red circles showcase the failure cases, where the very similar instances are unable to be well segmented by our approach.

Appendix B Experiments on S3DIS Dataset

The original large point clouds are divided into $1m\times 1m$ blocks with $0.5m$ overlapped between neighbouring blocks. It is the same as being used in PointNet . We sample $4096$ points from each block for training, but use all points of a block for testing. Each point is represented by a 9D vector (normalized xyz in the block, rgb, normalized xyz in the room). $H$ is set as $24$ in our experiments. We train our 3D-BoNet to predict object bounding boxes and point-level masks, and parallelly train a vanilla PointNet++ based sub-branch to predict point-level semantic labels. Particularly, all the semantic, bounding box and point mask sub-branches share the same PointNet++ backbone to extract point features, and are end-to-end trained from scratch.

Figure 9 shows the training curves of our proposed loss functions on Areas (1,2,3,4,6) of S3DIS dataset. It demonstrates that all the proposed loss functions are able to converge consistently, thus jointly optimizing the semantic segmentation, bounding box prediction, and point mask prediction branches in an end-to-end fashion.

Figure 10 presents the qualitative results of predicted bounding boxes and scores. It can be seen that the predicted boxes are not necessarily tight and precise. Instead, they tend to be inclusive but with high objectness. Fundamentally, this highlights the simple but effective concept of our bounding box prediction network. Given these bounded points, it is extremely easy to segment the instance inside.

Figure 11 visualizes the predicted instance masks, where the black points have $\sim$ 0 probability and the brighter points have $\sim$ 1 probability to be an instance within each predicted mask.

Appendix C Experiments for Computation Efficiency

Table 4 compares the time consumption of four existing approaches using their released codes on the validation split (312 scenes) of ScanNet(v2) dataset. SGPN , ASIS , GSPN and our 3D-BoNet are implemented by Tensorflow 1.4, 3D-SIS by Pytorch 0.4. All approaches are running on a single Titan X GPU and the pre/post-processing steps on an i7 CPU core with a single thread. Note that 3D-SIS automatically uses CPU for computing when some large scenes are unable to be processed by the single GPU. Overall, our approach is much more computationally efficient than existing methods, even achieving up to 20 $\times$ faster than ASIS .

Appendix D Gradient Estimation for Hungarian Algorithm

Given the predicted bounding boxes, $\mathbf{B}$ , and ground-truth boxes, $\mathbf{\bar{B}}$ , we compute the assignment cost matrix, $\mathbf{C}$ . This matrix is converted to a permutation matrix, $\mathbf{A}$ , using the Hungarian algorithm. Here we focus on the euclidean distance component of the loss, $\mathbf{C}^{ed}$ . The derivative of our loss component w.r.t the network parameters, $\theta$ , in matrix form is:

The components are easily computable except for $\frac{\partial\mathbf{A}}{\partial\mathbf{C}}$ which is the gradient of the permutation w.r.t the assignment cost matrix which is zero nearly everywhere. In our implementation, we found that the network is able to converge when setting this term to zero.

However, convergence could be sped up using the straight-through-estimator , which assumes that the gradient of the rounding is identity (or a small constant), $\frac{\partial\mathbf{A}}{\partial\mathbf{C}}=\mathds{1}$ . This would speed up convergence as it allows both the error in the bounding box alignment (1st term of Eq. 9) to be backpropagated and the assignment to be reinforced (2nd term of Eq. 9). This approach has been shown to work well in practice for many problems including for differentiating through permutations for solving combinatorial optimization problems and for training binary neural networks . More complex approaches could also be used in our framework for computing the gradient of the assignment such as which uses a Plackett-Luce distribution over permutations and a reparameterized gradient estimator.

Appendix E Generalization to Unseen Scenes and Categories

Our framework learns the object bounding boxes and point masks from raw point clouds without coupling with semantic information, which inherently allows the generalization across new categories and scenes. We conduct extra experiments to qualitatively demonstrate the generality of our framework. In particular, we use the well-trained model from S3DIS dataset (Areas 1/2/3/4/6) to directly test on the validation split of ScanNet(v2) dataset. Since ScanNet dataset consists of much more object categories than S3DIS dataset, there are a number of categories (e.g., toilet, desk, sink, bathtub) that the trained model has never seen before.

As shown in Figure 12, our model is still able to predict high-quality instance labels even though the scenes and some object categories have not been seen before. This shows that our model does not simply fit the training dataset. Instead, it tends to learn the underlying geometric features which are able to be generalized across new objects and scenes.