Group-Free 3D Object Detection via Transformers
Ze Liu, Zheng Zhang, Yue Cao, Han Hu, Xin Tong
Introduction
3D object detection on point cloud simultaneously localizes and recognizes 3D objects from a 3D point set. As a fundamental technique for 3D scene understanding, it plays an important role in many applications such as autonomous driving, robotics manipulation, and augmented reality.
Different from 2D object detection that works on 2D regular images, 3D object detection takes irregular and sparse point cloud as input, which makes it difficult to directly apply techniques used for 2D object detection techniques. Recent studies infer the object location and extract object features directly from the irregular input point cloud for object detection. In these methods, a point grouping step is required to assign a group of points to each object candidate, and then computes object features from assigned groups of points. For this purpose, different grouping strategies have been investigated. Frustum-PointNet applies the Frustum envelop of a 2D proposal box for point grouping. Point R-CNN groups points within the 3D box proposals to objects. VoteNet determines the group as the points which vote to the same (or spatially-close) center point. Although these hand-crafted grouping schemes facilitate 3D object detection, the complexity and diversity of objects in real scene may lead to wrong point assignments (shown in Figure. 1) and degrade the 3D object detection performance.
In this paper, we propose a simple yet effective technique for detecting 3D objects from point clouds without the handcrafted grouping step. The key idea of our approach is to take all points in the point cloud for computing features for each object candidate, in which the contribution of each point is determined by an automatically learned attention module. Based on this idea, we adapt the Transformer to fit for 3D object detection, which could simultaneously model the object-object and object-pixel relationships, and extract the object features without handcrafted grouping.
To further release the power of the transformer architecture, we improve it in two aspects. First, we propose to iteratively refine the prediction of objects by updating the spatial encoding of objects in different stages, while the original application of Transformers adopt the fixed spatial encoding. Second, we use the ensemble of detection results predicted at all stages during inference, instead of only using the results in the last stage as the final results. These two modifications efficiently improve the performance of 3D object detection with few computational overheads.
We validate our method with both ScanNet V2 and SUN RGB-D benchmarks. Results show that our method is effective and robust to the quality of initial object candidates, where even a simple farthest point sampling approach has been able to produce strong results on ScanNet V2 and SUN RGB-D benchmarks. For the SUN RGB-D dataset, our method with the ensemble scheme results in significant performance improvement (+3.8 mAP@0.25). With few bells and whistles, the proposed approach achieved state-of-the-art performance on both benchmarks.
We believe that our method also advocates a strong potential by using the attention mechanism or Transformers for point cloud modeling, as it naturally addresses the intrinsic irregular and sparse distribution problems encountered by 3D point clouds. This is contrary to 2D image modeling, where such modeling tools mainly act as a challenger or a complementary component to the mature grid modeling tools such as ConvNets variants and RoI Align .
Related Work
Early 3D object detection approaches project point cloud to 2D grids or 3D voxels so that the advanced convolutional networks can be directly applied. A set of methods project point cloud to the bird’s view and then employ 2D ConvNets for learning features and generate 3D boxes. These methods are mainly applied for the outdoor scenes in autonomous driving where objects are distributed on a horizontal plane so that their projections on the bird-view are occlusion-free. Note these approaches also need to address the irregular and sparse distribution issues of the 2D point projections, usually by pixelization. Other methods project point clouds into frontal views and then apply 2D ConvNets for object detection. Voxel-based methods convert points into voxels and employ 3D ConvNets to generate features for 3D box generation. All these projection/voxelization based methods suffer from quantization errors. The voxel-based methods also suffer from the large memory and computational cost of 3D convolutions.
Point based Detection
Recent methods directly process point clouds for 3D object detection. A core task of these methods is to compute object features from the irregularly and sparsely distributed points. All existing methods first assign a group of points to each object candidate and then compute object features from each point group. Frustum-PointNet groups points by the 3D Frustum envelope of a 2D box detected using an RGB object detector, and applies a PointNet on the grouped points to extract object features for 3D box prediction. Point R-CNN directly computes 3D box proposals, where the points within this 3D box are used for object feature extraction. PV-RCNN leverages the voxel representation to complement the point-based representation in Point R-CNN for 3D object detection and achieves better performance.
VoteNet groups points according to their voted centers and extract object features from grouped points by the PointNet. Some follow-up works further improve the point group generation procedure or the object box localization and recognition procedure .
Our method is also a point-based detection approach. Unlike existing point-based approaches, our method involves all the points for computing the features of each object candidate by an attention module. We also stack the attention modules to iteratively refine the detection results while maintaining the simplicity of our method.
Network architecture for Point Cloud
A large set of network architectures have been proposed for various point cloud based learning tasks. provides a good taxonomy and review of all these architectures, and discussing all of them is beyond the scope of this paper. Our method can take any point cloud architecture as the backbone network for computing the point features. We adopt PointNet++ used in previous methods in our implementation for a fair comparison.
Attention Mechanism/Transformer in NLP and 2D Image Recognition
The attention-based Transformer is the dominant network architecture for the learning tasks in the field of NLP . They have been also applied in the field of 2D image recognition as a strong competitor to the dominant grid/dense modeling tools such as ConvNets and RoI-Align. The most related works in 2D image recognition to this paper are those who apply the attention mechanism or Transformer architectures into 2D object detection .
Among these approaches, our method is most similar to , which also applies a Transformer architecture for 2D object detection. However, we found that directly applying this method to point clouds leads to significantly lower performance than our approach in 3D object detection task. On the one hand, this is caused by the new technologies we proposed, and on the other hand, it probably because our method better integrated the advantage of traditional 3D detection framework. We discussed these factors in Sec. 4.6.
Our approach improves the Transformer models to better adapt the 3D object detection task, including the update of object query locations in the multi-stage iterative box prediction, and an ensemble of detection results of stages. Although the attention mechanisms still have a certain performance gap compared to the dominant convolution-based methods in other tasks, we found that this architecture may well address the point grouping issue for object detection on point clouds. As a result, we advocate a strong potential of this architecture for modeling irregular 3D point clouds.
Methodology
While our framework can leverage any point cloud network to extract point features, we adopt PointNet++ as the backbone network for a fair comparison with the recent methods .
The backbone network receives a point cloud of points (i.e. 2048) as input. We follow the encoder-decoder architecture in to first down-sample the point cloud input into resolution (i.e. 256 points) through four stages of set abstraction layers, and then up-sample it to the resolution of (i.e. 1024 points) by feature propagation layers. The network will produce a -channel vector representation for each point on the resolution, denoted as , which are then used in the initial object candidates sampling module and the stacked attention modules. In the following parts, we will first describe these two modules in detail, and then present the loss function and head design for this framework.
1 Initial Object Candidate Sampling
While object detection on 2D images usually adopts data-independent anchor boxes as initial object candidates, it is generally intractable or impractical for 3D object detection to apply this simple top-down strategy, as the number of anchor boxes in 3D search space is too huge to handle. Instead, we follow recent practice to sample initial object candidates directly from the points on a point cloud, by a bottom-up way.
We consider three simple strategies to sample initial object candidates from a point cloud:
Farthest Point Sampling (FPS). The FPS approach has been widely adopted to generate a point cloud from a 3D shape or to down-sample the point clouds to a lower resolution. This method can be also employed to sample initial candidates from a point cloud. Firstly, a point is randomly sampled from the point cloud. Then the farthest point to the already-chosen point set is iteratively selected until the number of chosen points meets the candidate budget. Though it is simple, we show in experiments that this sampling approach along with our framework has been able to be comparable to the previous state-of-the-art 3D object detectors.
-Closest Points Sampling (KPS). In this approach, we classify each point on a point cloud to be a real object candidate or not. The label assignment in training follows this rule: a point is assigned positive if it is inside a ground-truth object box and it is one of the -closest points to the object center. In inference, the initial candidates are selected according to the classification score of the point.
KPS with non-maximal suppression (KPS-NMS). Built on the above KPS method, we introduce an additional non-maximal suppression (NMS) step, which iteratively removes spatially close object candidates, to improve the recall of sampled object candidates given a fixed number of objects, following the common practice in 2D object detection. In addition to the objectness scores, we predict also the object center that each point belongs to, where the NMS is conducted accordingly. Specifically, the candidates locating within a radius of the selected object center will be suppressed. The radius is set to 0.05 in our experiments.
In experiments, we will demonstrate that our framework has strong compatibility with the choice of these sampling approaches, mainly ascribed to the robust object feature extraction approach described in the next subsection (see Table 3). We use the KPS approach by default, due to its better performance than the FPS approach, and the same effectiveness as the more complex KPS-NMS approach.
2 Iterative Object Feature Extraction and Box Prediction by Transformer Decoder
With the initial object candidates generated by a sampling approach, we adopt the Transformer as the decoder to leverage all points on a point cloud to compute the object feature of each candidate. The multi-head attention network is the foundation of Transformer, it has three input sets: query set, key set and value set. Usually, the key set and value set are different projections of the same set of elements. Given a query set and a common element set of key set and value set, the output feature of the multi-head attention of each query element is the aggregation of the values that weighted by the attention weights, formulated as:
where indexes over attention heads, is the attention weight, indicate the query projection weight, value projection weight, key projection weight, and output projection weight, respectively.
While the standard Transformer predicts the sentence of a target language sequentially in an auto-regressive way, our Transformer computes object features and predicts 3D object boxes in parallel. The Transformer consists of several stacked multi-head self-attention and multi-head cross-attention modules, as illustrated in Figure 3.
Denote the input point features at stage as and the object features at the same stage as . A self-attention module models interaction between object features, formulated as:
A cross-attention module leverages point features to compute object features, formulated as:
where the notations are similar to those in Eq. (3). After the object feature are updated through the self-attention module and cross attention module, a feed-forward network (FFN) is then applied to further transformed feature of each object.
There are a few differences compared to the original Transformer decoders, as described below.
The original Transformer adopts a fixed spatial encoding for all of the stacked attention modules, indicating the indices of each word. The application of Transformers to 2D object detection instantiate the spatial encoding (object prior) as a learnable weight. During inference, the spatial encoding is fixed and same for any images.
In this work, we propose to refine the spatial encodings of an object candidate stage by stage. Specifically, we predict the 3D box locations and categories at each decoder stage, and the predicted location of a box in one stage will be used to produce the refined spatial encoding of the same object, the refined spatial encoding vector is then added to the output feature of this decoder stage and fed into the next stage. The spatial encodings of an object and a point are computed by applying independent linear layers on the parameterization vector of a 3D box and a point , respectively. In the experiments, we will show this approach can improve the mAP@0.25 and mAP@0.5 by 1.6 and 5.0 on the ScanNet V2 benchmark, compared to the approach without iterative refinement.
Ensemble from Multi-Stage Predictions
Another difference is that we ensemble the predictions of different stages to produce final detection results, while previous methods usually adopt the output of the last stage as the final results. Concretely, the detection results of different stages are combined and they together go through an NMS (IoU threshold of 0.25) procedure to generate the final object detection results. We find this approach can significantly improve the performance of some benchmarks, e.g. +3.8 mAP@0.25 on the SUN RGB-D dataset. Also note the overhead of this ensembling approach is marginal, mainly ascribed to the multi-stage nature of the Transformer decoder.
3 Heads and Loss Functions
We apply head networks on all decoder stages, with each mostly following the setting in . There are 5 prediction tasks: objectness prediction with a binary focal loss , box classification with a cross entropy loss , center offset prediction with a smooth-L1 loss , size classification with a cross entropy loss , and size offset prediction with a smooth-L1 loss . Also, all 5 prediction tasks are obtained by a shared 2-layer MLP and an independent linear layer.
The loss of -th decoder stage is the combination of these 5 loss terms by weighted summation:
where the balancing factors are set default as , , , and . The losses on all decoder stages are averaged to form the final loss:
Sampling Head
The head designs and the loss functions of the sampling module are similar to those of the decoders. There are two differences: firstly, the box classification task is not involved; secondly, the objectness task follows the label assignment as described in Sec. 3.1. Our final loss is the sum of decoder and sampling heads:
Experiments
We validate our approach on two widely-used 3D object detection datasets: ScanNet V2 and SUN RGB-D , and we follow the standard data splits for them both.
ScanNet V2 is constructed from an 3D reconstruction dataset of indoor scenes by enriched annotations. It consists of 1513 indoor scenes and 18 object categories. The annotations of per-point instance, semantic labels, and 3D bounding boxes are provided. We follow a standard evaluation protocol by using mean Average Precision(mAP) under different IoU thresholds, without considering the orientation of bounding boxes.
SUN RGB-D is a single-view RGB-D dataset for 3D scene understanding, consisting of 5K indoor RGB and depth images. The annotation consists of per-point semantic labels and oriented bounding object bounding boxes of 37 object categories. The standard mean Average Precision is used as evaluation metrics and the evaluation is reported on the 10 most common categories, following .
2 Implementation Details
ScanNet V2 We follow recent practice to use PointNet++ as default backbone network for a fair comparison. The backbone has 4 set abstraction layers and 2 feature propagation layers. For each set abstraction layer, the input point cloud is sub-sampled to 2048, 1024, 512, and 256 points with the increasing receptive radius of 0.2, 0.4, 0.8, and 1.2, respectively. Then, two feature propagation layers successively up-sample the points to 512 and 1024. More training details are given in Appendix.
SUN RGB-D The implementation mostly follow . We use 20k points as input for each point cloud. The network architecture and the data augmentation are the same as that for ScanNet V2. As the orientation of the 3D box is required in evaluation, we include an additional orientation prediction branch for all decoder stages. More training details are given in Appendix.
3 System-level Comparison
In this section, we compare with previous state-of-the-arts on ScanNet V2 and SUN RGB-D. Since previous works usually report the best results of multiple times on training and testing in the system-level comparison, we report both best results and average resultsWe train each setting 5 times and test each training trial 5 times. The average performance of these 25 trials is reported to account for algorithm randomness.
ScanNet V2 The results are shown in Table 1. With the same backbone network of a standard PointNet++, the proposed approach achieves 67.3 mAP@0.25 and 48.9 mAP@0.5 using 6 decoder stages and 256 object candidates, which is 2.8 and 5.5 better than previous best results using the same backbones. By more decoder stages as 12, the gap increases to 6.3 on mAP@0.5.
With stronger backbones and more sampled object candidates, i.e. more channels and 512 candidates, the performance of the proposed approach is improved to 69.1 mAP@0.25 and 52.8 mAP@0.5, outperforming previous best method by a large margin.
SUN RGB-D We also compare the proposed approach with previous state-of-the-arts on the SUN RGB-D dataset, which is another widely used 3D object detection benchmark. In this dataset, the ensemble approach over multiple stages is used by default during inference. The results are shown in Table. 2. Our base model achieves 63.0 on mAP@0.25 and 45.2 on mAP@0.5, which outperforms all previous state-of-the-arts that only use the point cloud. In particular, it outperforms the H3DNet on mAP@0.5 by 6.2.
4 Ablation Study
In this section, we validate our key designs on ScanNet V2. If not specified, all models have 6 attention modules, 256 sampled candidates, and are equipped with the proposed iterative object prediction approach. In evaluation, we report the average performance of 25 trials by default.
Sampling Strategy We first ablate the effects of different sampling strategies in Table. 3. It shows that our approach performs well by using different sampling strategies. It also works well in a wide range of hyper-parameters, such as in the KPS sampling approach (see Table. 4).
These results indicate the robustness of our framework for choosing different sampling approaches.
Iterative Box Prediction Table 5 ablates several design choices for iterative box prediction. With a naive iterative method where no spatial encoding is involved in the decoder stages, the approach shows reasonably good performance of 64.7 mAP@0.25 and 43.4 mAP@0.25, likely because the location information may have been implicitly included in the input object features. Actually, an additional fixed position encoding does not improve detection performance (64.6 mAP@0.25 and 43.5 mAP@0.5).
By refining the encodings of the box location stage by stage, the localization ability of the approach is significantly improved of the 4.1 points gains on the mAP@0.5 metric over the naive implementation (47.5 vs. 43.4). Also, more detailed spatial encoding by both box center and size is beneficial, compared to that only encodes box centers (66.3 vs. 65.2 on mAP@0.25 and 48.5 vs. 47.5 on mAP@0.5).
Table. 6 shows the performance of iterative box prediction with different decoder stages. More stages can bring significant performance improvement, especially in the mAP@0.5. Compared with not applying any attention modules, our 6-stage model performs better on mAP@0.25 and mAP@0.5 by 3.0 and 7.8, respectively.
Ensemble Multi-stage Predictions Each decoder stage of our approach will predict a set of 3D boxes. It is natural to ensemble these results of different decoder stages in expecting better final detection results. Table 7 shows the results, where significantly performance improvements are observed on SUN RGB-D (+3.8 mAP@0.25 and +1.9 mAP@0.5) and maintained performance on ScanNet V2. We hypothesize that it is because the point clouds of SUN RGB-D have lower quality than those of ScanNet V2: SUN RGB-D adopts real RGB-D signals to generate point clouds that many objects have missing parts due to occlusion, while the ScanNet V2 generate point clouds from 3D shape meshes which are more complete. The ensemble method can boost the performance more on real 3D scenes.
Comparison with Group-based Approaches Aggregating point features through RoI-Pooing, or according to the voted centers are two typical handcrafted grouping strategies in 3D object detection. We refer these two grouping strategies as baselines and compare with them. For a fair comparison, we only switch the feature aggregation mechanism while all other settings (e.g. the 6-stage decoder) remain unchanged. More details are in Appendix. Table 8 show the results. Although RoI-Pooling outperforms than the voting approach, it is still worse than our group-free approach by 1.2 points on mAP@0.25 and 4.1 points on mAP@0.5.
5 Inference Speed
The computational complexity of the attention model is determined by the number of points in a point cloud and the number of sampled object candidates. In our approach, only a small number of object candidates are sampled, which makes the cost of the attention model insignificant. With our default setting (256 object candidates, 1024 output points), stacking one attention model brings 0.95 GFLOPs, which is quite light compared to the backbone.
In addition, the realistic inference speed of our method is also very competitive, compared to other state-of-the-art methods. For a fair comparison, all experiments are run on the same workstation (single Titan-XP GPU, 256G RAM, and Xeon E5-2650 v3) and environment (Ubuntu-16.04, Python 3.6, Cuda-10.1, and PyTorch-1.3.1). The official code of other methods is used for evaluation. The batch size of all experiments is set to 1 (i.e. single image). The results are shown in Table. 9. Our method achieves better performance and also higher inference speed.
6 Comparison with DETR
DETR is a pioneer work that applies the Transformer to 2D object detection. Compared with DETR, our method involves more domain knowledge, such as the data-dependent initial object candidate generation, where DETR uses a data-independent object prior to representing each object candidate and is automatically learned without explicit supervision. Moreover, there is no iterative refinement on spatial encodings in DETR as in our approach. We evaluate these differences in 3D object detection. For a fair comparison, the backbone and decoder heads used in DETR are the same as in ours. We carefully tune the hyper-parameters for DETR and chose the best setting in comparison.
The results are shown in Table 10. With the same training length of 400 epochs, DETR achieves 39.6 mAP@0.25 and 21.4 mAP@0.5, significantly worse than our method. We guess it is mainly because of optimization difficulty by the data-independent object representation. The fixed spatial encoding also may contribute to inferior performance. In fact, the performance can be improved significantly by bridging these differences, reaching 59.9 mAP@0.25 and 42.9 mAP@0.5 using the same training epochs, and 61.8 mAP@0.25 and 45.2 mAP@0.5 by longer training.
The remaining performance gap is due to the difference in ground-truth assignments, where DETR adopts a set loss to automatically determine the assignments by detection losses and our approach manually assigns object candidates to ground-truths. This assignment may also be difficult for a network to learn.
7 Qualitative Results
Fig. 4 illustrates the qualitative results on both ScanNet V2 and SUN RGB-D. As the decoder networks go deeper, the more accurate detection results are observed.
Fig. 5 visualizes the learned cross-attention weights of different decoder stages. We could observe that the model of the lower stage always focuses on the surrounding points without considering the geometry. With the refinement, the model of the higher stage could focus more on the geometry and extract more high-quality object features.
Conclusion
In this paper, we present a simple yet effective 3D object detector based on the attention mechanism in Transformers. Unlike previous methods that require a grouping step for object feature computation, this detector is group-free which computes object features from all points in a point cloud, with the contribution of each point automatically determined by the attention modules. The proposed method achieves state-of-the-art performance on ScanNet V2 and SUN RGB-D benchmarks.
Appendix A1 Training Details
We follow recent practice to use the PointNet++ as our default backbone network for a fair comparison. The backbone network has four set abstraction layers and two feature propagation layers. For each set abstraction layer, the input point cloud is sub-sampled to 2048, 1024, 512, and 256 points with the increasing receptive radius of 0.2, 0.4, 0.8, and 1.2, respectively. Then, two feature propagation layers successively up-sample the points to 512 and 1024, respectively.
In the training phase, we use 50kWe evaluate our model on 40k points on ScanNet V2 according to previous works and the performance is similar: 66.3(40k) vs. 66.2(50k) on mAP@0.25, and 48.5(40k) vs. 48.6(50k) on mAP@0.5. points as input and adopt the same data augmentation as in , including a random flip, a random rotation between [, ], and a random scaling of the point cloud by [0.9, 1.1]. The network is trained from scratch by the AdamW optimizer (=0.9, =0.999) with 400 epochs. The weight decay is set to 5e-4. The initial learning rate is 0.006 and decayed by 10 at the 280-th epoch and the 340-th epoch. The learning rate of the attention modules is set as 1/10 of that in the backbone network. The gradnorm_clip is applied to stabilize the training dynamics. Following we use class-aware head for box size prediction.
SUN RGB-D
The implementation settings mostly follow . We use 20k points as input for each point cloud. The network architecture and the data augmentation are the same as that for ScanNet V2. As the orientation of the 3D box is required in evaluation, we include an additional orientation prediction branch for all decoder layers. The orientation branch contains a classification task and an offset regression task with loss weights of 0.1 and 0.04, respectively.
In training, the network is trained from scratch by the AdamW optimizer (=0.9, =0.999) with 600 epochs if not specified. The initial learning rate is 0.004 and decayed by 10 at the 420-th epoch, the 480-th epoch, and the 540-th epoch. The learning rate of attention modules is set as 1/20 of the backbone network. The weight decay is set to 1e-7, and the gradnorm_clip is used. We use class-agnostic head for size prediction.
A1.2 Other Pooling Mechanisms
For a fair comparison, we only switch the feature aggregation mechanism while all other settings remain unchanged. In the following, we will introduce the implementation details of RoI-Pooling and Voting aggregation mechanism.
For a given object candidate, the points within the predicted box of the object candidate are aggregated together, and the refined box is predicted from the aggregated features. The same as our group-free approach, the multi-stage refinement is also adopted. Thus the aggregated points and features will be updated and refined in multiple stages. Also, we tried two different strategies for feature aggregation: average-pooling and max-pooling. The results are shown in Table. 11. We could find that the approach with max-pooling performs better, so we use it for comparison by default.
Voting
The voting mechanism is first introduced by VoteNet and we implement it in our framework. Specifically, each point predicts the center of its corresponding object, and if the distance between the predicted center of points and the center of an object candidate is less than a threshold (set to 0.3 meters), then these points and the candidate are grouped. Further, a two-layer MLP with max-pooling is used to form the aggregation feature of the object candidate, and the refined boxes are predicted from the aggregated features in the multi-stage refinement process.
Appendix A2 More Results
We show per-category results on ScanNet V2 and SUN RGB-D under different IoU thresholds. Table 12 and Table 13 show the results of mAP@0.25 and mAP@0.5 on ScanNet V2, respectively. Table 14 and Table 15 show the results of mAP@0.25 and mAP@0.5 on SUN RGB-D, respectively.
We also show more qualitative results of our method on ScanNet V2 and SUN RGB-D. The results are shown in Figure 6 (ScanNet V2) and Figure 7 (SUN RGB-D).