FusionRCNN: LiDAR-Camera Fusion for Two-stage 3D Object Detection

Xinli Xu, Shaocong Dong, Lihe Ding, Jie Wang, Tingfa Xu, Jianan Li

I Introduction

3D object detection is one of the fundamental tasks in autonomous driving and robotics, which aims to capture accurate 3D information with multiple sensors. Since LiDAR sensors enjoy the natural advantage of obtaining accurate depth and shape information, previous methods achieve competitive performance by using only point clouds. Furthermore, some attempts significantly improve the performance through a two-stage refinement module, which inspires the researchers to explore more effective LiDAR-based two-stage detectors.

Two-stage methods can be divided into three main categories based on the representation of Point of Interest, i.e., point-based, voxel-based and point-voxel-based. Point-based approaches take the sampling points as input, and obtain point features for RoI refinement. Voxel-based methods rasterize point clouds into voxel-grids and extract features from 3D CNNs for refinement. Point-Voxel-based approaches combine the two types of feature learning schemes to improve detection performance. However, no matter for what representation, the sparsity and non-uniform distribution characteristics of point clouds make it difficult to distinguish and locate objects in the far distance, leading to false or missed detections, as illustrated in Fig. 1. Things get extremely worse when the proposals contain few (1-5) points, from which we can hardly obtain enough semantic information. Fortunately, camera is complementary to LiDAR by providing dense texture information. How to design the LiDAR-Camera fusion paradigm in two-stage to well leverage their complementary strengths is of great importance.

In this work, we focus on fusing LiDAR point clouds and images in the refinement stage. Previous works utilize an image segmentation sub-network to extract image features and attach image features to the raw points. However, we find that the point-based fusion ignores semantic density of image features and heavily relies on the image segmentation sub-networks. In light of the above, this work presents a deep fusion method, dubbed FusionRCNN, which comprises three steps: i) extract RoI features from points and images corresponding to proposals from any one-stage detectors; ii) fuse the features of these two modalities through well-designed intra-modality self-attention and inter-modality cross-attention, abandoning the heavy reliance on hard-associations between points and images while keeping the semantic density of images; iii) feed the encoded fusion features into a transformer-based decoder to predict the refined 3D bounding boxes and confidence scores.

To sum up, this work makes the following contributions:

We propose a flexible and effective two-stage multi-modality 3D detector named FusionRCNN, which fuses image and point clouds in regions of interest and can boost existing one-stage detectors with minor changes.

We utilize a novel transformer-based mechanism to simultaneously achieve attentive fusion between pixel set and point set, which is robust to calibration noise.

Our method has superior performance compared to two-stage approaches on KITTI and Waymo Open Dataset, especially on difficult samples with sparse points.

II Related Works

LiDAR-Based 3D Detection: Existing LiDAR-based 3d detection methods can be broadly grouped into three categories, The Voxel-based, Point-based, and Range View. Voxel-based detetors voxelize the unstructured point clouds as a regular 2D/3D grid which conventional CNNs can be easily applied. The pioneer work MV3D projects the point clouds to 2D bird-eye view grids and places lots of predefined 3D anchors for generating high accurate 3D candidate boxes, motivating following efficient bird-eye view representation methods. VoxelNet applies mini PointNet for voxel featurea extraction. SECOND introduces 3D sparse convolution to accelerate 3D voxel processing. For Point-based methods, PointNet and its variants directly take the raw points as input and use symmetric operators to address the unorderness of point clouds. PointRCNN and STD segment foreground points with PointNet and generate proposals. 3DSSD proposes a new sampling strategy for efficient computation. Range View detectors represent LiDAR point clouds as dense range images, where pixels contains extra accurate depth information. Compared to other methods, Voxel-based detetors balances the efficiency and performance, we choose the voxel-based detector as RPN networks in this paper.

LiDAR-Camera 3D Detection: Recently, LiDAR-Camera 3D detection has been receiving increasing attention as the two types sensors are complementary. LiDARs provide sparse point clouds containing accurate depth information, while cameras provide high-resolution images containing rich color and textures. MV3D creates 3D object proposal from LiDAR bev features and projects the proposals to multi-view images to extract RoI features. F-PointNet lift images proposal into a 3D frustum and achieve high performance. Point-level fusion methods decorate raw foreground LiDAR points and apply a common LiDAR-based detectors on the decorated point clouds. Among these methods, PointPainting , PointAugmenting , MVP , FusionPainting and AutoAlign which have gained great success are input-level decoration, while DeepFusion and Deep Continues Fusion are feature-level decoration. Recent works TransFusion and FUTR3D initialize object queries in 3D space and fuse image feature on the proposals. To our knowledge, few works focus on two-stage fusion networks, in this paper we propose a novel framework which can be applied as a plug-and-play RCNN module to existing detectors and boost their performance significantly.

III Method

where $\bm{B}_{r}$ and $\bm{S}_{r}$ are corrected bounding boxes and confidence scores, and $\mathcal{R}$ represents the proposed network.

Fig. 2 shows the overall architecture of the proposed FusionRCNN. We adopt the RoI Feature Extractor (Sec. III-A) to extract the RoI features from points and images corresponding to $\bm{B}$ , then fuses the features of these two modalities through Fusion Encoder (Sec. III-B). The encoding fusion features are further fed into Decoder (Sec. III-C) and predict the refined 3D bounding boxes and confidence scores.

Start with giving 3D bounding boxes $\bm{B}$ , point clouds $\bm{P}$ and camera images $\bm{I}$ , in order to capture sufficient structure and context information, we fix the center of the bounding box $\bm{b}_{i}$ while expanding the length, width and height with radio $k$ , and feed the scaled RoI to the feature extractor. We adopt a two-branch architecture, where the point/image RoI features are extracted from point clouds $\bm{P}$ and images $\bm{I}$ individually.

For the point branch, points within the corresponding box $\bm{b}_{i}$ after expansion are sampled or padded to a unified number $N$ . Inspired by the point embedding methods used in , We enhance the point features by concatenating the distance to the eight corners and the center of $\bm{b}_{i}$ as

III-B Fusion Encoder

Based on the above RoI Feature Extractor, we can get the per-point feature and the per-pixel image feature (pixel size varies since we fix a $S\times S$ pooling size while the projected proposal sizes are different) inside the RoI. Instead of fusing features by painting the image features into points like previous methods , which prefer to utilize the direct correspondence between points and image pixels but neglects the fact that a local region of pixels can contribute to one point and vice versa, we leverage self-attention and cross-attention to achieve the Set-to-Set fusion. Specifically, to make point and image features align with each other and better model the inner relationships, we first feed them into the multi-head self-attention layer respectively.

For embedded point features $\bm{F}^{P}$ , we have

Correspondingly, the image features are fed into another multi-head self-attention layer to enhance the context information as

Then, we fuse the information of the two domains at feature level through cross-attention as

Note that the cross-attention is not necessary, point and image branches can work independently, which increases the flexibility of our model and allows us to train the network decoupled.

Finally, $\bm{F}^{PI}_{cross}$ are fed into FFN with two linear layers.

In the encoding layer, we adopt a novel fusion strategy to promote the complementary of the two modalities. The rich semantic information of image will be integrated into the point features. Correspondingly, the object structure information extracted from point branches can also guide the aggregation of image features to reduce the impact of occlusion and other situations. In our fusion encoder, we stack several encoding layers to ensure full feature fusion.

III-C Decoder

The encoded fusion features are fed into the decoding layers to obtain the features of the final box. We initialize a learnable query embedding $\bm{E}$ with $d$ channels as a query, in which the encoded features are used as keys and values.

where $\bm{F}^{PI}$ is the output fusion features from fusion encoding layers. The decoder module is also composed of several decoding layers.

III-D Objectives

We train our model by end-to-end strategy. The overall loss is the sum of the RPN loss and the second stage network loss. RPN loss adopts the loss of the original network (SECOND ), and the newly introduced second stage loss includes confidence loss $L_{conf}$ and regression loss $L_{reg}$ ,

We employ the binary cross entropy loss as the L to guide the prediction of positive samples and negative samples as

The division of positive and negative samples is based on IoU as

where $t$ is a threshold of IoU. For positive samples, the regression loss is composed of smooth L1 loss of all parameters of bounding box as

where $\hat{p},p$ represent the parameters of predictions and aligned ground truth boxes respectively.

IV Experiments

We evaluate FusionRCNN on both KITTI and Waymo Open Dataset , and conduct extensive ablation studies to validate our design choices.

Model setup. We implement our network by open-sourced OpenPCDet . We employ SECOND as the RPN and follow the settings in OpenPCDet. For RoI head, we adopt ResNet50 pretrained on ImageNet as image backbone and keep its weight frozen during training to save time, the highest resolution output of FPN is selected as the feature map. For each RoI, the expanding radio $k$ is 2, we sample 256 point clouds, and the corresponding projected image region is converted to 7 $\times$ 7 resolution by RoIPooling. In addition, the number of encoding layers is set to 3 and the number of decoding layers is set to 1 to balance performance and efficiency.

Training details. The network is trained end-to-end on 8 Tesla V100 GPUs. On the Waymo Open Dataset, we apply Adam optimizer and the cycle decay strategy, the learning rate is 0.0008. Following CT3D, we train the model for 80 epochs.On KITTI, we apply the same training strategy, and train 100 epochs with a learning rate of 0.003, Moreover, we design several kinds of data augmentation i.e. flip, rotation and scaling supporting both images and point clouds.

IV-B Results on Waymo

Data and metrics. Waymo Open Dataset is a large-scale outdoor public dataset for autonomous driving research, which contains RGB images from five high-resolution cameras and 3D point clouds from five LiDAR sensors. The whole dataset consists of 798 scenes (20s fragment) for training and 202 scenes for validation and 150 for testing. The measures are reported based on the distances from 3D objects to sensor, i.e., 0-30m, 30-50m and >50m, respectively. These metrics are further divided into two difficulty levels: LEVEL1 for 3D boxes with more than 5 LiDAR points and LEVEL2 for boxes with at least 1 LiDAR point.Remarkably, the cameras in Waymo only cover around 250-degrees but not 360-degrees horizontally. Our framework can adapt to this situation. All models are trained on 20% Waymo dataset.

Main results. We first evaluate the performance of FusionRCNN on the large public Waymo Open Dataset. Tab. I reports the results of vehicle detection with 3D and BEV AP on validation sequences. Note that with the strong SECOND baseline, FusionRCNN outperforms all previous methods in both LEVEL_1 and LEVEL_2, leading PV-RCNN by 8.61% mAP and Voxel-RCNN by 3.32% mAP on LEVEL_1. FusionRCNN achieves 78.91% for the commonly used LEVEL_1 3D mAP evaluation metric, surpassing the previous state-of-the-art method CT3D by a significant margin(2.61% mAP). We ascribe this performance gain to our novel two-stage deep fusion design that effectively integrates geometry information from LiDAR and dense texture information from camera, which helps refine bounding box parameters and confidence scores accurately.

Additionally, we show multi-class detection results with Vehicle, Pedestrian, and Cyclist in Tab. II. After adopting FusionRCNN, we can see that the baseline model SECOND and CenterPoint significantly improved small objects, i.e., 10.55% mAP on Cyclist for SECOND, 6.43% on Pedestrian for CenterPoint. Tab. III shows that our method surpasses other single-frame methods in the stricter eval standard(IoU threshold for 0.8), which suggests that our method works excellently in location with rich structure and texture information.

Visualization. Experiments on Waymo show that our method has excellent performance in long-range detection. As CT3D use the same one-stage detector as RPN, We show a qualitative comparison between FusionRCNN and CT3D which merely uses point clouds in the refinement stage. The comparison is shown in Fig. 3.

IV-C Results on KITTI

Data and metrics. KITTI Dataset has been widely used in 3D detection tasks since its release. It contains multiple types of sensors like stereo cameras and a 64-beam Velodyne. There are 7,481 training samples commonly divided into 3,712 samples for training and 3,769 samples for validation, and 7,518 samples for testing. We conduct experiments on the commonly used category car whose detection IoU threshold is 0.7. We also report the results for three difficulty levels(easy, moderate and hard) according to the object size, occlusion state and truncation level.

Main results. To further verify our framework, we conduct experiments on the KITTI validation set and compare with previous state-of-art methods. Tab. IV shows our method improves the one-stage method SECOND for all three difficulty levels with a significant margin (+1.29% for Easy, +7.02% for Moderate and +2.1% for Hard) and has a great competitive with all LiDAR-based and LiDAR-Camera methods. Our FusionRCNN achieves better performance than two-stage fusion competitor PI-RCNN , which brings 7.11% improvement on Moderate mAP. Furthermore, we compare FusionRCNN with the released method PV-RCNN and CT3D since they share the same RPN. FusionRCNN performs better than PV-RCNN in all difficulty levels , while compared with the state-of-the-art method CT3D, our method has better performance overall, which leads CT3D by 0.36% on Easy level and 0.33% on Hard level with comparable result in Moderate. Remarkably, FusionRCNN achieves the AP of 79.32%(Hard), and outperforms state-of-the-art 3D detectors. Compared with point-based two-stage methods, our novel two-stage fusion framework is better at capturing structural and contextual information effectively.

IV-D Ablation Studies

Effect of LiDAR-Camera fusion. We investigate the effect of introducing texture information from camera images. We switch FusionRCNN to a LiDAR-based method named FusionRCNN-L by disabling the image branch in RoI Feature Extractor and cross-attention module in Fusion Encoder, then inference with the same settings. As shown in Tab. V, FusionRCNN-L achieves 90.25% mAP in Vehicle BEV detection and surpasses most of the methods in Tab. I. By adopting LiDAR-Camera fusion, FusionRCNN gains further promotion, especially for long-range detection (50m-Inf).

Different RPN Backbones. we plug FusionRCNN into popular single-stage detectors, i.e., SECOND, PointPillar and CeterPoint to verify the generality of FusionRCNN. Tab. VI shows our method improves all three baseline models with significant boosts, +6.14%, +2.7% and +5.55% 3D mAP on LEVEL_1. This benefits are from that our method utilizes a novel LiDAR-Camera fusion mechanism, leveraging structure and semantic information from LiDAR and camera images.

RoI Feature Extractor. Our RoI feature extractor contains a point and an image branch. Previous works have proved that raw points have more accurate structure information to benefit local bounding box contextual information extraction. We mainly conduct an ablation study on image branch. Some parameters may affect the performance of image feature extraction and in turn detection performance. We test with different output size $S$ of RoI image features in Tab. VII. We find that these settings have little impact on image extraction branch. One possible explanation is that LiDAR and image features fuse dynamically in our fusion encoding layer, and the image features contribute to category classification with high-level contextual information.

V Conclusion

In this work, we propose a novel two-stage multi-modality 3D detector named FusionRCNN, which successfully integrates LiDAR point cloud and camera image information in the regions of interest. FusionRCNN leverages a well-designed attention mechanism to achieve Set-to-Set fusion, and thus becomes more robust to the LiDAR-Camera calibration noise. We show that FusionRCNN outperforms state-of-the-art two-stage 3D detectors both on Waymo Open Dataset and KITTI dataset, which is plug-and-play and has enormous potential to boost all existing one-stage 3D detectors.