3D Fully Convolutional Network for Vehicle Detection in Point Cloud

Bo Li

I INTRODUCTION

Understanding point cloud data has been recognized as an inevitable task for many robotic applications. Compared to image based detection, object detection in point cloud naturally localizes the 3D coordinates of the objects, which provides crucial information for subsequent tasks like navigation or manipulation.

In this paper, we design a 3D fully convolutional network (FCN) to detect and localize objects as 3D boxes from point cloud data. The 2D FCN has achieved notable performance in image based detection tasks. The proposed approach extends FCN to 3D and is applied to 3D vehicle detection for an autonomous driving system, using a Velodyne 64E lidar. Meanwhile, the approach can be generalized to other object detection tasks on point cloud captured by Kinect, stereo or monocular structure from motion.

II RELATED WORKS

A majority of 3D detection algorithms can be summarized as two stages, i.e. candidate proposal and classification. Candidates can be proposed by delicate segmentation algorithms , sliding window , random sampling , or the recently popular Region Proposal Network (RPN) . For the classification stage, research have been drawn to features including shape model and geometry statistic features . Sparsing coding and deep learning are also used for feature representation.

Besides directly operating in the 3D point cloud space, some other previous detection alogrithms project 3D point cloud onto 2D surface as depthmaps or range scans . The projection inevitably loses or distorts useful 3D spatial information but can benefit from the well developed image based 2D detection algorithms.

II-B Convolutional Neural Network and 3D Object Detection

CNN based 3D object detection is recently drawing a growing attention in computer vision and robotics. embed 3D information in 2D projection and use 2D CNN for recognition or detection. also suggest it possible to predict 3D object localization by 2D CNN network on range scans. operates 3D voxel data but regards one dimension as a channel to apply 2D CNN. are among the very few earlier works on 3D CNN. focus on object recognition and proposes 3D R-CNN techniques for indoor object detection combining the Kinect image and point cloud.

In this paper, we transplant the fully convolutional network (FCN) to 3D to detect and localize object as 3D boxes in point cloud. The FCN is a recently popular framework for end-to-end object detection, with top performance in tasks including ImageNet, KITTI, ICDAR, etc. Variations of FCN include DenseBox , YOLO and SSD . The approach proposed in this paper is inspired by the basic idea of DenseBox.

III APPROACH

The procedure of FCN based detection frameworks can be summarized as two tasks, i.e. objectness prediction and bounding box prediction. As illustrated in Figure 1, a FCN is formed with two output maps corresponding to the two tasks respectively. The objectness map predicts if a region belongs to an object and the bounding box map predicts the coordinates of the object bounding box. We follow the denotion of . Denote $\mathbf{o}^{a}_{\mathbf{p}}$ as the output at region $\mathbf{p}$ of the objectness map, which can be encoded by softmax or hinge loss. Denote $\mathbf{o}^{b}_{\mathbf{p}}$ as the output of the bounding box map, which is encoded by the coordinate offsets of the bounding box.

Denote the groundtruth bounding box coordinates offsets at region $\mathbf{p}$ as $\mathbf{b}_{\mathbf{p}}$ . Similarly, in this paper we assume only one bounding box map is produced, though a more sophisticated network can have multiple bounding box offsets predicted for one class, corresponding to multiple scales or aspect ratios. Each bounding box loss is denoted as

The overall loss of the network is thus denoted as

with $w$ used to balance the objectness loss and the bounding box loss. $\mathcal{P}$ denotes all regions in the objectness map and $\mathcal{V}\in\mathcal{P}$ denotes all object regions. In the deployment phase, the regions with postive objectness prediction are selected. Then the bounding box predictions corresponding to these regions are collected and clustered as the detection results.

III-B 3D FCN Detection Network for Point Cloud

Although a variety of discretization embedding have been introduced for high-dimensional convolution , for simplicity we discretize the point cloud on square grids. The discretized data can be represented by a 4D array with dimensions of length, width, height and channels. For the simplest case, only one channel of value $\{0,1\}$ is used to present whether there is any points observed at the corresponding grid elements. Some more sophisticated features have also been introduced in the previous works, e.g. .

The mechanism of 2D CNN naturally extends to 3D on the square grids. Figure 2 shows an example of the network structure used in this paper. The network follows and simplifies the hourglass shape from . Layer conv1, conv2 and conv3 downsample the input map by $1/2^{3}$ sequentially. Layer deconv4a and deconv4b upsample the incoming map by $2^{3}$ respectively. The ReLU activation is deployed after each layer. The output objectness map ( $\mathbf{o}^{a}$ ) and bounding box map ( $\mathbf{o}^{b}$ ) are collected from the deconv4a and deconv4b layers respectively.

Similar to DenseBox, the objectness region $\mathcal{V}$ is denoted as the center region of the object. For the proposed 3D case, a 3D sphere located at the object center is used. Points inside the sphere are labeled as positive / foreground label. The bounding box prediction at point $\mathbf{p}$ is encoded by the coordinate offsets, defined as:

where $\mathbf{c}_{\mathbf{p},\star}$ define the 3D coordinates of 8 corners of the object bounding box corresponding to the region $\mathbf{p}$ .

The training and testing processes of the 3D CNN follows . For the testing phase, candidate bounding boxes are extracted from regions predicted as objects and scored by counting its neighbors from all candidate bounding boxes. Bounding boxes are selected from the highest score and candidates overlapping with selected boxes are suppressed.

Figure 3 shows an example of the detection intermediate results. Bounding box predictions from objectness points are plotted as green boxes. Note that for severely occluded vehicles, the bounding boxes shape are distorted and not clustered. This is mainly due to the lack of similar samples in the training phase.

III-C Comparison with 2D CNN

Compared to 2D CNN, the dimension increment of 3D CNN inevitably consumes more computational resource, mainly due to 1) the memory cost of 3D data embedding grids and 2) the increasing computation cost of convolving 3D kernels.

On the other hand, naturally embedding objects in 3D space avoids perspective distortion and scale variation in the 2D case. This make it possible to learn detection using a relatively simpler network structure.

IV EXPERIMENTS

We evaluate the proposed 3D CNN on the vehicle detection task from the KITTI benchmark . The task contains images aligned with point cloud and object info labeled by both 3D and 2D bounding boxes.

The experiments mainly focus on detection of the Car category for simplicity. Regions within the 3D center sphere of a Car are labeled as positive samples, i.e. in $\mathcal{V}$ . Van and Truck are labeled to be ignored. Pedestrian, Bicycle and the rest of the environment are labeled as negative background, i.e. $\mathcal{P}-\mathcal{V}$ .

The KITTI training dataset contains 7500+ frames of data, of which 6000 frames are randomly selected for training in the experiments. The rest 1500 frames are used for offline validation, which evaluates the detection bounding box by its overlap with groundtruth on the image plane and the ground plane. The detection results are also compared on the KITTI online evaluation, where only the image space overlap are evaluated.

The KITTI benchmark divides object samples into three difficulty levels. Though this is is originally designed for the image based detection, we find that these difficulty levels can also be approximately used in difficulty division for detection and evaluation in 3D. The minimum height of 40px for the easy level approximately corresponds to objects within 28m and the minimum height of 25px for the moderate and hard levels approximately corresponds to object within 47m.

The original KITTI benchmark assumes that detections are presented as 2D bounding boxes on the image plane. Then the overlap area of the image plane bounding box with its ground truth is measured to evaluate the detection. However, from the perspective of building a complete autonomous driving system, evaluation in the 2D image space does not well reflect the demand of the consecutive modules including planning and control, which usually operates in world space, e.g. in the full 3D space or on the ground plane. Therefore, in the offline evaluation, we validate the proposed approach in both the image space and the world space, using the following metrics:

Bounding box overlap on the image plane. This is the original metric of the KITTI benchmark. The 3D bounding box detection is projected back to the image plane and the minimum rectangle hull of the projection is taken as the 2D bounding boxes. Some previous point cloud based detection methods also use this metric for evaluation. A detection is accepted if the overlap area IoU with the groundtruth is larger than 0.7.

Bounding box overlap on the ground plane. The 3D bounding box detection is projected onto the 2D ground plane orthogonally. A detection is accepted if the overlap area IoU with the groundtruth is larger than 0.7. This metric reflects the demand of the autonomous driving system naturally, in which the vertical localization of the vehicle is less important than the horizontal.

For the above metrics, the naive Average Precision (AP) and the Average Orientation Similarity (AOS) are both evaluated.

The performance of the proposed approach and is listed in Table I. The proposed approach uses less layers and connections compared with but achieves much better detection accuracy. This is mainly because objects have less scale variation and occlusion in 3D embedding. More detection results are visualized in Figure 4.

IV-B KITTI Online Evaluation

The proposed approach is also evaluated on the KITTI online system. Note that on the current KITTI object detection benchmark image based detection algorithms outperforms previous point cloud based detection algorithms by a significant gap. This is due to two reasons: 1) The benchmark is using the metric of bounding box overlap on the image plane. Projecting 3D bounding boxes from point cloud inevitably introduce misalignment with 2D labeled bounding boxes. 2) Images have much higher resolution than point cloud (range scan), which enhances the detection of far or occluded objects.

The proposed approach is compared with previous point cloud based detection algorithms and the results are listed in Table II. The performance of our method outperforms previous methods by a significant gap of $>20\%$ , which is even comparable – though not as well as yet – with image based algorithms.

V Conclusions

Recent study in deploying deep learning techniques in point cloud have shown the promising ability of 3D CNN to interpret shape features. This paper attempts to further push this research. To the best of our knowledge, this paper proposes the first 3D FCN framework for end-to-end 3D object detection. The performance improvement of this method is significant compared to previous point cloud based detection approaches. While in this paper the framework are experimented on the point cloud collected by Velodyne 64E under the scenario of autonomous driving, it naturally applies to point cloud created by other sensors or reconstruction algorithms.

ACKNOWLEDGMENT

The author would like to acknowledge the help from Xiaohui Li and Songze Li. Thanks also goes to Ji Wan and Tian Xia.