Center-based 3D Object Detection and Tracking
Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl
Introduction
Strong 3D perception is a core ingredient in many state-of-the-art driving systems . Compared to the well-studied 2D detection problem, 3D detection on point-clouds offers a series of interesting challenges: First, point-clouds are sparse, and most regions of 3D space are without measurements . Second, the resulting output is a three-dimensional box that is often not well aligned with any global coordinate frame. Third, 3D objects come in a wide range of sizes, shapes, and aspect ratios, e.g., in the traffic domain, bicycles are near planer, buses and limousines elongated, and pedestrians tall. These marked differences between 2D and 3D detection made a transfer of ideas between the two domain harder . The crux of it is that an axis-aligned 2D box is a poor proxy of a free-form 3D object. One solution might be to classify a different template (anchor) for each object orientation , but this unnecessarily increases the computational burden and may introduce a large number of potential false-positive detections. We argue that the main underlying challenge in linking up the 2D and 3D domains lies in this representation of objects.
In this paper, we show how representing objects as points (Figure 1) greatly simplifies 3D recognition. Our two-stage 3D detector, CenterPoint, finds centers of objects and their properties using a keypoint detector , a second-stage refines all estimates. Specifically, CenterPoint uses a standard Lidar-based backbone network, i.e., VoxelNet or PointPillars , to build a representation of the input point-cloud. It then flattens this representation into an overhead map-view and uses a standard image-based keypoint detector to find object centers . For each detected center, it regresses to all other object properties such as 3D size, orientation, and velocity from a point-feature at the center location. Furthermore, we use a light-weighted second stage to refine the object locations. This second stage extracts point-features at the 3D centers of each face of the estimated objects 3D bounding box. It recovers the lost local geometric information due to striding and a limited receptive field, and brings a decent performance boost with minor cost.
The center-based representation has several key advantages: First, unlike bounding boxes, points have no intrinsic orientation. This dramatically reduces the object detector’s search space while allowing the backbone to learn the rotational invariance of objects and rotational equivariance of their relative rotation. Second, a center-based representation simplifies downstream tasks such as tracking. If objects are points, tracklets are paths in space and time. CenterPoint predicts the relative offset (velocity) of objects between consecutive frames, which are then linked up greedily. Thirdly, point-based feature extraction enables us to design an effective two-stage refinement module that is much faster than previous approaches .
We test our models on two popular large datasets: Waymo Open Dataset , and nuScenes Dataset . We show that a simple switch from the box representation to center-based representation yields a - mAP increase in 3D detection under different backbones . Two-stage refinement further brings an additional mAP boost with small () computation overhead. Our best single model achieves and level 2 mAPH for vehicle and pedestrian detection on Waymo, mAP and NDS on nuScenes, outperforming all published methods on both datasets. Notably, in NeurIPS 2020 nuScenes 3D Detection challenge, CenterPoint is adopted in 3 of the top 4 winning entries. For 3D tracking, our model performs at AMOTA outperforming the prior state-of-the-art by AMOTA on nuScenes. On Waymo 3D tracking benchmark, our model achieves and level 2 MOTA for vehicle and pedestrian tracking, respectively, surpassing previous methods by up to . Our end-to-end 3D detection and tracking system runs near real-time, with FPS on Waymo and FPS on nuScenes.
Related work
2D object detection predicts axis-algined bounding box from image inputs. The RCNN family finds a category-agnostic bounding box candidates, then classify and refine it. YOLO , SSD , and RetinaNet directly find a category-specific box candidate, sidestepping later classification and refinement. Center-based detectors, e.g. CenterNet or CenterTrack , directly detect the implicit object center point without the need for candidate boxes. Many 3D object detectors evolved from these 2D object detector. We argue center-based representation is a better fit in 3D application comparing to axis-aligned boxes.
3D object detection aims to predict three dimensional rotated bounding boxes . They differ from 2D detectors on the input encoder. Vote3Deep leverages feature-centric voting to efficiently process the sparse 3D point-cloud on equally spaced 3D voxels. VoxelNet uses a PointNet inside each voxel to generate a unified feature representation from which a head with 3D sparse convolutions and 2D convolutions produces detections. SECOND simplifies the VoxelNet and speeds up sparse 3D convolutions. PIXOR project all points onto a 2D feature map with 3D occupancy and point intensity information to remove the expensive 3D convolutions. PointPillars replaces all voxel computation with a pillar representation, a single tall elongated voxel per map location, improving backbone efficiency. MVF and Pillar-od combine multiple view features to learn a more effective pillar representation. Our contribution focuses on the output representation, and is compatible with any 3D encoder and can improve them all.
VoteNet detects objects through vote clustering using point feature sampling and grouping. In contrast, we directly regress to 3D bounding boxes through features at the center point without voting. Wong et al. and Chen et al. used similar multiple points representation in the object center region (i.e., point-anchors) and regress to other attributes. We use a single positive cell for each object and use a keypoint estimation loss.
Two-stage 3D object detection. Recent works considered directly applying RCNN style 2D detectors to the 3D domains . Most of them apply RoIPool or RoIAlign to aggregate RoI-specific features in 3D space, using PointNet-based point or voxel feature extractor. These approaches extract region features from 3D Lidar measurements (points and voxels), resulting in a prohibitive run-time due to massive points. Instead, we extract sparse features of 5 surface center points from the intermediate feature map. This makes our second stage very efficient and keeps effective.
3D object tracking. Many 2D tracking algorithms readily track 3D objects out of the box. However, dedicated 3D trackers based on 3D Kalman filters still have an edge as they better exploit the three-dimensional motion in a scene. Here, we adopt a much simpler approach following CenterTrack . We use a velocity estimate together with the point-based detection to track centers of objects through multiple frames. This tracker is much faster and more accurate than dedicated 3D trackers .
Preliminaries
At test time, the detector produces heatmaps and dense class-agnostic regression maps. Each local maxima (peak) in the heatmaps corresponds to an object, with confidence proportional to the heatmap value at the peak. For each detected object, the detector retrieves all regression values from the regression maps at the corresponding peak location. Depending on the application domain, Non-Maxima Suppression (NMS) may be warranted.
3D Detection Let be an orderless point-cloud of 3D location and reflectance measurements. 3D object detection aims to predict a set of 3D object bounding boxes in the bird eye view from this point-cloud. Each bounding box consists of a center location , relative to the objects ground plane, and 3D size , and rotation expressed by yaw . Without loss of generality, we use an egocentric coordinate system with sensor at and yaw.
With a map-view feature map , a detection head, most commonly a one- or two-stage bounding-box detector, then produces object detections from some predefined bounding boxes anchored on this overhead feature-map. As 3D bounding boxes come with various sizes and orientation, anchor-based 3D detectors have difficulty fitting an axis-aligned 2D box to a 3D object. Moreover, during the training, previous anchor-based 3D detectors rely on 2D Box IoU for target assignment , which creates unnecessary burdens for choosing positive/negative thresholds for different classes or different dataset. In the next section, we show how to build a principled 3D object detection and tracking model based on point representation. We introduce a novel center-based detection head but rely on existing 3D backbones (VoxelNet or PointPillars).
CenterPoint
Center heatmap head. The center-head’s goal is to produce a heatmap peak at the center location of any detected object. This head produces a -channel heatmap , one channel for each of classes. During training, it targets a 2D Gaussian produced by the projection of 3D centers of annotated bounding boxes into the map-view. We use a focal loss . Objects in a top-down map view are sparser than in an image. In map-view, distances are absolute, while an image-view distorts them by perspective. Consider a road scene, in map-view the area occupied by vehicles small, but in image-view, a few large objects may occupy most of the screen. Furthermore, the compression of the depth-dimension in perspective projection naturally places object centers much closer to each other in image-view. Following the standard supervision of CenterNet results in a very sparse supervisory signal, where most locations are considered background. To counteract this, we increase the positive supervision for the target heatmap by enlarging the Gaussian peak rendered at each ground truth object center. Specifically, we set the Gaussian radius to , where is the smallest allowable Gaussian radius, and is a radius function defined in CornerNet . In this way, CenterPoint maintains the simplicity of the center-based target assignment; the model gets denser supervision from nearby pixels.
At inference time, we use this offset to associate current detections to past ones in a greedy fashion. Specifically, we project the object centers in the current frame back to the previous frame by applying the negative velocity estimate and then matching them to the tracked objects by closest distance matching. Following SORT , we keep unmatched tracks up to frames before deleting them. We update each unmatched track with its last known velocity estimation. See supplement for the detailed tracking algorithm diagram.
CenterPoint combines all heatmap and regression losses in one common objective and jointly optimizes them. It simplifies and improves previous anchor-based 3D detectors (see experiments). However, all properties of the object are currently inferred from the object’s center-feature, which may not contain sufficient information for accurate object localization. For example, in autonomous driving, the sensor often only sees the side of the object, but not its center. Next, we improve CenterPoint by using a second refinement stage with a light-weight point-feature extractor.
We use CenterPoint unchanged as a first stage. The second stage extracts additional point-features from the output of the backbone. We extract one point-feature from the 3D center of each face of the predicted bounding box. Note that the bounding box center, top and bottom face centers all project to the same point in map-view. We thus only consider the four outward-facing box-faces together with the predicted object center. For each point, we extract a feature using bilinear interpolation from the backbone map-view output . Next, we concatenate the extracted point-features and pass them through an MLP. The second stage predicts a class-agnostic confidence score and box refinement on top of one-stage CenterPoint’s prediction results.
For class-agnostic confidence score prediction, we follow and use a score target guided by the box’s 3D IoU with the corresponding ground truth bounding box:
where is the IoU between the -th proposal box and the ground-truth. The training is supervised with a binary cross entropy loss:
where is the predicted confidence score. During the inference, we directly use the class prediction from one-stage CenterPoint and computes the final confidence score as the geometric average of the two scores where is the final prediction confidence of object and and are the first stage and second stage confidence of object , respectively.
For box regression, the model predicts a refinement on top of first stage proposals, and we train the model with L1 loss. Our two-stage CenterPoint simplifies and accelerates previous two-stage 3D detectors that use expensive PointNet-based feature extractor and RoIAlign operations .
2 Architecture
All first-stage outputs share a first convolutional layer, Batch Normalization , and ReLU. Each output then uses its own branch of two convolutions separated by a batch norm and ReLU. Our second-stage uses a shared two-layer MLP, with a batch norm, ReLU, and Dropout with a drop rate of , followed by two branches of three fully-connected layers, one for confidence score and one for box regression prediction.
Experiments
We evaluate CenterPoint on Waymo Open Dataset and nuScenes dataset. We implement CenterPoint using two 3D encoders: VoxelNet and PointPillars , termed CenterPoint-Voxel and CenterPoint-Pillar respectively.
Waymo Open Dataset. Waymo Open Dataset contains 798 training sequences and 202 validation sequences for vehicle and pedestrian. The point-clouds are captured with a 64 lanes Lidar, which produces about 180k Lidar points every 0.1s. The official 3D detection evaluation metrics include the standard 3D bounding box mean average precision (mAP) and mAP weighted by heading accuracy (mAPH). The mAP and mAPH are based on an IoU threshold of 0.7 for vehicles and 0.5 for pedestrians. For 3D tracking, the official metrics are Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP) . The official evaluation toolkit also provides a performance breakdown for two difficulty levels: LEVEL_1 for boxes with more than five Lidar points, and LEVEL_2 for boxes with at least one Lidar point.
Our Waymo model uses a detection range of for the and axis, and for the axis. CenterPoint-Voxel uses a voxel size following PV-RCNN while CenterPoint-Pillar uses a grid size of .
nuScenes contains 1000 driving sequences, with 700, 150, 150 sequences for training, validation, and testing, respectively. Each sequence is approximately 20-second long, with a Lidar frequency of 20 FPS. The dataset provides calibrated vehicle pose information for each Lidar frame but only provides box annotations every ten frames (0.5s). nuScenes uses a 32 lanes Lidar, which produces approximately 30k points per frame. In total, there are 28k, 6k, 6k, annotated frames for training, validation, and testing, respectively. The annotations include 10 classes with a long-tail distribution. The official evaluation metrics are an average among the classes. For 3D detection, the main metrics are mean Average Precision (mAP) and nuScenes detection score (NDS). The mAP uses a bird-eye-view center distance instead of standard box-overlap. NDS is a weighted average of mAP and other attributes metrics, including translation, scale, orientation, velocity, and other box attributes . After our test set submission, the nuScenes team adds a new neural planning metric (PKL) . The PKL metric measures the influence of 3D object detection for down-streamed autonomous driving tasks based on the KL divergence of a planner’s route (using 3D detection) and the ground truth trajectory. Thus, we also report the PKL metric for all methods that evaluate on the test set.
For 3D tracking, nuScenes uses AMOTA , which penalizes ID switches, false positive, and false negatives and is averaged among various recall thresholds.
For experiments on nuScenes, we set the detection range to for the and axis, and for axis. CenterPoint-Voxel use a voxel size and CenterPoint-Pillars uses a grid.
Training and Inference. We use the same network designs and training schedules as prior works . See supplement for detailed hyper-parameters. During the training of two-stage CenterPoint, we randomly sample boxes with : positive negative ratio from the first stage predictions. A proposal is positive if it overlaps with a ground truth annotation with at least 0.55 IoU . During inference, we run the second stage on the top 500 predictions after Non-Maxima Suppression (NMS). The inference times are measured on an Intel Core i7 CPU and a Titan RTX GPU.
1 Main Results
We first present our 3D detection results on the test sets of Waymo and nuScenes. Both results use a single CenterPoint-Voxel model. Table 1 and Table 2 summarize our results. On Waymo test set, our model achieves level 2 mAPH for vehicle detection and level 2 mAPH for pedestrian detection, surpassing previous methods by mAPH for vehicles and mAPH for pedestrians. On nuScenes (Table 2), our model outperforms the last-year challenge winner CBGS with multi-scale inputs and multi-model ensemble by mAP and NDS. Our model is also much faster, as shown later. A breakdown along classes is contained in the supplementary material. Our model displays a consistent performance improvement over all categories and shows more significant improvements in small categories ( mAP for traffic cone) and extreme-aspect ratio categories ( mAP for bicycle and mAP for construction vehicle). More importantly, our model significantly outperforms all other submissions under the neural planar metric (PKL), a hidden metric evaluated by the organizers after our leaderboard submission. This highlights the generalization ability of our framework.
D Tracking
Table 3 shows CenterPoint’s tracking performance on the Waymo test set. Our velocity-based closest distance matching described in Section 4 significantly outperforms the official tracking baseline in the Waymo paper , which uses a Kalman-filter based tracker . We observe a and MOTA improvement for vehicle and pedestrian tracking, respectively. On nuScenes (Table 4), our framework outperforms the last challenge winner Chiu et al. by AMOTA. Notably, our tracking does not require a separate motion model and runs in a negligible time, on top of detection.
2 Ablation studies
Center-based vs Anchor-based We first compare our center-based one-stage detector with its anchor-based counterparts . On Waymo, we follow the state-of-the-art PV-RCNN to set the anchor hyper-parameters: we use two anchors per-locations with °and °; The positive/ negative IoU thresholds are set as for vehicles and for pedestrians. On nuScenes, we follow the anchor assignment strategy from the last challenge winner CBGS . All other parameters are the same as our CenterPoint model.
As is shown in Table 5, on Waymo dataset, simply switching from anchors to our centers gives mAPH and mAPH improvements for VoxelNet and PointPillars encoder, respectively. On nuScenes (Table 6) CenterPoint improves anchor-based counterparts by - mAP and - NDS across different backbones. To understand where the improvements are from, we further show the performance breakdown on different subsets based on object sizes and orientation angles on the Waymo validation set.
We first divide the ground truth instances into three bins based on their heading angles: 0°to 15°, 15°to 30°, and 30°to 45°. This division tests the detector’s performance for detecting heavily rotated boxes, which is critical for the safe deployment of autonomous driving. We also divide the dataset into three splits: small, medium, and large, and each split contains of the overall ground truth boxes.
Table 7 and Table 8 summarize the results. Our center-based detectors perform much better than the anchor-based baseline when the box is rotated or deviates from the average box size, demonstrating the model’s ability to capture the rotation and size invariance when detecting objects. These results convincingly highlight the advantage of using a point-based representation of 3D objects.
One-stage vs. Two-stage In Table 9, we show the comparison between single and two-stage CenterPoint models using 2D CNN features on Waymo validation. Two-stage refinement with multiple center features gives a large accuracy boost to both 3D encoders with small overheads (6ms-7ms). We also compare with RoIAlign, which densely samples points in the RoI , our center-based feature aggregation achieved comparable performance but is faster and simpler.
The voxel quantization limits two-stage CenterPoint’s improvements for pedestrian detection with PointPillars as pedestrians typically only reside in 1 pixel in the model input.
Two-stage refinement does not bring an improvement over the single-stage CenterPoint model on nuScenes in our experiments. We think the reason is that the nuScenes dataset uses 32 lanes Lidar, which produces about 30k Lidar points per frame, about of the number of points in the Waymo dataset, which limits the potential improvements of two-stage refinement. Similar results have been observed in previous two-stage methods like PointRCNN and PV-RCNN .
Effects of different feature components In our two-stage CenterPoint model, we only use features from the 2D CNN feature map. However, previous methods propose to also utilize voxel features for second stage refinement . Here, we compare with two voxel feature extraction baselines:
Voxel-Set Abstraction. PV-RCNN proposes the Voxel-Set Abstraction (VSA) module, which extends PointNet++ ’s set abstraction layer to aggregate voxel features in a fixed radius ball.
Radial basis function (RBF) Interpolation. PointNet++ and SA-SSD use a radial basis function to aggregate grid point features from three nearest non-empty 3D feature volumes.
For both baselines, we combine bird-eye view features with voxel features using their official implementations. Table 10 summarizes the results. It shows bird-eye view features are sufficient for good performance while being more efficient comparing to voxel features used in the literatures .
To compare with prior work that did not evaluate on Waymo test, we also report results on the Waymo validation split in Table 11. Our model outperforms all published methods by a large margin, especially for the challenging pedestrian class(+18.6 mAPH) of the level 2 dataset, where boxes contain as little as one Lidar point.
3D Tracking. Table 12 shows the ablation experiments of 3D tracking on nuScenes validation. We compare with last year’s challenge winner Chiu et al. , which uses mahalanobis distance-based Kalman filter to associate detection results of CBGS . We decompose the evaluation into the detector and tracker to make the comparison strict. Given the same detected objects, using our simple velocity-based closest point distance matching outperforms the Kalman filter-based Mahalanobis distance matching by AMOTA (line 1 vs. line 3 and line 2 vs. line4). There are two sources of improvements: 1) we model the object motion with a learned point velocity, rather than modeling 3D bounding box dynamic with a Kalman filter; 2) we match objects by center point-distance instead of a Mahalanobis distance of box states or 3D bounding box IoU. More importantly, our tracking is a simple nearest neighbor matching without any hidden-state computation. This saves the computational overhead of a 3D Kalman filter (73ms vs. 1ms).
Conclusion
We proposed a center-based framework for simultaneous 3D object detection and tracking from the Lidar point-clouds. Our method uses a standard 3D point-cloud encoder with a few convolutional layers in the head to produce a bird-eye-view heatmap and other dense regression outputs. Detection is a simple local peak extraction with refinement, and tracking is a closest-distance matching. CenterPoint is simple, near real-time, and achieves state-of-the-art performance on the Waymo and nuScenes benchmarks.
References
Appendix A Tracking algorithm
Appendix B Implementation Details
Our implementation is based on the open-sourced code of CBGS https://github.com/poodarchu/Det3D. CBGS provides implementations of PointPillars and VoxelNet on nuScenes. For Waymo experiments, we use the same architecture for VoxelNet and increases the output stride to 1 for PointPillars following the dataset’s reference implementationhttps://github.com/tensorflow/lingvo/tree/master/lingvo/tasks/car.
A common practice in nuScenes is to transform and merge the Lidar points of non-annotated frames into its following annotated frame. This produces a denser point-cloud and enables a more reasonable velocity estimation. We follow this practice in all nuScenes experiments.
For data augmentation, we use random flipping along both and axis, and global scaling with a random factor from . We use a random global rotation between for nuScenes and for Waymo . We also use the ground-truth sampling on nuScenes to deal with the long tail class distribution, which copies and pastes points inside an annotated box from one frame to another frame.
For nuScenes dataset, we follow CBGS to optimize the model using AdamW optimizer with one-cycle learning rate policy , with max learning rate 1e-3, weight decay 0.01, and momentum to . We train the models with batch size 16 for 20 epochs on 4 V100 GPUs.
We use the same training schedule for Waymo models except a learning rate 3e-3, and we train the model for 30 epochs following PV-RCNN . To save computation on large scale Waymo dataset, we finetune the model for 6 epochs with second stage refinement modules for various ablation studies. All ablation experiments are conducted in this same setting.
For the nuScenes test set submission, we use a input grid size of and add two separate deformable convolution layers in the detection head to learn different features for classification and regression. This improves CenterPoint-Voxel’s performance from NDS to NDS on nuScenes validation. For the nuScenes tracking benchmark, we submit our best CenterPoint-Voxel model with flip testing, which yields a result of AMOTA on nuScenes validation.
Appendix C nuScenes Performance across classes
We show per-class comparisons with state-of-the-art methods in Table 13.
Appendix D nuScenes Detection Challenge
As a general framework, CenterPoint is complementary to contemporary methods and was used by three of the top 4 entries in the NeurIPS 2020 nuScenes detection challenge. In this section, we describe the details of our winning submission which significantly improved 2019 challenge winner CBGS by mAP and NDS. We report some improved results in Table 14. We use PointPainting to annotate each lidar point with image-based instance segmentation results generated by a Cascade RCNN model trained on nuImagesacquired from https://github.com/open-mmlab/mmdetection3d/tree/master/configs/nuimages. This improves the NDS from to . We then perform two test-time augmentations including double flip testing and point-cloud rotation around the yaw axis. Specifically, we use [0°, 6.25°, 12.5°, 25°] for yaw rotations. Theses test time augmentations improve the NDS from to . In the end, we ensemble five models with input grid size between to and filter out predictions with zero number of points, which yields our best results on nuScenes validation, with mAP and NDS.