HDNET: Exploiting HD Maps for 3D Object Detection

Bin Yang, Ming Liang, Raquel Urtasun

Introduction

Autonomous vehicles have the potential of providing cheaper and safer transportation. A typical autonomous system is composed of the following functional modules: perception, prediction, planning and control . Perception is concerned with detecting the objects of interest (e.g. vehicles) in the scene and track them over time. The prediction module estimates the intentions and trajectories of all actors into the future. Motion planning is responsible for producing a trajectory that is safe, while control outputs the commands necessary for the self-driving vehicle to execute such trajectory.

3D object detection is a fundamental task in perception systems. Modern 3D object detectors exploit LiDAR as input as it provides good geometric cues and eases 3D localization when compared to camera-only approaches. In the context of real-time applications, single-shot detectors have been shown to be more promising than proposal-based methods as they are very efficient and can produce very accurate estimates. However, object detection is far from solved as many challenges remain, such as dealing with occlusion and the sparsity of the LiDAR at long range.

Most self-driving systems have access to High-Definition (HD) maps that contain geometric and semantic information about the environment. While HD maps are widely used by motion planning systems , they are vastly ignored by perception systems . In this paper we argue that HD maps provide strong priors that can boost the performance and robustness of modern object detectors. Towards this goal, we derive an efficient and effective single-stage detector that operates in Bird’s Eye View (BEV) and fuses LiDAR information with rasterized maps. Bird’s eye view is a good representation for 3D LiDAR as it is amenable to efficient inference and retains the metric space. Since HD maps might not be available everywhere, we also propose a map prediction module that estimates the map geometry and semantics from a single online LiDAR sweep.

Our experiments on the public KITTI BEV object detection benchmark and a large-scale 3D object detection benchmark TOR4D show that we can achieve significant Average Precision (AP) gain on top of a state-of-the-art detector by exploiting HD maps. On TOR4D when HD maps are available, we achieve 2.42%, 3.43% and 5.49% AP gains for ranges over 0-70 m, 30-50 m and 50-70 m respectively. On KITTI, where HD maps are unavailable, we show that when using a pre-trained map prediction module (trained on a different continent) we can still get 2.87% AP gain, surpassing all competing methods including those which also exploit cameras. Importantly, the proposed map-aware detector runs at 20 frames per second.

Related Work

In this paper we show how modern 3D object detectors can benefit from HD maps. Here we revisit literatures on 3D object detection from point clouds, as well as works that exploit priors from maps.

Some detectors search objects in 3D space densely via sliding window. Due to the sparsity of LiDAR point cloud, these approaches suffer from expensive computation, as many proposals are evaluated where there is no evidence. A more efficient representation of point clouds is to exploit 2D projections. MV3D uses a combination of range view and bird’s eye view projections to fuse multi-view information. PIXOR and FAF show superior performance in both speed and accuracy by exploiting the bird’s eye view representation alone. Apart from grid-like representations, some works learn feature representations from un-ordered point sets for object detection. In addition to point cloud based detection, many works try to fuse data from multiple sensors to improve performance.

2 Exploiting Priors from Maps

Maps contain geographic, geometric and semantic priors that are useful for many tasks. leverages dense priors from large-scale crowd-sourced maps to build a holistic model that does joint 3D object detection, semantic segmentation and depth reconstruction. However, the priors are extracted by rendering a 3D world from the map, which is very time-consuming. In , the crowd-sourced maps are also used for fine-grained road segmentation. 3D object detectors often use the ground prior . However, they treat ground as a plane which is often inaccurate for curved roads. In this paper, we exploit the point-wise geometric ground prior as well as the semantic road prior for 3D object detection. In , online estimated maps are used to reason the location of objects in 3D. However, the proposed generative model is very slow and thus not amenable for robotics applications. uses HD maps to predict the intention of different traffic actors like vehicles, pedestrians and bicyclists, which belongs to the more high-level prediction system.

Exploiting HD Maps for 3D Object Detection

HD maps are typically employed for motion planning but they are vastly ignored for perception systems. In this paper we bridge the gap and study how HD maps can be used as priors for modern 3D object detectors. Towards this goal, we derive a single-stage detector that can exploit both semantic and geometric information extracted from the maps (whether built offline or estimated online). We refer the reader to Figure 1 for the overall architecture of our proposed map-aware detector.

We project the LiDAR data to bird’s eye view (BEV) as this provides a compact representation that enables efficient inference. Note that this is a good representation for our application as vehicles drive on the ground. Figure 2 illustrates how we incorporate priors from HD maps into our BEV LiDAR representation.

We treat the $Z$ axis as feature channel and exploit 2D convolutions. This is beneficial as they are much more efficient than 3D convolutions. It is therefore important to have discriminant features along the $Z$ axis. However, LiDAR point clouds often suffer from translation variance along the $Z$ axis due to the slope of the road, particularly at range. For example, 1 degree of road slope would lead to 1.22 meters offset in $Z$ axis at 70 meters distance. To make things worse, standard LiDAR sensors have very sparse returns after 50 meters.

To remedy this translation variance, we exploit detailed ground information from the HD maps. Specifically, given a LiDAR point cloud $\{(x_{i},y_{i},z_{i})\}$ , we query the ground point at each location $(x_{i},y_{i})$ from the map, denoted as $(x_{i},y_{i},z^{0}_{i})$ and replace the absolute distance $z_{i}$ with the distance relative to the ground $z_{i}-z^{0}_{i}$ . We then discretize the resulting representation into a 3D occupancy grid enhanced to also contain LiDAR intensity features following . Specifically, we first define the 3D dimension $L\times W\times H$ of the scene that we are interested in. We then compute binary occupancy maps at a resolution of $d_{L}\times d_{W}\times d_{H}$ , and compute the intensity feature map at a resolution of $d_{L}\times d_{W}\times H$ . The discretized LiDAR BEV representation is of size $\frac{L}{d_{L}}\times\frac{W}{d_{W}}$ , with $\frac{H}{d_{H}}+3$ channelsTwo additional occupancy channels are added to cover points outside the height range..

The LiDAR provides a full scan of the surrounding environment, containing moving objects and static background (e.g. roads and buildings). However, in the context of self driving cars we mainly care about moving objects on the road. Motivated by this, we exploit the semantic road mask available on HD maps as prior knowledge about the scene. Specifically, we extract the road layout information from the HD maps and rasterize it onto the bird’s eye view as a binary channel at the same resolution as the discretized LiDAR representation. We then concatenate the road mask together with the LiDAR representation along the channels to create our input representation.

2 Network Structure

We adopt a fully convolutional network for single-stage dense object detection. Figure 3 (left) shows the detection network, which is composed of two parts: a backbone network that extracts multi-scale features and a header network that outputs pixel-wise dense detection estimates.

Backbone network: consists of four convolutional blocks, each having {2, 2, 3, 6} conv2D layers with filter number {32, 64, 128, 256}, filter size 3, and stride 1. We apply batch normalization and ReLU after each conv2D layer. After each of the first three convolutional blocks there’s a MaxPool layer with filter size 3 and stride 2. Multi-scale features are generated by resizing and concatenating feature maps from different blocks. The total downsampling rate of the network is 4.

Header network: consists of five conv2D layers with filter number 256, filter size 3 and stride 1, followed by the last conv2D layer that outputs pixel-wise dense detection estimates $(p,cos(2\theta),sin(2\theta),dx,dy,\log w,\log l)$ . $p$ denotes the confidence score for object classification, where the rest denotes the geometry of the detection. Specifically, we parameterize the object orientation (in $XY$ plane) by $(cos(2\theta),sin(2\theta))$ with a period of $\pi$ so that we do not distinguish between heading forward and backwardBecause they are the same in terms of IoU computation.. $(dx,dy,\log w,\log l)$ are commonly used parameterization for object center offset and size. The advantages of producing dense detections are two-fold: it is efficient to compute via convolutions and the maximum recall rate of objects is 100%.

3 Learning and Inference

Adding priors (semantic prior in particular) at the input level has both advantages as well as challenges. On one hand, the network can fully exploit the interactions between priors and raw data; on the other hand, this may lead to overfitting issues, and potentially poor results when HD maps are unavailable or noisy. In practice, having a detector that works regardless of map availability is important. Towards this goal, we apply data dropout on the semantic prior, which randomly feeds an empty road mask to the network during training. Our experiments show that data dropout largely improves the model’s robustness to map availability.

During inference, we pick all pixels on the output feature map whose confidence score is above a certain threshold, and decode them into oriented bounding boxes. Non-Maximum-Suppression (NMS) with 0.1 Intersection-Over-Union (IoU) threshold is then applied to get the final detections.

Online Estimation of HD Maps

So far we have shown how to exploit geometric and semantic priors from the already built HD maps. However, such maps might not be available everywhere. To tackle this problem, we create the map priors online from a single LiDAR sweep. In our scenario, there is no need to estimate dense 3D HD maps. Instead, we only have to predict the BEV representation of the geometric and semantic priors (shown in Figure 2). In this way, the estimated map features can be seamlessly integrated into the current framework.

Network structure: We estimate online maps from LiDAR point clouds via two separate neural networks that tackle ground estimation and road segmentation respectively. The network structure for these two tasks are the same U-Net structure as shown in Figure 3 (right). This has the advantages of retaining low-level details and producing high-resolution predictions.

Road segmentation: We predict pixel-wise BEV road segmentation as the estimated road prior. During training we use the rasterized road mask as ground-truth label and train the network with cross-entropy loss summed over all locations. We refer the reader to Figure 2 for an illustration.

Experiments

We validate the proposed method on two benchmarks: a large-scale 3D object detection benchmark TOR4D , which has over one million frames as well as corresponding HD maps; and the public KITTI BEV object detection benchmark . We denote our baseline detector without exploiting map priors as PIXOR++ since it follows the overall architecture of PIXOR , and denote the map-aware detector which exploits map priors as HDNET.

Implementation details: We discretize the LiDAR point cloud within the following 3D region $x\in[-70.4,70.4]$ , $y\in$ , $z\in[-2,3.4]$ in vehicle coordinate system (with the ego-car located at the origin). We use a discretization resolution of 0.2 meter in all three axes. The detection network is trained with stochastic gradient descent with momentum for 1.5 epochs with a batch size of 32. The initial learning rate is 0.02 and is multiplied by 0.1 after 1 and 1.4 epochs.

Baseline PIXOR++ detector: The baseline detector achieves 81.78% AP within 0-70 meters, outperforming FAF and PIXOR by 4.3% and 2.96% respectively. What’s more, the baseline detector runs at 17 ms per frame. In terms of performance at different ranges, we observe an AP drop of 3.44% and 20.74% when evaluated at 30-50 m range and 50-70 m range. This suggests that long range detection is significantly difficult.

Online map estimation: Figure 4 shows the evaluation results of the online map prediction module. In ground height estimation task, we are able to achieve $<5$ cm L1 error pixel-wise within 50 m range. When the range becomes longer, the error increases since we have more sparse LiDAR observations there. In road segmentation task, we achieve 97.70% pixel-wise accuracy and 92.86% IoU on the validation set. Note that these results are from one single LiDAR sweep only.

Map-aware HDNET detector: As seen from Table 1, HDNET outperforms the baseline PIXOR++ by 1.33%/2.42% in settings with online/offline maps. When it comes to longer range, the gain in both settings increases, especially for the offline one (with 5.49% gain in 50-70 m). This shows that map priors are especially helpful for the more difficult long range detection. Comparing geometric and semantic priors in both settings, we have three main observations: (1) geometric prior generally helps more than semantic prior, while in online setting the advantage diminishes (especially in long range) due to larger errors in online ground height estimation; (2) geometric prior and semantic prior are very complementary to each other; (3) in all settings and all ranges, we show that exploiting map priors could bring consistent gain over a state-of-the-art detector.

2 Evaluation on KITTI BEV Benchmark

Implementation details: We set the region of interest for the point cloud to $[0,70.4]\times\times$ meter and use a discretization resolution of $0.1\times 0.1\times 0.2$ meter. Following , we use the same ResNet based network structure that’s more robust to overfitting. Since KITTI has only thousands of training frames and we do not use pre-trained weights for the network, we apply the following data augmentation during training. For each frame of LIDAR point cloud, we apply random scaling ( $0.9\sim 1.1$ for all 3 axes), translation ( $-5\sim 5$ meter for $X$ and $Y$ axes) and rotation ( $-5\sim 5$ degrees along $Z$ axis). KITTI only annotates objects within the camera field of view (FOV), therefore we ignore all pixels outside the camera FOV during training. We use $\alpha=0.75$ and $\gamma=0.5$ in the focal loss. We train the network with stochastic gradient descent with momentum for 50 epochs with a batch size of 16 frames. The initial learning rate is 0.01 and we decay it by $0.1$ after 30 and 45 epochs respectively.

BEV object detection: We compare PIXOR++ and HDNET with other state-of-the-art detectors on KITTI BEV object detection benchmark and show the evaluation results in Table 2. Our baseline detector PIXOR++ has already surpassed all the competing LiDAR based detectors, outperforming the second best VoxelNet by 4.44% AP in moderate setting while being the fastest in runtime (35 ms). Because KITTI doesn’t have HD maps information, we transfer the map prediction module trained on TOR4D to KITTI without any fine-tuning. Note that we do not exploit any object labels from the TOR4D dataset. With map priors added, HDNET achieves an absolute 2.87% AP gain in moderate setting, setting the new record on the benchmark. Note that HDNET even outperforms other competitors that exploit camera images (e.g. ContFuse ) or additional object labels from outside datasets (e.g. F-PointNet ), which highlights the long-ignored value of HD maps in 3D object detection.

3 Ablation Studies

We conduct four ablation studies investigating different ways to exploit the geometry and semantic priors, the effect of data dropout on robustness to unavailable maps, and the detection performance with regard to different ranges using priors from online estimated maps or offline built maps.

We model the geometric prior as a point-wise ground surface, which is more flexible and accurate than commonly used ground plane assumption . We compare these two design choices. For the ground plane counterpart, we use the publicly available ground plane results from 3DOP , which were generated by road segmentation and RANSAC fitting. In contrast, we directly predict the point-wise ground. The way how the ground prior is used in detection is the same for these two methods. We evaluate these two priors using HDNET on KITTI validation set (without data augmentation) and show the results in Table 3a. Both methods can boost the detection performance of the baseline, while our point-wise ground parameterization achieves significantly more gain.

3.2 Road prior

We incorporate semantic road prior into the input representation of the detector, while there are methods that use such prior at the output level. Two variants are created for output level fusion: multi-task learning and output masking. For multi-task learning, we add another output branch to the detection network to predict the road mask during training. For output masking, we mask the detection output with the road mask during both training and testing stages. We show the comparative results using the road prior extracted from offline HD maps in Table 3b, where we see that our input fusion performs the best among all three methods. For multi-task learning, we observe limited performance gain from jointly learning road mask and object detection. Output masking hurts the detection performance as errors in road prior is unrecoverable.

3.3 Data dropout

Data dropout is applied to the map prior during training stage to improve the model’s robustness to missing maps. To validate the effectiveness of data dropout, we train two models with 100% map availability. The only difference is that one model is trained with data dropout applied to the road prior, while the other model doesn’t use data dropout. During testing, we manually control the availability of road prior at {0%, 50%, 100%} and compare the performance.

The evaluation results are shown in Table 3c, where we see that without data dropout, AP drops by 11.57%/6.37% at 0%/50% map availability. However, when data dropout is introduced, the map-aware detector is able to surpass the baseline when map’s available; and when map’s unavailable, the performance is almost as good as the baseline.

3.4 Performance at various ranges

We define the range as the distance from the object to the ego-car on the $XY$ plane. We divide the full 70 meters range into 7 bins, and when evaluating on each range bin we ignore the detections and labels outside this range bin. We conduct the fine-grained evaluation in both online and offline settings and show results in Table 3d.

For offline setting, we see that the gain increases as the object range becomes larger. At $>40$ meters range, the gain is larger than 3%. This makes sense as the map priors help more when we have fewer LiDAR observations. For online setting, we have the similar trending in near range ( $<40$ meters), but the gain starts to decrease at long ranges ( $>40$ meters). The reason is that map priors are estimated from single LiDAR sweep, which is very sparse in long range.

4 Qualitative Results

We compare some qualitative results from two detector variants: the baseline PIXOR++ detector without map, and the HDNET detector with online map estimation on the validation set of KITTI BEV object detection benchmark in Figure 5.

From the figure we can easily identify false positives and false negatives in long range for PIXOR++. However, when it comes to HDNET, the detection accuracy and localization precision at long range increases remarkably. This shows that exploiting map priors does help the long range detection a lot.

Conclusion

In this paper we address the problem of how to exploit HD maps information to boost the performance of modern 3D object detection systems in the context of autonomous driving. We identify the geometric and semantic priors in HD maps, and incorporate them into the bird’s eye view LiDAR representation. Because HD maps are not available everywhere, we also propose a map prediction module that estimates both map priors online from single LiDAR sweep. Experimental results on the public KITTI BEV object detection benchmark and a large-scale 3D object detection benchmark show that the proposed map-aware detector consistently outperforms the baseline detector whether the HD maps are available or not. The whole framework also runs at over 20 frames per second due to the use of BEV representation as well as the single-stage detection framework.