RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving

Peixuan Li, Huaici Zhao, Pengfei Liu, Feidao Cao

Introduction

3D object detection is an essential component of scene perception and motion prediction in autonomous driving . Currently, most powerful 3D detectors heavily rely on 3D LIDAR laser scanners for the reason that it can provide scene locations . However, the LiDAR-based systems are expensive and not conducive to embedding into the current vehicle shape. In comparison, monocular camera devices are cheaper and convenient which makes it drawing an increasing attention in many application scenarios . In this paper, the scope of our research lies in 3D object detection from only monocular RGB image.

Monocular 3D object detection methods can be roughly divided into two categories by the type of training data: one utilizes complex features, such as instance segmentation, vehicle shape prior and even depth map to select best proposals in multi-stage fusion module . These features require additional annotation work to train some stand-alone networks which will consume plenty of computing resources in the training and inferring stages. Another one only employs 2D bounding box and properties of a 3D object as the supervised data . In this case, an intuitional idea is to building a deep regression network to predict directly 3D information of the object. This can cause performance bottlenecks due to the large search space. For this reason, recent works have clearly pointed out that apply geometric constraints from 3D box vertexes to 2D box edges to refine or directly predict object parameters . However, four edges of a 2D bounding box provide only four constraints on recovering a 3D bounding box while each vertex of a 3D bounding box might correspond to any edges in the 2D box, which will takes 4,096 of the same calculations to get one result . Meanwhile, the strong reliance on the 2D box causes a sharp decline in 3D detection performance when predictions of 2D detectors even have a slight error. Therefore, most of these methods take advantage of two-stage detectors to ensure the accuracy of 2D box prediction, which limit the upper-bound of the detection speed.

In this paper, we propose an efficient and accurate monocular 3D detection framework in the form of one-stage, which be tailored for 3D detection without relying on 2D detectors. The framework can be divided into two main parts, as shown in Fig. 1. First, we perform a one-stage fully convolutional architecture to predict 9 of the 2D keypoints which are projected points from 8 vertexes and central point of 3D bounding box. This 9 keypoints provides 18 geometric constrains on the 3D bounding box. Inspired by CenterNet , we model the relationship between the eight vertexes and the central point to solve the keypoints grouping and the vertexes order problem. The SIFT, SUFT and other traditional keypoint detection methods computed an image pyramid to solve the scale-invariant problem. A similar strategy was used by CenterNet as a post-processing step to further improve detection accuracy, which slows the inference speed. Note that the Feature Pyramid Network(FPN) in 2D object detection is not applicable to the network of keypoint detection, because adjacent keypoints may overlap in the case of small-scale prediction. We propose a novel multi-scale pyramid of keypoint detection to generate a scale-space response. The final activate map of keypoints can be obtained by means of the soft-weighted pyramid. Given the 9 projected points, the next step is to minimize the reprojection error over the perspective of 3D points that parameterized by the location, dimension, and orientation of the object. We formulate the reprojection error as the form of multivariate equations in $\mathfrak{se}_{3}$ space, which can generate the detection results accurately and efficiently. We also discuss the effect of different prior information on our keypoint-based method, such as dimension, orientation, and distance. The prerequisite for obtaining this information is not to add too much computation so as not to affect the final detection speed. We model these priors and reprojection error term into an overall energy function in order to further improve 3D estimation.

To summarize, our main contributions are the following:

We formulate the monocular 3D detection as the keypoint detection problem and combine the geometric constrains to generate properties of 3D objects more efficiently and accurately.

We propose a novel one-stage and multi-scale network for 3D keypoint detection which provide the accurate project points for multi-scale object.

We propose an overall energy function that can jointly optimize the prior and 3D object information.

Evaluation on the KITTI benchmark, We are the first real-time 3D detection method using only images and achieves better accuracy under the same running time in comparing other competitors.

Related Work

The 3D detection can be divided into two groups by the type of data: LiDAR-, and image-based methods. LiDAR-based method. LiDAR-based systems can provide accuracy and reliable point cloud of object surfaces in 3D scene. Therefor, most of the recent 3D object detection employ it in different representation to obtain the state-of-the-art model . Extra Data or Network for Image-based 3D Object Detection. In the last years, many studies develop the 3D detection in an image-based method for the reason that camera devices are more convenient and much cheaper. To complement the lacked depth information in image-based detection, most of the previous approaches heavily relied on the stand-alone network or additional labeling data, such as instance segmentation, stereo, wire-frame model, CAD prior , and depth, as shown in Table. 1. Among them, monocular 3D detection is a more challenging task due to the difficulty of obtaining reliable 3D information from a single image. One of the first examples enumerate a multitude of 3D proposals from pre-defined space where the objects may appear as the geometrical heuristics. Then it takes the other complex prior, such as shape, instance segmentation, contextual feature, to filter out dreadful proposals and scoring them by a classifier. To make up for the lack of depth, embed a pre-trained stand-alone module to estimate the disparity and 3D point cloud. The disparity map concatenates the front view representation to help the 2D proposal network and the 3D detection can be boosted by fusing the extracted feature after RoI pooling and point cloud. As a followup, combines the 2D detector and monocular depth estimation model to obtain the 2D box and corresponding point cloud. The final 3D box can be obtained by the regression of PointNet after the aggregation of the image feature and 3D point information through attention mechanism, which achieves the best performance in the monocular image. Intuitively, these methods would certainly increase the accuracy of the detection, but the additional network and annotated data would lead to more computation and labor-intensive work. Image-only in Monocular 3D Object Detection. Recent works have tried to fully explore the potency of RGB images for 3D detection. Most of them include geometric constraints and 2D detectors to explicitly describe the 3D information of the object. uses CNN to estimate the dimension and orientation extracted feature from the 2D box, then it proposes to obtain the location of an object by using the geometric constraints of the perspective relationship between 3D points and 2D box edges. This contribution is followed by most image-based detection methods either in refinement step or as direct calculation on 3D objects . All we know in this constraint is that certain 3D points are projected onto 2D edges, but the corresponding relationship and the exact location of the projection are not clear. Therefore, it needs to exhaustively enumerate $8^{4}=4096$ configurations to determine the final correspondence and can only provide four constraints, which is not sufficient for fully 3D representation in 9 parameters. It led to the need to estimate other prior information. Nevertheless, possible inaccuracies in the 2D bounding boxes may result in a grossly inaccurate solution with a small number of constraints. Therefore, most of these methods obtain more accurate 2D box through a two-stage detector, which is difficult to get real-time speed. Keypoints in Monocular 3D Object Detection. It is believed that the detection accuracy of occluded and truncated objects can be improved by deducing complete shapes from vehicle keypoints . They represent the regular-shape vehicles as a wire-frame template , which is obtained from a large number of CAD models. To train the keypoint detection network, they need to re-label the data set and even use depth maps to enhance the detection capability. is most related to our work, which also considers the wire-frame model as prior information. Furthermore, It jointly optimizes the 2D box, 2D keypoints, 3D orientation, scale hypotheses, shape hypotheses, and depth with four different networks. This has limitations in run time. In contrast to prior work, we reformulate the 3D detection as the coarse keypoints detection task. Instead of predicting the 3D box based on an off-the-shelf 2D detectors or other data generators, we build a network to predict 9 of 2D keypoints projected by vertexes and center of 3D bounding box while minimize the reprojection error to find an optimal result.

Proposed Method

In this section. We first describe the overall architecture for keypoint detection. Then we detail how to find the 3D vehicles from the generated keypoints.

where $\odot$ denote element-wise product.

Detection Head. The detection head is comprised of three fundamental components and six optional components which can be arbitrarily selected to boost the accuracy of 3D detection with a little computational consumption. Inspired by CenterNet , we take a keypoint as the maincenter for connecting all features. Since the 3D projection point of the object may exceed the image boundary in the case of truncation, the center point of the 2D box will be selected more appropriately. The heatmap can be define as $M\in^{\frac{H}{S}\times{\frac{W}{S}\times C}}$ , where $C$ is the number of object categories. Another fundamental component is the heatmap $V\in^{\frac{H}{S}\times{\frac{W}{S}\times 9}}$ of nine perspective points projected by vertexes and center of 3D bounding box. For keypoints association of one object, we also regress an local offset $V_{c}\in R^{\frac{H}{S}\times{\frac{W}{S}\times 18}}$ from the maincenter as an indication. Keypoints of $V$ closest to the coordinates from $V_{c}$ are taken as a group of one object.

where $p^{m},p^{v}$ are the position of maincenter and vertexes in the original image. The regression coordinate of vertexes with an L1 loss as:

Finial, we define the multi-task loss for keypoint detection as:

We empirical set $\omega_{main}=1,\omega_{kpver}=1,\omega_{dim}=1,\omega_{ori}=0.5,\omega_{dis}=0.1,\omega_{off}^{m}=0.5$ and $\omega_{off}^{v}=0.5$ in our experimental.

2 3D Bounding Box Estimate

Consider an image $I$ , a set of $i=1...N$ object are represented by 9 keypoints and other optional prior, given by our keypoint detection network. We define this keypoints as $\widehat{kp}_{ij}$ for $j\in 1...9$ , dimension as $\widehat{D}_{i}$ , orientation as $\hat{\theta}_{i}$ , and distance as $\widehat{Z}_{i}$ . The corresponding 3D bounding box $B_{i}$ can be defined by its rotation $R_{i}(\theta)$ , position $T_{i}=[T_{i}^{x},T_{i}^{y},T_{i}^{z}]^{T}$ , and dimensions $D_{i}=[h_{i},w_{i},l_{i}]^{T}$ . Our goal is to estimate the 3D bounding box $B_{i}$ , whose projections of center and 3D vertexes on the image space best fit the corresponding 2D keypoints $\widehat{kp}_{ij}$ . This can be solved by minimize the reprojection error of 3D keypoints and 2D keypoints. We formulate it and other prior errors as a nonlinear least squares optimization problem:

where $e_{cp}(..),e_{d}(..),e_{r}(..)$ are measurement error of camera-point, dimension prior and orientation prior respectively. We set $\omega_{d}=1$ and $\omega_{r}=1$ in our experimental. $\Sigma$ is the covariance matrix of keypoints projection error. It is the confidence extracted from the heatmap corresponding to the keypoints:

In the rest of the section, we will first define this error item, and then introduce the way to optimize the formulation. Camera-Point. Following the , we define the homogeneous coordinate of eight vertexes and 3D center as:

Given the camera intrinsics matrix $K$ , the projection of these 3D points into the image coordinate is:

where $\xi\in\mathfrak{se}_{3}$ and $\exp$ maps the $\mathfrak{se}_{3}$ into $SE_{3}$ space. The projection coordinate should fit tightly into 2D keypoints detected by the detection network. Therefore, the camera-point error is then defined as:

Minimizing the camera-point error needs the Jacobians in $\mathfrak{se}_{3}$ space. It is given by:

where $P^{{}^{\prime}}=[X^{{}^{\prime}},Y^{{}^{\prime}},Z^{{}^{\prime}}]^{T}=\left(\exp(\xi^{\wedge}P)\right)_{1:3}$ . Dimension-Prior: The $e_{d}$ is sample defined as:

Rotation-Prior: We define $e_{r}$ in $SE3$ space and use $log$ to map the error into its tangent vector space:

These multivariate equations can be solved via the Gauss-newton or Levenberg-Marquardt algorithm in the g2o library . A good initialisation is mandatory using this optimization strategy. We adopt the prior information generated by keypoint detection network as the initialization value, which is very important in improving the detection speed.

Experimental

We evaluated our experiments on the KITTI 3D detection benchmark , which has a total of 7481 training images and 7518 test images. We follow the and to split the training set as $train1,val_{1}$ and $train2,val_{2}$ respectively. We comprehensively compare our framework and other method on this two validation as well as test set.

2 Comparison with Other Methods

To fully evaluate the performance of our keypoint-based method, for each task three official evaluation metrics be reported in KITTI: average precision for 3D intersection-over-union ( $AP_{3D}$ ), average precision for Birds Eye View ( $AP_{BEV}$ ), and Average Orientation Similarity (AOS) if 2D bounding box available. We evaluate our method at three difficulty settings: easy, moderate, and hard, according to the object’s occlusion, truncation, and height in the image space . $\bm{AP_{3D}}$ and $\bm{AP_{BEV}}$ . We compare our method with current image-based SOTA approaches and also provide a comparison about running time. However, it is not realistic to list the running times of all previous methods because most of them do not report their efficiency. The results $AP_{3D}$ , $AP_{BEV}$ and running time are shown in Table 2 and 3, respectively. ResNet-18 as the backbone achieves the best speed while our accuracy outperforms most of the image-only method. In particular, it is more than 100 times faster than Mono3D while outperforms over 10% for both $AP_{BEV}$ and $AP_{3d}$ across all datasets. In addition, our ResNet-18 method is more than 75 times faster while having a comparable accuracy than 3DOP , which employs stereo images as the input. DLA-34 as the backbone achieves the best accuracy while having relatively good speed. It is faster about 3 times than the recently proposed M3D-RPN while achieves the improvement in most of the metrics. Note that comparing our method with this all approaches is unfair because most of these approaches rely on extra stand-alone network or data in addition to monocular images. Nevertheless, we achieve the best speed with better performance. Results on the KITTI testing set. We also evaluate our results on the KITTI testing set, as shown in Table. 4. More details can be found on the KITTI website http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d.

3 Qualitative Results

Fig. 4 shows some qualitative results of our method. We visualize the keypoint detection network outputs, geometric constraint module outputs and BEV images. The results of the projected 3D box on image demonstrate than our method can handle crowded and truncated objects. The results of the BEV image show that our method has an accuracy localization in different scenes.

4 Ablation Study

Effect of Optional Components. Three optional components be employed to enhance our method: dimension, orientation, distance and keypoints offset. We experiment with different combinations to demonstrate their effect on 3D detection. The results are shown in Table.5, we train our network with DLA-34 backbone and evaluate it using $AP_{3D}$ and $AP_{BEV}$ . The combinations of dimension, orientation, distance and keypoints offset achieve the best accuracy meanwhile have a faster running speed. This is because we take the output predicted by our network as the initial value of the geometric optimization module, which can reduce the search space of the gradient descent method.

Effect of Keypoint FPN. We propose keypoint FPN as a strategy to improve the performance of multi-scale keypoint detection. To better understand its effect, we compare the $AP_{3D}$ and $AP_{BEV}$ with and without KFPN. The details are shown in Table. 6, using KFPN achieves the improvement across all sets while no significant change in time consumption.

2D Detection and Orientation. Although our focus is on 3D detection, we also compare the performance of our methods in 2D detection and orientation evaluation. We report the AOS and AP with a threshold IoU=0.7 for comparison. The results are shown in Table. 7, the Deep3DBox train MS-CNN in KITTI to produce 2D bounding box and adopt VGG16 for orientation prediction, which gives him the highest accuracy. Deep3Dbox takes advantage of better 2D detectors, however, our $AP_{3D}$ outperforms it by about 20% in moderate sets, which emphasize the importance of customizing the network specifically for 3D detection. Another interesting finding is that the 2D accuracy of back-projection 3D results is better than the direct prediction, thanks to our method that can infer the occlusive area of the object.

Conclusion

In this paper, we have proposed a faster and more accurate monocular 3D object detection method for autonomous driving scenarios. We reformulate 3D detection as the keypoint detection problem and show how to recover the 3D bounding box by using keypoints and geometric constraints. We specially customize the point detection network for 3D detection, which can output keypoints of the 3D box and other prior information of the object using only images. Our geometry module formulates this prior to easy-to-optimize loss functions. Our approach generates a stable and accurate 3D bounding box without containing stand-alone networks, additional annotation while achieving real-time running speed.