Voxel Transformer for 3D Object Detection

Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, Chunjing Xu

Introduction

3D object detection has received increasing attention in autonomous driving and robotics. Detecting 3D objects from point clouds remains challenging to the research community, mainly because point clouds are naturally sparse and unstructured. Voxel-based detectors transform irregular point clouds into regular voxel-grids and show superior performance in this task. In this paper, we propose Voxel Transformer (VoTr), an effective Transformer-based backbone that can be applied in most voxel-based detectors to further enhance detection performance.

Previous approaches can be divided into two branches. Point-based approaches directly operate and generate 3D bounding boxes on point clouds. Those approaches generally apply point operators to extract features directly from point clouds, but suffer from the sparse and non-uniform point distribution and the time-consuming process of sampling and searching for neighboring points. Alternatively, voxel-based approaches first rasterize point clouds into voxels and apply 3D convolutional networks to extract voxel features, and then voxels are transformed into a Bird-Eye-View (BEV) feature map and 3D boxes are generated on the BEV map. Compared with the point-based methods which heavily rely on time-consuming point operators, voxel-based approaches are more efficient with sparse convolutions, and can achieve state-of-the-art detection performance.

The 3D sparse convolutional network is a crucial component in most voxel-based detection models. Despite its advantageous efficiency, the 3D convolutional backbones cannot capture rich context information with limited receptive fields, which hampers the detection of 3D objects that have only a few voxels. For instance, with a commonly-used 3D convolutional backbone and the voxel size as $(0.05m,0.05m,0.1m)$ on the KITTI dataset, the maximum receptive field in the last layer is only $(3.65m,3.65m,7.3m)$ , which can hardly cover a car with the length over $4m$ . Enlarging the receptive fields is also intractable. The maximum theoretical receptive field of each voxel is roughly proportional to the product of the voxel size $V$ , the kernel size $K$ , the downsample stride $S$ , and the layer number $L$ . Enlarging $V$ will lead to the high quantization error of point clouds. Increasing $K$ leads to the cubic growth of convoluted features. Increasing $S$ will lead to a low-resolution BEV map which is detrimental to the box prediction, and increasing $L$ will add much computational overhead. Thus it is computationally extensive to obtain large receptive fields for the 3D convolutional backbones. Given the fact that the large receptive field is heavily needed in detecting 3D objects which are naturally sparse and incomplete, a new architecture should be designed to encode richer context information compared with the convolutional backbone.

Recently advances in 2D object classification, detection, and segmentation show that Transformer is a more effective architecture compared with convolutional neural networks, mainly because long-range relationships between pixels can be built by self-attention in the Transformer modules. However, directly applying standard Transformer modules to voxels is infeasible, mainly owing to two facts: 1) Non-empty voxels are sparsely distributed in a voxel-grid. Different from pixels which are densely placed on an image plane, non-empty voxels only account for a small proportion of total voxels, e.g., the non-empty voxels normally occupy less than $0.1\%$ of the total voxel space on the Waymo Open dataset . Thus instead of performing self-attention on the whole voxel-grids, special operations should be designed to only attend to those non-empty voxels efficiently. 2) The number of non-empty voxels is still large in a scene, e.g., there are nearly $90k$ non-empty voxels generated per frame on the Waymo Open dataset. Therefore applying fully-connected self-attention like the standard Transformer is computationally prohibitive. New methods are thus highly desired to enlarge the attention range while keeping the number of attending voxels for each query in a small value.

To this end, we propose Voxel Transformer (VoTr), a Transformer-based 3D backbone that can be applied upon voxels efficiently and can serve as a better substitute for the conventional 3D convolutional backbones. To effectively handle the sparse characteristic of non-empty voxels, we propose the sparse voxel module and the submanifold voxel module as the basic building blocks of VoTr. The submanifold voxel modules operate strictly on the non-empty voxels, to retain the original 3D geometric structure, while the sparse voxel modules can output features at the empty locations, which is more flexible and can further enlarge the non-empty voxel space. To resolve the problem that non-empty voxels are too numerous for self-attention, we further propose two attention mechanisms: Local Attention and Dilated Attention, for multi-head attention in the sparse and submanifold voxel modules. Local Attention focuses on the neighboring region to preserve detailed information. Dilated Attention obtains a large attention range with only a few attending voxels, by gradually increasing the search step. To further accelerate the querying process for Local and Dilated Attention, we propose Fast Voxel Query, which contains a GPU-based hash table to efficiently store and lookup the non-empty voxels. Combining all the above components, VoTr significantly boosts the detection performance compared with the convolutional baselines, while maintains computational efficiency.

Our main contributions can be summarized as follows: 1) We propose Voxel Transformer, the first Transformer-based 3D backbone for voxel-based 3D detectors. 2) We propose the sparse and submanifold voxel module to handle the sparsity characteristic of voxels, and we further propose special attention mechanisms and Fast Voxel Query for efficient computation. 3) Our VoTr consistently outperforms the convolutional baselines and achieves the state-of-the-art performance with $74.95\%$ LEVEL_1 mAP for vehicle and $82.09\%$ mAP for moderate car class on the Waymo dataset and the KITTI dataset respectively.

Related Work

3D object detection from point clouds. 3D object detectors can be divided into $2$ streams: point-based and voxel-based. Point-based detectors operate directly on raw point clouds to generate 3D boxes. F-PointNet is a pioneering work that utilizes frustums for proposal generation. PointRCNN generates 3D proposals from the foreground points in a bottom-up manner. 3DSSD introduces a new sampling strategy for point clouds. Voxel-based detectors transform point clouds into regular voxel-grids and then apply 3D and 2D convolutional networks to generate 3D proposals. VoxelNet utilizes a 3D CNN to extract voxel features from a dense grid. SECOND proposes 3D sparse convolutions to efficiently extract voxel features. HVNet designs a convolutional network that leverages the hybrid voxel representation. PV-RCNN uses keypoints to extract voxel features for boxes refinement. Point-based approaches suffer from the time-consuming process of sampling and aggregating features from irregular points, while voxel-based methods are more efficient owing to the regular structure of voxels. Our Voxel Transformer can be plugged into most voxel-based detectors to further enhance the detection performance while maintaining computational efficiency.

Transformers in computer vision. Transformer introduces a fully attentional framework for machine translation. Recently Transformer-based architectures surpass the convolutional architectures and show superior performance in the task of image classification, detection and segmentation. Vision Transformer splits an image into patches and feeds the patches into a Transformer for image classification. DETR utilizes a Transformer-based backbone and a set-based loss for object detection. SETR applies progressive upsampling on a Transformer-based backbone for semantic segmentation. MaX-DeepLab utilizes a mask Transformer for panoptic segmentation. Transformer-based architectures are also used in 3D point clouds. Point Transformer designs a novel point operator for point cloud classification and segmentation. Pointformer introduces attentional operators to extract point features for 3D object detection. Our Voxel Transformer extends the idea of Transformers on images, and proposes a novel method to apply Transformer to sparse voxels. Compared with point-based Transformers, Voxel Transformer benefits from the efficiency of regular voxel-grids and shows superior performance in 3D object detection.

Voxel Transformer

In this section, we present Voxel Transformer (VoTr), a Transformer-based 3D backbone that can be applied in most voxel-based 3D detectors. VoTr can perform multi-head attention upon the empty and non-empty voxel positions though the sparse and submanifold voxel modules, and long-range relationships between voxels can be constructed by efficient attention mechanisms. We further propose Fast Voxel Query to accelerate the voxel querying process in multi-head attention. We will detail the design of each component in the following sections.

In this section, we introduce the overall architecture of Voxel Transformer. Similar to the design of the conventional convolutional architecture which contains $3$ sparse convolutional blocks and $6$ submanifold convolutional blocks, our VoTr is composed of a series of sparse and submanifold voxel modules, as shown in Figure 2. In particular, we design $3$ sparse voxel modules which downsample the voxel-grids by $3$ times and output features at different voxel positions and resolutions as inputs. Each sparse voxel module is followed by $2$ submanifold voxel modules, which keeps the input and output non-empty locations the same, to maintain the original 3D structure while enlarge receptive fields. Multi-head attention is performed in all those modules, and the attending voxels for each querying voxel in multi-head attention are determined by two special attention mechanisms: Local Attention and Dilated Attention, which captures well diverse context in different ranges. Fast Voxel Query is further proposed to accelerate the searching process for the non-empty voxels in multi-head attention.

Voxel features extracted by our proposed VoTr are then projected to a BEV feature map to generate 3D proposals, and the voxels and corresponding features can also be utilized on the second stage for RoI refinement. We note that our proposed VoTr is flexible and can be applied in most voxel-based detection frameworks .

2 Voxel Transformer Module

In this section, we present the design of sparse and submanifold voxel modules. The major difference between sparse and submanifold voxel modules is that submanifold voxel modules strictly operate on the non-empty voxels and extract features only at the non-empty locations, which maintains the geometric structures of 3D scenes, while sparse voxel modules can extract voxel features at the empty locations, which shows more flexibility and can expand the original non-empty voxel space according to needs. We first introduce self-attention on sparse voxels and then detail the design of sparse and submanifold voxel modules.

Self-attention on sparse voxels. We define a dense voxel-grid, which has $N_{dense}$ voxels in total, to rasterize the whole 3D scene. In practice we only maintain those non-empty voxels with a $N_{sparse}\times 3$ integer indices array $\mathcal{V}$ and $N_{sparse}\times d$ corresponding feature array $\mathcal{F}$ for efficient computation, where $N_{sparse}$ is the number of non-empty voxels and $N_{sparse}\ll N_{dense}$ . In each sparse and submanifold voxel module, multi-head self-attention is utilized to build long-range relationships among non-empty voxels. Specifically, given a querying voxel $i$ , the attention range $\Omega(i)\subseteq\mathcal{V}$ is first determined by attention mechanisms, and then we perform multi-head attention on the attending voxels $j\in\Omega(i)$ to obtain the feature $f^{attend}_{i}$ . Let $f_{i},f_{j}\in\mathcal{F}$ be the features of querying and attending voxels respectively, and $v_{i},v_{j}\in\mathcal{V}$ be the integer indices of querying and attending voxels. We first transform the indices $v_{i},v_{j}$ to the corresponding 3D coordinates of the real voxel centers $p_{i},p_{j}$ by $p=r\cdot(v+0.5)$ , where $r$ is the voxel size. Then for a single head, we compute the query embedding $Q_{i}$ , key embedding $K_{j}$ and value embedding $V_{j}$ as:

where $W_{q},W_{k},W_{v}$ are the linear projection of query, key and value respectively, and the positional encoding $E_{pos}$ can be calculated by:

Thus self-attention on voxels can be formulated as:

where $\sigma(\cdot)$ is the softmax normalization function. We note that self-attention on voxels is a natural 3D extension of standard 2D self-attention with sparse inputs and relative coordinates as positional embeddings.

Submanifold voxel module. The outputs of submanifold voxel modules are exactly at the same locations with the input non-empty voxels, which indicates its ability to keep the original 3D structures of inputs. In the submanifold voxel module, two sub-layers are designed to capture the long-range context information for each non-empty voxel. The first sub-layer is the self-attention layer that combines all the attention mechanisms, and the second is a simple feed-forward layer in . Residual connections are employed around the sub-layers. The major differences between the standard Transformer module and our proposed module are as three folds: 1) We append an additional linear projection layer after the feed-forward layer for channel adjustment of voxel features. 2) We replace layer normalization with batch normalization. 3) We remove all the dropout layers in the module, since the number of attending voxels is already small and randomly rejecting some of those voxels hampers the learning process.

Sparse voxel module. Different from the submanifold voxel module which only operates on the non-empty voxels, the sparse voxel module can extract features for the empty locations, leading to the expansion of the original non-empty space, and it is typically required in the voxel downsampling process . Since there is no feature $f_{i}$ available for the empty voxels, we cannot obtain the query embedding $Q_{i}$ from $f_{i}$ . To resolve the problem, we give an approximation of $Q_{i}$ at the empty location from the attending features $f_{j}$ :

where the function $\mathcal{A}$ can be interpolation, pooling, etc. In this paper, we choose $\mathcal{A}$ as the maxpooling of all the attending features $f_{j}$ . We also use Eq.3 to compute multi-head attention. The architecture of sparse voxel modules is similar to submanifold voxel modules, except that we remove the first residual connection around the self-attention layer, since the inputs and outputs are no longer the same.

3 Efficient Attention Mechanism

In this section, we delve into the design of the attention range $\Omega(i)$ , which determines the attending voxels for each query $i$ , and is a crucial factor in self-attention on sparse voxels. $\Omega(i)$ is supposed to satisfy the following requirements: 1) $\Omega(i)$ should cover the neighboring voxels to retain the fine-grained 3D structure. 2) $\Omega(i)$ should reach as far as possible to obtain a large context information. 3) the number of attending voxels in $\Omega(i)$ should be small enough, e.g. less than $50$ , to avoid heavy computational overhead. To tackle those issues, we take the inspiration from and propose two attention mechanisms: Local Attention and Dilated Attention to control the attention range $\Omega(i)$ . The designs of the two mechanisms are as follows.

Local Attention. We define $\varnothing(start,end,stride)$ as a function that returns the non-empty indices in a closed set $[start,end]$ with the step as $stride$ . In the 3D cases, for example, $\varnothing((0,0,0),(1,1,1),(1,1,1))$ searches the set $\{(0,0,0),(0,0,1),(0,1,0),\cdots,(1,1,1)\}$ with $8$ indices for the non-empty indices. In Local Attention, given a querying voxel $v_{i}$ , the local attention range $\Omega_{local}(i)$ parameterized by $R_{local}$ can be formulated as:

where $R_{local}=(1,1,1)$ in our experiments. Local Attention fixes the $stride$ as $(1,1,1)$ to exploit every non-empty voxel inside the local range $R_{local}$ , so that the fine-grained structures can be retained by Local Attention.

Dilated Attention. The attention range $\Omega_{dilated}(i)$ of Dilated Attention is defined by a parameter list $R_{dilated}$ : $[(R^{(1)}_{start},R^{(1)}_{end},R^{(1)}_{stride}),\cdots,(R^{(M)}_{start},R^{(M)}_{end},R^{(M)}_{stride})]$ , and the formulation of $\Omega_{dilated}(i)$ can be represented as:

where $\setminus$ is the set subtraction operator and the function $\bigcup$ takes the union of all the non-empty voxel sets. We note that $R^{(i)}_{start}<R^{(i)}_{end}\leq R^{(i+1)}_{start}$ and $R^{(i)}_{stride}<R^{(i+1)}_{stride}$ , which means that we gradually enlarge the querying step $R^{(i)}_{stride}$ when search for the non-empty voxels which are more distant. This leads to a fact that we preserve more attending voxels near the query while still maintaining some attending voxels that are far away, and $R^{(i)}_{stride}>(1,1,1)$ significantly reduces the searching time and memory cost. With a carefully designed parameter list $R_{dilated}$ , the attention range is able to reach more than $15m$ but the number of attending voxels for each querying voxel is still kept less than $50$ . It is worth noting that Local Attention can be viewed as a special case in Dilated Attention when $R_{start}=(0,0,0)$ , $R_{end}=(1,1,1)$ and $R_{stride}=(1,1,1)$ .

4 Fast Voxel Query

An illustration of Fast Voxel Query is shown in Figure 4. Fast Voxel Query consists of four major steps: 1) we build a hash-table on GPUs which stores the hashed non-empty integer voxel indices $v_{j}$ as keys, and the corresponding indices $j$ for the array $\mathcal{V}$ as values. 2) For each query $i$ , we apply Local Attention and Dilated Attention to obtain the attending voxel indices $v_{j}\in\Omega(i)$ . 3) We look up the respective indices $j$ for $\mathcal{V}$ using the hashed key $v_{j}$ in the hash table, and $v_{j}$ is judged as an empty voxel and rejected if the hash value returns $-1$ . 4) We can finally gather the attending voxel indices $v_{j}$ and features $f_{j}$ from $\mathcal{V}$ and $\mathcal{F}$ with $j$ for voxel self-attention. We note that all the steps can be conducted in parallel on GPUs by assigning each querying voxel $i$ a separate CUDA thread, and in the third step, the lookup process for each query only costs $O(N_{\Omega})$ time complexity, where $N_{\Omega}$ is the number of voxels in $\Omega(i)$ and $N_{\Omega}\ll N_{sparse}$ .

To leverage the spatial locality of GPU memory, we build the hash table as a $N_{hash}\times 2$ tensor, where $N_{hash}$ is the hash table size and $N_{sparse}<N_{hash}\ll N_{dense}$ . The first row of the $N_{hash}\times 2$ hash table stores the keys and the second row stores the values. We use the linear probing scheme to resolve the collisions in the hash table, and the atomic operations to avoid the data race among CUDA threads. Compared with the conventional methods , our proposed Fast Voxel Query is efficient both in time and in space, and our approach remarkably accelerates the computation of voxel self-attention.

Experiments

In this section, we evaluate Voxel Transformer on the commonly used Waymo Open dataset and the KITTI dataset. We first introduce the experimental settings and two frameworks based on VoTr, and then compare our approach with previous state-of-the-art methods on the Waymo Open dataset and the KITTI dataset. Finally, we conduct ablation studies to evaluate the effects of different configurations.

Waymo Open Dataset. The Waymo Open Dataset contains $1000$ sequences in total, including $798$ sequences (around $158k$ point cloud samples) in the training set and 202 sequences (around $40k$ point cloud samples) in the validation set. The official evaluation metrics are standard 3D mean Average Precision (mAP) and mAP weighted by heading accuracy (mAPH). Both of the two metrics are based on an IoU threshold of 0.7 for vehicles and 0.5 for other categories. The testing samples are split in two ways. The first way is based on the distances of objects to the sensor: $0-30m$ , $30-50m$ and $>50m$ . The second way is according to the difficulty levels: LEVEL_1 for boxes with more than five LiDAR points and LEVEL_2 for boxes with at least one LiDAR point.

KITTI Dataset. The KITTI dataset contains $7481$ training samples and $7518$ test samples, and the training samples are further divided into the train split ( $3712$ samples) and the $val$ split ( $3769$ samples). The official evaluation metric is mean Average Precision (mAP) with a rotated IoU threshold 0.7 for cars. On the test set mAP is calculated with $40$ recall positions by the official server. The results on the val set are calculated with 11 recall positions for a fair comparison with other approaches.

We provide $2$ architectures based on Voxel Transformer: VoTr-SSD is a single-stage voxel-based detector with VoTr as the backbone. VoTr-TSD is a two-stage voxel-based detector based on VoTr.

VoTr-SSD. Voxel Transformer for Single-Stage Detector is built on the commonly-used single-stage framework SECOND . In particular, we replace the 3D sparse convolutional backbone of SECOND, with our proposed Voxel Transformer as the new backbone, and we still use the anchor-based assignment following . Other modules and configurations are kept the same for a fair comparison.

VoTr-TSD. Voxel Transformer for Two-Stage Detector is built upon the state-of-the-art two-stage framework PV-RCNN . Specifically, we replace the 3D convolutional backbone on the first stage of PV-RCNN, with our proposed Voxel Transformer as the new backbone, and we use keypoints to extract voxel features from Voxel Transformer for the second stage RoI refinement. Other modules and configurations are kept the same for a fair comparison.

Implementation Details. VoTr-SSD and VoTr-TSD share the same architecture on the KITTI and Waymo dataset. The input non-empty voxel coordinates are first transformed into $16$ -channel initial features by a linear projection layer, and then the initial features are fed into VoTr for voxel feature extraction. The channels of voxel features are lifted up to $32$ and $64$ in the first and second sparse voxel module respectively, and other modules keep the input and output channels the same. Thus the final output features have $64$ channels. The number of total attending voxels is set to $48$ for each querying voxel, and the number of heads is set to $4$ for multi-head attention. The GPU hash table size $N_{hash}$ is set to $400k$ . We would like readers to refer to supplementary materials for the detailed design of attention mechanisms.

Training and Inference Details. Voxel Transformer is trained along with the whole framework with the ADAM optimizer. On the KITTI dataset, VoTr-SSD and VoTr-TSD are trained with the batch size $32$ and $16$ respectively, and with the learning rate $0.01$ for $80$ epochs on $8$ V100 GPUs. On the Waymo Open dataset, we uniformly sample $20\%$ frames for training and use the full validation set for evaluation following . VoTr-SSD and VoTr-TSD are trained with the batch size $16$ and the learning rate $0.003$ for $60$ and $80$ epochs respectively on $8$ V100 GPUs. The cosine annealing strategy is adopted for the learning rate decay. Data augmentations and other configurations are kept the same as the corresponding baselines .

2 Comparisons on the Waymo Open Dataset

We conduct experiments on the Waymo Open dataset to verify the effectiveness of our proposed VoTr. As is shown in Table 1, simply switching from the 3D convolutional backbone to VoTr gives $1.05\%$ and $3.26\%$ LEVEL_1 mAP improvements for SECOND and PV-RCNN respectively. In the range of 30-50m and 50m-Inf, VoTr-SSD gives $1.42\%$ and $1.72\%$ improvements, and VoTr-TSD gives $3.37\%$ and $4.83\%$ improvements on LEVEL_1 mAP. The significant performance gains in the far away area show the importance of large context information obtained by VoTr to 3D object detection.

3 Comparisons on the KITTI Dataset

We conduct experiments on the KITTI dataset to validate the efficacy of VoTr. As is shown in the Table 2, VoTr-SSD and VoTr-TSD brings $2.29\%$ mAP and $0.66\%$ mAP improvement on the moderate car class on the KITTI val split. For the hard car class, VoTr-TSD achieves 79.14 $\%$ mAP, outperforming all the previous approaches by a large margin, which indicates the long-range relationships between voxels captured by VoTr is significant for detecting 3D objects that only have a few points. The results on the val split in Table 3 show that VoTr-SSD and VoTr-TSD outperform the baseline methods by $1.79\%$ and $0.35\%$ mAP for the moderate car class. Observations on the KITTI dataset are consistent with those on the Waymo Open dataset.

4 Ablation Studies

Effects of Local and Dilated Attention. Table 4 indicates that Dilated Attention guarantees larger receptive fields for each voxel and brings $2.79\%$ moderate mAP gain compared to using only Local Attention.

Effects of dropout in Voxel Transformer. Table 5 details the influence of different dropout rates to VoTr. We found that adding dropout layers in each module is detrimental to the detection performance. The mAP drops by $8.52\%$ with the dropout probability as $0.3$ .

Effects of the number of attending voxels. Table 6 shows that increasing the number of attending voxels from $24$ to $48$ boosts the performance by $1.19\%$ , which indicates that a voxel can obtain richer context information by involving more attending voxels in multi-head attention.

Comparisons on the model parameters. Table 7 shows that replacing the 3D convolutional backbone with VoTr reduces the model parameters by $0.5M$ , mainly because the modules in VoTr only contain linear projection layers, which have only a few parameters, while 3D convolutional kernels typically contain a large number of parameters.

Comparisons on the inference speed. Table 8 shows that with carefully designed attention mechanisms and Fast Voxel Query, VoTr maintains computation efficiency with $14.65$ Hz running speed for the single-stage detector. Replacing the convolutional backbone with VoTr only adds about $20$ ms latency per frame.

Visualization of attention weights. Figure 5 shows that a querying voxel can dynamically select the features of attending voxels in a very large context range, which benefits the detection of objects that are sparse and incomplete.

Conclusion

We present Voxel Transformer, a general Transformer-based 3D backbone that can be applied in most voxel-based 3D detectors. VoTr consists of a series of sparse and submanifold voxel modules, and can perform self-attention on sparse voxels efficiently with special attention mechanisms and Fast Voxel Query. For future work, we plan to explore more Transformer-based architectures on 3D detection.

References

Appendix A Architecture

The detailed architecture of Voxel Transformer is shown in Figure 6. Input voxels are downsampled $3$ times with the stride $2$ by $3$ sparse voxel modules. Figure 7 shows an illustration of the voxel downsampling process. We note that the downsampled voxel centers are no longer overlapped with the original voxel centers, since the voxel size are doubled during downsampling. Thus sparse voxel modules are needed to perform voxel attention on those empty locations.

Appendix B Dilated Attention

In this section, we provide the configurations of Dilated Attention in Table 9. We use the same configurations for both VoTr-TSD and VoTr-SSD on the KITTI dataset. With carefully designed Dilated Attention, a single self-attention layer can obtain large context information with only a few attending voxels.

Appendix C Qualitative Results

In this section, we provide the qualitative results on the KITTI dataset in Figure 8, and the Waymo Open dataset in Figure 9. With rich context information captured by self-attention, our Voxel Transformer is able to detect those 3D objects that are sparse and incomplete effectively.