Spherical Transformer for LiDAR-based 3D Recognition

Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, Jiaya Jia

Introduction

Nowadays, point clouds can be easily collected by LiDAR sensors. They are extensively used in various industrial applications, such as autonomous driving and robotics. In contrast to 2D images where pixels are arranged densely and regularly, LiDAR point clouds possess the varying-sparsity property — points near the LiDAR are quite dense, while points far away from the sensor are much sparser, as shown in Fig. 2 (a).

However, most existing work does not specially consider the the varying-sparsity point distribution of outdoor LiDAR point clouds. They inherit from 2D CNNs or 3D indoor scenarios, and conduct local operators (e.g., SparseConv ) uniformly for all locations. This causes inferior results for the sparse distant points. As shown in Fig. 1, although decent performance is yielded for the dense close points, it is difficult for these methods to deal with the sparse distant points optimally.

We note that the root cause lies in limited receptive field. For sparse distant points, there are few surrounding neighbors. This not only results in inconclusive features, but also hinders enlarging receptive field due to information disconnection. To verify this finding, we visualize the Effective Receptive Field (ERF) of the given feature (shown with the yellow star) in Fig. 2 (d). The ERF cannot be expanded due to disconnection, which is caused by the extreme sparsity of the distant car.

Although window self-attention , dilated self-attention , and large-kernel CNN have been proposed to conquer the limited receptive field, these methods do not specially deal with LiDAR point distribution, and remain to enlarge receptive field by stacking local operators as before, leaving the information disconnection issue still unsolved. As shown in Fig. 1, the method of cubic self-attention brings a limited improvement.

In this paper, we take a new direction to aggregate long-range information directly in a single operator to suit the varying-sparsity point distribution. We propose the module of SphereFormer to perceive useful information from points 50+ meters away and yield large receptive field for feature extraction. Specifically, we represent the 3D space using spherical coordinates $(r,\theta,\phi)$ with the sensor being the origin, and partition the scene into multiple non-overlapping windows. Unlike the cubic window shape, we design radial windows that are long and narrow. They are obtained by partitioning only along the $\theta$ and $\phi$ axis, as shown in Fig. 2 (b). It is noteworthy that we make it a plugin module to conveniently insert into existing mainstream backbones.

The proposed module does not rely on stacking local operators to expand receptive field, thus avoiding the disconnection issue, as shown in Fig. 2 (e). Also, it facilitates the sparse distant points to aggregate information from the dense-point region, which is often semantically rich. So, the performance of the distant points can be improved significantly (i.e., +17.1% mIoU) as illustrated in Fig. 1.

Moreover, to fit the long and narrow radial windows, we propose exponential splitting to obtain fine-grained relative position encoding. The radius $r$ of a radial window can be over 50 meters, which causes large splitting intervals. It thus results in coarse position encoding when converting relative positions into integer indices. Besides, to let points at varying locations treat local and global information differently, we propose dynamic feature selection to make further improvements.

In total, our contribution is three-fold.

We propose SphereFormer to directly aggregate long-range information from dense-point region. It increases the receptive field smoothly and helps improve the performance of sparse distant points.

To accommodate the radial windows, we develop exponential splitting for relative position encoding. Our dynamic feature selection further boosts performance.

Our method achieves new state-of-the-art results on multiple benchmarks of both semantic segmentation and object detection tasks.

Related Work

Segmentation is a fundamental task for vision perception. Approaches for LiDAR-based semantic segmentation can be roughly grouped into three categories, i.e., view-based, point-based, and voxel-based methods. View-based methods either transform the LiDAR point cloud into a range view , or use a bird-eye view (BEV) for a 2D network to perform feature extraction. 3D geometric information is simplified.

Point-based methods adopt the point features and positions as inputs, and design abundant operators to aggregate information from neighbors. Moreover, the voxel-based solutions divide the 3D space into regular voxels and then apply sparse convolutions. Further, methods of propose various structures for improved effectiveness. All of them focus on capturing local information. We follow this line of research, and propose to directly aggregate long-range information.

Recently, RPVNet combines the three modalities by feature fusion. Furthermore, 2DPASS incorporates 2D images during training, and fuses multi-modal features. Despite extra 2D information, the performance of these methods still lags behind compared to ours.

Object Detection.

3D object detection frameworks can be roughly categorized into single-stage and two-stage methods. VoxelNet extracts voxel features by PointNet and applies RPN to obtain the proposals. SECOND is efficient thanks to the accelerated sparse convolutions. VoTr applies cubic window attention to voxels. LiDARMultiNet unifies semantic segmentation, panoptic segmentation, and object detection into a single multi-task network with multiple types of supervision. Our experiments are based on CenterPoint , which is a widely used anchor-free framework. It is effective and efficient. We aim to enhance the features of sparse distant points, and our proposed module can be conveniently inserted into existing frameworks.

2 Vision Transformer

Recently, Transformer become popular in various 2D image understanding tasks . ViT tokenizes every image patch and adopts a Transformer encoder to extract features. Further, PVT presents a hierarchical structure to obtain a feature pyramid for dense prediction. It also proposes Spatial Reduction Attention to save memory. Also, Swin Transformer uses window-based attention and proposes the shifted window operation in the successive Transformer block. Moreover, methods of propose different designs to incorporate long-range dependencies. There are also methods that apply Transformer into 3D vision. Few of them consider the point distribution of LiDAR point cloud. In our work, we utilize the varying-sparsity property, and design radial window self-attention to capture long-range information, especially for the sparse distant points.

Our Method

In this section, we first elaborate on radial window partition in Sec. 3.1. Then, we propose the improved position encoding and dynamic feature selection in Sec. 3.2 and 3.3.

To model the long-range dependency, we adopt the window-attention paradigm. However, unlike the cubic window attention , we take advantage of the varying-sparsity property of LiDAR point cloud and present the SphereFormer module, as shown in Fig. 3.

Specifically, we represent LiDAR point clouds using the spherical coordinate system $(r,\theta,\phi)$ with the LiDAR sensor being the origin. We partition the 3D space along the $\theta$ and $\phi$ axis. We, thus, obtain a number of non-overlapping radial windows with a long and narrow ’pyramid’ shape, as shown in Fig. 3. We obtain the window index for the token at ( $r_{i}$ , $\theta_{i}$ , $\phi_{i}$ ) as

where $\Delta\theta$ and $\Delta\phi$ denote the window size corresponding to the $\theta$ and $\phi$ dimension, respectively.

Tokens with the same window index would be assigned to the same window. The multi-head self-attention is conducted within each window independently as follows.

SphereFormer serves as a plugin module and can be conveniently inserted into existing mainstream models, e.g., SparseConvNet , MinkowskiNet , local window self-attention . In this paper, we find that inserting it into the end of each stage works well, and the network structure is given in the supplementary material. The resulting model can be applied to various downstream tasks, such as semantic segmentation and object detection, with strong performance as produced in experiments.

SphereFormer is effective for the sparse distant points to get long-range information from the dense-point region. Therefore, the sparse distant points overcome the disconnection issue, and increase the effective receptive field.

Comparison with Cylinder3D.

Although both Cylinder3D and ours use polar or spherical coordinates to match LiDAR point distribution, there are two essential differences yet. First, Cylinder3D aims at a more balanced point distribution, while our target is to enlarge the receptive field smoothly and enable the sparse distant points to directly aggregate long-range information from the dense-point region. Second, what Cylinder3D does is replace the cubic voxel shape with the fan-shaped one. It remains to use local neighbors as before and still suffers from limited receptive field for the sparse distant points. Nevertheless, our method changes the way we find neighbors in a single operator (i.e., self-attention) and it is not limited to local neighbors. It thus avoids information separation between near and far objects and connects them in a natural way.

2 Position Encoding

For the 3D point cloud network, the input features have already incorporated the absolute $xyz$ position. Therefore, there is no need to apply absolute position encoding. Also, we notice that Stratified Transformer develops the contextual relative position encoding. It splits a relative position into several discrete parts uniformly, which converts the continuous relative positions into integers to index the positional embedding tables.

This method works well with local cubic windows. But in our case, the radial window is narrow and long, and its radius $r$ can take even more than 50 meters, which could cause large intervals during discretization and thus coarse-grained positional encoding. As shown in Fig. 4 (a), because of the large interval, $key_{1}$ and $key_{2}$ correspond to the same index. But there is still a considerable distance between them.

Specifically, since the $r$ dimension covers long distances, we propose exponential splitting for the $r$ dimension as shown in Fig. 4 (b). The splitting interval grows exponentially when the index increases. In this way, the intervals near the $query$ are much smaller, and the $key_{1}$ and $key_{2}$ can be assigned to different position encodings. Meanwhile, we remain to adopt the uniform splitting for the $\theta$ and $\phi$ dimensions. In notation, we have a query token $q_{i}$ and a key token $k_{j}$ . Their relative position $(r_{ij},\theta_{ij},\phi_{ij})$ is converted into integer index $(\mathbf{idx}^{r}_{ij},\mathbf{idx}^{\theta}_{ij},\mathbf{idx}^{\phi}_{ij})$ as

where $a$ is a hyper-parameter to control the starting splitting interval, and $L$ is the length of the positional embedding tables. Note that we also add the indices with $\frac{L}{2}$ to make sure they are non-negative.

The exponential splitting strategy provides smaller splitting intervals for near token pairs and larger intervals for distant ones. This operation enables a fine-grained position representation between near token pairs, and still maintains the same number of intervals in the meanwhile. Even though the splitting intervals become larger for distant token pairs, this solution actually works well since distant token pairs require less fine-grained relative position.

3 Dynamic Feature Selection

Point clouds scanned by LiDAR have the varying-sparsity property — close points are dense and distant points are much sparser. This property makes points at different locations perceive different amounts of local information. For example, as shown in Fig. 5, a point of the car (circled in green) near the LiDAR is with rich local geometric information from its dense neighbors, which is already enough for the model to make a correct prediction – incurring more global contexts might be contrarily detrimental. However, a point of bicycle (circled in red) far away from the LiDAR lacks shape information due to the extreme sparsity and even occlusion. Then we should supply long-range contexts as a supplement. This example shows treating all the query points equally is not optimal. We thus propose to dynamically select local or global features to address this issue.

As shown in Fig. 6, for each token, we incorporate not only the radial contextual information, but also local neighbor communication. Specifically, input features are projected into query, key and value features as Eq. (2). Then, the first half of the heads are used for radial window self-attention, and the remaining ones are used for cubic window self-attention. After that, these two features are concatenated and then linearly projected to the final output $\mathbf{z}$ for feature fusion. It enables different points to dynamically select local or global features. Formally, the Equations (3-5) are updated to

Experiments

In this section, we first introduce the experimental setting in Sec. 4.1. Then, we show the semantic segmentation and object detection results in Sec. 4.2 and 4.3. The ablation study and visual comparison are shown in Sec. 4.4 and 4.5. Our code and models will be made publicly available.

For semantic segmentation, we adopt the encoder-decoder structure and follow U-Net to concatenate the fine-grained encoder features in the decoder. We follow to use SparseConv as our baseline model. There are a total of 5 stages whose channel numbers are $, and there are two residual blocks at each stage. Our proposed module is stacked at the end of each encoding stage. For object detection, we adopt CenterPoint as our baseline model, where the backbone possesses 4 stages whose channel numbers are$ . Our proposed module is stacked at the end of the second and third stages. Note that our proposed module incurs negligible extra parameters, and more details are given in the supplementary material.

Datasets.

Following previous work, we evaluate methods on nuScenes , SemanticKITTI , and Waymo Open Dataset (WOD) for semantic segmentation. For object detection, we evaluate our methods on the nuScenes dataset. The details of the datasets are given in the supplementary material.

Implementation Details.

For semantic segmentation, we use 4 GeForce RTX 3090 GPUs for training. We train the models for 50 epochs with AdamW optimizer and ‘poly’ scheduler where power is set to 0.9. The learning rate and weight decay are set to $0.006$ and $0.01$ , respectively. Batch size is set to 16 on nuScenes, and 8 on both SemanticKITTI and Waymo Open Dataset. The window size is set to $[120m,2^{\circ},2^{\circ}]$ for $(r,\theta,\phi)$ on both nuScenes and SemanticKITTI, and $[80m,1.5^{\circ},1.5^{\circ}]$ on Waymo Open Dataset. During data preprocessing, we confine the input scene to the range from $[-51.2m,-51.2m,-4m]$ to $[51.2m,51.2m,2.4m]$ on SemanticKITTI and $[-75.2m,-75.2m,-2m]$ to $[75.2m,75.2m,4m]$ on Waymo. Also, we set the voxel size to $0.1m$ on both nuScenes and Waymo, and $0.05m$ on SemanticKITTI.

For object detection, we adopt the OpenPCDet codebase and follow the default CenterPoint to set the training hyper-parameters. We set the window size to $[120m,1.5^{\circ},1.5^{\circ}]$ .

2 Semantic Segmentation Results

The results on SemanticKITTI test set are shown in Table 1. Our method yields $74.8\%$ mIoU, a new state-of-the-art result. Compared to the methods based on range images and Bird-Eye-View (BEV) , ours gives a result with over $20\%$ mIoU performance gain. Moreover, thanks to the capability of directly aggregating long-range information, our method significantly outperforms the models based on sparse convolution . It is also notable that our method outperforms 2DPASS that uses extra 2D images in training by $1.9\%$ mIoU.

In Tables 2 and 3, we also show the semantic segmentation results on nuScenes test and val set, respectively. Our method consistently outperforms others by a large margin, and achieves the 1st place on the benchmark. It is intriguing to note that our method is purely based on LiDAR data, and it works even better than approaches of that use additional 2D information.

Moreover, we demonstrate the semantic segmentation results on Waymo Open Dataset val set in Table 4. Our model outperforms the baseline model with a substantial gap of $3.3\%$ mIoU. Also, it is worth noting that our method achieves a $9.3\%$ mIoU performance gain for the far points, i.e., the sparse distant points.

3 Object Detection Results

Our method also achieves strong performance in object detection. As shown in Table 8, our method outperforms other published methods on nuScenes test set, and ranks 3rd on the LiDAR-only benchmark. It shows that directly aggregating long-range information is also beneficial for object detection. It also manifests the capability of our method to generalize to instance-level tasks.

4 Ablation Study

To testify the effectiveness of each component, we conduct an extensive ablation study and list the result in Table 5. The Experiment I (Exp. I for short) is our baseline model of SparseConv. Unless otherwise specified, we train the models on nuScenes train set and make evaluations on nuScenes val set for the ablation study. To comprehensively reveal the effect, we also report the performance at different distances, i.e., close ( $\leq 20m$ ), medium ( $>20m$ & $\leq 50m$ ), far ( $>50m$ ) distances.

By comparing Experiments I and II in Table 5, we can conclude that the radial window shape is beneficial. Further, the improvement stems mainly from better handling the medium and far points, where we yield $5.67\%$ and $13.39\%$ mIoU performance gain, respectively. This result exactly verifies the benefit of aggregating long-range information with the radial window shape.

Moreover, we also compare the radial window shape with the cubic one proposed in . As shown in Table 6, the radial window shape considerably outperforms the cubic one.

Besides, we investigate the effect of window size as shown in Table 7. Setting it too small may make it hard to capture meaningful information, while setting it too large may increase the optimization difficulty.

Exponential Splitting.

Compared to Exp. IV, Exp. V improves with $1.36\%$ more mIoU, which shows the effectiveness. Moreover, the consistent conclusion could be drawn from Experiments II and III, where we witness $3.88\%$ and $4.43\%$ more mIoU for the medium and far points, respectively. Also, we notice that with exponential splitting, all the close, medium, and far points are better dealt with.

Dynamic Feature Selection.

From the comparison between Experiments III and V, we note that dynamic feature selection brings a $0.8\%$ mIoU performance gain. Interestingly, we further notice that the gain mainly comes from the close points, which indicates that the close points may not rely too much on global information, since the dense local information is already enough for correct predictions for the dense close points. It also reveals the fact that points at varying locations should be treated differently. Moreover, the comparison between Exp. II and IV leads to consistent conclusion. Although the performance of medium and far decreases a little, the overall mIoU still increases, since their points number is much than that of the close points.

5 Visual Comparison

As shown in Fig. 7, we visually compare the baseline model (i.e., SparseConv) and ours. It visually indicates that with our proposed module, more sparse distant objects are recognized, which are highlighted with cyan boxes. More examples are given in the supplementary material.

Conclusion

We have studied and dealt with varying-sparsity LiDAR point distribution. We proposed SphereFormer to enable the sparse distant points to directly aggregate information from the close ones. We designed radial window self-attention, which enlarges the receptive field for distant points to intervene with close dense ones. Also, we presented exponential splitting to yield more detailed position encoding. Dynamically selecting local or global features is also helpful. Our method demonstrates powerful performance, ranking 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks and achieving the 3rd on nuScenes object detection benchmark. It shows a new way to further enhance 3D visual understanding. Our limitations are discussed in the supplementary material.