Embracing Single Stride 3D Object Detector with Sparse Transformer

Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, Zhaoxiang Zhang

Introduction

LiDAR-based 3D object detection for autonomous driving has been benefiting from the progress of image-based object detection. The mainstream 3D detectors quantize the 3D space into a stack of pseudo-images from Bird Eye’s View (BEV), which makes it convenient to borrow advanced techniques from the 2D counterparts. Many works are proposed under this paradigm and achieve competitive performance. However, 3D and 2D spaces have intrinsic distinction in their relative object scales, where the objects in 3D spaces have much smaller relative sizes (See Fig. 2). For example, in Waymo Open Dataset , the perception range is usually $150m\times 150m$ , while a vehicle is only about $4m$ long, even a pedestrian occupies as little as $1m$ in length. Such a tiny pedestrian equivalently translates to an object of size $8\times 8$ pixels in a $1200\times 1200$ image, suggesting that object detection on such a tiny scale is one of the challenges in 3D object detection.

Different from the above challenge of small scales in the 3D space, 2D detectors have to consider the handling of the objects with varied scales. It is observed in Fig. 2 that the scales of objects in 2D images exhibit a long-tail distribution, while in 3D space they are quite concentrated due to the non-projective transformation used in voxelization. To handle the varied scales, 2D detectors usually build multi-scale features with a series of downsampling and upsampling operations. Such multi-scale architecture is also widely inherited in 3D detectors (See Fig. 1) . Since the object size in 3D object detectors is usually tiny while no large objects exist, a question naturally arises: do we really need downsampling in 3D object detectors ?

With this question in mind, we make an exploratory attempt on the single-stride architecture with no downsampling operators. The single-stride network maintains the original resolution throughout the network. However, it is challenging to make such a design feasible. The discard of downsampling operators leads to two issues: 1) the increase of computation cost; 2) the decrease of receptive field. The former constrains the applicability to the real-time system and the latter hinders the capability of object recognition. For the issue of computation, sparse convolution seems to be a solution, but the sparse connectivity between voxelsWe provide a clear illustration for this in our supplementary materials. makes the decrease of receptive field even more severe (See Table 7). For the issue of receptive field, we experimentally show that some commonly adopted techniques do not meet our needs (See Table 1): the dilated convolution is not friendly to small objects, and the larger kernel leads to unaffordable computational overhead in the single stride architecture. Therefore, we are getting into a dilemma, where it is difficult to design a convolutional network simultaneously satisfying the three aspects: single stride architecture, sufficient receptive field, and acceptable computation cost.

These difficulties naturally lead us to think out of the paradigm of CNN, and the attention mechanism emerges as a better option because of the following two reasons: 1) The attention-based model is better at capturing large context and build sufficient receptive field. 2) Due to the capability of modeling dynamic data, the attention-based model fits well into the sparse voxelized representation of point clouds, where only a small portion of voxels are occupied. This property guarantees the efficiency of our single stride network. Although the attention mechanism is efficient on sparse data, computing attentions on a global scale is still unaffordable and undesirable. So we partition the voxelized 3D space into many local regions and apply self-attention inside each of them. Eventually, this local attention mechanism, named as Sparse Regional Attention (SRA), enjoys the best of two worlds. By stacking SRA layers, we make the single-stride network feasible and obtain a transformer-style network, called Single-stride Sparse Transformer (SST). Extensive experiments are conducted on the large-scale Waymo Open Dataset . We summarize our contributions as follows:

We rethink the architecture of current mainstream LiDAR-based 3D detectors. With pilot experiments, we point out that the network stride is an overlooked design factor for LiDAR-based 3D detectors.

We propose the Single-stride Sparse Transformer (SST). With its local attention mechanism and capability of handling sparse data, we overcome receptive field shrinkage in the single-stride setting and avoid heavy computational overhead.

Our method achieves state-of-the-art performance on the large-scale Waymo Open Dataset. Thanks to the characteristic of single stride, our method obtains exciting results on tiny objects like pedestrians (83.8 LEVEL_1 AP on the validation split).

Related Work

3D LiDAR-based Detection There are three major representations for point cloud learning in autonomous driving, Point-based, Voxel-based, and Range View. Point based representation backed by PointNet families are widely adopted for feature learning of small region of irregular points . Voxel-based representation combined with convolutions are the most popular treatment. As explored in several recent works , range view enjoys computational advantages over voxels, especially for long-range LiDAR sensors. Some hybrid approaches investigate how to combine different types of representations .

Transformers in Visual Recognition The success of transformer architectures in NLP and speech recognition has inspired lots of work to investigate the power of attention in visual recognition . The pioneering work ViT splits an image into patches, and then feeds sequences of patches to multiple transformer blocks for image classification. DeiT explores training strategies for data-efficient learning of vision transformers. Swin-Transformer exploits the power of local attention to build high-performance transformer-based image backbones. Several works have investigated the use of transformers for point cloud perceptions. Some of them focus on the indoor scene such as . For autonomous driving scenarios, Pointformer proposes a point-based local and global attention module directly operating on point clouds. In addition, VoTr uses the local self-attention module to replace the sparse convolution for voxel processing, where each voxel serves as a query and attends with its neighbor voxels.

Small Object Detection Small object detection is a challenging track in 2D object detection. The mainstream of current methods focuses on increasing the resolution of the input and output features, while none of them gives up the multi-stride architectures. Some other methods adopt the scale-aware training and strong data augmentations . To the best of our knowledge, there is no method specialized for small object detection in 3D space.

Discussion of Network Stride

The stride of a network is a simple but critical aspect in the architecture design. Some previous works in 3D detection have found that the performance can benefit from the recovery of output resolution by upsampling. However, they do not delve into this phenomenon. Therefore, we conduct a simple pilot study to reveal the influence of network stride on 3D detectors and motivate the design of our network.

For generality, we adopt the widely used PointPillars in MMDetection3D as our base model. The experiments are conducted on Waymo Open Dataset . We uniformly sample 20% training data (32K frames) Training with 20% data is a setting for efficient validation adopted in . and adopt $1\times$ schedule (12 epochs).

Based on the standard PointPillars model $D_{2}$ , we extend it to three more variants: $D_{3}$ , $D_{1}$ , and $D_{0}$ , and they only differ in the network stride. From $D_{3}$ to $D_{0}$ , the set of strides of their four stages for each model are $\{1,2,4,8\}$ , $\{1,2,4,4\}$ , $\{1,2,2,2\}$ and $\{1,1,1,1\}$ , respectively. Since the output feature maps of the four stages will be upsampled to the original resolution by an FPN-like module, our modification does not change the resolution of feature maps in the detection head. Except for the resolution of feature maps, all the four models have the same hyper-parameters. To reduce memory overhead, we change the filter number from 256 to 128 in convolution layers.

The main results are shown in Table 1. Performances of all three classes improve from $D_{3}$ to $D_{1}$ , and there is a significant boost from $D_{2}$ to $D_{1}$ . The performance boost from $D_{3}$ to $D_{1}$ supports our motivation that Smaller strides are better for 3D detection.

However, from $D_{1}$ to $D_{0}$ , the vehicle performance has a significant drop, while the performance drop in pedestrian is slight and performance of cyclist keeps going up. We conjecture that the limited receptive field of $D_{0}$ hinders the performance improvement from $D_{1}$ to $D_{0}$ since the pedestrian and cyclist have smaller sizes than vehicles.

To verify our conjecture, we add two more variants: $D_{0}^{dilation}$ and $D_{0}^{5\times 5}$ . $D_{0}^{dilation}$ adopts dilated convolutions with dilation as 2 in the last two stages. $D_{0}^{5\times 5}$ increases the kernel size in last two stages to $5\times 5$ . Table 1 shows that, dilation increases the performance of vehicle class while decreases performances of pedestrian and cyclist, indicating that it indeed enlarges the receptive field, however misses fine-grained details. Meanwhile, larger kernel consistently improves the performance of all three classes but unfortunately has the highest latency. Above studies support our major motivation of single-stride 3D detectors, and it also reveals the another important aspect in our network design: Sufficient receptive field is crucial.

In summary, above experiments verify two motivations of 3D object detector designs:

The single stride architecture has a great potential in LiDAR-based 3D detection for autonomous driving.

The key to make single stride architecture feasible lies in appropriately addressing the shrinkage of receptive field and reducing computational overhead.

Methodology

So far, we know the keys to make single stride architecture feasible are sufficient receptive field and acceptable computational cost. However, as we discussed in Sec. 1, it is difficult to simultaneously satisfy the two factors with convolutional single stride architecture. So we turn to the attention mechanism in Transformer , and present our method as follows.

We build up our Single-stride Sparse Transformer (SST) as in Fig. 4. SST voxelizes the point clouds and extracts voxel features following prior work . For each voxel and its features, SST treats them as “tokens.” SST first partitions the voxelized 3D space to fixed-size non-overlapping regions (Sec. 4.2). Then SST applies Sparse Regional Attention (SRA) to voxel tokens in each region (Sec. 4.3). To handle the objects scattering multiple regions and capture useful local context, we adopt Region Shift (Sec. 4.4), which is inspired by the shifted window in Swin-Transformer . The backbone preserves the number of voxels as well as their spatial locations, thus satisfying the single-stride property, and can be integrated with mainstream detection heads (Sec. 4.5).

2 Regional Grouping

Given the input voxel tokens, Regional Grouping divides the 3D space into non-overlapping regions, so that the self-attentions only interact with tokens coming from the same regions. The regional grouping not only maintains sufficient receptive field, but also avoids expensive computation overhead in global attentions. We illustrate it intuitively in Fig. 3. Each regional grouping divides the input tokens into groups according to their physical locations, where the tokens belonging to the same regions (green rectangles) are assigned to the same group.

3 Sparse Regional Attention

Sparse Regional Attention (SRA) operates on the regional sparse sets of voxel tokens coming from regional grouping. For a group of tokens $\mathcal{F}$ and their corresponding spatial $(x,y,z)$ coordinates $\mathcal{I}$ , SRA follows conventional transformers as follows

where $\mathbf{PE}(\cdot)$ stands for the absolute positional encoding function used in , $\mathbf{MSA}(\cdot)$ denotes the Multi-head Self-Attention, and $\mathbf{LN}(\cdot)$ represents Layer Normalization. This manner of SRA well exploits the sparsity of point clouds, because it only computes the voxels with actual LiDAR points.

Region Batching for Efficient Implementation Due to the sparsity of point cloud, the number of valid tokens in each region varies. To utilize the parallel computation of modern devices, we batch regions with similar number of tokens together. In practice, if a region contains the tokens with number $N_{token}$ , satisfying:

then we pad the number of tokens to $2^{i+1}$ . With padded tokens, we can divide all the regions into several batches, and then process all regions in the same batch in parallel. As the padded tokens are masked in the computation as in , they have no effect on other valid tokens. In this way, it is easy to implement an efficient SRA module in current popular deep learning frameworks without engineering efforts as taken in the sparse convolution .

4 Region Shift

Though SRA can cover a considerably large region, there are some objects inevitably truncated by the grouping. To tackle this issue and aggregate useful context, we further use Region Shift in our design, which is similar to the shifting mechanism in Swin Transformer for information communication. Supposing the size of regions in regional grouping is $(l_{x},l_{y},l_{z})$ , the Region Shift moves the original regions by $(l_{x}/2,l_{y}/2,l_{z}/2)$ and groups the tokens according to this new set of regions, as illustrated in “Shifted regional grouping” of Fig. 3.

5 Integration with Detection

To work with the existing detector heads, SST places the sparse voxel tokens back to dense feature maps according to their spatial locations. Unoccupied locations are filled with zeros. As LiDAR only captures points on object surfaces, 3D object centers are likely to reside on the empty locations with zero features, which is unfriendly to the current designs of detection heads. . So we add two $3\times 3$ convolutions to fill most of the holes on the object centers.

As for the detection head and loss function, we adopt the same settings as PointPillars for simplicity. Specifically, we use the SSD head, the smooth L1 bounding box localization loss $\mathcal{L}_{loc}$ , the classification loss $\mathcal{L}_{cls}$ in the form of focal loss , and the direction loss $\mathcal{L}_{dir}$ penalizing wrong orientations. The final loss function is Eq 3, where $N_{p}$ is the number of positive samples. We leave the detailed setting in supplementary materials.

6 Two Stage SST

Although our main contribution lies in the design of the single stride architecture in the first stage, there is a considerable gap between the single stage detector and the two stage detector. To match the performance with current two stage detectors, we apply LiDAR-RCNN as our second stage. LiDAR-RCNN is a lightweight second stage network consists of a simple PointNet for feature extraction, only taking the raw point cloud inside proposal as input.

7 Discussion

Because of the distinctions between point clouds and RGB images, there are several differences in the design choices and motivations between our design and Swin-Transformer as highlighted here.

Our SST network follows the single-stride guideline, while Swin-Transformer follows the hierarchical structure with multi-stride, which uses “token merge” to increase the receptive field.

The tokens for our region-based attention scatter sparsely because of the sparsity of point clouds, while the tokens in vision transformers have dense layouts. This is one of the reasons for the efficiency of SST even in the single stride architecture.

Experiments

We conduct our experiments on Waymo Open Dataset (WOD) . The dataset contains 1150 sequences in total (more than 200K frames), 798 for training, 202 for validation and 150 for test. Each frame covers a scene with a size of $150m\times 150m$ . It is a very challenging dataset and adopted as the benchmark in many recent state-of-the-art methods.

2 Implementation Details

We implement our model based on the popular 3D object detection codebase – MMDetection3D, which provides standard and solid baselines. Please refer to supplementary materials for more details.

Model Setup For generality, we build our Single-stride Sparse Transformer (SST) on the basis of popular PointPillars . We replace its backbone with 6 consecutive Sparse Regional Attentions (SRA) blocks, and each block contains 2 attention modules as Fig. 4 shows. All the attention modules are equipped with 8 heads, 128 input channels, and 256 hidden channels. In Regional Grouping, each region covers a volume with size $3.84m\times 3.84m\times 6m$ . As for other parts, SST follows the implementation of PointPillars in MMDetection3D. We use the BEV pillar size of $0.32m\times 0.32m\times 6m$ , which can be easily extended to the 3D voxels with smaller heights.

Model Variants We develop several variants of SST in our experiments. SST_1f: basic single-stage model using 1-frame point cloud. SST_3f: consecutive 3 frame point clouds are used as model input, and the point cloud in different frames are concatenated together after aligning the ego-pose. SST_TS_1f and SST_TS_3f: two stage model based on above models, using a standard LiDAR-RCNN for refinement.

Training Scheme We train our model for 24 epochs (2 $\times$ ) on WOD with AdamW optimizer and cosine learning rate scheduler. The maximum learning rate is $0.001$ , and the weight decay is $0.05$ .

3 Comparison with State-of-the-art Detectors

We compare our SST with state-of-the-art methods in Table 2 (vehicle) and Table 3 (pedestrian). We divide current methods into the branches of one-stage and two-stage detectors for fair comparison.

Table 2 shows the results on vehicles, where our models achieve competitive performances. With a lightweight second stage for refinement, our two-stage detectors are comparable with state-of-the-art methods.

Table 3 shows the results on pedestrians. Due to the tiny size and non-rigid property, pedestrian detection is more challenging than vehicle detection. Networks are prone to confuse pedestrians with other slim objects, like poles and trees, leading to a high false positive rate. Under such cases, our best model outperforms all other methods in the challenging pedestrian class. SST_TS_3f is 4.4 AP ahead of the second best RSN with the same temporal information (3 frames). We owe such leading performance to the single-stride characteristic of SST.

4 Deep Investigation of Single Stride

Single-stride models better use dense observations. First, SST has more advantages in short-range metrics (0m - 30m) than in long-range metrics (50m - inf): In Table 4, SST_1f outperforms the PointPillars counterpart in short-range metric by 12.8 AP for pedestrian class, but the margin is not that significant over PointPillars in the long-range metric. Second, SST benefits more from multi-frame data. In Table 4, RSN got improved by 6.4 AP in long-range metric from RSN_1f to RSN_3f, while the performance of SST in long-range metric gets more significantly improved by 10.4 AP from SST_1f to SST_3f.

Does the single stride model fail on large vehicles? As smaller strides reduce the receptive fields, it would be a major concern whether our model has sufficient receptive fields for extreme cases, e.g., extremely large vehicles. We therefore divide all the vehicles into three groups according to the lengths of their ground-truth boxes, and evaluate the recalls of SST on them. Please refer to supplementary materials for the evaluation details. In Table 5, our SST outperforms the PointPillars baseline for all vehicles, even those longer than $8m$ . This supports that our attention mechanism provides proper receptive fields in the single stride architecture.

Localization quality test with stricter IoU thresholds. By preserving the original resolution, our SST is supposed to localize objects more precisely as in . To verify this, we evaluate SST with higher 3D IoU thresholds (0.8 for vehicle, 0.6 for pedestrian). In Table 6, we compare our models with the PointPillars baseline and other models with available results from , then a couple of interesting findings emerge:

Comparing MVF++ with our SST_1f on vehicles, MVF++ is slightly better than SST_1f under the normal threshold, while SST_1f is better with the stricter threshold. This suggests the single stride structure enables more precise localization of vehicles.

The 3DAL is an offboard method using all the past and future frames in a sequence (around 200 frames) and is equipped with tracking . Nonetheless, our best model SST_TS_3f surprisingly surpasses 3DAL on pedestrian on both IoU thresholds with as few as 3 frames of point clouds.

These findings suggest that the single-stride architecture is capable of better localizing objects with full and fine-grained information.

Comparison with other alternatives. There are some potential alternatives to our SST in order to preserve the input resolution. Here we make a comprehensive comparison. We first introduce these alternative models as follows. PointPillars-SS: The single stride version of PointPillars introduced in Sec. 3. SparsePillars-SS: We replace all the standard 2D convolutions in backbone of PointPillars-SS with Submanifold Sparse Convolutions . Due to the sparsity, SparsePillars-SS also faces the issue of “empty hole” (details in Sec. 4.5) as in SST, so we add two more 2D convolutions before its detection head. HRNetV2p-W18 : HRNet maintains the high resolution while building multi-scale features. We adopt the standard HRNetV2p-W18 from MMDetection for the experiment. To keep the output resolution in HRNet the same as PointPillars, we reduce the stride of the first two convolutions in HRNet from 2 to 1. All the alternatives have the same setting with SST_1f except their backbones. Table 7 shows the comparison between different models.

In Table 7, our method outperforms all other alternatives with relatively low latency. Besides, two things need to be noticed: (1) SparsePillars-SS is much worse than other models in vehicle class. Due to the properties of submanifold sparse convolution, this model suffers from more severe receptive field shrinkage than PointPillars-SS. For example, if a vehicle part is isolated with all the surrounding voxels being in empty, it can not perceive information from other parts in the whole forward process. On the contrary, the attention mechanism in SST well addresses this issue while maintaining sparsity. (2) HRNetV2p-W18 allocates too much computation on the high-stride (low resolution) branches which is not needed in 3D object detection. So the capacity of its high-resolution branch is limited, leading to its inferior performance.

5 Qualitative Analysis of Sparse Attention

We visualize the attention weights in Fig. 5 and list our observations as follows.

Sufficient Coverage In Fig. 5 (a) Complete Vehicle, the query token (pink dot) in the car has strong relation with all other parts of the car. In other words, this single token can effectively cover the whole car. This demonstrates that the attention mechanism is indeed effective to enlarge the receptive field.

Semantic Discrimination In Fig. 5 (b) Person near a Wall, the query token on the person builds strong dependency with other body parts, but has little relations with background points, e.g., wall. In Fig. 5 (c) Person beside a Vehicle, the pedestrian standing next to the vehicle attends only with itself. These two cases reveal that the learned sparse attention weight is discriminative between different semantic classes. This property helps distinguish pedestrians from other slim objects and reduces false positives.

Instance Discrimination In the crowded cases, such as Fig. 5 (d) Multiple Pedestrians, the query token in a person mainly focuses on the same person. Due to the high semantic similarity, it also slightly attends to other people. In Fig. 5 (e) Multiple Vehicles, the query token in the vehicle almost has no dependency on the nearby vehicles. These two cases suggest that the learned sparse attention weights are discriminative for different instances.

6 Hyper-parameter Ablation

Region Size We show the performance under different region sizes for Regional Grouping in Table 8. SST is in general robust to the region size and slightly better with larger regions. Especially, SST has the best performance in pedestrian detection with the largest local region size. It suggests that the local context is helpful to recognize pedestrians. For example, pedestrians are more likely to appear on the sidewalks than on vehicle lanes.

Network Depth SST is relatively shallow by design thanks to the large receptive fields from the attention mechanism. In Table 9 we show the impact of model depths on SST. In general, SST is robust to different depths, and the performance of pedestrian class is even slightly better with fewer layers. This demonstrates our method does not rely on a very large or deep model, thus it is easier to build efficient single-stride models.

Conclusion and Limitations

In this paper, we analyze the impact of the network stride on 3D object detectors for autonomous driving, and empirically show that 3D object detectors do not really need downsampling. To build a single-stride network, we adopt the sparse regional attention to address the problem of insufficient receptive fields and avoid expensive computation. By stacking the sparse attention modules, we propose the Single-stride Sparse Transformer, achieving state-of-the-art performance on the Waymo Open Dataset. Due to the single stride structure, our models obtain remarkable performance on the challenging pedestrian class. Without elaborated optimization, our model uses slightly more memory than baseline models, and we will pursue a more memory-friendly model in the future. We wish our work could break the stereotype in the backbone design of point cloud data, and inspire more thoughts on the specialized architectures.

References

Appendices

Appendix A Submission on Test Server

Due to the submission frequency limit of the Waymo test server, we only report the results of our best model. We compare SST with the three most competitive methods and report their performances in the multi-frame setting from the official leaderboard. The results are shown in Table A and Table B. The performance of SST on vehicle class is comparable with these methods, and the performance of SST on pedestrian class significantly outperforms other methods.

Appendix B Discussion of Sparse Operations

Due to the space limit of the main paper, we leave the discussion on sparse operations in the supplementary materials. In this section, we discuss two problems for sparse operations: (1) insufficient receptive field of submanifold sparse convolution (SSC) , and (2) the difficulties of downsampling/upsampling in sparse data.

In Sec. 1 and Table 7 in our main paper, we briefly point out that the SSC-based single-stride architecture faces a severe problem of the insufficient receptive field. We demonstrate this issue here in Fig. A by comparing the behaviors of SSC and standard 2D convolution in sparse data. Both the SSC and standard convolutions have two layers with a kernel size of three. However, the SSC could not reach the voxel on the top-left corner from the voxel marked with a star, while the standard convolution is capable of doing this. This example intuitively illustrates the insufficiency of receptive fields for SSC, and we explain the reasons in detail as follows.

The SSC do not “fill” empty voxels for the sake of efficiency, which largely constrains the information communication between voxels. Under such conditions, in Fig. A (a), only one voxel (the pink one) in has information communication with the one marked by a red star if the kernel size is $3\times 3$ . On the contrary, Fig. A (b) shows that the 2D convolution can gradually enlarge the receptive field by involving the empty voxels in the convolution process, which is more effective for aggregating information compared to the SSC.

To give an experimental illustration, we conduct experiments on the class of vehicles, which require sufficient receptive field for detection. In the Table 7 of the main paper, replacing the $3\times 3$ standard convolutions with SSC will cause a significant drop of AP from 64.69 to 51.57. We further increase the receptive field by expanding the kernel size of SSC to $5\times 5$ and $7\times 7$ . These improve the performance from the 3D AP 51.57 to 55.40 and 56.77, but there is still a large gap to the variant using standard convolutions. Therefore, these numbers support our analyses on the insufficient receptive fields of SSC.

B.2 Downsampling/Upsampling in Sparse Data

Although downsampling and upsampling are common in dense data, e.g., pooling in CNN, token merge in Swin-Transformer, it is non-trivial to transfer these techniques to sparse data like point clouds. A variant of SSC named Sparse Convolution (SC) follows the standard convolution to implement the downsampling and upsampling in sparse data. With such implementation, data loses sparsity rapidly and this leads to high computational overhead.

In our sparse Transformer, downsampling/upsampling by token merge also needs careful consideration. First, the downsampling operation is still an open problem for point clouds: what is the best way to merge the varied number of tokens scattered in different spatial locations? Second, the upsampling operation is also non-trivial and requires future research: how to recover a couple of tokens in different locations from a single token effectively and efficiently? In developing the SST, we encounter these challenges and find it difficult to offer satisfying solutions. Although we have bypassed these difficulties by adopting the single-stride architecture, we hope future research may work on this downsampling/upsampling question and better utilizes sparse data.

Appendix C Potential Improvements

In order to rule out unimportant factors and present a clean architecture, we only inherit the basic framework of PointPillars . So there is a large room for further performance improvements, and we list some of them as follows. We will adopt these techniques in our future work.

IoU Prediction. In detection, the classification score of a bounding box are not always consistent with the real regression quality. So many recent methods use another branch to predict the IoU between output bounding boxes and the corresponding ground-truth boxes, and use the predicted IoU to correct the classification scores.

More Powerful Second Stage. We use LiDAR-RCNN as our second stage, which is a lightweight PointNet-like module only takes the raw point cloud as input. So it has no effect on our first stage and is convenient for our analysis of single-stride architecture. However, its performance is inferior to some other elaborately designed RCNNs, e.g., CenterPoint , PartA2 , PVRCNN , PyramidRCNN , which reuse the features from the single stage to achieve better refinement. With the point-level features interpolated from feature maps in the first stage, SST can be equipped with most of these methods and aim for better abilities.

Incorporating Advanced Techniques in Vision Transformer. We have witnessed the fast progress of vision transformers. Many advanced techniques can be borrowed to enhance the performance of SST. (1) Better efficiency: There are a lot of techniques can be adopted to improve our efficiency, for example, token selection , attention simplication . (2) Better efficacy: Some techniques can be used to make SST more effective, e.g., relative positional encoding , different attention mechanism .

Appendix D Computational Complexity Compared with Convolutions

We investigate the computational complexity of the SST architecture and convolutional architectures. Our analyses demonstrate that SST has a unique advantage in efficiency by utilizing the sparsity of point clouds and the regional grouping.

Following the calculation in Swin-Transformer , we inspect the computational complexities of convolutional architectures and SST. For an input scene size of $h\times w$ , a convolution layer with kernel size $k\times k$ and channel number $C$ has the complexity as Equation A. On the same scene, an SRA operation has the complexity as Equation B, where it has $H$ -heads, region size of $R\times R$ , and the average sparsity as $S$ , which is the ratio for non-empty voxelsOur calculation is approximate because we assume non-empty voxels uniformly scatter in the space..

As shown in the equations, the computational complexities for convolutions and SRA operations are all $O(hw)$ , thus are both linear to the scale of input. However, the SRA operations have the linear factor of $S$ , which is generally small due to the sparsity of point clouds. According to our statistics, $S$ roughly equals to 0.09 on Waymo Open Dataset with our voxelization. Such an analysis indicates that our SRA operations is efficient by properly exploiting the sparsity of LiDAR data.