SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation

Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka

Introduction

LiDAR sensors are widely used in many applications , especially autonomous driving . For level 4 & 5 autonomous vehicles, most of the solutions rely on LiDAR to obtain a point-cloud representation of the environment. LiDAR point clouds can be used in many ways to understand the environment, such as 2D/3D object detection , multi-modal fusion , simultaneous localization and mapping and point-cloud segmentation . This paper is focused on point-cloud segmentation. This task takes a point-cloud as input and aims to assign each point a label corresponding to its object category. For autonomous driving, point-cloud segmentation can be used to recognize objects such as pedestrians and cars, identify drivable areas, detecting lanes, and so on. More applications of point-cloud segmentation are discussed in .

Recent work on point-cloud segmentation is mainly divided into two categories, focusing on small-scale or large-scale point-clouds. For small-scale problems, ranging from object parsing to indoor scene understanding, most of the recent methods are based on PointNet . Although PointNet-based methods have achieved competitive performance in many 3D tasks, they have limited processing speed, especially for large-scale point clouds. For outdoor scenes and applications such as autonomous driving, typical LiDAR sensors, such as Velodyne HDL-64E LiDAR, can scan about $64\times 3000=192,000$ points for each frame, covering an area of $160\times 160\times 20$ meters. Processing point clouds at such scale efficiently or even in real time is far beyond the capability of PointNet-based methods. Hence, much of the recent work follows the method based on spherical projection proposed by Wu et al. . Instead of processing 3D points directly, these methods first transform a 3D LiDAR point cloud into a 2D LiDAR image and use 2D ConvNets to segment the point cloud, as shown in Figure 1. In this paper, we follow this method based on spherical projection.

To transform a 3D point-cloud into a 2D grid representation, each point in the 3D space is projected to a spherical surface. The projection angles of each point are quantized and used to denote the location of the pixel. Each point’s original 3D coordinates are treated as features. Such representations of LiDAR are very similar to RGB images, therefore, it seems straightforward to adopt 2D convolution to process “LiDAR images”. This pipeline is illustrated in Figure 1.

However, we discovered that an important difference exists between LiDAR images and regular images. For a regular image, the feature distribution is largely invariant to spatial locations, as visualized in Figure 2. For a LiDAR image, its features are converted by spherical projection, which introduces very strong spatial priors. As a result, the feature distribution of LiDAR images varies drastically at different locations, as illustrated in Figure 2 and Figure 3 (top). When we train a ConvNet to process LiDAR images, convolution filters may fit local features and become only active in some regions and are not used in other parts, as confirmed in Figure 3 (bottom). As a result, the capacity of the model is under-utilized, leading to decreased performance in point-cloud segmentation.

To tackle this problem, we propose Spatially-Adaptive Convolution (SAC), as shown in Figure 1. SAC is designed to be spatially-adaptive and content-aware. Based on the input, it adapts its filters to process different parts of the image. To ensure efficiency, we factorize the adaptive filter into a product of a static convolution weight and an attention map. The attention map is computed by a one-layer convolution, whose output at each pixel location is used to adapt the static weight. By carefully scheduling the computation, SAC can be implemented as a series of widely supported and optimized operations including element-wise multiplication, im2col, and reshaping, which ensures the efficiency of SAC.

SAC is formulated as a general framework such that previous methods such as squeeze-and-excitation (SE) , convolutional block attention module (CBAM) , context-aggregation module (CAM) , and pixel-adaptive convolution (PAC) can be seen as special cases of SAC, and experiments show that the more general SAC variants proposed in this paper outperform previous ones.

Using spatially-adaptive convolution, we build SqueezeSegV3 for LiDAR point-cloud segmentation. On the SemanticKITTI benchmark, SqueezeSegV3 outperforms all previously published methods by at least 3.7 mIoU with comparable inference speed, demonstrating the effectiveness of spatially-adaptive convolution.

Related work

Recent papers on point-cloud segmentation can be divided into two categories - those that deal with small-scale point-clouds, and those that deal with large-scale point clouds. For small-scale point-cloud segmentation such as object part parsing and indoor scene understanding, mainstream methods are based on PointNet . DGCNN and Deep-KdNet extend the hierarchical architecture of PointNet++ by grouping neighbor points. Based on the PointNet architecture, further improve the effectiveness of sampling, reordering and grouping to obtain a better representation for downstream tasks. PVCNN improves the efficiency of PointNet-based methods using voxel-based convolution with a contiguous memory access pattern. Despite these efforts, the efficiency of PointNet-based methods is still limited since they inherently need to process sparse data, which is more difficult to accelerate . It is noteworthy to mention that the most recent RandLA-Net significantly improves the speed of point cloud processing in the novel use of random sampling.

Large-scale point-cloud segmentation is challenging since 1) large-scale point-clouds are difficult to annotate and 2) many applications require real-time inference. Since a typical outdoor LiDAR (such as Velodyne HDL-64E) can collect about $200K$ points per scan, it is difficult for previous methods to satisfy a real-time latency constraint. To address the data challenge, proposed tools to label 3D bounding boxes and convert to point-wise segmentation labels. proposed to train with simulated data. Recently, Behley et al. proposed SemanticKITTI , a densely annotated dataset for large-scale point-cloud segmentation. For efficiency, Wu et al. proposed to project 3D point clouds to 2D and transform point-cloud segmentation to image segmentation. Later work continued to improve the projection-based method, making it a popular choice for large-scale point-cloud segmentation.

2 Adaptive Convolution

Standard convolutions use the same weights to process input features at all spatial locations regardless of the input. Adaptive convolutions may change the weights according to the input and the location in the image. Squeeze-and-excitation and its variants compute channel-wise or spatial attention to adapt the output feature map. Pixel-adaptive convolution (PAC) changes the convolution weight along the kernel dimension with a Gaussian function. Wang et al. propose to directly re-weight the standard convolution with a depth-aware Gaussian kernel. 3DNConv further extends by estimating depth through an RGB image and using it to improve image segmentation. In our work, we propose a more general framework such that channel-wise attention , spatial attention and PAC can be considered as special cases of spatially-adaptive convolution. In addition to adapting weights, deformable convolutions adapt the location to pull features to convolution. DKN combines both deformable convolution and adaptive convolution for joint-image filtering. However, deformable convolution is orthogonal to our proposed method.

3 Efficient Neural Networks

Many applications that involve point-cloud segmentation require real-time inference. To meet this requirement, we not only need to design efficient segmentation pipelines , but also efficient neural networks which optimize the parameter size, FLOPs, latency, power, and so on .

Many neural nets have been targeted to achieve efficiency, including SqueezeNet , MobileNets , ShiftNet , ShuffleNet , FBNet , ChamNet , MnasNet , and EfficientNet . Previous work shows that using a more efficient backbone network can effectively improve efficiency in downstream tasks. In this paper, however, in order to rigorously evaluate the performance of spatially-adaptive convolution (SAC), we use the same backbone as RangeNet++ .

Spherical Projection of LiDAR Point-Cloud

To process a LiDAR point-cloud efficiently, Wu et al. proposed a pipeline (shown in Figure 1) to project a sparse 3D point cloud to a 2D LiDAR image as

where $(x,y,z)$ are 3D coordinates, $(p,q)$ are angular coordinates, $(h,w)$ are the height and width of the desired projected 2D map, $f=f_{up}+f_{down}$ is the vertical field-of-view of the LiDAR sensor, and $r=\sqrt{x^{2}+y^{2}+z^{2}}$ is the range of each point. For each point projected to $(p,q)$ , we use its measurement of $(x,y,z,r)$ and intensity as features and stack them along the channel dimension. This way, we can represent a LiDAR point cloud as a LiDAR image with the shape of $(h,w,5)$ . Point-cloud segmentation can then be reduced to image segmentation, which is typically solved using ConvNets.

Despite the apparent similarity between LiDAR and RGB images, we discover that the spatial distribution of RGB features are quite different from $(x,y,z,r)$ features. In Figure 2, we sample nine pixels on images from COCO , CIFAR10 and SemanticKITTI and compare their feature distribution. In COCO and CIFAR10, the feature distribution at different locations are rather similar. For SemanticKITTI, however, feature distribution at each locations are drastically different. Such spatially-varying distribution is caused by the spherical projection in Equation (1). In Figure 3 (top), we plot the mean of x, y, and z channels of LiDAR images. Along the width dimension, we can see the sinusoidal change of x and y channels. Along the height dimension, points projected to the top of the image have higher z-values than the ones projected to the bottom. As we will discuss later, such spatially varying distribution can degrade the performance of convolutions.

Spatially-Adaptive Convolution

Previous methods based on spherical projection treat projected LiDAR images as RGB images and process them with standard convolution as

where $Y\in\mathbf{R}^{O\times S\times S}$ is the output tensor, $X\in\mathbf{R}^{I\times S\times S}$ denotes the input tensor, and $W\in\mathbf{R}^{O\times I\times K\times K}$ is the convolution weight. $O,I,S,K$ are the output channel size, input channel size, image size, and kernel size of the weight, respectively. $\hat{i}=i-\left\lfloor{K/2}\right\rfloor$ , $\hat{j}=j-\left\lfloor{K/2}\right\rfloor.$ $\sigma(\cdot)$ is a non-linear activation function.

Convolution is based on a strong inductive bias that the distribution of visual features is invariant to image locations. For RGB images, this is a somewhat valid assumption, as illustrated in Figure 2. Therefore, regardless of the location, a convolution use the same weight $W$ to process the input. This design makes the convolution operation very computationally efficient: First, convolutional layers are efficient in parameter size. Regardless of the input resolution $S$ , a convolutional layer’s parameter size remains the same as $O\times I\times K\times K$ . Second, convolution is efficient to compute. In modern computer architectures, loading parameters into memory costs orders-of-magnitude higher energy and latency than floating point operations such as multiplications and additions . For convolutions, we can load the parameter once and re-use for all the input pixels, which significantly improves the latency and power efficiency.

However, for LiDAR images, the feature distribution across the image are no longer identical, as illustrated in Figure 2 and 3 (top). Many features may only exist in local regions of the image, so the filters that are trained to process them are only active in the corresponding regions and are not useful elsewhere. To confirm this, we analyze a trained RangeNet21 by calculating the average filter activation across the image. We can see in Figure 3 (bottom) that convolutional filters are sparsely activated and remain zero in many regions. This validates that convolution filters are spatially under-utilized.

2 Spatially-Adaptive Convolution

To better process LiDAR images with spatially-varying feature distributions, we re-design convolution to achieve two goals: 1) It should be spatially-adaptive and content-aware. The new operator should process different parts of the image with different filters, and the filters should adapt to feature variations. 2) The new operator should be efficient to compute.

To achieve these goals, we propose Spatially-Adaptive Convolution (SAC), which can be described as the following:

$W(\cdot)\in\mathbf{R}^{O\times I\times S\times S\times K\times K}$ is a function of the raw input $X_{0}$ . It is spatially-adaptive, since $W$ depends on the location $(p,q)$ . It is content-aware since $W$ is a function of the raw input $X_{0}$ . Computing $W$ in this general form is very expensive since $W$ contains too many elements to compute.

To reduce the computational cost, we factorize $W$ as the product of a standard convolution weight and a spatially-adaptive attention map as:

$\hat{W}\in\mathbf{R}^{O\times I\times S\times S}$ is a standard convolution weight, and $A\in\mathbf{R}^{O\times I\times S\times S\times K\times K}$ is the attention map. To reduce the complexity, we collapse several dimensions of $A$ to obtain a smaller attention map to make it computationally tractable.

We denote the first dimension of $A$ as the output channel dimension (O), the second as the input channel dimension (I), the 3rd and 4th dimensions as spatial dimensions (S), and the last two dimensions as kernel dimensions (K).

As long as we retain the spatial dimension $A$ , SAC is able to spatially adapt a standard convolution. Experiments show that all variants of SAC effectively improve the performance on the SemanticKITTI dataset.

3 Efficient Computation of SAC

To efficiently compute an attention map, we feed the raw LiDAR image $X_{0}$ into a 7x7 convolution followed by a sigmoid activation. The convolution computes the values of the attention map at each location. The more dimensions to adapt, the more FLOPs and parameter size SAC requires. However, most of the variants of SAC are very efficient. Taking SqueezeSegV3-21 as an example, the cost of adding different SAC variants is summarized in Table 1. The extra FLOPs (2.4% - 24.8%) and parameters (1.1% - 14.9%) needed by SAC is quite small.

After obtaining the attention map, we need to efficiently compute the product of the convolution weight $\hat{W}$ , attention map $A$ , and the input $X$ . One choice is to first compute the adaptive weight as Equation (4) and then process the input $X$ . However, the adaptive weight varies per pixel, so we are no longer able to re-use the weight spatially to retain the efficiency of standard convolution.

So, instead, we first combine the attention map $A$ with the input tensor $X$ . For attention maps without kernel dimensions, such as $A_{S}$ or $A_{IS}$ , we directly perform element-wise multiplication (with broadcasting) between $A$ and $X$ . Then, we apply a standard convolution with weight $W$ on the adapted input. The examples of SAC-S and SAC-IS are illustrated in Figures 4(b) and 4(d) respectively. Pseudo-code implementation is provided in the supplementary material.

Overall, SAC can be implemented as a series of element-wise multiplications, im2col, reshaping, and standard convolution operations, which are widely supported and well optimized. This ensures that SAC can be computed efficiently.

4 Relationship with Prior Work

Several prior works can be seen as variants of a spatially-adaptive convolution, as described by Equations 3 and 4. Squeeze-and-Excitation (SE) uses global average pooling and fully-connected layers to compute channel-wise attention to adapt the feature map, as illustrated in Figure 6(a). It can be seen as the variant of SAC-I with a attention map of $A_{I}\in\mathbf{R}^{1\times I\times 1\times 1\times 1\times 1}$ . The convolutional block attention module (CBAM) can be see as applying $A_{I}$ followed by an $A_{S}$ to adapt the feature map, as shown in Figure 6(b). SqueezeSegV2 uses the context-aggregation module (CAM) to combat dropout noises in LiDAR images. At each position, it uses a 7x7 max pooling followed by 1x1 convolutions to compute a channel-wise attention map. It can be seen as the variant SAC-IS with the attention map of $A_{IS}\in\mathbf{R}^{1\times I\times S\times S\times 1\times 1}$ as illustrated in Figure 6(c). Pixel-adaptive convolution (PAC) uses a Gaussian function to compute kernel-wise attention for each pixel. It can be seen as the variant of SAC-SK, with the attention map of $A_{SK}\in\mathbf{R}^{1\times 1\times S\times S\times K\times K}$ , as illustrated in Figure 6(d). Our ablation studies compare variants of SAC, including ones proposed in our paper and in prior work. Experiments show our proposed SAC variants outperform previous baselines.

SqueezeSegV3

Using the spatially-adaptive convolution, we build SqueezeSegV3 for LiDAR point-cloud segmentation. The overview of the model is shown in Figure 1.

To facilitate rigorous comparison, SqueezeSegV3’s backbone architecture is based on RangeNet . RangeNet contains five stages of convolution, each stage contains several blocks. At the beginning of the stage, it performs downsampling. The output is then upsampled to recover the resolution. Each block of RangeNet contains two stacked convolutions. We replace the first one with SAC-ISK as in Figure 4(a). We remove the last two downsampling. To keep the same FLOPs, we reduce the channels of last two stages. The output channel sizes from $Stage1$ to $Stage5$ are 64, 128, 256, 256 and 256 respectively, while the output channel sizes in RangeNet are 64, 128, 256, 512 and 1024. Due to the removal of the last two downsampling operations, we only adopt 3 upsample blocks using transposed convolution and convolution.

2 Loss Function

We introduce a multi-layer cross entropy loss to train the proposed network, which is also used in . During training, from $stage1$ to $stage5$ , we add a prediction layer at each stage’s output. For each output, we respectively downsample the groundtruth label map by 1x, 2x, 4x, 8x and 8x, and use them to train the output of $stage1$ to $stage5$ . The loss function can be described as

In the equation, $w_{c}=\frac{1}{log(f_{c}+\epsilon)}$ is a normalization factor and $f_{c}$ is the frequency of class $c$ . $H_{i},W_{i}$ are the height and width of the output in $i$ -th stage, $y_{c}$ is the prediction for the $c$ -th class in each pixel and $\hat{y}_{c}$ is the label. Compared to the single-stage cross-entropy loss used for the final output, the intermediate supervisions guide the model to form features with more semantic meaning. In addition, they help mitigate the vanishing gradient problem in training.

Experiments

We conduct our experiments on the SemanticKITTI dataset , a large-scale dataset for LiDAR point-cloud segmentation. The dataset contains 21 sequences of point-cloud data with 43,442 densely annotated scans and total 4549 millions points. Following , sequences-{0-7} and {9, 10} (19130 scans) are used for training, sequence-08 (4071 scans) is for validation, and sequences-{11-21} (20351 scans) are for test. Following previous work , we use mIoU over 19 categories to evaluate the accuracy.

2 Implementation Details

We pre-process all the points by spherical projection following Equation (1). The 2D LiDAR images are then processed by SqueezeSegV3 to get a 2D predicted label map, which is then restored back to the 3D space. Following previous work , we project all points in a scan to a $64\times 2048$ image. If multiple points are projected to the same pixel on the 2D image, we keep the point with the largest distance. Following RangeNet21 and RangeNet53 in , we propose SqueezeSegV3-21 (SSGV3-21) and SqueezeSegV3-53 (SSGV3-53). The model architecture of SSGV3-21 and SSGV3-53 are similar to RangeNet21 and RangNet53 , except that we replace regular convolution blocks with SAC blocks. Both models contain 5 stages, each of them has a different input resolution. In SSGV3-21, the 5 stages respectively contain 1, 1, 2, 2, 1 blocks and in SSGV3-53, the 5 stages contain 1, 2, 8, 8, 4 blocks, which are also same as RangeNet21 and RangeNet53, respectively.

We use the SGD optimizer to end-to-end train the whole model. During training, SSGV3-21 is trained with an initial learning rate of $0.01$ , SSGV3-53 is trained with an initial learning rate of $0.005$ . We use the warming up strategy to change the learning rate for 1 epoch. During inference, the original points will be projected and fed into SqueezeSegV3 to get a 2D prediction. Then we adopt the restoration operation to obtain the 3D prediction, as previous work .

3 Comparing with Prior Methods

We compare two proposed models, SSGV3-21 and SSGV3-53, with previous published work . From Table 2, we can see that the proposed SqueezeSegV3 models outperforms all the baselines. Compared with the previous state-of-the-art RangeNet53 , SSGV3-53 improves the accuracy by 3.0 mIoU. Moreover, when we apply post-processing KNN refinement following (indicated as *), the proposed SSGV3-53* outperforms RangeNet53* by 3.7 mIoU and achieves the best accuracy in 14 out of 19 categories. Meanwhile, the proposed SSGV3-21 also surpasses RangeNet21 by 1.4 mIoU and the performance is close to RangeNet53* with post-processing. The advantages are more significant for smaller objects, as SSGV3-53* significantly outperforms RangeNet53* by 13.0 IoU, 10.0 IoU, 7.4 IoU and 15.3 IoU in categories of bicycle, other-vehicle, bicyclist and Motorcyclist respectively.

In terms of speed, SSGV3-21 (16 FPS) is closet RangeNet21 (20 FPS). Even though SSGV3-53 (7 FPS) is slower than RangeNet53 (12 FPS), note that our implementation of SAC is primitive and it can be optimized to achieve further speedup. In comparison, PointNet-based methods do not perform well in either accuracy and speed except RandLA-Net which is a new efficient and effective work.

4 Ablation Study

We conduct ablation studies to analyze the performance of SAC with different configurations. Also, we compare it with other related operators to show its effectiveness. To facilitate fast training and experiments, we shrink the LiDAR images to $64\times 512$ , and use the shallower model of SSGV3-21 as the starting point. We evaluate the accuracy directly on the projected 2D image, instead of the original 3D points, to make the evaluation faster. We train the models in this section on the training set of SemanticKITTI and report the accuracy on the validation set. We study different variations of SAC, input kernel sizes, and other techniques used in SqueezeSegV3.

Variants of spatially-adaptive convolution: As shown in Figure 4 & 6, spatially-adaptive convolution can have many variation. Some variants are equivalent to or similar with methods proposed by previous papers, including squeeze-and-excitation (SE) , convolutional block attention maps (CBAM) , pixel-adaptive convolution (PAC) , and context-aggregation module (CAM) . To understand the effectiveness of SAC variants and previous methods, we swap them into SqueezeSegV3-21. The results are reported in Table. 3.

It can be seen that SAC-ISK significantly outperforms all the other settings in term of mIoU. CAM and SAC-IS have the worst performance, which demonstrates the importance of the attention on the kernel dimension. Squeeze-and-excitation (SE) also does not perform well, since SE is not spatially-adaptive, and the global average pooling used in SE ignores the feature distribution shift across the LiDAR image. In comparison, CBAM improves the baseline by 0.8 mIoU. Unlike SE, it also adapts the input feature spatially. This comparison shows that being spatially-adaptive is crucial for processing LiDAR images. Pixel-adaptive convolution (PAC) is similar to the SAC variant of SAC-SK, except that PAC uses a Gaussian function to compute the kernel-wise attention. Experiments show that the proposed SAC-SK slightly outperforms SAC-SK, possibly because SAC-SK adopts a more general and learnable convolution to compute the attention map. Comparing SAC-S and SAC-IS, adding the input channel dimension does not improve the performance.

Kernel Sizes of SAC: We use a one-layer convolution to compute the attention map for SAC. However, what should be the kernel size for this convolution? A larger kernel size makes sure that it can capture spatial information around, but it also costs more parameters and MACs. To examine the influence of kernel size, we use different kernel sizes in the SAC convolution. As we can see in Table 4, a 1x1 convolution provides a very strong result that is better than its 3x3 and 5x5 counterparts. 7x7 convolution performs the best.

The effectiveness of other techniques: In addition to SAC, we also introduce several new techniques to SqueezeSegV3, including removing the last two downsample layers and multi-layer loss. We start from the baseline of RangeNet21. First, we remove downsampling layers and reduce the channel sizes of the last two stages to 256 to keep the MACS the same. The performance improves by 3.9 mIoU. After adding the multi-layer loss, the mIoU increases by another 1.5%. Based on the above techniques, adding SAC-ISK further boost mIoU by 2.3%.

References

Appendix

In this appendix, we provide the pseudo code implementation for SAC variants discussed in Section 4.2, including SAC-S, SAC-IS, SAC-SK and SAC-ISK.