Deformable Filter Convolution for Point Cloud Reasoning

Yuwen Xiong, Mengye Ren, Renjie Liao, Kelvin Wong, Raquel Urtasun

Introduction

3D perception is one of the key components of real-world robotic systems. These robots are typically equipped with 3D sensors such as LiDAR and RGBD cameras which produce outputs in the form of point clouds. These point clouds correspond to a set of vectors of location coordinates and associated features. Driven by the success of deep learning on 2D images, there has been a fresh wave of deep network architectures proposed to tackle this new challenge: unlike image grids, point clouds are sampled sparsely and non-uniformly with continuous spatial coordinates, and they are equivalent up to permutations.

A well-studied approach is to discretize the points into voxels . As 3D convolution can be inefficient in terms of both computation and storage, various approximations have been proposed – recent work also finds 2D convolutions almost as competitive but with a much improved efficiency . Despite great strides being made by leveraging existing 2D CNN backbone networks on voxelized inputs, current approaches face the dilemma of either clashing with a loss of local geometric details or struggling with sensor noise and a smaller field of view.

To complement the weaknesses of voxelized networks, new network architectures have been proposed to process points directly. While simple average and max-pooling operations can preserve permutation invariance , this approach lacks interpretability on how spatial features are aggregated. Various attempts to define a general learnable continuous convolution layer have been made, by hammering a learned multi-layer network to predict the filter weights or features , or restricting to a simpler family of functions . While the former may lack robustness and require extra supervision for regions with sparse neighborhoods, the latter could be limited by less powerful feature representations.

This paper advocates a simple idea: we learn 3D cuboid filters just like the ones in a voxelized network. However, when performing the convolution operation, instead of discretizing the points into a voxel grid, we deform the 3D filter towards the point clouds. We propose two ways to integrate our proposed convolution operator into popular backbones, by 1) enriching feature representation from an additional point-wise input stream, and by 2) smoothing out point-wise predictions within local neighborhoods. Composing our deformable filter convolution layers with voxelized networks results in a significant gain in performance on semantic segmentation and object detection tasks, comparing to voxel only networks and previous attempts at fusing point-wise features. Moreover, our proposed joint network achieves state-of-the-art performance on the TOR4D large-scale LiDAR semantic segmentation benchmark.

Related work

Previous work on processing 3D data can be roughly categorized into multi-viewpoints, voxelization, and point-based representations. Inspired by an agent-centric 2D view of the world, multi-view representations treat 3D data as snapshots of 2D images taken at different view points. Front view representations , which considers depth as an additional channel in the input, can leverage 2D image network architectures. However, these approaches bear a significant loss of 3D information, and are unable to reason about 3D rigid transformations. To address this issue, voxelization-based representations instead process the occupancy grids using 3D convolution . As pure 3D convolution suffers from computation and storage inefficiency, OctNet and O-CNN use OctTree to efficiently compute voxel convolutions.

While voxels are intuitive 3D counterparts to 2D images, real world sensors such as LiDAR and depth camera instead produce point clouds as their native output. A popular approach is to discretize points with point statistics in each voxel of a grid and learn a 2D convolutional neural net (CNN) using a bird’s eye view representation . As such procedure can potentially result in loss of details, especially when the points are sampled non-uniformly, deep network architectures have hence been designed to directly handle point data as inputs. Qi et al. propose PointNet which applies a fully connected network on each point individually and a permutation-invariant max-pooling operation to aggregate global information. To leverage local neighborhood and hierarchical information passing, PointNet++ adds grouping and sampling layers to perform stagewise aggregation, which is similar to pooling layers in regular CNNs. demonstrate the effectiveness of the PointNet-based architecture on 3D object detectors. Inspired by the SIFT feature extractor , Jiang et al. propose PointSIFT , which uses local octant directional vectors as feature extraction layers showing good results on point cloud segmentation tasks.

Grouping local neighborhood of points and aggregating information using permutation invariant operators, as done in PointNet++ , are special cases of graph neural networks (GNNs) , where node interactions are modeled using a neural network. Point clouds can be treated as a sparse graph where edges denote two points which are close. 3D-GNN uses a GNN to approximate messaging passing in point clouds. ECC and EdgeConv propose to generate the edge weights through a neural network. KD-Net recursively processes hierarchical graph structures through a KD-Tree. uses learnable kernel anchor points to smooth out local neighborhoods.

Another line of work views point clouds as discrete samples of a continuous function in space. Various parameterizations of learnable continuous filter functions have thus been proposed. Wang et al. and Hermosilla et al. propose to use a multi-layer perceptron (MLP) to represent the convolution filter function. The MLP takes in an offset vector towards the center of the local neighborhood, and outputs the value of the filter function at that location. ContFuse is a memory efficient successor of as it directly predicts the output features through an MLP. Other families of learnable filter functions have also been studied, e.g. radial basis function (RBF) and polynomial function kernels . To prevent the function value from growing unbounded at a large distance, a step function is applied in to make sure the filter function is zero outside a certain radius. In contrast, instead of predicted by a parametric function , our 3D filters have learnable weights at well defined 3D positions, which are potentially more robust and sample efficient.

Motivated by 3D convolution with grid structured filters, Atzmon et al. propose “extension” and “restriction” operators. First, the extension operator interpolates point features onto a grid structure; then 3D convolution is applied on the grids; finally, the restriction operator projects convolved features back to point locations. Similarly, SPLATNet extends points onto lattice grids, and PointwiseCNN discretizes points into filter bins. These extension operators could potentially suffer from the loss of local geometric information due to discretization. The design of our deformable filter convolution also takes inspiration from deformable convolution , extending model capacity with continuous spatial reasoning. Despite the main difference of applying on 2D images vs. 3D point clouds, our proposed operator does not resample point features, as was done in , but interpolates 3D filters at point locations. This particular design addresses the potential discretization issue mentioned above . In concurrent work KPConv also tries to deform filters similarly to our approach.

Deformable Filter Convolution

In this section, we first define our deformable filter convolution operator in the context of spatially continuous functions with discrete samples. We then show equivariance properties of the proposed operator.

where ${\cal N}(\mathbf{y})=\{\mathbf{x}:\lVert\mathbf{x}-\mathbf{y}\rVert\leq r\}$ is the set of points in the neighborhood of $\mathbf{y}$ , and $\mu$ is the measure of the volume covered by each neighboring point. For simplicity, we further assume that $\mu$ is some constant based on the approximation that points are uniformly sampled in local neighborhoods. Without loss of generality, we assume $\mu=1$ since the actual value can be merged with the learned filter $g$ .

Previous work proposed to represent the filter $g(\mathbf{y}-\mathbf{x})$ using an MLP or a polynomial function . The potential issues with these continuous representations are 1) the filter kernel can be highly non-linear, and 2) depending on the point cloud distribution, the actual filter being used can vary significantly.

Different from previous approaches, we only parameterize the filter at discretized anchors $X^{\prime}$ that are coherent with the 3D grid structure. In contrast to , which “voxelizes” the points into a grid structure and then projects them back to the original locations, we “deform” the standard 3D filter from the anchors towards the points $(\mathbf{y}-\mathbf{x})$ . This leads to better preservation of local geometric information compared to feature voxelization as shown in Figure 2 for the 1D case. To deform 3D filters, we use an interpolation kernel $k(\cdot,\cdot)$ ,

In practice, we choose to use a tri-linear interpolation kernel

where $a_{d}$ is the filter grid unit length on dimension $d$ . This interpolation scheme is continuous everywhere and naturally decays to zero when a point falls far away from the anchors, without using a manually designed step function, as was done in . Tri-linear interpolation is easy to implement and unlike the Gaussian kernel, it does not have the gradient vanishing problem. In summary, our 3D deformable filter convolution operator can be written as:

2 Analysis

In this section we analyze various equivariance properties of the proposed convolution operator on point clouds. First, we show that our operator is translation-equivariant. Second, we show that our operator is permutation-equivariant under discretization of continuous signals. Equivariance could be more useful than invariance in certain scenarios as it preserves the transformation information. We start by defining the equivariance property mathematically.

Let $\mathcal{T}^{F}_{g}:F\mapsto F$ be a transformation operator that produces a group action of $g$ in a transformation group $G$ on a function space F. An operator $L:F\mapsto H$ is said to be $\mathcal{T}_{G}$ -equivariant if $L(\mathcal{T}^{F}_{g}(f))=\mathcal{T}^{H}_{g}(L(f))$ for any $f\in F$ , $g\in G$ .

Translation equivariance is a desired property as it is an efficient way of sharing parameters that produces consistent outputs regardless of the location of the regions of interest. In short, if the input $f$ is translated by an offset $\Delta\mathbf{x}$ , the effect on the output $h$ is also a translation of $\Delta\mathbf{x}$ . CNNs have translation equivariance on 2D grid locations, which is one of the reasons they are successful in the image domain. Here, we show that our convolution operator is translation equivariant in a continuous domain with a $d$ -dimensional coordinate system. In contrast, popular voxelization based approaches (e.g., ) are unfortunately not translation equivariant since the voxel grid is fixed and points can be assigned to different discretization bins.

$\mathcal{T}^{F}_{\Delta\mathbf{x}}(\cdot):F\mapsto F$ is a translation operator on F if $\mathcal{T}_{\Delta\mathbf{x}}(f)(\mathbf{x})=f(\mathbf{x}+\Delta\mathbf{x})$ for all $f\in F$ , $\mathbf{x}\in\textbf{dom}(f)$ .

Note that Proposition 1 is generalized to any coordinate system of the input points. In Cartesian coordinates of 3D space, $d=3$ and translation means the conventional translation, whereas translation in polar coordinates is equivalent to rotation in Cartesian coordinates.

When the inputs are point clouds, the input function is discretized by an input array, where each entry stores $D^{\prime}$ -dimensional features of the point. The output of the convolution operation is an array with output $D$ -dimensional features of the point. Permutation equivariance ensures that the ordering of the points in the input does not affect the output. PointNet aggregates information using a global max-pooling, which is permutation-invariant but not equivariant. As a result, it cannot aggregate local neighborhood information.

We first define the permutation operator on the set of functions with integer domain. Note that all arrays can be represented as a function that maps from positive integers to numbers.

Experimental Evaluation

We verify the effectiveness of our proposed operator on a suite of point cloud benchmarks and tasks including semantic segmentation, object detection, and object classification.

First, we verify the usefulness of the proposed deformable filter convolution on the task of LiDAR semantic segmentation. We report results on the TOR4D dataset , which consists of 1,239,706 frames in training, 123,975 frames in validation and 123,475 frames in test set. The dataset contains 7 object classes: “vehicle”, “bicyclist”, “pedestrian”, “motorcycle”, “background”, “animal”, and “road”. We omit the “animal” class for evaluation due to the small number of examples.

We adopt a UNet architecture to extract voxelized features and then add our point convolution layer on top. One baseline approach is to use a voxelized network only, which suffers from the precision of the output, since points belonging to the same voxel will have the same output. To predict on the point level, one needs to fuse the voxelized feature onto point clouds. We consider the continuous fusing operator (ContFuse) proposed in as a strong baseline. This operator is similar to what has been done in PointNet++ , where each point first queries a local neighborhood, then passes the neighboring point features plus coordinate offsets through an MLP, and finally averages the neighborhood features together. Note that our approach is much more memory efficient than the ContFuse operation, since we do not need to tile the neighboring point features into a larger size tensor. PCC also adopts the same setting where they found the best performance when their continuous convolution layers are composed on top of the voxelized features. Figure 3 illustrates the overall network architecture, where the point-based convolution header takes inputs from a voxelized feature extractor.

Implementation details:

The voxel feature extractor uses a standard U-Net with channels in the encoder and decorder. Each encoder block contains convolution, batch norm, ReLU, and max pooling layers. The decoder has a symmetric structure, with max-pooling replaced by bilinear interpolation to upsample the feature map. Skip connections are added between each pair of corresponding layers connecting the encoder and decoder. Similar to , the ContFuse baseline has 7 blocks in the point-based header, each containing an MLP of size , where the inputs are concatenated with the offset coordinates of the neighborhood (3-dimensional) and the output classes (7 plus none of the above). There are skip connections combining the outputs of these blocks. Our deformable version has 2 blocks, each with channels, and skip connections are also applied on the output channels. We used 3 $\times$ 3 $\times$ 3 convolution filters with filter grid unit length 0.2m and neighborbood size 16. We used the Adam optimizer with initial learning rate 1e-4, weight decay 5e-4, batch size 16 and 0.1 $\times$ learning rate decay at 50k and 100k iterations. The baseline U-Net was trained for 450k iterations, and ContFuse and our deformable filter version were trained for 115k iterations.

Results and discussion:

Results on TOR4D test set are shown in Table 1 and qualitative visualization of the output results shown in Figure 7. Our joint network using deformable filter layers signifcantly surpasses the baseline, achieving state-of-the-art performance. Notably, our deformable filter model outperforms the ContFuse baseline, which uses an MLP to aggregate point cloud neighborhood information. To understand filter activations, we also plot the most activated region of each filter channel using guided backprop , shown in Figure 6. It is clear that the regions that activate individual neurons the most are roughly corresponding to the semantic classes.

2 KITTI BEV object detection

We evaluate our deformable filter convolution on KITTI BEV object detection benchmark , consisting of 7,481 training and 7,518 testing frames of LiDAR point clouds. Half of the training data is split for validation. Detectors on “car”, “pedestrian” and “cyclist” categories are trained and evaluated individually.

We adopt the HDNet architecture proposed in , one of the top-performing object detectors on this benchmark. HDNet is a voxelized network that performs regular 2D convolution on BEV with the $z$ -axis to be the channel dimension. It also predicts the map information with a pretrained module. To preserve local geometric details and to provide extra information to the voxelized backbone network, we add a deformable filter input branch that processes raw LiDAR points and the output of the branch is fused with the backbone network. Shown in Figure 4, we add 3 layers of deformable filter convolution to process the point cloud, with channel dimension respectively. The last deformable filter layer samples the output points at the voxel centers so that we can concatenate the feature with the voxel input branch. We compare our proposed operator with parametric continuous convolution (PCC) , which uses an MLP to predict the convolution filter weights. The MLP takes 3-d coordinate inputs and outputs the element-wise seperable filter weight.

The backbone network consists of five residual blocks with convolution layers with channel dimensions respectively. The initial convolution for each residual block has stride 2 to downsample the feature map. The grid unit lengths are 0.15m for car, 0.5m/0.2m/0.1m/0.05m for pedestrian, and 0.33m for cyclist. We use 3 $\times$ 3 $\times$ 3 filters for car and pedestrian, and 7 $\times$ 7 $\times$ 7 filters for cyclist. NMS thresholds are 0.1, 0.3, and 0.5 for car, pedestrian, cyclist respectively. For data augmentation we apply random scaling of 0.9 $\times$ to 1.1 $\times$ random rotation of -5 to 5 degrees along $z$ -axis and random translation of -5 to 5 meters for $x$ and $y$ axes. Models are trained using SGD with momentum for 50 epochs with mini-batch size 16. The initial learning rate is 0.01 with 0.1 $\times$ learning rate decay at the 30th and 45th epoch; weight decay constant is 2e-4.

Results and discussion:

In Table 2 we show results on KITTI validation set where we compare our proposed operator with the HDNet baseline and PCC . Overall, our method has the best performance across all three categories, especially on pedestrian and cyclist. Notably, PCC has a negative impact on the cyclist detector, possibly due to less training data and sparser point clouds for positive examples, whereas our method delivers a significant gain over the baseline network. In Figure 5, we visualize the 3D filters learned on our cyclist detector, compared to the ones learned by PCC. Our filters learn more expressive 3D shapes that are not linearly separable.

3 ModelNet 40 classification

To further verify the effectiveness of our proposed operator, we evaluate on ModelNet , a standard point cloud classification benchmark. ModelNet contains CAD models of 40 categories, 9,843 shapes for training and 2,468 for testing.

We modified the PointNet and PointNet++ architectures to incorporate our deformable filter layers. The inputs to these networks are $xyz$ coordinates (3 channels). In the original PointNet, we replace the pointwise fully connected layers with deformable filter convolution layers. The resulting architecture has the same number of channels (). In PointNet++, since our convolution layer operates on a different neighborhood compared to the sampling and grouping layers, we augment the original architecture with a residual branch that contains the deformable filter layers with the same number of hidden units compared to the MLP network in the original network. We use 3 $\times$ 3 $\times$ 3 convolution filters with filter grid unit length 0.2 for the PointNet architecture, and [0.2, 0.4, 0.8] for each downsample stage of the PointNet++ architecture. We fix the neighborhood size to be 8. We use mini-batch size 16, and base learning rate 1e-3 with an exponential decay of 0.7 $\times$ every 20 epochs. In both experiments, we use the standard 1,024 points with furthest point samples as inputs to the network, and for fair comparison with baselines we do not aggregate the final prediction from multiple votes.

Results and discussion:

As shown in Table 3 and 4, by simply adding our deformable filter layers in standard PointNet-based architecture, we observe a reasonable increase in performance, comparable to other competitive approaches using point cloud inputs.

Conclusion

This paper presents deformable filter convolution, a learnable convolution layer that combines voxel-like filtering and spatially continuous reasoning. It naturally augments existing top-performing voxel-based network architectures by fusing point features into input and output branches. We show significant gain in performance of the joint network, compared to voxel only baselines and other point-based convolution approaches. The proposed convolution operator enables us to achieve state-of-the-art performance on LiDAR semantic segmentation. As future work, we plan to integrate our proposed convolution layer as a fundamental building block of end-to-end point-based networks, without resorting to voxelized feature maps.

References

Appendix A Technical proofs

Proposition 1 (Translation equivariance).

Proposition 2 (Permutation equivariance).

Appendix B KITTI detection results

Figure 8 shows vehicle detection results on the KITTI dataset.