PointSIFT: A SIFT-like Network Module for 3D Point Cloud Semantic Segmentation

Mingyang Jiang, Yiran Wu, Tianqi Zhao, Zelin Zhao, Cewu Lu

Introduction

3D point cloud understanding is a long-standing problem. Typical tasks include 3D object classification , 3D object detection and 3D semantic segmentation . Among these tasks, 3D semantic segmentation which assigns semantic labels to points is relatively challenging. Firstly, the sparseness of point cloud in 3D space makes most spatial operators inefficient. Moreover, the relationship between points is implicit and difficult to be represented due to the unordered and unstructured property of point cloud. In retrospect of previous work, several lines of solutions have been proposed to resolve the problem. In handcrafted voxel feature is used to model geometric relationship, and in 2D CNN features from RGBD images are extracted. Additionally, there is a dilemma between 2D convolution and 3D convolution: 2D convolution fails to capture 3D geometry information such as normal and shape while 3D convolution requires heavy computation.

Recently, PointNet architecture directly operates on point cloud instead of 3D voxel grid or mesh. It not only accelerates computation but also notably improves the segmentation performance. In this paper, we also work on raw point clouds. We get inspiration from the successful feature detection algorithm Scale-invariant feature transform (SIFT) which involves two key properties: scale-awareness and orientation-encoding. Believing that the two properties should also benefit 3D shape description, we design a novel module called PointSIFT for 3D understanding that possesses the properties. The main idea of PointSIFT is illustrated in Figure 1. Unlike SIFT algorithm which uses handcrafted features, our PointSIFT is a parametric deep learning module which can be optimized and adapted to point cloud segmentation.

The basic building block of our PointSIFT module is an orientation-encoding (OE) unit which convolves the features of nearest points in 8 orientations. In comparison to K-nearest neighbor search in PointNet++ where $K$ neighbors may fall in one orientation, our OE unit captures information of all orientations. We further stack several OE units in one PointSIFT module for representation of different scales. In order to make the whole architecture scale-aware, we connect these OE units by shortcuts and jointly optimize for adaptive scales. Our PointSIFT module receives point cloud with $n$ features each point and outputs points of $n$ features with better representation power. PointSIFT is a general module that can be integrated into various PointNet-based architectures to improve 3D shape representation.

We further build a hierarchical architecture that recursively applies the PointSIFT module as local feature descriptor. Resembling conventional segmentation framework in 2D and 3D , our two-stage network first downsamples the point cloud for effective calculation and then upsamples to get dense predictions. The PointSIFT module is used in each layer of the whole framework and significantly improves the representation ability of the network.

Experimental results show that our architecture based on PointSIFT module outperforms state-of-the-art methods on S3DIS (relative $\mathbf{12\%}$ improvement) and ScanNet dataset(relative $\mathbf{8.4\%}$ mean IoU improvement).

Related Work

Deep learning on 3D data is a growing field of research. We investigate segmentation methods on several important 3D representations. Point cloud segmentation is discussed in more detail. After that, we briefly survey the SIFT descriptor where we borrow inspiration.

The first attempt to apply deep learning is through volumetric representation. Many works attempt to voxelize 3D point cloud or scene into regular voxel grids. However, the main challenges in volumetric representation are data sparsity and computational overhead of 3D convolution. A practical choice for resolution of voxel grid can be $32\times 32\times 32$ , which is far from sufficient to faithfully represent complex shapes or scenes. Further, conversion is required to construct volumetric grids based on handy data format such as point cloud, which suffers from both truncation error and information loss. While some recent work propose techniques to address the sparsity issue (e.g., octree data structure), the performance of volumetric methods is still not on par with methods based on point cloud .

Some literature focuses on the use of graph Laplacian to process meshes. Further, functional map and cycle consistency helps build correspondence between shapes. However, this kind of methods are confined to manifold meshes.

make an effort to exploit the strong capacity of 2D CNNs in 3D recognition. In these works, in order for the input to fit in 2D CNNs, the projection of 3D shapes to 2D images is required. The 3D-level understanding of object (or scene) is achieved by combining 2D images taken from various viewpoints. However, such kind of projection results in the loss of most crucial and discriminative geometric details. For example, calculating normal vectors becomes nontrivial, spatial distance is not preserved, and occlusion prevents a holistic understanding of both local and global structure. Failure to include those geometric details could substantially limit the performance in tasks such as shape completion and semantic segmentation.

2 Deep Learning on Point Cloud

Recently, a series of works propose several effective architectures that process point cloud directly. Among those, a big branch of works apply PoinetNet as an unordered global or local descriptor. PointNet is a pioneering effort that applies deep learning to unordered point clouds by point-wise encoding and aggregation through global max pooling. PointNet++ proposes a hierarchical neural network to capture local geometric details. PointCNN uses $\mathcal{X}$ -Conv layer instead of vanilla mlp (multilayer perceptron) layer to exploit certain canonical ordering of points. Dynamic Graph CNN (DGCNN) suggests an alternative grouping method to ball query used in PointNet++ : KNN w.r.t. Euclidean distance between feature vectors. Superpoint Graphs (SPG) first partitions point cloud into superpoints and embed every superpoint with shared PointNet. The semantic labels of superpoints are predicted from the PointNet embedding of current superpoint and spatially neighboring superpoints. While achieving leading results, we feel that segmentation algorithms could benefit from having ordered operators to some extent.

Another branch of work focuses on rotation equivariance or invariance. G-CNN designs filter so that the filter set is closed under certain rotations (e.g., 90-degree rotation) to achieve rotation invariance for those fixed rotations. This kind of method achieves exact rotation invariance only for some discrete rotations while introduces huge computational overhead. The time complexity is proportional to the cardinality of equivalence classes under the rotations one want to be equivariant of, making it impractical to consider rotation equivariance for large and general groups. Spherical CNN projects 3D shapes onto spheres and process the signal with spherical filters for rotation equivariant representation. Global max pooling that transforms sphere to a single value further achieves rotation invariance. While spherical CNNs is fully invariant to rotation, the projection of shapes onto sphere introduces large error and is not appropriate for objects with certain topological property or scenes. Besides, the operation that helps achieve rotation invariance substantially limits the model capacity, discouraging its use on segmentation tasks.

3 Scale-Invariant Feature Transform (SIFT)

SIFT is a local image pattern descriptor widely used in object recognition, 3D modeling, robotics and various other fields. SIFT and its variants consist of two entities, a scale-invariant detector and a rotation-invariant descriptor.

We borrow inspiration from both entities in SIFT algorithm. In keypoint detection stage, the SIFT algorithm achieves scale invariance with multi-scale representation which we also use for robustly processing objects of various scale. As for feature description stage, SIFT detects dominant orientations for rotation invariance and comprehensively perceives image pattern in different orientations. Given the fact that ordered descriptors like the descriptor of SIFT or kernels of CNN yield impressive results for 2D images, we expect that having such descriptors may also benefit representation of point cloud. The above two observations from SIFT algorithm lead us to design a scale-aware descriptor that encodes information from different orientations with ordered operations.

Problem Statement

The objective of segmentation algorithms are finding optimal function that gives accurate semantic labels.

Several properties of point set $P$ have been emphasized in previous work . The density of $P$ may not be uniform everywhere, and $P$ can be very sparse. Moreover, $P$ as a set is unordered and unstructured which distinguishes point cloud with common sequential or structured data like image or video.

Our Method

Our network follows a encode-decode (downsample-upsample) framework similar to general semantic segmentation network for point cloud segmentation (Illustrated in Figure 2) . In the downsampling stage, we recursively apply our proposed PointSIFT module combined with set abstraction (SA) module introduced in for hierarchical feature embedding. For upsampling stage dense feature is enabled by effectively interleaving feature propagation (FP) module with PointSIFT module. One of our main contribution and core component of our segmentation network is the PointSIFT module which is endowed with the desired property of orientation-encoding and scale-awareness.

Given an $n\times d$ matrix as input which describes a point set of size $n$ with $d$ dimension feature for every point, PointSIFT module outputs an $n\times d$ matrix that assigns a new $d$ dimension feature to every point.

Inspired by the widely used SIFT descriptor, we seek to design our PointSIFT module as a local feature description method that models various orientations and is invariant to scale.

Local descriptors in previous methods typically apply unordered operation (e.g., max pooling ) based on the observation that point cloud is unordered and unstructured. However, using ordered operator could be much more informative (max pooling discards all inputs except for the maximum) while still preserves the invariance to order of input points. One natural ordering for point cloud is the one induced by the ordering of the three coordinates. This observation leads us to the Orientation-encoding(OE) unit which is a point-wise local feature descriptor that encodes information of eight orientations.

The first stage of OE embedding is Stacked 8-neighborhood(S8N) Search which finds nearest neighbors in each of the eight octants partitioned by ordering of three coordinates. Since distant points provides little information for description of local patterns, when no point exists within searching radius $r$ in some octant, we duplicate $p_{0}$ as the nearest neighbor of itself.

We further process features of those neighbors which resides in a $2\times 2\times 2$ cube for local pattern description centering at $p_{0}$ . Many previous works ignore the structure of data and do max pooling on feature vectors along $d$ dimensions to get new features. However, we believe that ordered operators such as convolution can better exploit the structure of data. Thus we propose orientation-encoding convolution which is a three-stage operator that convolves the $2\times 2\times 2$ cube along $X$ , $Y$ , and $Z$ axis successively. Formally, the features of neighboring points is a vector $V$ of shape $2\times 2\times 2\times d$ , where the first three dimensions correspond to three axes. Slices of vector $M$ are feature vectors, for example $M_{1,1,1}$ represents the feature from top-front-right octant.The three-stage convolution is formulated as:

1.2 Scale-awareness

In order for our PointSIFT module to be scale-aware, we follow the long-standing method of multi-scale representation by stacking several Orientation-encoding (OE) units in PointSIFT module, as Figure 4 illustrated. Higher level OE units have larger receptive field than those of lower level. By constructing a hierarchy of OE units, we obtain a multi-scale representation of local regions in the point cloud. The features of various scales are then concatenated by several identity shortcuts and transformed by another point-wise convolution that outputs a $d$ dimensional multi-scale feature. In the process of jointly optimizing feature extraction and the point-wise convolution that integrates multi-scale feature, neural networks will learn to select or attend to appropriate scales which makes our network scale-aware.

The fact that the input and output vector of our PointSIFT module is of the same shape makes it convenient for our module to be integrated into other existing point cloud segmentation architectures. We are looking forward to future applications of PointSIFT module, probably not restricted to point cloud segmentation domain.

2 Overall Architecture

We first revisit set abstraction (SA) and feature propagation (FP) module in PointNet++, thus present how to construct our model by SA, FP and pointSIFT modules.

Set abstraction (SA) and feature propagation (FP) modules are proposed in PointNet++ that correspond to downsampling and upsampling of point cloud respectively. We give a very brief introduction of SA and FP module here.

A set abstraction module takes in $N\times d$ input standing for a point cloud of $N$ points and $d$ dimensional feature each point. The output is $N^{\prime}\times d^{\prime}$ which corresponds to $N^{\prime}$ downsampled points each with $d^{\prime}$ dimensional feature. The downsampling is implemented by finding $N^{\prime}$ centroids with farthest point sampling, assigning points to centroids and then calculate embedding of centroids by feeding feature of assigned points through shared PointNet.

The feature propagation module uses linear interpolation weighted by distances to upsample the point cloud. It receives input point set of size $N$ and outputs an upsampled set of size $N^{\prime}$ where feature dimensionality is kept same. The upsampling process takes points that are dropped during downsampling, and assign features to them based on features of $k$ nearest points that are not dropped weighted by Euclidean distance in 3D space.

2.2 Architecture Details

The input of our architecture is the 3D coordinates (or concatenating with RGB value) of $8192$ points. In the downsampling stage, following , multi-layer perceptron (MLP) is used to transform the input 3D (or 6D) vectors into features with $64$ dimensions. Three consecutive downsampling (set abstraction, SA) operations shrink the size of the point set to $1024$ , $256$ , $64$ respectively. For the upsampling part, we use feature propagation(FP) module as proposed by for dense feature and prediction. The point set is lifted to $256$ , $1024$ , $8192$ points respectively by three FP layers which are aligned to its counterpart in the downsampling stage. Our PointSIFT module is inserted between all adjacent SA and FP layers. Finally, point features of the last upsampling layer pass through a fully connected layer for semantic label prediction.

Moreover, we insert FP modules not only between downsampling layers but also from downsampling layers to its counterpart in the encoding stage. We call these links FP-shortcuts for they resemble shortcuts by linking corresponding downsampling and upsampling layers and complements low-level information that might be lost during downsampling. The FP-shortcuts lead to much faster convergence which is proved by many prior works that use such shortcut flavor connections . The prediction accuracy is also improved by a considerable margin which is also reported in many works, e.g., residual networks .

Experiments

Our experiment consists of two parts: verifying the effectiveness of the OE unit and PointSIFT module (Section 5.1) and introducing the results on semantic segmentation benchmark datasets (Section 5.2). In what follows, SA and FP are set abstraction and feature propagation module respectively, which are proposed by .

We apply stacked 8-neighborhood search in OE unit, which is fundamentally different from ball query search proposed in PointNet++ . The main difference lies in the fact that S8N search finds neighbors in each of 8 octants, while ball query searches for global nearest neighbors. As Figure 7 suggests, the global nearest neighbor search can result in a neighbor set of homogeneous points which is less informative than searching in 8 directions.

To justify our S8N search, we substitute ball query with S8N search in a lightweight version of PointNet++ and compare the performance. The detailed architecture of the network is elaborated in Table 3. For fairness, we set the number of neighbors found by the two neighbor search method to be the same. The results are shown in Figure 6 which demonstrates the effectiveness of our S8N grouping plus OE Convolution.

Another observation that justifies the PointSIFT module is that more points from individual input point cloud contribute to the final representation for PointSIFT. In PointNet++ , the grouping layer fails to assign some points in the point cloud to any centroid and thus lost the information from the unassigned points. We claim the PointSIFT Module can almost avoid information loss in the downsampling process, which is beneficial to semantic segmentation. To prove this, we conduct an experiment on ScanNet dataset compared to PointNet++ . Given an input of size 8192, the first step is downsampling 8192 points to 1024 points. The method of selects 1024 centroids and groups 32 nearest points inside the searching radius.

Our results show that in PointNet++, 1622 points on average are not grouped in 8192 points. That is, information of about $20\%$ points are not integrated into the downsampled representation. On the contrary, our PointSIFT module involves more points in computation by performing multiple orientation-encoding(OE) convolutions in OE units. By inserting PointSIFT module before each downsampling layer, our results show that all points can be processed and make a contribution to final predictions. The results are reported in Table 1.

1.1 Effectiveness of Scale Awareness

We design a toy experiment to verify the effectiveness of scale-awareness in our framework. The experiment setting is that we generate 10000 simple shapes (e.g., spheres, cuboids) with different scales, train our framework on generated data. Then we test if the activation magnitude of PointSIFT modules in different layer for certain shape is aligned to the scale of the shape. As aforementioned, different layers in PointSIFT module correspond to different scales. So if such alignment exists we can conclude that the network is aware of scale. It turns out that 89% of the time, the position of PointSIFT module with the highest activation in the hierarchy is aligned with the relative scale of input shape w.r.t max and min scale. This toy experiment demonstrates that the proposed PointSIFT framework is aware of scale in some sense.

2 Results on Semantic Segmentation Benchmark Datasets

ScanNet is a scene semantic labeling task with a total of 1513 scanned scenes. We follow , use 1201 scenes for training and reserve 312 for testing without RGB information. The result is reported in Table 4. Compared with other methods, our proposed PointSIFT method achieves better performance in the sense of per-voxel accuracy. Moreover, our method outperforms PointNet++ even though we do not use multi-scale grouping (MSG) in SA module which helps address the issue of non-uniform sampling density. The result demonstrates the advantage of our PointSIFT module in point searching and grouping.

The S3DIS dataset takes six folders of RGB-D point cloud data from three different buildings (including 271 rooms). Each point is annotated with labels from 13 categories. We prepare the training dataset following : we split points by room and sample rooms into 1m $\times$ 1m blocks. As have been used in , we use k-fold strategy for train and test. Overall Accuracy and mIoU are shown in Table 5. Our PointSIFT architecture outperforms other methods. Per-category IoUs are shown in Table 2, our method improves remarkably on results of semantic segmentation task for this dataset and wins in most of the categories. Our method can achieve great results in some hard categories that other methods perform poorly, i.e. about 11 mIoU points on the sofa and 42 mIoU points on board. Some results are visualized in Figure 5.

Conclusion

We propose a novel PointSIFT module and demonstrated significant improvement over state-of-art for semantic segmentation task on standard datasets. The proposed module is endowed with two key properties: First, orientation-encoding units capture information of different orientations. Second, multi-scale representation of PointSIFT modules enables the processing of objects with various scale. Moreover, an effective end-to-end architecture for point cloud semantic segmentation is proposed based on PointSIFT modules. We also conduct comprehensive experiments to justify the effectiveness of the proposed PointSIFT modules.