Prototype Adaption and Projection for Few- and Zero-shot 3D Point Cloud Semantic Segmentation

Shuting He, Xudong Jiang, Wei Jiang, Henghui Ding

I Introduction

Point cloud semantic segmentation aims to classify every point in a given 3D point cloud representation of a scene . It is one of the essential and fundamental tasks in the field of computer vision and is in intense demand for many real-world applications, e.g., virtual reality, self-driving vehicles, and robotics. Driven by the large-scale datasets and powerful deep learning technologies, fully supervised 3D point cloud semantic segmentation methods have demonstrated significant achievements in recent years. Nevertheless, it is laborious and expensive to build large-scale segmentation datasets with point-level annotations. Besides, additional annotated samples for novel categories and fine-tuning/re-training operations are required when extending the trained segmentation model to novel categories. To address these issues, few-shot 3D point cloud semantic segmentation is proposed and has attracted lots of attention. Few-shot point cloud segmentation aims to generate mask for the unlabeled point cloud of query sample based on the clues of a few labeled support samples. It greatly eases the heavy demand for large-scale datasets and demonstrates good generalization capability on novel categories.

A critical challenge of few-shot 3D point cloud segmentation lies in how to effectively classify every point by the limited support information. Methods in 2D few-shot segmentation extract discriminative and representative support features as prototypes (feature vectors) to guide the segmentation of query images, which has achieved significant results. However, the success of few-shot semantic segmentation in 2D computer vision is driven by the pre-training on large-scale datasets like imagenet . The feature extractor pretrained on large-scale datasets greatly helps the few-shot learning by generating a good feature representation of objects. However, the development of 3D deep learning is hindered by the limited volume and instance modality of current datasets, due to the significant cost of 3D data collection and annotation. This results in less representative features and large feature variation in few-shot 3D point cloud segmentation, even for intra-class samples. Therefore, the prototypical methods that work well in 2D few-shot segmentation are ineffective for the less-well pre-trained networks for 3D point cloud. To address this issue, we propose a Query-Guided Prototype Adaptation (QGPA) module to modify features of support sample prototypes to feature space of query sample. Specifically, the proposed QGPA leans the feature distribution mapping via cross attention between support and query features on the channel dimension, which produces a feature channel adaptation to convert the prototypes from support feature space to query feature space. Propagating the channel-wise distribution from support feature space to query feature space for prototypes smooths the channel distribution discrepancy. With such prototype adaptation, we greatly alleviate the feature variation issue in point clouds and significantly improve the performance of few-shot 3D point cloud semantic segmentation.

Moreover, optimizing prototype generation is worth enhancing the category-related features, as more representative and discriminative prototypes are the foundation for the success of subsequent adaptation and segmentation. If the prototype obtained from the support feature is not an apposite representative, it can hardly transfer informative clues to the query sample. Meantime, the usage of prototype adaptation reduces the influence of query mask supervision on prototype generation from the support set, despite the proposed QGPA having greatly narrowed down the feature gap between prototypes and query features. Hence, we propose a Self-Reconstruction (SR) module, which enables prototypes to reconstruct the support masks, to strengthen the representation of prototypes. Specifically, after obtaining the prototype by an average pooling over the features of points indicated by the support mask, we apply the prototype back to the support features to reconstruct the support mask and employ explicit supervision on this self-reconstruction process. Such a simple self-reconstruction plays an important regularization role in the whole few-shot point cloud segmentation task to enhance the discriminative and semantic information embedded in prototypes.

Finally, although previous approaches and the above-proposed method reduce the number of required annotated samples via meta-learning, point-wise segmentation masks are still necessary. In some practical application scenarios, we may only have the category name of interest but have no corresponding images or masks. Thus, in this work, we propose to step forward further and discard support masks, i.e., jointly considering few-shot and zero-shot 3D point cloud segmentation, as shown in Fig. 1. To this end, we introduce the semantic information, e.g., words of category name, to indicate the target categories and propose a Semantic Projection Network that bridges the semantic and visual features. Our projection network takes the semantic embedding as input and outputs a projected prototype, supervised by real prototypes from point clouds. During testing, besides obtaining prototypes via support branch with dense-annotated support masks, prototypes can be alternatively obtained by inputting semantic words with our proposed projection network.

In a nutshell, the main contributions of our work are:

e propose an efficient and effective Query-Guided Prototype Adaption (QGPA) that propagates the channel-wise features from support sample prototypes to query feature space, which maps prototypes into query feature space. Prototypes are thus endowed with better adaption ability to mitigate channel-wise intra-class sample variation.

We introduce Self-Reconstruction (SR) module that enforces the prototype to reconstruct the support mask generating this prototype, which greatly helps the prototype preserve discriminative class information.

We design a semantic projection network to produce prototypes with the input of semantic words, which facilitates the inference without the use of support information.

We achieve new state-of-the-art performance on two few-shot point cloud segmentation benchmarks, S3DIS and ScanNet. Specifically, our method significantly outperforms state-of-the-arts by 7.90% and 14.82% under the challenging 2-way-1-shot setting on S3DIS and ScanNet benchmarks, respectively.

II Related Work

3D point cloud semantic segmentation aims to label each point in a given 3D point cloud by the most appropriate semantic category from a set of predefined categories. Thanks to the great success of deep neural networks, most recent deep-learning-based approaches have achieved impressive improvements in point cloud segmentation performance. There are two mainstreams in point cloud segmentation: voxel-based and point-based methods . The point-based methods have attracted more and more attention because of its simplicity and effectiveness. PointNet , a first point-based method, proposes a novel neural network to segment point clouds directly, which preserves the permutation invariance of the input well. DGCNN utilizes EdgeConv module to capture local structures which is neglected in PointNet. Despite these approaches achieved promising segmentation performance, they cannot easily segment unseen categories without being fine-tuned on enough labeled data. In this work, we follow the structure of DGCNN to capture local structure feature and propose our method to generalize to new classes with only a few of annotated samples.

II-B Few-shot 3D Point Cloud Semantic Segmentation

Few-shot 3D Point Cloud Semantic Segmentation put the general 3D point cloud semantic segmentation in a few-shot scenario, where model is endowed the ability to segment novel classes with only a handful support data. Zhao et al. propose attention-aware multi-prototype transductive inference to segment novel classes with a few annotated examples for few-shot point cloud semantic segmentation for the first time. However, attMPTI is very complicated and time-consuming due to exploiting multiple prototypes and establishing graph construction for few-shot point cloud segmentation and cannot achieve the impressive result. In this work, we deal with the few-shot point cloud semantic segmentation following the paradigm of . We explore to mitigate the feature variation for the objects with same label but from different images via a simple and effective transformer design.

II-C Few-shot Learning and Zero-shot Learning

Few-shot learning focuses on learning a new paradigm for a novel task with only a few annotated samples. Existing work can be grouped into two main categories, which are based respectively on metric learning , and meta-learning network . The core concept in metric learning is distance measurement between images or regions. For example, Vinyals et al. design matching networks to embed image into an embedded feature and implement a weighted nearest neighbor matching for classifying unlabelled samples. Snell et al. introduce a prototypical network to build a metric space where an input is identified in line with its distance from the class prototypes. Our work is in conformity with the prototypical network while we use it for more challenging segmentation tasks with a simple yet effective design.

Zero-shot learning aims to classify images of unseen categories with no training samples via utilizing semantic descriptors as auxiliary information. There are two main paradigms: classifier-based methods and instance-based methods. Classifier-based methods aim to learn a good projection between visual and semantic spaces or transfer the relationships obtained in semantic space to visual feature space . Another main branch is instance-based methods that synthesize some fake samples for unseen classes. The proposed semantic projection network bridges semantic prototypes and visual prototypes, and combines zero-shot learning with few-shot learning to flexibly handle cases with masks and without masks, which greatly eases the heavy demand for large-scale 3D datasets.

II-D Few-shot Segmentation

As an extension of few-shot classification, few-shot segmentation performs a challenging task of predicting dense pixel classification on new classes with only a handful of support examples . Shaban et al. introduce this research problem for the first time and design a classical two-branch network following the pipeline of Siamese network . Later on, PL introduces the concept of prototypical learning into segmentation task for the first time. where predictions is generated according to the cosine similarity between pixels in query image and prototypes generated from support image and support mask. SG-One designs masked average pooling to generate object related corresponding prototype, which has become the cornerstone of subsequent methods. PANet extends this work to a more efficient design and propose a prototype alignment regularization to make better use of support set information, achieving better generalization capability. CANet designs two-branch architecture to perform multi-level feature comparison and embed iterative optimization module to get refined predicted results. PPNet and PMMs have similar idea to decompose objects into multiple parts and are capable of obtaining diverse and fine-grained representative features. PFENet generates training-free prior masks utilizing a pre-trained backbone, and alleviated the spatial inconsistency through enhancing query features with prior masks and support features.

However, existing methods have a common characteristic, i.e. the feature extractor pretrained on large-scale datasets greatly affects the performance of few-shot learning. Therefore, the feature extractor for few-shot 3D point cloud cannot provide representative features for objects because of lacking of pretraining on large-scale datasets hindered by the limiting volume and instance modality. Consequently, existing popular prototypical methods in 2D few-shot classification/segmentation do not work well in the field of 3D point cloud segmentation. In this work, we tackle this issue by proposing a prototype adapter and self-reconstruction to project the prototype from support point clouds feature space to query point clouds feature space effectively.

III Approach

In this section, we present the proposed approach. We first give the task definition in Sec. III-A and the architecture overview of our proposed model in Sec. III-B. Then the proposed Query-Guided Prototype Adaption (QGPA), Self-Reconstruction, and Semantic Prototype Projection are presented in Sec. III-C, Sec. III-E, and Sec. III-D, respectively.

Each training/testing episode in few-shot point cloud segmentation instantiates a $C$ -way $K$ -shot segmentation learning task. There are $K$ $\langle$ point cloud, mask $\rangle$ pairs for each of the $C$ different categories in support set, and every point in the query point cloud is classified to one of the $C$ categories or “background” that does to belong to any of the $C$ categories. We denote the support set as $\mathcal{S}=\{(I_{S}^{c,k},M_{S}^{c,k})\}$ , where $I$ is point cloud, $M$ is mask, $k\in\{1,\cdots,K\}$ , and $c\in\{1,\cdots,C\}$ . At each episode, the inputs are $\{I_{S}^{c,k},M_{S}^{c,k},I_{Q}\}$ and the output is segmentation prediction of query point cloud $I_{Q}$ , of which the ground truth is $M_{Q}$ . During training, the model gains knowledge about the $C\in C_{train}$ classes from the support set and then applies the knowledge to segment the query set. After obtaining the segmentation model from the training set $D_{train}$ , the model’s few-shot segmentation performance is evaluated on the testing set $D_{test}$ . As each training episode contains different semantic categories, the model can be well generalized after training.

We further introduce semantic words as an alternative choice to provide support information for target categories. In our training stage, the support set are reformulated as $\mathcal{S}=\{(I_{S}^{c,k},M_{S}^{c,k},W_{S})\}$ , where $W_{S}$ is semantic word set for support point cloud. In the testing stage, different from previous methods that compulsorily require point-level annotated masks, it is acceptable for our approach to only input the semantic word related to the target-of-interest, i.e., $\mathcal{S}=\{W_{S}^{c}\}$ . The goal of our approach is summarized as, to train a model that, when given support set $S$ with either annotated masks or semantic words that provide support information for $C$ classes, generates a segmentation mask with point-wise labels from the supported $C$ classes (and “background”) for the input query point cloud $Q$ .

III-B Architecture Overview

The overall training architecture of our proposed approach is shown in Figure 2. For each episode in the training stage, point clouds of the support set and query set are processed by a DGCNN backbone and mapped to deep features. To obtain prototypes from support set, the masked average pooling (MAP) is applied over support features. Then, Query-Guided Prototype Adaption (QGPA) is utilized to rectify the feature channel distribution gap between query and support point clouds. The cosine similarity is employed between prototype and query feature to produce the score maps for generating the final mask prediction. Every point cloud in the query set is assigned the label with the most similar prototype. To preserve class-related discriminative information embedded in prototype, Self-Reconstruction (SR) module is introduced to obtain self-consistency high-quality prototypes. What’s more, a semantic projection network is proposed to project the semantic word embeddings to visual prototypes under regression loss. During the inference stage, the proposed semantic projection network can take place of the visual support branch to provide prototypes when there is no point-wise annotated masks as support information.

III-C Query-Guided Prototype Adaption

Prototypes obtained from the support point clouds have a channel-wise feature distribution gap with features of query point clouds, as we discussed in Sec. I. Each sample has different feature channel response distribution . This feature distribution gap is more obvious in 3D point cloud than in 2D segmentation in terms of image level, due to the lack of large-scale 3D datasets for pretraining of feature extractor. To rectify the feature distribution gap, we design Query-Guided Prototype Adaption (QGPA) that maps the prototype to query feature space under the guidance of query-support feature interaction, as shown in Figure 3.

where $\operatorname{softmax}(\cdot)$ is a row-wise softmax function for attention normalization. This cross-attention map Attn establishes the channel-to-channel correspondence between query and support features, which guides the channel distribution propagation. Finally, a matrix multiplication is conducted between the Attn and the transpose of value to rectify the prototype to query feature channel distribution:

where $\langle a,b\rangle$ represents the computation of cosine similarity between $a$ and $b$ , $\alpha$ is an amplification factor. The predicted segmentation mask is then given by

Learning proceeds by minimizing the negative log-probability

where $N$ is the total number of points, and ${M}_{Q}$ is the ground truth mask of query point cloud ${I}_{Q}$ .

III-D Self-Reconstruction

Although through section III-C we can obtain refined prototypes that better fit the distribution of query features, they may lose their original critical class and semantic information that were learned from the support set. Additionally, it’s crucial to extract more representative and discriminative prototypes from the support set, as this is the foundation for the success of subsequent adaptation and segmentation. The discriminative prototypes should have the category information flow of the support point cloud, i.e., prototypes need to have the capability to reconstruct ground-truth masks from themselves.

For each support feature $F_{S}^{c,k}$ and corresponding support mask $M^{c,k}_{S}$ , we calculate cosine similarity with softmax function between support feature $F_{S}^{c,k}$ and prototypes $\{p^{0},p^{c}\}$ , to get score map $S_{S}^{c,k}$ ,

The reconstructed support mask is given by

This reconstructed support mask $\hat{M}_{S}^{c,k}$ is expected to be consistent with the information of the original support point cloud ground truth mask $M^{c,k}_{S}$ . We call this process as Self-Reconstruction (SR). The Self-Reconstruction loss $\mathcal{L}_{\text{sr}}$ is computed by minimizing the negative log-probability, similar to Eq. (9):

The final segmentation loss is sum of $\mathcal{L}_{seg}$ and $\mathcal{L}_{sr}$ :

Without this Self-Reconstruction as a constraint, prototypes may lose the original critical class and semantic information when aligning with the distribution of the query feature. On the other hand, it does not adequately utilize the support information for few-shot learning. The proposed Self-Reconstruction module serves as an important regularization role in the whole few-shot point cloud segmentation task to preserve discriminative and semantic information embedded in prototypes. Besides, it provides a mechanism to balance prototype adaptation while maintaining the original separability, adding an additional level of refinement to the process.

III-E Semantic Prototype Projection

IV Experiments

In this section, we report the experimental results of the proposed approach in comparison with previous state-of-the-art methods and ablation studies that verify the effectiveness of our proposed modules.

Our approach is implemented based on the public platform Pytorchhttp://pytorch.org. During training, we employ a variety of techniques to augment the training samples, such as Gaussian jittering, random shift, random scale, and random rotation around the z-axis. The training samples are then randomly sampled at every episode of each iteration. The feature extractor is initially pre-trained on the training set $\mathcal{D}_{\text{train}}$ for a total of 100 epochs, using Adam as the optimizer. We set the learning rate and batch size to 0.001 and 32, respectively. Following this pre-training stage, we proceed to train our proposed model with the pre-trained weights as initialization weights. For the convenience of explanation, we divide the network into two sub-networks, the segmentation part (including feature extractor, Query-Guided Prototype Adaption, and Self-Reconstruction) and the semantic projection part (Semantic Projection Network). To ensure that the performance of segmentation part is not affected by the semantic projection part, we employ two independent optimizers during joint training. Adam is utilized as the optimizer to train the segmentation part, with an initial learning rate of 0.001 for the newly added layers, and 0.0001 for the feature extractor. It should be noted that the Self-Reconstruction does not introduce any additional parameters. The learning rates decay by 0.5 after every 5K iterations. The semantic projection component is trained using the Adam optimizer with a consistent learning rate of 0.0002 throughout the entire training process. As for hyper-parameter settings, we set the $\sigma$ of $\mathcal{G}$ used in Eq. (14) to $\{2,5,10,20,40,60\}$ . We set the number of Transformer layers and attention head to 1 for simplicity.

IV-B Datasets and Evaluation Metrics

We perform experimental evaluation on two public 3D point clouds datasets S3DIS and ScanNet . S3DIS is composed of 271 point clouds from Matterport scanners in six different areas from three buildings. The annotation for the point clouds has 12 semantic classes in addition to one background class annotated with clutter. ScanNet is made up of 1,513 point clouds of scans from 707 unique indoor scenes. The annotation for the point clouds has 20 semantic classes in addition to one background class annotated with unannotated space. Since the original scene is too large to process, we need to split it into smaller blocks. As a result, S3DIS and ScanNet contain 7,547 and 36,350 blocks through data pre-processing strategy utilized in , respectively. $N=2,048$ points are randomly sampled for each block and each point is represented by a 9D vector (XYZ, RGB and normalized spatial coordinates).

Following , semantic classes in each dataset is evenly split into two non-overlapping subsets . For both S3DIS and ScanNet, when testing the model with test class set $C_{unseen}$ on one fold, we use the other fold to train the model with train class set $C_{seen}$ for cross-validation.

In training process, an episode is constructed using the following procedure. First, $C$ classes from $C_{seen}$ is randomly chosen which should meet the criterion $N<|C_{seen}|$ ; Next, random choose sample from support set $S$ and a query set $Q$ based on the chosen $C$ classes. Finally, the ground-truth mask ${M_{S}}$ for the support set and ${M_{Q}}$ for the query set are generated from the original mask annotation as the binary mask according to the chosen classes. The episodes for testing are built in a similar form. Except for one difference, we traverse $N$ classes out of $C_{unseen}$ classes instead of randomly choosing $N$ classes to get more fair result. 100 episodes are sampled for evaluation.

Evaluation Metrics

Following conventions in the point cloud semantic segmentation community, we evaluate all methods with Mean Intersection-over-Union (mean-IoU). The per-class Intersection over Union (IoU) is defined as $\frac{TP}{TP+FN+FP}$ , where the $TP$ , $FN$ , and $FP$ is the count of true positives, false negatives and false positives, respectively. For few-shot setting, mean-IoU is calculated by averaging over all the classes in testing classes $\mathcal{C}_{\text{unseen}}$ .

IV-C Ablation Study

We conduct detailed component ablation studies to evaluate each of our proposed modules in TABLE I. The experiments are conducted on both S3DIS and ScanNet under 2-way 1-shot and 2-way 5-shot settings. We adopt ProtoNet as our baseline. First of all, when adding the proposed Query-Guided Prototype Adaption (QGPA) module on our baseline, a performance gain of 3.97% and 10.66% in terms of mIoU under 1-shot settings is observed over the baseline on S3DIS and ScanNet, respectively, as shown in TABLE I. The performance gain is because of the benefit of our effective transformer design and the adaption of prototypes from support feature space to query feature space. The superior result demonstrates that the capacity of transformer in adapting feature channel-wise correlations between samples, which is important in point cloud scenery, especially for few-shot learning with only a handful of training samples.

Then, by introducing Self-Reconstruction (SR) as auxiliary supervision, we further obtain significant improvement over the QGPA, e.g., 3.78% and 4.66% performance gain under 2-way 1-shot setting on S3DIS and ScanNet, respectively, as shown in the last row in TABLE I. The proposed Self-Reconstruction forces the prototypes to restore the support information generating them, which give constraints on the prototypes to retrain the class-related clues. With SR, better prototypes that contain discriminative feature representations are produced. Meantime, the gradients from QGPA may make the class-related support clues less pronounced while the proposed SR protect and enhance such clues. Meanwhile, we observe that the performance gain by Self-Reconstruction over baseline without QGPA is less than with QGPA, e.g., 0.17% and 0.11% under 2-way-1-shot setting on S3DIS and ScanNet, respectively, as shown in the third row in TABLE I. This phenomenon suggests that original prototypes extract discriminative clues from support samples well and adding extra constraints on this basis does not play a big role. When combining these two modules together, our full method achieves the best results that improve substantially over the baseline, which demonstrates that the proposed SR and QGPA are mutually beneficial. It is worth noting that when SR is added on the model with QGPA, it can play a better positive role than on the model without QGPA, which is in line with our original motivation. When we utilize QGPA to align prototypes with the query feature distribution, it may lose the original discriminative and semantic characteristics deviate from the original prototype. With our proposed SR as supervision, prototypes can preserve these informative clues by reconstructing support gourd-truth masks from themselves.

Ablation Study for Baseline Configuration

We study the effects of various designs of the ProtoNet since it is the baseline of our method. The results of different variants are listed in TABLE II. The results are improved consistently with the help of data augmentation (AUG), multi-scale feature aggregation (MS) , and align loss (AL) . As shown in TABLE II, based on the vanilla ProtoNet , with random scale and random shift to augment training samples, an improvement of 0.81% on S3DIS and 2.2% on ScanNet is achieved. Then, by aggregating multi-scale features from the backbone DGCNN , we further improve the results by 2.05% on S3DIS and 3.33% on ScanNet. Then by introducing align loss with a performance gain of around 1%, the baseline achieves 52.17% on S3DIS and 41.03% on ScanNet.

Ablation Study for QGPA Configuration

In Figure 4, we illustrate the effects of three hyper-parameters in our proposed QGPA configuration (i.e., number of transformer block layer, dropout rate, hidden point number $N^{\prime}$ ) under the setting of 2-way 1-shot point cloud semantic segmentation on one split of S3DIS and ScanNet. As shown in Figure 4 (a), increasing the layers of QGPA achieves better results, but overly large layers consume more computing resources and slow down the inference speed. Thus, we choose a single layer to achieve a good balance of accuracy and efficiency. As Figure 4 (b) reveals that increasing drop out rate degrades the result a lot, the drop rate of 0.0 gives the best result on both datasets. While in Figure 4 (c), with the increase of hidden point number, the result first rises and then becomes flat. Therefore, our transformer’s hidden point number $N^{\prime}$ is set to 512 to achieve robust performance and keep efficient.

Comparison with Other Designs for QGPA

To verify the superiority of our proposed QGPA, we list several other SOTA transformer-like modules design in TABLE III. For a fair comparison, except for the transformer design, these methods are trained with the same experimental configuration and we conduct experiments based on the same baseline without Self-Reconstruction. Classifier Weight Transformer (CWT) is a few-shot segmentation transformer architecture which modifies classifier weight adapting to each query feature. Classifier weight can be regarded as prototypes, hence here we set the prototype as query; query feature as the key and value input of CWT transformer, please refer to CWT’s paper for CWT architecture details. Besides, we use DETR decoder branch which contains self-and cross-attention block design. We set the prototype as query; query feature as the key and value, similar with CWT. It is worth noting that the number of prototypes varies with the number of ways $C$ . For instance, in a 2-way setting, there would be three prototypes. Therefore, the self-attention block in DETR is able to work as intended. As the experimental results shown in TABLE III, CWT and the vanilla DETR transformer perform inferior to our proposed Query-Guided Prototype Adaption (QGPA), where performance drops of -3.98%/-6.5% on S3DIS and -9.80%/-7.66% on ScanNet under 1-shot settings are observed, respectively. This confirms that simply applying Transformers for the 3D point cloud segmentation task is not effective, because they ignore the discrepancy in the distribution of features in the channel-wise space. D-QGPA represents the degradation of our proposed QGPA which replaces support feature with query feature for key input. It will lead to that attention value is calculated from query feature itself and cannot get knowledge from support feature. We find that D-QGPA does not lead to improvement but a large drop (-14.18% under 1-shot On ScanNet) compared to QGPA. This indicates the essential importance of adaption from support to query feature space. Without support feature, the prototype adaption lacks the information of the source, only the information of the target, which makes the adaptation process out of action. The model parameters and inference speed of these methods are also listed in TABLE III. As can be seen, our method is more lightweight and efficient.

Ablation Study for Different Choices of Prototypes in Self-Reconstruction

Upon transferring the prototype to the query feature space, the feature gap between the refined prototype and the original support feature arises. Consequently, utilizing the refined prototype in query feature space to reconstruct the support mask in support feature space is unreasonable. In contrast, using constraints on the original prototype can facilitate the extraction of a more discriminative representation from the support set. This serves as a foundation for the subsequent adaptation’s success, allowing the refined prototype to indirectly retain critical class and semantic information. Furthermore, we conduct extensive experiments to assess the choice of original prototypes or refined prototypes in conducting reconstructed support masks, as shown in TABLE IV. Utilizing the original prototypes yields superior performance, exhibiting an increase of up to 4.05% in mIoU compared to the refined prototype, which is in accordance with our previous analysis.

IV-D Qualitative Result

To qualitatively demonstrate the performance of our proposed approach, we visualize some point cloud segmentation result on the S3DIS and ScanNet in Figure 5 and Figure 6, respectively. In both Figure 5 and Figure 6, the first column is visualization of input point clouds, the second column is ground-truth masks, the third and fourth columns are predicted masks by attMPTI and our proposed approach, respectively. Our approach achieves better results than attMPTI . For example, in the last row of Figure 5, our approach segments the “sofa” very well, while attMPTI identifies part of the “sofa” to “background”. In the last row of Figure 6, attMPTI incorrectly classifies “sink” to “toilet” and “background”, while our approach generates high-quality mask for “sink”. The qualitative results in Figure 5 and Figure 6 demonstrate the effectiveness of our proposed approach and our approach’s superior over previous state-of-the-art method attMPTI .

IV-E Comparison with State-of-the-Art Methods

In TABLE V and TABLE VI, we compare with previous state-of-the-art methods and report our quantitative results on S3DIS and ScanNet datasets, respectively. Our proposed method significantly outperforms previous state-of-the-art method by a large margin. We outperforms attMPTI in all settings. For example, our proposed approach is 7.90% and 3.5% better than attMPTI under 2-way 1-shot and 2-way 5-shot settings on S3DIS, and is 22.46% and 12.32% better than attMPTI under 3-way 1shot, 3-way 5-shot settings on ScanNet. Compared to ProtoNet which has a similar design paradigm to us, our method achieves up to 13.57% and 24.07% gains on S3DIS and ScanNet, respectively. The huge improvements demonstrate that our method can obtain more dicriminative and adaptive prototypes from not only support samples but also support-query feature adaption. The superior results obtained by our method show that the intra-class-sample-variations problem is critical in 3D point cloud scenery, and our proposed Query-Guided Prototype Adaption and Self-Reconstruction are effective to address this problem.

IV-F Comparison with State-of-the-Art Zero-shot Methods

We further evaluate our model with semantic prototype projection branch, as shown in TABLE VII. During testing, the point-level annotations of the support set are replaced by semantic prototypes which are generated from our semantic branch. We report our results with both word2vec and CLIP as text encoder. On the one hand, Compared with TABLE V, our text-based model achieves competitive results compared to the ones with visual support samples. Therefore, the introduction of semantic projection network is able to bridge the gap between visual support and semantic words and establish a generalized framework for few- and zero-shot learning that achieves superior performance regardless of whether the input is in the form of semantic words or visual support samples. It is worth noting that our zero-shot segmentation model achieves better results under 5-shot training than 1-shot. This is because we jointly train our few-shot model and zero-shot model, and use the visual prototypes from support samples as the ground truth of our word-projected prototypes during training. More support samples produce better visual prototypes, which contribute to training a better text-vision projection network that can generate more accurate word-projected prototypes. On the other hand, to provide a fair comparison with other zero-shot methods, we follow their official code and evaluate their method under our experimental settings for data augmentations for the training and the training and testing subset splitting of the data to get the results in TABLE VII. The results show that our method has significant improvement over 3DGenZ , the only open-source 3D zero-shot segmentation method to the best of our knowledge. This comparison further validates the effectiveness of our method.

IV-G Computational Complexity

In TABLE VIII, we present the number of parameters and computational complexity of our proposed model and previous SOTA method attMPTI . The Query-Guided Prototype Adaption (QGPA) introduces two linear layers that map the point cloud from 2048 to 512, resulting in a moderate increase in the number of parameters. With the addition of Self-Reconstruction (SR), we only need to calculate an additional loss item, and no additional parameters are introduced, thus keeping the computational complexity unchanged. Finally, we integrate our Semantic Project Network (SPN) to obtain the final model. As we only need to learn the mapping from semantic words to visual prototypes, the increase in the number of parameters and computational complexity is minimal. Our model demonstrates strong performance while maintaining a relatively low number of parameters and computational complexity, particularly in terms of FPS. Although attMPTI’s Transductive Inference process doesn’t increase the parameter count, it significantly slows down inference speed. As a result, our approach is a highly effective and efficient solution for few- and zero-shot 3D point cloud semantic segmentation, delivering superior results and faster FPS.

V Conclusion

We propose a prototype adaption and projection network for few-shot and zero-shot point cloud semantic segmentation. By analyzing the feature channel distribution of 2D images and 3D point clouds, we have observed that the feature intra-class variation of 3D point clouds is worse than 2D due to the lack of pre-training on large-scale datasets. We hence propose a Query-Guided Prototype Adaption (QGPA) module to map the prototypes extracted in support feature space to the query feature space, which greatly improves the few-shot segmentation performance. To preserve more class-specific clues in prototypes, we introduce Self-Reconstruction (SR) that enables the prototype to reconstruct the corresponding mask as well as possible. Furthermore, a semantic projection network is proposed to deal with the zero-shot learning cases where no annotated sample is provided but just category names. The semantic projection network makes our model more practical in the real-world. We evaluate the proposed approach on two popular 3D point cloud segmentation datasets, which show new state-of-the-art performances with significant improvement over previous methods.