Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation

Shuting He, Henghui Ding, Wei Jiang

Introduction

Image segmentation aims to group pixels with different semantics, e.g., category or instance . Deep learning methods have greatly advanced the performance of image segmentation with the powerful learning ability of CNNs and Transformer . However, since deep learning methods are data-driven, great challenges are induced by the intense demand for large-scale labeled training samples, which are labor-intensive and time-consuming. To address this issue, zero-shot learning (ZSL) is proposed to classify novel objects with no training samples. Recently, ZSL is extended into segmentation tasks like zero-shot semantic segmentation (ZSS) and zero-shot instance segmentation (ZSI) . Herein, we further introduce zero-shot panoptic segmentation (ZSP) and aim to build a universal framework for zero-shot panoptic/semantic/instance segmentation with the help of semantic knowledge, as shown in Fig. 1.

Different from image classification, segmentation requires pixel-wise classification and is more challenging in terms of class representation learning. Substantial efforts have been devoted to zero-shot semantic segmentation and can be categorized into projection-based methods and generative model-based methods . The generative model-based methods are usually superior to the projection-based methods because they produce synthetic training features for the unseen group, which contribute to alleviating the crucial bias issue of tending to classify objects into seen classes. Owing to the above merits, we follow the paradigm of generative model-based methods to address zero-shot segmentation tasks.

However, the current generative model-based methods are usually in the form of per-pixel-level generation, which is not robust enough in the more complicated scenarios. Recently, several works propose to decouple the segmentation into class-agnostic mask prediction and object-level classification . We follow this strategy and degenerate the pixel-level generation to a more robust object-level generation. What’s more, previous generative works usually learn a direct mapping from semantic embedding to visual features. Such a generator does not consider the visual-semantic gap of feature granularity that images contain much richer information than languages. The direct mapping from coarse to fine-grained information results in low-quality synthetic features. To address this issue, we propose to utilize abundant primitives with very fine-grained semantic attributes to compose visual representations. Different assemblies of these primitives construct different class representations, where the assembly is decided by the relevance between primitives and semantic embeddings. Primitives greatly enhance the expressive diversity and effectiveness of the generator, especially in terms of rich fine-grained attributes, making the synthetic features for different classes more reliable and discriminative.

However, there are only real image features of seen classes to supervise the generator, leaving unseen classes unsupervised. To provide more constraints for the feature generation of unseen classes, we propose to transfer the inter-class relationships in semantic space to visual space. The category relationships obtained by semantic embeddings are employed to constrain the inter-class relationships of visual features. With such constraint, the visual features, especially the synthesized features for unseen classes, are promoted to have a homogeneous inter-class structure as in semantic space. Nevertheless, there is a discrepancy between the visual space and the semantic space , so as to their inter-class relationships. Visual features contain richer information and cannot be fully aligned with semantic embeddings. Directly aligning two disjoint relationships inevitably compromises the discriminative of visual features. To address this issue, we propose to disentangle visual features into semantic-related and semantic-unrelated features, where the former is better aligned with the semantic embedding while the latter is noisy to semantic space. We only use semantic-related features for relationship alignment. The proposed relationship alignment and feature disentanglement are mutually beneficial. Feature disentanglement builds semantic-related visual space to facilitate relationship alignment and excludes semantic-unrelated features that are noisy for alignment. Relationship alignment in turn contributes to disentangling semantic-related features by providing semantic clues.

Overall, the main contributions are as follows:

We study universal zero-shot segmentation and propose Primitive generation with collaborative relationship Alignment and feature Disentanglement learning (PADing) as a unified framework for ZSP/ZSI/ZSS.

We propose a primitive generator that employs lots of learned primitives with fine-grained attributes to synthesize visual features for unseen categories, which helps to address the bias issue and domain gap issue.

We propose a collaborative relationship alignment and feature disentanglement learning approach to facilitate the generator producing better synthetic features.

The proposed approach PADing achieves new state-of-the-art performance on zero-shot panoptic segmentation (ZSP), zero-shot instance segmentation (ZSI), and zero-shot semantic segmentation (ZSS).

Related Work

Zero-shot learning (ZSL) aims to classify images of unseen classes with no training samples via utilizing semantic descriptors as auxiliary information. There are two main paradigms: classifier-based methods that learn a visual-semantic projection and instance-based methods that synthesize fake samples for unseen classes. Generalized zero-shot learning (GZSL), introduced by Scheirer et al. , aims to classify samples from both seen and unseen sets. Then, Chao et al. show that the ZSL methods can’t work well in GZSL setting from experiments, due to the feature of overfitting on seen classes. Classification score calibration methods and out-of-distribution detector methods are proposed to alleviate this bias issue.

Image Segmentation is one of the most fundamental computer vision tasks . Deep-learning-based image segmentation methods under a fully supervised manner are extensively studied . However, these methods require a large number of labeled training samples and cannot handle unseen categories that do not appear or are not defined in training data. To address these issues, Zero-Shot Semantic Segmentation (ZSS) and Zero-Shot Instance Segmentation (ZSI) extend ZSL methods to semantic segmentation and instance segmentation, respectively. In this work, we further introduce Zero-Shot Panoptic Segmentation (ZSP) to extend the zero-shot learning to the panoptic segmentation task. There are two main paradigms: projection-based methods and generative-based methods . Projection-based techniques commonly utilize a projection approach to map the visual or semantic features of seen categories onto a shared space. (e.g., visual, semantic, or latent space), and then classify novel objects by measuring the feature similarity in the common space. The generative methods adopt generator to produce synthetic features for unseen classes. However, existing generative works usually learn a direct mapping from semantic embedding to visual features and do not consider the visual-semantic gap of feature granularity. We design a primitive generation and semantic-related alignment approach to universally address zero-shot segmentation, including ZSP, ZSI, and ZSS.

Methodology

Fig. 2 illustrates the overview architecture of our proposed approach, Primitive generation with collaborative relationship Alignment and feature Disentanglement learning (PADing). Our backbone predicts a set of class-agnostic masks and their corresponding class embeddings. Primitive generator is trained to synthesize class embeddings from semantic embeddings. The real & synthetic class embeddings are disentangled to semantic-related and semantic-unrelated features. We conduct the relationship alignment learning on the semantic-related feature. With the synthesized unseen class embeddings, we re-train our classifier with both the real class embedding of seen categories and the synthetic class embedding of unseen categories. The training process is demonstrated in Algorithm 1. The details of each part will be introduced in the following sections.

2 Primitive Cross-Modal Generation

Due to the lack of unseen samples, the classifier cannot be optimized with features of unseen classes. As a result, the classifier trained on seen classes tends to assign all objects/stuff a label of seen group, which is called bias issue . To address this issue, previous methods propose to utilize a generative model to synthesize fake visual features for unseen classes. However, previous generative zero-shot segmentation works commonly adopt Generative Moment Matching Network (GMMN) or GAN , which consist of multiple linear layers as feature generator. Such a generator, though achieves good performance, does not consider the visual-semantic difference of feature granularity. It is well known that image generally contains much richer information than language. Visual information provides very fine-grained attributes of objects while textual information typically provides abstract and high-level attributes. Such difference results in an inconsistency between visual features and semantic features. To address this challenge, we propose a Primitive Cross-Modal Generator that employs lots of learned attribute primitives to construct visual representations.

where ${\mathcal{X}}^{\prime}$ represents synthetic visual features and $\mathcal{Z}$ denotes random sample with a fixed Gaussian distribution. $\omega_{1}$ is the linear layer. Different from feature generation via processing semantic embedding with several linear layers, we synthesize visual features via weighted assembling these abundant primitives, which provides much more diverse and richer representations. Moreover, for related categories that share some similarities in semantic space, primitives provide an explicit way to express such similarities. For example, dog and cat both have the attributes of hairy and tail, so the primitives related to hairy and tail show high response to the semantic embedding query of dog and cat. With such primitives that describe fine-grained attributes, we can easily construct different category representations and transfer the knowledge of seen classes to unseen ones.

We follow to define our generator loss $\mathcal{L}_{\mathcal{G}}$ to diminish maximum mean discrepancy between two probability distributions:

where $X^{s}$ and ${X^{s}}^{\prime}$ denote real visual features and synthetic visual features of seen classes, respectively. $k$ is a kernel and $k(f,f^{\prime})=\exp(-\frac{1}{2\sigma^{2}}\|f-f^{\prime}\|^{2})$ with bandwidth $\sigma$ .

When a semantic embedding from unseen group is fed into the trained Primitive Generator, we can get its corresponding synthetic class embedding. We then re-train our classifier with both the real class embedding of seen categories and the synthetic class embedding of unseen categories, which greatly alleviates the bias issue. Besides, such global representations are more robust than per-pixel classification and can thus have a better alignment between visual space and semantic space.

3 Semantic-Visual Relationship Alignment

It is well known that relationships among categories are naturally different . For example, there are three objects: apple, orange, and cow. Obviously, the relationship of apple & orange is closer than apple & cow. Class relationships in semantic space are powerful prior knowledge, while the category-specific feature generation does not explicitly leverage such relationships. As shown in Fig. 4, we build such relationships with semantic embeddings and explore to transfer this knowledge to visual space, making semantic-visual alignment in terms of class-wise relationships. By considering the relationship, there are more constraints on the unseen categories’ feature generations, to pull or push their distances with seen categories.

Relationship Alignment

Then we conduct relationship alignment between semantic-related visual space and semantic space. We use KL divergence loss to make the similarity of any two semantic-related features $\hat{x}_{i}$ and $\hat{x}_{j}$ reach the similarity of their corresponding semantic embeddings $a_{[\hat{x}_{i}]}$ and $a_{[\hat{x}_{j}]}$ , i.e.,

where $[\hat{x}_{i}]$ is the ground truth class index of $\hat{x}_{i}$ , $\tau$ is the temperature parameter to control the sharpness of similarity distribution operating on the KL loss. $\hat{x}_{i}^{s}$ of the seen group is from either real features or synthetic features while $\hat{x}_{i}^{u}$ of the unseen group is from synthetic features by generator only. There are two kinds of alignment, intra-group alignment and inter-group alignment, with different focuses in Eq. 6. When $\hat{x}_{i}$ and $\hat{x}_{j}$ are from the same group, e.g., $\hat{x}_{i}^{s}$ and $\hat{x}_{j}^{s}$ both from seen group, it is intra-group alignment and contributes to extracting better class representations with the relationships as a constraint. When they are from different groups, e.g., $\hat{x}_{i}^{s}$ from seen group and $\hat{x}_{j}^{u}$ from unseen group, it is inter-group alignment that aims to transfer the relationship knowledge from seen to unseen. Inter-group alignment gives constraints on the relationships of seen and unseen categories, real features and synthetic features. It greatly improves the model’s adaptability and generalization to unseen categories.

Collaborative Disentanglement and Alignment Our disentanglement and alignment are complementary and mutually beneficial. On the one hand, disentanglement promotes relationship alignment. With the disentanglement, semantic-related features can be extracted for alignment and semantic-unrelated noises are excluded. On the other hand, relationship alignment facilitates disentanglement. Introducing intra-group and inter-group alignment, class-wise relationship among semantic-related features can be constructed and the discrepancy between semantic-visual feature distributions can be reduced, eventually leading to the improvement of the feature disentanglement.

4 Training Objective

Algorithm 1 shows the overall training pipeline of our universal zero-shot segmentation model. First, we pre-train our segmentation backbone with annotated data from seen classes in a full-supervision manner. Next, We train the primitive generator under the following objective:

where $\lambda$ is the weight to control the importance of the disentanglement and alignment module. Once the generator is trained above, it can generate synthetic features for unseen classes. Together with the real features from seen classes, we can train a new classification layer.

Experiments

Datasets. We use the popular dataset MSCOCO 2017, which consists training set with 118k images and validation set with 5k images. For panoptic segmentation, 133 classes (80 thing classes and 53 stuff classes) are included in annotations. For semantic segmentation, COCO-Stuff contains 171 valid classes in total. To get a fair comparison with ZSI , we use MSCOCO 2014 for instance segmentation which contains 80k training and 40k validation images.

2 Zero-Shot Panoptic Segmentation Task

Because of the high similarities between semantic segmentation and panoptic segmentation, we develop the ZSP datasets by following the previous ZSS works . In order to avoid any information leakage, SPNet selects 15 classes in COCO stuff that do not appear in ImageNet as unseen classes. In COCO panoptic dataset, we find 14 classes overlapped with the 15 ones selected by SPNet and set them as unseen classes, i.e., {cow, giraffe, suitcase, frisbee, skateboard, carrot, scissors, cardboard, sky-other-merged, grass-merged, playingfield, river, road, tree-merged}, while the remaining 119 classes are set as seen classes. To guarantee no information leakage in the training set, we discard the training images that contain even one pixel of any unseen classes. Thus the model is trained by samples of seen classes only with 45617 training images. We use all 5k validation images to evaluate the performance of ZSP. Panoptic and semantic segmentation tasks are evaluated on the union of thing and stuff classes while instance segmentation is only evaluated on the thing classes.

Evaluation Metrics. Under the GZSL setting, the model needs to segment objects/stuff of both seen and unseen classes, which is closer to real-world complicated scenarios. Following previous ZSS , ZSD , and ZSI tasks, we compute seen metrics, unseen metrics, and the harmonic mean (HM) of seen metrics and unseen metrics as follows,

where $\rm{P}_{seen}$ and $\rm{P}_{unseen}$ denote the seen and unseen metrics, respectively. We use the PQ (panoptic quality) metric which can be viewed as the multiplication of a segmentation quality (SQ) and a recognition quality (RQ). We also report the results on instance segmentation, object detection and semantic segmentation tasks. For instance segmentation and object detection, we use the standard mAP (mean Average Precision) with an IoU threshold of 0.5. For semantic segmentation, we use mIoU (mean Intersection-over-Union) .

3 Ablation Study

In Tab. 1 and Tab. 2, We perform ablation studies of the proposed PADing on MS-COCO dataset under four tasks, including zero-shot panoptic segmentation, zero-shot instance segmentation, zero-shot object detection, and zero-shot semantic segmentation. It is worth noting that the results in Tab. 2 are obtained by the model trained on zero-shot panoptic segmentation task only, which achieves our goal of training a single model for universal zero-shot image segmentation tasks. For simplicity, our ablation analysis mainly focuses on ZSP, because ZSI, ZSD, ZSS have similar trends with ZSP. First, to demonstrate the advantage of introducing generative model, we implement a projection-based segmentation baseline by using CLIP text embeddings as classifier’s weights, similar with ZegFormer-seg . During training, there are 119 text embeddings used in classifier, while during inference, we add another 14 unseen text embeddings into classifier and label each object to one of these 133 classes. As the 2nd row in Tab. 1, there is a strong bias towards seen classes, resulting in extreme low accuracy even zero for unseen group. Next, we construct baseline build upon generative GMMN model following ZS3 , which outperforms projection-based method by 4.9% in terms of unseen PQ. This phenomenon shows that generative model contributes to solving crucial bias issue.

GMMN vs. Our Primitive Generator. As shown in Tab. 1 and Tab. 2, our primitive generator significantly surpasses GMMN generator by at least 9.0% PQ, 5.7% SQ, and 10.9% RQ for HM metric. This shows that our primitive generator is capable of generating more effective features and the primitives can better grasp the real distribution of visual features compared to the baseline generator GMMN.

Number of Primitives. We report the network’s performance with different numbers of primitives in Tab. 3. From the results, increasing the primitive number from 100 to 400 brings a significant performance gain of 4.2%. The performance is a little down when the primitive number is larger than 400, thus we choose 400 as the default setting.

Effectiveness of Alignment. Then, by applying semantic alignment as a constraint to our generator, the HM-PQ is further improved by 2.6%, demonstrating the effectiveness of introducing inter-class relationships inherent from semantic space. Finally, we evaluate the alignment module with disentanglement, see 6) PADing in Tab. 1 and Tab. 2. In comparison to using alignment only, alignment+disentanglement transfers semantic prior knowledge on semantic-related features and consistently brings performance gains of 2.0% HM-PQ, 13.3% HM-SQ, and 2.7% HM-RQ. The significant improvement demonstrates that the semantic-visual discrepancy has been alleviated owing to omitting semantic-unrelated noises. The utilization of disentanglement enables more effective alignment in the separated semantic-related space.

Visualization of synthesized feature representations. To study the properties of our synthesized unseen features and demonstrate the effectiveness of our proposed approach, we employ t-SNE to show the distribution of our synthetic features in Fig. 5. As we can see in Fig. 5 (a), the synthesized features produced by GMMN generator are messy due to the semantic-visual discrepancy. In Fig. 5 (b), when introducing our primitive generator, features belonging to the same class become more compact and features from different classes are highly separable. Furthermore, after applying relationship-alignment constraint on the semantic-related feature, see Fig. 5 (c), features belonging to different classes are farther apart with better-structured distributions, which shows that the structure relationship is embedded into synthetic features and the synthesized unseen features are greatly enhanced with better discrimination.

4 Comparison with State-of-the-art ZSS Methods

To further validate the superiority of our approach, we compare it with previous state-of-the-art ZSS methods on the challenging semantic segmentation datasets COCO-Stuff in Tab. 4. It is worth noting that we only report results without self-training and without complicated crop-mask image preprocess utilized for CLIP image encoder for a fair comparison. We train our model with semantic segmentation annotations. The proposed approach outperforms the previous best method ZegFormer-seg by 3.5% HM-IoU and 3.4% unseen-IoU, demonstrating its effectiveness. It is worth noting that the above methods use ResNet-101 while we only use ResNet-50.

5 Comparison with State-of-the-art ZSI Methods

We compare the proposed method with the previous state-of-the-art method ZSI under the Generalized Zero-Shot Instance Segmentation (GZSI) setting in Tab. 5. Our model is trained with instance segmentation annotations for a fair comparison. We achieve new state-of-the-art performance on both 48/17 split and 65/15 split. For example, we surpass ZSI by 7.20% HM-mAP and 5.27% HM-Recall on 48/17 split. It is worth noting that ZSI uses ResNet-101 while we use ResNet-50.

6 Qualitative Results

To qualitatively demonstrate the effectiveness of our proposed approach, we visualize some examples of zero-shot panoptic segmentation results in Fig. 6. The second row is ground-truth mask while the third and fourth rows are predicted masks by baseline and our proposed approach, respectively. We observe that our PADing successively finds several unseen classes, e.g., suitcase, grass, frisbee, road, tree, skateboard, that are missed or misclassified by the baseline model. Besides, thanks to the class-agnostic mask generation ability of Mask2Former , our results show high-quality masks.

Conclusion

We propose primitive generation with collaborative relationship alignment and feature disentanglement learning (PADing) as a unified framework to achieve universal zero-shot segmentation. A primitive generator is proposed to synthesize fake training features for unseen classes. A collaborative feature disentanglement and relationship alignment learning strategy is proposed to help the generator produce better fake unseen features, where the former one decouples visual features to semantic-related part and semantic-unrelated part and the later one transfer inter-class knowledge from semantic space to visual space. Extensive experiments on three zero-shot segmentation tasks demonstrate the effectiveness of the proposed approach.

Acknowledgement Shuting He and Wei Jiang were partially supported by National Natural Science Foundation of China (No.62173302).