OMG-Seg: Is One Model Good Enough For All Segmentation?

Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy

Introduction

Visual segmentation that aims to understand semantics at the pixel level has been a longstanding problem in the vision community, fueling advancements in diverse applications such as robotics, autonomous vehicles, and augmented / virtual reality systems. Over the past decade, owing to the tremendous progress in deep learning , this fundamental problem has been significantly transformed into a diverse set of tasks for image and video data, including basic semantic object / instance segmentation, panoptic segmentation, and the more recent prompt-driven interactive segmentation . Consequently, a plethora of task-specific deep segmentation models (e.g., Mask-RCNN , Mask2Former , and SAM ), along with different benchmarks, have been proposed. The latest studies strive to extend these standard close-set segmentation models to more dynamic, real-world scenarios. This involves integrating pre-trained vision-language foundation models, e.g., CLIP , into deep segmentation frameworks, enabling visual segmentation through open-vocabulary text descriptions.

Most existing deep segmentation models often focus on a single specific task. In many scenarios, a generalizable model capable of handling a broader spectrum of segmentation tasks is highly desired. A unified model of this nature would eliminate the necessity for task-specific designs, while providing a versatile solution to a wide range of segmentation tasks through a single and cohesive architecture. This approach benefits significantly from leveraging large and varied data corpora, which enhances the model’s adaptability and effectiveness across different segmentation tasks.

Unifying diverse segmentation tasks within a single model is non-trivial because each task typically comes with its own unique model design. The emergence of transformers has catalyzed several segmentation models based on the Detection Transformer (DETR) architecture , yielding notable successes in performance and task integration. Concurrently, there are also models that employ a similar framework to merge open-vocabulary and multi-dataset segmentation within a unified architecture. Yet, these models often fall short in generalizing to video or interactive segmentation, both essential for broader applications. Some recent studies aim to unify all vision tasks under one single framework with segmentation included. However, these more generalized models still lag behind task-specific segmentation models in terms of performance.

In this study, we demonstrate that one model is good enough for all segmentationThis includes primarily pure visual and 2D segmentation tasks, excluding specific tasks like medical segmentation and referring segmentation . Nonetheless, these could be seamlessly integrated into our OMG-Seg framework with appropriate input adaptations. by introducing OMG-Seg, a unified segmentation model designed to deliver competitive performance across a broad spectrum of visual segmentation tasks. Unlike previous unified models that typically employ a shared visual backbone but several task-specific branches, OMG-Seg adopts a shared encoder-decoder architecture. In particular, we unify all the task outputs as a unified query representation. One query can represent a mask label, an image or tube mask, a unique ID, and a visual prompt. Then, we can adopt a shared decoder to process all types of queries with their features. This setup facilitates general training and inference processes that unify all visual-only segmentation tasks, capitalizing on the extensive parameter sharing across tasks. Through co-training on combined image and video datasets, OMG-Seg, once trained, is capable of handling up to ten diverse segmentation tasks across different datasets.

OMG-Seg achieves comparable results on image, video, open-vocabulary, and interactive segmentation settings over eight different datasets, including COCO , ADE-20k , VIPSeg , Youtube-VIS-2019 , Youtube-VIS-2021, and DAVIS-17 , based on one single shared model. To the best of our knowledge, we are the first to achieve four different settings in one single model.

Related Work

Universal Image/Video Segmentation. The advent of vision transformers has led to a wave of innovation in universal segmentation. Recent works have developed mask classification architectures grounded in an end-to-end set prediction approach, outperforming specialized models in both image and video segmentation tasks . Despite these advancements, most existing methods still rely on distinct models for different segmentation tasks and datasets. Recently, there has been a shift towards training a single model across diverse datasets and tasks, reaping the benefits of parameter sharing. For instance, OneFormer integrates three image segmentation tasks within a single model, while UNINEXT concentrates on unifying instance-level tasks. Similarly, TarVIS combines various video segmentation tasks using target prompts. However, none of these existing works has thoroughly investigated the joint training of image, video, and prompt-driven data within one comprehensive segmentation model. Our work stands as the first attempt in this direction, stretching the potential of co-training across these domains. For a more in-depth comparison of model capabilities, please refer to Tab. 1.

Visual Foundation Models. Recent studies in visual foundation models have exhibited a diversification in optimization techniques, encompassing various learning paradigms. These include vision-only pre-training strategies , joint vision-language pre-training approaches, and multi-modal frameworks that incorporate visual prompting . A notable example, SAM , demonstrates the generalizability and scalability of extensive training in achieving general segmentation. Building on this, Semantic-SAM augments the SAM model by adding semantic labels and increased levels of granularity. However, despite their impressive capabilities, these visual foundation models typically fall short in video segmentation tasks, necessitating further refinement for optimal performance in more dynamic contexts.

Open Vocabulary Segmentation. This line of visual segmentation research aims to recognize and segment novel objects beyond the limited closed-set visual concepts. Leveraging the transferrable representations offered by vision language models (VLMs), many studies explore the alignment between region and text representations during training. At the inference stage, detectors can recognize new classes using the text embeddings derived from VLMs. Our model follows this notion to achieve open vocabulary segmentation. In particular, we use frozen VLMs to serve both as a feature extractor and classifier. This strategy allows for a seamless transition into the open vocabulary setting.

Unified Modeling. The adaptable nature of the transformer architecture facilitates the sharing of fundamental modules across various modalities. This versatility has inspired several research initiatives that use a common transformer framework for different domains. Notably, efforts in the realm of vision generalists have been directed toward unifying disparate tasks within the vision domain. For instance, the Pix2Seq series approach task unification through auto-regressive token prediction. Similarly, Unified-IO implements a sequence-to-sequence pipeline, converting diverse inputs and outputs into discrete token sequences. Furthermore, recent advancements have explored visual in-context learning as the means to combine various vision tasks. These methods predominantly target task unification across domains. However, bridging the performance gap between unified segmentation models and purpose-built segmentation models remains an open problem.

Methodology

Motivation and Overview. Our OMG-Seg is a single yet versatile model—with reduced task-specific customization and maximal parameter sharing—that can support a diverse set of segmentation tasks, making it one model for all segmentation. Our goal is not to pursue state-of-the-art results for each task but to increase the modeling capacity of one generalizable segmentation model while allowing extensive knowledge sharing between tasks.

The main idea of our approach is to leverage object queries for representing distinct entities, encompassing various mask types and their respective video formats. In Sec. 3.1, we begin by reexamining the definitions of image, video, interactive, and open vocabulary segmentation settings. In this exploration, we show that the target outputs of these varied settings can be effectively transformed into a unified query representation. Specifically, a single query can encapsulate a mask label, an image or tube mask, a unique identifier, or a visual prompt.

For example, video segmentation tasks require only an additional ID compared to image segmentation, which can be adapted from the query ID. This allows us to employ a shared decoder to process each query and its associated features in a streamlined manner, with the primary distinction being the specific feature inputs used in cross-attention layers. In the context of image tasks, we follow the established design of Mask2former , enabling queries and features to engage in masked-cross attention. For video tasks, we incorporate temporal features with 3D position embeddings and focus on predicting tube masks for objects across short video clips. For interactive segmentation tasks, we employ the same decoder as image tasks but skip the self-attention operation to condition mask prediction only on the visual prompts and image contents, as detailed in Sec. 3.2.

In addition, to circumvent class taxonomy conflicts, we adopt CLIP embeddings for mask classification. We employ the frozen CLIP visual encoder as the backbone, whose features are shared by the pixel decoder and the open-vocabulary mask classification. This design enables efficient open-vocabulary inference without incurring additional costs. The training and inference pipelines built on such a frozen backbone are described in Sec. 3.3.

Open-Vocabulary and Multi-Dataset Segmentation. The task formulation is the same as the previous image and video segmentation. However, this setting goes beyond fixed label space. In particular, it requires open-set recognition on various datasets. Meanwhile, multi-dataset segmentation requires one model to segment more concepts under different datasets. As a common practice, we adopt CLIP text embedding as the mask classifier, which avoids taxonomy conflicts and achieves open-set recognition at the same time. As a result, we measure the distance between the visual query feature and class embeddings rather than the learned classifier.

All the Things are in Queries. As mentioned above, by combining all different settings, we can represent all the output segmentation entities using the same query-based mask classification framework. In particular, one object query corresponds to one mask $m_{i}$ , label $c_{i}$ , and ID $d_{i}$ . Depending on different task settings, the formats and ranges of $m_{i}$ , $c_{i}$ , and $d_{i}$ are different. However, the formats and ranges of $m_{i}$ , $c_{i}$ , and $d_{i}$ are similar. Thus, it is natural to put all these tasks into one shared encoder-and-decoder framework and co-train one model for all segmentation tasks Thus, it is natural to put all these tasks into one shared encoder-and-decoder framework and co-train one model for all segmentation tasks.

2 OMG-Seg Architecture

Overview. OMG-Seg follows the architecture design of Mask2Former . As shown in Fig. 2, it contains a backbone, a pixel decoder, and a mask decoder. The difference lies in the following aspects, including frozen backbone design, combined object queries which contain both object query and visual prompt, and a shared multi-task decoder. Given different task settings, the decoder outputs corresponding masks and labels. We

VLM Encoder as Frozen Backbone. To enable open-vocabulary recognition, for the backbone part, we adopt the frozen CLIP visual model as a feature extractor. We use the ConvNeXt architecture from the OpenCLIP . Given image/video inputs, the VLM encoder extracts multi-scale frozen feature $\{F^{frozen}_{j}\}^{3}_{j=1},$ for further process.

Pixel Decoder as Feature Adapter. The pixel decoder is the same as Mask2Former, which contains multi-stage deformable attention layers. It transforms the frozen feature $\{F^{frozen}_{j}\}^{3}_{j=1},$ into the fused feature $\{F^{fuse}_{j}\}^{3}_{j=1},$ with the same channel dimension, where $j$ is the layer index of feature. $j=3$ is the highest-resolution feature.

Combined Object Queries. As analyzed above, each object query represents one type of mask output. However, from the functionality perspective, image, video, and interactive modes represent different properties. For images, object queries focus on object-level localization and recognition. For video, object queries may involve temporal consistency, such as the same object long different frames. For interactive segmentation, object queries are forced to locate specific regions. For image and video input, we adopt object queries to represent image masks or tracked tube masks. Since both need semantic labels. We term them as semantic queries, $Q_{obj}^{s}$ . For interactive mode, following SAM , we adopt the prompt encoder to encode the various visual prompts into the same shape of object queries. We term them as location queries, $Q_{obj}^{l}$ . Thus, we can share the same interface for the transformer decoder.

Shared Multi-Task Decoder. Its main operation is cross-attention, which takes in the combined object queries ( $Q_{obj}^{s}$ and $Q_{obj}^{l}$ ) and the image/video feature $F^{fuse}_{j}$ , and outputs refined object queries. The final masks are obtained via dot-product of refined queries and high-resolution feature $F^{fuse}_{3}$ . For image semantic level tasks, we adopt the same procedure of Mask2Former. In particular, $Q_{obj}^{s}$ perform masked cross-attention with multi-scale features $F^{fuse}_{j}$ . $Q_{obj}^{s}$ is Query while $F^{fuse}_{j}$ are the Key and Value. Then, a multi-head self-attention (MHSA) layer is applied to the refined queries. The refined queries and high-resolution features are used to

For video tasks, we adopt the same cross-attention design. The only difference is the pyramid features $F^{fuse}_{j}$ are contacted along the temporal dimension with 3D position embeddings, which are the default setting as previous works . The combined video features and refined queries are used to predict the tube mask.

For interactive segmentation, we carry out the same cross-attention design. However, we skip the self-attention to avoid interaction between mask queries in the MHSA layer, since the interactive segmentation only cares about the input visual prompt regions. After obtaining the refined object query, it is passed through a prediction FFN, which typically consists of a 3-layer perceptron with a ReLU activation layer and a linear projection layer. All the queries are supervised by mask classification loss and mask prediction loss. The decoding process is in a cascaded manner, in three stages for each feature pyramid.

3 Training and Inference

Joint Image Video Dataset Co-training. Rather than first pre-trained on image datasets, our goal is to train all segmentation tasks only once jointly. All training targets are one entity label and mask for all three different cases. The entity can be thing, stuff, class-agnostic masks, and their corresponding labels. Note that the instance masks with the same ID $d$ form the tube masks. During training, we apply Hungarian matching between the predicted and ground-truth entity masks to assign object queries to video/image entities, and then supervise their predicted masks and classification. The classifier is replaced by CLIP text embedding to avoid cross-dataset taxonomy conflicts. The final loss function is given as $L=\lambda_{cls}L_{cls}+\lambda_{ce}L_{ce}+\lambda_{dice}L_{dice}$ . Here, $L_{cls}$ is the Cross-Entropy (CE) loss for mask classification, and $L_{ce}$ and $L_{dice}$ are mask Cross Entropy (CE) loss and Dice loss for segmentation, respectively.

Universal Inference. For image segmentation, we follow the same inference procedure of Mask2Former . For example, for PS, we merge the things and stuff according to the sorted scores. The scores are generated by CLIP text embedding. For video segmentation tasks, for VIS and VPS, to generate instance ID, following previous work, we use query matching rather than introducing extra tracking components. For VOS tasks, we adopt mask matching between the first frame and the remaining frames. For interactive segmentation tasks, we follow the original SAM , by providing box and point prompts, and obtain the binary masks. For open vocabulary segmentation, since we have a frozen CLIP encoder, we merge mask pooled score and learned score with the open-vocabulary embeddings.

Combining Tasks For More Applications. Since our model can perform various segmentation tasks, combining interactive, open vocabulary and image/video segmentation tasks can lead to several new applications. For example, we can combine interactive and video segmentation, leading to flexible prompt-driven video object segmentation. Or we can combine interactive segmentation with an open vocabulary setting, which results in open vocabulary interactive segmentation. More examples are provided in Sec. 4 and supplementary.

Experiments

Datasets and Metrics. Unlike regular settings, we aim to explore co-training on multiple datasets as much as possible. In Tab. 2, we use COCO panoptic , COCO-SAM, VIPSeg , and Youtube-VIS-2019 (YT-VIS-19) as training datasets. In addition to the closed-set testing, we include the open vocabulary (OV) inference by using Youtube-VIS-2021, ADE-20k , and DAVIS-2017 datasets , where their annotations are not used during the training. COCO-SAM is created by using the ground truth boxes, and mask center points are visual prompts. The annotations are obtained by COCO panoptic masks. Moreover, we also include the multi-dataset settings in Tab. 3 to verify the effectiveness of multi-dataset co-training of our OMG-Seg. In addition to Tab. 2, we add more datasets, including ADE-20k and YT-VIS21 for joint co-training. We use the corresponding metrics for each dataset, including PQ , mask mAP , VPQ , tube mAP , J&F , and mIoU .

Implementation Details. We implement our models and all other baselines in MMDetection . We use the distributed training framework with 32 A100 GPUs. Each mini-batch has one image per GPU. For data augmentation, we adopt large-scale jitter as previous works to build strong baselines. For all models in each table, we adopt the same training steps. We use OpenCLIP to initialize the backbone network and replace learned classifiers with their corresponding text embeddings. For image inputs, we treat them as pseudo videos by concatenating two images and their masks into one. We adopt different sampling rates to balance the training examples for each dataset. We report results of both frozen and trained backbones for reference. We list more details in the supplementary material.

System-level Comparison. In Tab. 2, we present a comparative analysis of our OMG-Seg against recent methodologies across a variety of settings. A significant highlight of our work is its unique capability to deliver substantial results in all scenarios using a single model framework. In the realm of specific image and video segmentation models, OMG-Seg demonstrates performance on par with leading approaches like Mask2Former , Tube-Link , and TarViS . While it exhibits a slight decrease in performance on the COCO image segmentation benchmark, it achieves near state-of-the-art results on the VIPSeg datasets, showcasing its robustness and versatility. Furthermore, when benchmarked against open vocabulary methods such as FCCLIP and ODISE , OMG-Seg not only competes favorably but also outperforms ODISE in certain scenarios. This is particularly evident in the realm of open vocabulary video segmentation on YT-VIS-21, as detailed in the 7th column of the table. These findings underscore the effectiveness and adaptability of our OMG-Seg approach in handling a wide array of segmentation challenges.

In addition, our method has been benchmarked against recent unified models, revealing insightful comparisons. When compared with vision generalists such as that described in , our approach, OMG-Seg, demonstrates superior performance. However, in comparison with several specialized segmentation models, including UNINEXT and Wang et al. , we observe a discernible performance discrepancy in the COCO datasets, notably in panoptic and instance segmentation tasks. This gap, we argue, can be partially attributed to our training regime, which spans only 24 epochs, and also we keep the backbone frozen. Furthermore, the integration of video segmentation and interactive segmentation datasets for joint co-training presents a more formidable challenge compared to previous works. This is primarily because learning spatial-temporal and localization-sensitive features from image data is inherently more complex, given the diversity of the learning targets.

Despite these challenges, it is noteworthy that no other existing models offer the comprehensive segmentation capabilities that OMG-Seg does. This ability to effectively handle all forms of segmentation, despite the small performance gaps noted, reinforces our assertion that OMG-Seg is a robust and versatile model suitable for diverse segmentation scenarios.

Multi-dataset Setting. In Tab. 3, we extend our investigation to multi-dataset settings. To ensure a fair comparison in the same setting, we reimplemented two key baselines: K-Net and Mask2Former . Our findings indicate that joint co-training generally enhances performance across most video segmentation datasets, leading to substantial model parameter reduction (from 1326M to 221M). This improvement is consistent across three VPS and VIS datasets, irrespective of whether the backbones are frozen or not. However, it is noteworthy that the performance on the ADE-20k dataset significantly diminishes under joint co-training. We hypothesize that this is largely due to the challenges posed by scale variance and the uneven distribution of classes within the dataset. Interestingly, when using a pre-trained backbone, we observe an uplift in image segmentation performance, albeit at the cost of a minor decline in video segmentation efficacy. This trade-off can be attributed to the unbalanced nature of samples that pursue different optimization objectives, essentially causing a tug-of-war over the representational capacity of the backbone. Such a scenario suggests that incorporating a greater volume of video training examples could potentially address this issue.

Qualitative Result. In Fig. 3, we show the effectiveness of our OMG-Seg model using a ConvNeXt-Large model across five different tasks. The first two rows demonstrate the model’s high-quality image segmentation capabilities on the COCO dataset. In the VIS and VPS tasks, OMG-Seg shows proficiency in segmenting and tracking foreground objects. Notably, in the last row, we show an open-vocabulary video instance segmentation on Youtube-VIS, successfully identifying the “lizard” class, which was not included in the training set.

2 Ablation Study and Analysis

In this section, we use COCO, VIPSeg, and Youtube-VIS-19 for ablation studies of our OMG-Seg. All experiments use frozen ConvNeXt-Large as the backbone and the same data augmentation with 12 epochs training by default.

Effect of Training Dataset. In Tab. 4, we evaluate the impact of various datasets on model performance. As indicated in the first row, using only the COCO dataset yields satisfactory zero-shot results across other datasets, largely attributed to the employment of frozen CLIP visual features for zero-shot region feature classification. Upon integration of the VIPSeg dataset, a slight dip in performance on the COCO dataset is observed. However, this is counterbalanced by significant improvements in both the VIPSeg and Youtube-VIS datasets. Incorporating all three datasets, COCO, VIPSeg, and Youtube-VIS, results in an optimal performance balance across all datasets, establishing this combination as our preferred and default configuration.

Ablation on Shared Decoder Design. In Tab. 5, we explore the efficacy of a shared decoder design. Employing a separate decoder head for video segmentation tasks results in a slight performance decrease. This outcome is influenced by our use of pseudo-video samples during image dataset training. By sharing the decoder, we align the optimization objectives more closely, which particularly benefits the video datasets with short clips .

Ablation on Extra Adapter. In Tab. 6, we assess the addition of an extra adapter to the frozen CLIP backbone, enhancing the capacity of OMG-Seg. Our experiments reveal that the adapter boosts performance with fewer training epochs, but its effectiveness against the baseline disappears in extended training scenarios. In addition, we experiment with increasing the neck capacity by duplicating attention layers in the pixel decoder, observing similar outcomes to the adapter implementation. Consequently, we opt not to incorporate additional adapters, maintaining a cleaner and simpler framework.

Ablation on Other CLIPs. In Tab. 7, following the approach of prior open vocabulary research , we primarily employ convolution-based CLIP models due to their spatial information handling and adaptability to scale variations across different datasets. As we scale up the CLIP model size and extend training steps, we observe improvements across all three datasets. Notably, model convergence is achieved at 24 epochs, faster than in previous studies . This accelerated convergence may be attributed to the model’s limited capacity, suggesting that larger models could further elevate performance.

Conclusion

In this study, we introduce the first joint co-training framework for image, video, open-vocabulary, and interactive segmentation. Our solution, OMG-Seg, is a novel yet simple framework that uses a unified query representation and a shared decoder for diverse tasks. For the first time, it is possible to train a single segmentation model capable of performing across ten different tasks with competitive performance compared to task-specific models. This approach significantly reduces both the parameter size and the need for specialized engineering in model design for various applications. We envision that our efficient and versatile framework will serve as a robust baseline for multi-task and multi-dataset segmentation.

Appendix

Overview. In this appendix, we first present more method details in Sec. A. Then, we present more experiment results in Sec. B. Finally, we show more image, video, open-vocabulary, and interactive segmentation demos in Sec. C.

Appendix A More Method Details

More Detailed Comparison with Recent Works. Due to the page limitation, we only select several representative works for setting comparison. Compared with specific models , our method achieves extreme parameter sharing and performs various tasks that these models cannot perform.

Compared with video segmentation and unified video segmentation , our method can also achieve open-vocabulary and interactive segmentation, as well as good enough performance on image segmentation. This is because our model is jointly co-trained on both image and video segmentation datasets without introducing task-specific tuning on video segmentation datasets. In addition, due to the frozen CLIP backbone, our method can also perform video open vocabulary segmentation without any architecture modification.

Compared with recent partial unified models, our method achieves all related visual segmentation in one model. For example, compared with Semantic-SAM , our model can achieve both video segmentation (VIS, VSS, VPS) and open-vocabulary segmentation. Compared with UNINEXT , our method can perform interactive segmentation, panoptic segmentation (VPS, PS), and open-vocabulary segmentation. Compared with OneFormer , we can achieve video, open-vocabulary, and interactive segmentation. Compared with TarVS , we can keep image segmentation without specific fine-tuning. Compared with recent FreeSeg , we can achieve both video segmentation and interactive segmentation in one model.

Implementation Details of OMG-Seg. We use balanced training for our model. In particular, for two different setting of Tab.2 and Tab.3 in the main paper, we balance each dataset sample according to the COCO dataset size. Then, we choose the same data augmentation as Mask2Former . For the text embedding generation, we follow the standard open-vocabulary detection and segmentation setting . We generate multiple text prompts with the class names and keep the text embedding fixed for both training and inference. In this way, we can achieve multi-dataset and open-vocabulary segmentation.

More Detailed Inference Process. Our model has various inference modes. For image segmentation on various datasets, we simply follow the Mask2Former to obtain the corresponding mask and labels. For video segmentation, we adopt simple query matching without learning the extra tracking query embedding. We believe adding such components will improve the video segmentation. For open-vocabulary segmentation, we fuse the frozen CLIP visual scope and predicted scope to boost the novel class segmentation. For interactive segmentation, we mainly use the point prompts to evaluate despite the box prompts, which can also be used as SAM . Moreover, since our model adopts the frozen CLIP features, we can freely label the prompt-driven segmentation masks, where we can achieve open-vocabulary interactive segmentation. The GFlops of the main paper are calculated with $1200\times 800$ by default.

Appendix B More Experiment Results

In addition to the main paper, we also provide more ablation studies and experiment results here.

Results Using ResNe50 backbone. In Tab. 8, we report our model using ResNet50 backbone. We jointly co-train our model with 24 epochs. Compared with specific Mask2Former for 50 epoch training, our model can achieve considerable results but with less parameter costs.

Exploration on ViT-based CLIP backbone. In Tab. 9, we explore the CLIP-ViT backbone. We find using frozen CLIP-ViT leads to inferior results. This is because the position embedding of ViT is fixed (224 by default), and a simple bilinear upsampling operation hurts the origin representation. Thus, in the second row, we adopt the learned architecture. However, we still find performance gaps with convolution-based CLIP. Moreover, since there is no frozen CLIP and the open-vocabulary ability is lost during the fine-tuning.

Interactive Segmentation with Masked Self-Attention. In interactive mode, we set the query invisible (achieve this by masking) to each other during the cross-attention process. If not, as shown in Tab. 10, we find a significant performance drop for both COCO-SAM and COCO-PS. This is because, for interactive segmentation, the local features are good enough, while introducing the global information will bring noise to the query learning.

Appendix C More Visualization Example

More Visual Results on More Tasks. In Fig. 4, we present more visual examples for two additional tasks. One is open-vocabulary panoptic segmentation on ADE-20k. As shown in the top row, our method can achieve good zero-shot segmentation quality. In the second row, we also provide interactive segmentation on the ImageNet-1k dataset. We add the class labels that are from the simple CLIP score. To this end, we achieve open-vocabulary interactive segmentation.

Limitation and Future Work. One limitation of our work is the capacity of our model. Since we use the frozen architecture to keep the open-vocabulary ability, which leads to inferior results for one specific dataset or task. However, we believe adding more dataset co-training with the learned backbone will improve our model performance. With the aid of more text-image pairs or classification datasets, we also achieve open-vocabulary segmentation ability while keeping the performance improved on close sets. This is our future work to scale up our model. Moreover, we can also add a text path to support language-driven segmentation tasks, such as referring image/video segmentation or even with large language models (LLMs) to perform joint reasoning and segmentation in one framework.