Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively

Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, Chen Change Loy

Introduction

The Segment Anything Model (SAM) and CLIP have made significant strides in various vision tasks, showcasing remarkable generalization capabilities in segmentation and recognition, respectively. SAM, in particular, has been trained with a massive dataset of mask labels, making it highly adaptable to a wide range of downstream tasks through interactive prompts. On the other hand, CLIP’s training with billions of text-image pairs has given it an unprecedented ability in zero-shot visual recognition. This has led to numerous studies exploring the extension of CLIP to open vocabulary tasks, such as detection and segmentation.

While SAM and CLIP offer considerable advantages, they also have inherent limitations in their original designs. SAM, for instance, lacks the capability to recognize the segments it identifies. Efforts to overcome this by integrating a classification head have been made , but these solutions are constrained to specific datasets or closed-set settings. On the other hand, CLIP, which is trained using image-level contrastive losses, faces challenges in adapting its representations for dense prediction tasks. To address this, several studies have investigated ways to align CLIP’s representation for dense predictions. However, these approaches tend to be dataset-specific and not universally applicable. For example, some research has focused on open vocabulary segmentation on the ADE-20k dataset, using the COCO dataset for pre-training. Merging SAM and CLIP in a naïve manner, as illustrated in Fig. 2 (a) and (b), proves to be inefficient. This approach not only incurs substantial computational expenses but also yields subpar results, including recognition of small-scale objects, as evidenced by our experimental results.

In this study, we address these challenges with a unified encoder-decoder framework that integrates a CLIP encoder and a SAM decoder, as depicted in Fig. 2 (c). To bridge these two distinct components effectively, we introduce two novel modules, SAM2CLIP and CLIP2SAM, facilitating dual knowledge transfer. First, we distill knowledge from the SAM encoder to a CLIP encoder using SAM2CLIP. This distillation process is uniquely executed not directly on the CLIP encoder, which is kept frozen to maintain its existing knowledge, but rather on a lightweight transformer-like adapter using a pixel-wise distillation loss. The adapter takes multiscale features as input, with the goal of aligning CLIP features with SAM representation. On the decoding side, the CLIP2SAM module transfers knowledge from the frozen CLIP encoder to the SAM decoder. In particular, we design a feature pyramid adapter with a RoIAlign operator to be jointly trained with the SAM decoder.

Following the spirit of SAM, we enhance our model’s recognition capabilities by harnessing the power of established semantic datasets, including COCO , LVIS , and ImageNet-22k . This strategy elevates our model to the versatility of SAM, endowing it with enhanced capability to segment and recognize any objects, as shown in Fig. 1. As our approach is an adaptation of SAM, it is flexible enough to be integrated with various detectors, making it suitable for both closed-set and open-set environments.

We conduct extensive experiments across a range of datasets and scenarios, encompassing closed-set and open-vocabulary interactive segmentation. Notably, when compared to basic combined baselines, our approach demonstrates superior performance, achieving over 2% improvement in IoU and 3% in mAP with various detectors on the COCO dataset. In particular, in case of recognition on LVIS, our approach achieves over 20% improvements over previous adapters. Furthermore, by expanding our approach with a more diverse array of datasets, we have developed a versatile, interactive tool suitable for practical applications. For detailed results, we direct the reader to Sec. 4 and the appendix.

Related Work

Vision Language Models (VLMs). Vision-language pre-training has given rise to models with aligned image and text representations . Recent studies on contrastive vision-language pre-training have significantly improved the generalization ability of recognition models. Meanwhile, several works aim to design better optimization goals for downstream multi-modal tasks, including caption and visual question answering. Among these works, CLIP models that are pre-trained on billion-scale image-text pairs have shown impressive zero-shot classification performance on a wide range of datasets. Our goal is to enable SAM to perform recognition tasks with the help of pre-trained VLMs.

Open Vocabulary Dense Prediction. This direction aims to recognize region visual concepts of arbitrary categories described by texts, which includes object detection , semantic segmentation , and panoptic segmentation . This necessitates the alignment between region and text representations with the help of VLMs . For open-vocabulary detection, a series of works distill knowledge from the CLIP models to recognize novel objects. In contrast to distillation-based methods, several works directly build object detectors upon frozen CLIP CNNs. For open-vocabulary segmentation, the typical works first generate class-agnostic mask proposals and then classify the proposals with CLIP. Recently, several works build the mask generator upon the frozen diffusion model and CLIP model. Meanwhile, several studies focus on class-agnostic segmentation and detection to enrich generalization ability in various domains. However, most approaches are trained and tested on specific datasets. Our approach is based on SAM, which provides a general, interactive tool to support different open vocabulary detectors.

Prompting in Computer Vision. Prompting, originating from in-context learning in natural language processing (NLP) as seen in works like Brown et al. and Rubin et al. , leverages a large language model to infer unseen tasks through context-specific input-output pairs. Recent studies have explored in-context learning for visual tasks. Common techniques involve mask image modeling for cross-task visual prompting, as employed by approaches like Painter and Bar et al. . SAM demonstrates in-context learning through interactive segmentation, using diverse visual prompts like points, boxes, and masks, although it is limited to class-agnostic mask prediction. Meanwhile, other studies have concentrated on efficient parameter tuning of visual foundation models, typically focusing on a single model. Our work uniquely bridges two models, CLIP and SAM, exploring their combined potential for enhanced general segmentation and recognition capabilities.

Segmentation Anything Model. SAM presents a new data engine and portable model for general object segmentation. Subsequent research has employed SAM as an interactive segmentation tool for various vision tasks, including grounding , tracking , distillation , medical analysis , and generation . While most studies use SAM to augment downstream tasks, none have yet integrated VLMs and SAM into a unified model capable of both segmentation and recognition of novel classes. Our work makes the first attempt to merge the capabilities of VLMs with SAM for enhanced task versatility.

Methodology

We first review the SAM, CLIP, and combined baselines in Sec. 3.1. Then, we detail our Open Vocabulary SAM in Sec. 3.2. Last, we present our model’s training details and application in Sec. 3.3.

The mask decoder takes the image feature $F$ , sparse prompts $Q_{sp}$ , mask tokens $Q_{mask}$ , and the IoU token $Q_{IoU}$ as input. All the inputs will be concatenated and encoded with a lightweight two-way transformer. Consequently, each mask token is transformed into a dynamic linear classifier, capable of calculating the foreground mask probability for every sparse prompt. Simultaneously, the IoU token is tasked with predicting the confidence score for each mask. Considering the multi-granular nature of SAM’s data annotations, encompassing both instance and part level, $Q_{mask}$ naturally encodes multi-granularity. Our study concentrates exclusively on the object level, which aligns more closely with prevalent real-world applications and datasets such as COCO and LVIS .

CLIP. Given an input image $X$ and a corresponding caption $C$ , the CLIP framework processes these modalities to produce respective embeddings: the image embedding $E_{I}$ , derived from its image encoder, and the text embedding $\mathbf{t}$ , obtained from its text encoder. In the context of open-vocabulary object detection and segmentation, CLIP’s capability to generalize beyond fixed class labels is leveraged to replace traditional classifiers. For instance, in open-vocabulary detection scenarios, the text embedding $\mathbf{t_{c}}$ for the $c$ -th object category is generated by inputting the category name into the CLIP text encoder. This process can employ a single template prompt, such as ”a photo of {category},” or multiple prompt templates. Subsequently, for a given region embedding $r$ , that is produced by the RoI-Align , the classification score for the $c$ -th category is computed as follows:

where $<\cdot,\cdot>$ denotes the cosine similarity, and $\tau$ is a learnable or fixed temperature to re-scale the value.

Combined Baselines. We introduce two different baselines for combining CLIP and SAM, as depicted in Fig. 2 (a) and (b). The first approach, termed the ‘cropped image baseline’, employs the SAM mask decoder’s output to segment and resize the original input image. This processed image then serves as the input for the CLIP image encoder, and, in conjunction with the CLIP text embedding, the mask is classified using Equ. (1). The second approach, referred to as the ‘cropped CLIP image feature baseline’, employs the same initial CLIP feature extraction step. However, in this method, masks predicted by the SAM decoder are used to crop the CLIP image features. Subsequent pooling of these masked features yields the final label, akin to baseline (a).

While both baselines enable zero-shot inference of images, they exhibit a noticeable knowledge gap on specific datasets. To address this, we draw inspiration from recent advancements in visual prompting or adapters . Specifically, we propose incorporating additional learnable tokens as an adapter to fine-tune the model for enhanced performance on downstream datasets. These zero-shot inference capabilities and the fine-tuned models constitute our primary comparison baselines under various experimental conditions, detailed in Sec. 4.1.

2 Open Vocabulary SAM

While both baseline models can be enhanced through visual prompting or adapters, as we will discuss in Sec. 4, they face several challenges in real-world applications. First, the requirement for two independent backbones in the combined model increases computational costs ( $Prob.1$ ). Second, SAM and CLIP are trained with distinct objectives – SAM through supervised learning and CLIP via contrastive learning – and there is limited research on knowledge transfer between such diverse architectures ( $Prob.2$ ). Third, despite adapter integration, significant performance gaps remain in recognizing small objects ( $Prob.3$ ). Fourth, there is a lack of exploration into integrating open-vocabulary capabilities for SAM and CLIP, particularly in the context of feature fusion and data scaling ( $Prob.4$ ). Our work aims to solve these problems in a unified yet effective framework.

Unified Architecture. We design a unified architecture for both segmentation and recognition to address $Prob.1$ . Specifically, we adopt the frozen CLIP visual encoder as our feature extractor. Then, both SAM’s mask decoder and prompt encoder are appended behind the CLIP encoder. The meta-architecture of open-vocabulary SAM is shown in Fig. 2 (c), with the more detailed version shown in Fig. 3. This unified architecture is made possible via the SAM2CLIP, which transfers knowledge of SAM to CLIP with distillation, and CLIP2SAM, which employs CLIP knowledge and combines the SAM mask decoder for recognition. We have chosen convolution-based visual backbones for the frozen CLIP backbone, aligning with previous studies that have highlighted their superiority in capturing spatial structures . The efficacy of different CLIP backbones is further explored in Sec. 4.2.

SAM2CLIP. To resolve $Prob.2$ , we design the SAM2CLIP module that bridges the gap in feature representations learned by SAM and CLIP, using adaptation and distillation methods. Through comprehensive experiments, we discovered that employing distillation loss $L_{distill}$ along with transformer-based adapters , yields effective results. Specifically, the distillation process involves a simple pixel-wise approach, where SAM-Huge serves as the teacher and the frozen CLIP equipped with an adapter assumes the student’s role. We then implement a per-pixel mean squared error (MSE) loss to align the SAM feature $F_{sam}$ with the CLIP feature $E_{I}$ , as detailed below:

We design a multi-scale adapter $A_{sam2clip}$ to align the features from CLIP and SAM. In particular, we take pyramid CLIP features $E_{I}^{i},i=1,2,3$ as the inputs. Such pyramid features contain both high-resolution and semantic information, which is proven crucial for semantic segmentation . The MSE loss is revised as follows:

CLIP2SAM. This module aims to leverage CLIP’s knowledge to enhance the recognition capabilities of the SAM decoder. A straightforward approach involves appending a label token $Q_{label}$ to the existing mask token $Q_{mask}$ and IoU token $Q_{IoU}$ . Using $Q_{label}$ , we introduce a specialized adapter to facilitate the transfer of knowledge from the frozen CLIP to the SAM decoder. Subsequently, the enhanced $Q_{label}$ , combined with the output of the prompt encoder and adapted CLIP features, is fed into a two-way transformer. Following the cross-attention process, the improved $Q_{label}$ undergoes further refinement through a multilayer perceptron (MLP), ensuring better alignment with CLIP’s text embedding. The final labels are derived by calculating the distance between the refined label token and the CLIP text embedding, as in Equ. (1).

This design, however, falls short of recognizing small objects ( $Prob.3$ ) since the adaptation only involves the single-scale feature, which is mainly focused on segmentation. We present a simple yet effective solution to handle this issue, introducing a lightweight feature pyramid network (FPN) for CLIP2SAM adaption. As shown in Fig. 3, the pyramid network extracts multi-scale CLIP features as the inputs. Then, we apply the RoI-Align operation to extract region features. Like the R-CNN framework , we apply one convolution layer and a MLP to learn the feature embedding without introducing cross-attention in the mask decoder. In particular, for point prompts, we first obtain the corresponding masks via the SAM decoder and obtain the box via the corresponding masks. For box prompts, we can directly send it to the FPN for region feature extraction. Given that our method incorporates only a few convolution layers, it does not significantly increase computational costs compared to the original SAM.

Open Vocabulary. To tackle $Prob.4$ , the open-vocabulary challenge, we leverage the knowledge embedded in the frozen CLIP backbone, which aids in recognizing novel and unseen objects during inference. In line with previous studies , we fuse the learned class scores with those from the frozen CLIP via a geometric mean to leverage information from both the CLIP backbone and CLIP2SAM. Additionally, we investigate various strategies to expand the vocabulary size, such as joint training with multiple datasets, as detailed in Sec. 4.2. Our experimental results demonstrate that the model scales effectively with large datasets.

3 Training and Application

Training and Loss Function. We first use the SAM-1B (1%) dataset for training the SAM2CLIP module to transfer SAM’s knowledge into open-vocabulary SAM, with the loss $L_{distill}$ (Equ. (3)). Then, we joint train the CLIP2SAM and mask decoder using segmentation mask and label annotations from COCO or LVIS. The final loss function is given as $L=\lambda_{cls}L_{t\_cls}+\lambda_{ce}L_{t\_ce}+\lambda_{dice}L_{t\_dice}$ . Here, $L_{t\_ce}$ is the Cross-Entropy (CE) loss for mask classification, and $L_{t\_ce}$ and $L_{t\_dice}$ are mask Cross Entropy (CE) loss and Dice loss for segmentation, respectively. In addition, we adopt joint training with the ImageNet dataset for our Open-Vocabulary SAM for demo (See Fig. 5).

Inference and Demo Tools. Our model performs inference like SAM, with points and boxes as visual prompts. Specifically, we test boxes and points as visual prompts for the encoder in Sec. 4. On the project page, we show a demo of our model, which can segment and recognize with prompts.

Experiment

Datasets and Metrics. We mainly use COCO and LVIS datasets for the experiments. Moreover, we also use part of SAM data (1%) for SAM2CLIP knowledge transfer. For COCO, we report the results of both close-set and open-vocabulary settings for the instance segmentation task. In particular, following Zareian et al. , we split 48 base classes with annotations and 17 target classes without annotations. We use the base class annotations for training. For LVIS datasets, we adopt the open-vocabulary setting and report the results of $AP_{rare}$ for novel classes. In addition, we also report the accuracy of each box or point prompt for reference since our goal is to add recognition ability to SAM. Meanwhile, each prompt’s intersection-over-union (IoU) with its ground truth mask is also adopted to verify the segmentation ability of our method. Different from previous open vocabulary segmentation tasks, where the proposals are generated by the detectors themselves, we term our setting as open-vocabulary interactive segmentation, where the boxes or points prompts serve as conditional inputs.

Baselines. As shown in Fig. 2 (a) and (b), based on different adapter designs, we append these adapters to the different locations of the combined models. For example, when using CoOp , we append the learnable tokens by combining them with CLIP features. For several convolution-based adapters , we add the extra convolution layers along with SAM or CLIP backbone for fair comparison. By default, we adopt SAM-huge and CLIP R50x16.

Implementation Details. We implement our models in PyTorch with both MMDetection framework and SAM codebase . We use 8 A100 GPUs for distributed training. Each mini-batch has two images per GPU. The optimizer is AdamW with a weight decay of 0.0001. We adopt full image size for a random crop in the pre-training and training process following Cheng et al. . All the class names are transferred into CLIP text embedding, following previous works . We train each model for 12 epochs for fair comparison. Due to the limitation of computation costs, we do not adopt joint SAM data and COCO data training. We first perform training the SAM2CLIP on SAM data, and then we finetune the model on COCO or LVIS data. Please refer to the supplementary material for more details.

Comparison with Combined Baselines Using Ground Truth. To avoid the influence of other modules, we first demonstrate the recognition ability of our model in Tab. 1. Compared to the simple combined approaches, adding various adapters with joint co-training leads to better results. However, the recognition ability is still limited on both COCO and LVIS. Our Open-Vocabulary SAM achieves the best results on both boxes and points as visual prompts. We observe more significant gains on LVIS datasets. We argue that LVIS contains more small objects, which is more challenging than COCO. Our method can solve $Prob.2$ and lead to over 20% accuracy improvement. Although the segmentation quality is pretty good (about 80 IoU on COCO and LVIS with box prompt), our method still achieves 2% IoU improvements. This indicates the effectiveness of our joint co-training on mask prediction and classification. Compared with boxes as prompts, using points as prompts is more challenging since the location clues of points are much weaker than boxes. However, our approach is still better than combined baselines or them with adapters.

Comparison with Combined Baselines on OV-Detector. In Tab. 2, we adopt a more challenging setting by using the box prediction from the existing open-vocabulary detector to simulate the interactive segmentation process with deviation. We choose the representative Detic as the open-vocabulary detector. Again, our method also achieves the best performance on both COCO and LVIS datasets. In particular, on COCO, compared with previous works , our method achieves 3.0 mask mAP improvements with much lower parameter costs. Results with more detectors can be found in the supplementary.

Comparison with SAM on various detectors. In Tab. 3, we also test the mask prediction quality of our model and original SAM on two different detectors. Our method can achieve better performance than the original SAM and perform comparably with SAM fine-tuned on COCO. It is worth noting that our Open-Vocabulary SAM has much lower computational costs and parameters than SAM.

Visualization Comparison. In Fig. 4, we compare our approach with the feature-crop baseline. Our model shows a better performance in classifying small and rare object classification, as well as handling occlusion scenarios.

Model as a Zero Shot Annotation Tool. In addition to COCO and LVIS standard datasets training, following the spirit of SAM, we also scale up our model by training it with more data. In particular, we adopt more detection data (V3Det , Object365 ) and classification data (ImageNet22k ). Owing to significant costs, we have not conducted comparisons with other baselines for this setting. Rather, we have adapted our method into an interactive annotation tool capable of segmenting and recognizing over 22,000 classes.

2 Ablation Studies and Analysis

Effectiveness of SAM2CLIP and CLIP2SAM. We first verify the effectiveness of our proposed two modules in Tab. 4. We adopt image-crop variant of the baseline for comparison. In particular, by sharing a single backbone, we observe a significant drop in the number of parameters and FLOPs, with a little drop in segmentation performance. The slight drop is caused by the domain gap between SAM data and COCO data during the training of the SAM2CLIP module. However, after adding our CLIP2SAM module and joint co-training with mask classification and prediction, a significant improvement in both segmentation and classification is observed, with just a negligible increase in compute cost.

Detailed Design on SAM2CLIP. In Tab. 5, we explore the detailed design of SAM2CLIP in the first stage of open-vocabulary SAM training. The results show that distillation benefits most when multi-scale features are adopted, suggesting that both high-resolution features and high-level semantics are important to align CLIP’s feature with the SAM’s feature.

Detailed Design on CLIP2SAM. In Tab. 6, we present extensive design for the CLIP2SAM module. We compare two designs: a simple classification token with cross attention (Cls Token) and a combination of this token with mask pooled CLIP feature (CLS Token & CLIP MLP fusion). These designs work better than the combined baseline shown in the first row. Nonetheless, due to resolution constraints, these variants cannot handle small objects well, as shown in Fig. 4. In contrast, our design that includes a light FPN improves the performance considerably.

Ablation on Different CLIP Backbones. In Tab. 7, we explore the effect of frozen CLIP visual backbone. We do not add the CLIP2SAM module. Motivated by recent works , CNN-based CLIPs encapsulate more structural information, which is good for our goal since we have location-sensitive visual prompts as the input. Thus, we avoid naïve ViT design for SAM2CLIP but adopt CNN-based CLIPs. As shown in the table, we find ConvNext large achieves the best performance.

Conclusion

We present Open Vocabulary SAM, a SAM-inspired method for interactive segmentation and recognition. Unlike previous open-vocabulary detection and segmentation methods, our method explores interactive open-vocabulary segmentation for the first time. Given the user’s inputs, such as boxes or points, the proposed approach can interactively segment and label each visual prompt. Compared with the combined baselines and various visual adapters, our proposed CLIP2SAM and SAM2CLIP are both efficient and effective in various settings. Our open vocabulary segmentation is compatible with different detectors, including open-vocabulary detectors and close-set detectors. With more data, our model plays a similar role as SAM, offering an effective annotation tool for both segmentation and instance labeling. In particular, our method can perform large vocabulary segmentation and recognition over 22K classes. We hope our research can provide a solid baseline for combining the strengths of different forms of vision foundation models.

Broader Impact and Limitations. Our study advances interactive segmentation and recognition by combining CLIP and SAM into one framework. This integration holds substantial potential to enhance annotation and image editing applications. A current limitation is the lack of exploration into mask prompts for interactive tasks, which we plan to address in future work.

Appendix

Overview. In the supplementary materials, we provide more details and results to support our main paper. The contents are presented as follows: we first present more details and discussions of our method in Sec. A. Then, we report more results in Sec. B. Finally, we present a video demo for method introduction and discuss future work in Sec. C (see introduction.mp4 and tools_demo.mov).

Appendix A Method Discussion

Comparison with Recent Joint SAM and CLIP Models. Several recent works also explore joint segmentation and recognition as one system. Recognize Anything adopts the tagging-based model. However, the focus of that work is to build large-scale tagging datasets via automatic text semantic parsing. Moreover, it cannot perform interactive segmentation. SAM-CLIP integrates multi-task learning, continual learning techniques, and teacher-student distillation, where its goal is to build semantic segmentation models. On the other hand, our Open-Vocabulary SAM is a variant of SAM, which can support interactive segmentation and mask classification with flexible user inputs. Therefore, SAM-CLIP is orthogonal to the proposed Open-Vocabulary SAM. Meanwhile, Semantic-SAM extends Mask-DINO with fine-grained and interactive segmentation. It contains a decoupled transformer decoder to generate entity proposals, making it a universal and multi-task model. In contrast, our model only contains a lightweight decoder (SAM decoder) and CLIP2SAM prediction. Besides, our work mainly explores the open-vocabulary ability of a frozen CLIP model to enhance the SAM. As a result, our method only requires training partial parameters and fuses the knowledge embodied in SAM and CLIP.

Comparison with Open-Vocabulary Methods. Our work is orthogonal to previous open-vocabulary methods. Specifically, we aim to build a SAM-like model with interactive prompts, such as points and boxes. Previous open-vocabulary methods need to generate region proposals or mask proposals to recall all possible objects in the scene. Following the same spirit of SAM, our model mainly takes these proposals for segmentation and recognition. Thus, we can deploy our model on various OV-detectors to achieve instance segmentation or class-agnostic detectors to achieve joint segmentation and recognition.

Training and Inference Details. In Fig. 6, we present a more detailed visualization of both training and inference of our Open-Vocabulary SAM. As shown in the figure, only three components are learned during the training, including SAM2CLIP (129M), CLIP2SAM (3.8M), and SAM decoder (4.0M). The remaining parameters are frozen. The total parameters are 304M. After SAM2CLIP distillation, the heavy SAM encoder (637M) is dropped when fine-tuning our model on the detection and classification datasets, which speeds up the CLIP2SAM process.

The CLIP2SAM module is co-trained using a diverse mixture of datasets, encompassing detection, segmentation, and classification tasks. Specifically, for segmentation, the training integrates both mask and classification labels, while for detection and classification tasks, the focus is solely on training the classification head. The classification dataset, notable for its extensive range of class categories, coupled with the inclusion of several recent works that feature a large vocabulary, significantly enhances the model’s capabilities. After this extensive co-training process, our model exhibits the remarkable ability to recognize and segment over twenty thousand class categories, all within a single, unified framework.

Appendix B More Experimental Results

Comparative Results on More Detectors. In Tab. 8, we compare the Open-Vocabulary SAM with the original SAM (ViT-H) across various detectors. Our method uses the scores and labels generated by the corresponding detectors and bounding boxes as prompts for mask generation. Notably, with ViTDet (Huge), a strong detector, our Open-Vocabulary SAM achieves comparable or superior segmentation to the original SAM. With less robust detectors, Open-Vocabulary SAM segmentation ability slightly decreases. It is noteworthy that the Open-Vocabulary SAM is more efficient, requiring fewer parameters and less computational resources than the SAM (ViT-H) model.

More Comparisons with SAM Models. In Tab. 9, we compare our method with several SAM models, focusing on open-vocabulary recognition. We generate masks using ground truth bounding boxes from the COCO dataset and evaluate them using the per-instance IoU metric (1-IoU), as per Semantic-SAM . Our Open-Vocabulary SAM stands out for its ability to recognize a broad spectrum of categories in an open-vocabulary setting, in addition to effective mask generation.

Scaling Up With More Datasets. Our Open-Vocabulary SAM, scaled with large-scale datasets, demonstrates impressive zero-shot classification on the COCO validation set, detailed in Tab.10. Using ImageNet-21k (I-21k), a dataset with 19,167 categories and image-level supervision, the model achieves a 44.5% accuracy in zero-shot instance-level classification. This underscores the efficacy of training with readily available image-level annotations. In addition, incorporating large-vocabulary detection datasets like LVIS , Objects365 , and V3Det , which offer instance-level annotations, significantly enhances classification accuracy. For a more comprehensive understanding of the model’s open-vocabulary classification capabilities, we direct readers to watch our demo video on our project page https://www.mmlab-ntu.com/project/ovsam.

Appendix C Failure Cases, Demo and Future Work

Failure Cases. In Fig. 8, we show two instances where our Open-Vocabulary SAM falls short. The first case, depicted on the left, involves class labels that are subtly distinct and challenging to differentiate, even for humans. The second scenario, shown on the right, features partial occlusions where the differentiation between a bowl and a vase becomes complex. These examples highlight the need for our model to develop a more nuanced understanding of the scene’s details.

Short Introduction To Demo. We include an introduction video and a demo video in addition to our main paper and supplementary file. The former presents a short introduction to better understand our work, while the latter shows the demo tools of our Open-Vocabulary SAM (please also refer to Fig. 7 for illustration), which can segment and recognize various classes on many scenes. Please refer to our project page: https://www.mmlab-ntu.com/project/ovsam for more details.

Future work. While users can efficiently interact with specific objects using point-and-click or box-dragging techniques, future work will explore using coarse masks or language descriptions as interactive prompts. We aim to continue investigating these promising new directions.