SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer

Introduction

The goal of interactive image segmentation is to obtain high-quality pixel-level annotations with limited user interaction such as clicking. Interactive image segmentation approaches have been widely applied to annotate large-scale image datasets, which drive the success of deep models in various applications, including video understanding , self-driving , and medical imaging . Much research has been devoted to explore interactive image segmentation with different interaction types, such as bounding boxes , polygons , clicks , scribbles , and their combinations . Among them, the click-based approach is most common due to its simplicity and well-established training and evaluation protocols.

Recent advances in click-based approaches mainly lie in two orthogonal directions: 1) the development of more effective backbone networks and 2) the exploration of more elaborate refinement modules built upon the backbone. For the former direction, different hierarchical backbones, including both ConvNets and ViTs , have been developed for interactive segmentation. For the latter direction, various refinement modules, including local refinement and click imitation , have been proposed to further boost segmentation performance. In this work, we delve into the former direction and focus on exploring a plain backbone for interactive segmentation.

A hierarchical backbone is the predominant architecture for current interactive segmentation methods. This design is deeply rooted in ConvNets, represented by ResNet , and has been adopted by ViTs, represented by the Swin Transformer . The motivation for a hierarchical backbone stems from the locality of convolution operations, leading to insufficient model receptive field size without the hierarchy. To increase the receptive field size, ConvNets have to progressively downsample feature maps to capture more global contextual information. Therefore, they often require a feature pyramid network such as FPN to aggregate multi-scale representations for high-quality segmentation. However, this reasoning no longer applies for a plain ViT, in which global information can be captured from the first self-attention block. Because all feature maps in the ViT are of the same resolution, the motivation for an FPN-like feature pyramid also no longer remains. The above reasoning is supported by a recent finding that a plain ViT can serve as a strong backbone for object detection . This finding indicates a general-purpose ViT backbone might be suitable for other tasks, which then can decouple pretraining from finetuning and transfer the benefits from readily available pretrained ViT models (e.g. MAE ) to these tasks. However, although this design is simple and has been proven effective, it has not yet been explored in interactive segmentation. In this work, we propose SimpleClick, the first plain-backbone method for interactive segmentation. The core of SimpleClick is a plain ViT backbone that maintains single-scale representations throughout. We only use the last feature map from the plain backbone to build a simple feature pyramid for segmentation, largely decoupling the general-purpose backbone from the segmentation-specific modules. To make SimpleClick more efficient, we use a light-weight MLP decoder to transform the simple feature pyramid into a segmentation (see Sec. 3 for details).

We extensively evaluate our method on 10 public benchmarks, including both natural and medical images. With the plain backbone pretrained as a MAE , our method achieves 4.15 NoC@90 on SBD, which outperforms the previous best method by 21.8% without a complex FPN-like design and local refinement. We demonstrate the generalizability of our method by out-of-domain evaluation on medical images. We further analyze the computational efficiency of SimpleClick, highlighting its suitability as a practical annotation tool.

We propose SimpleClick, the first plain-backbone method for interactive image segmentation.

SimpleClick achieves state-of-the-art performance on natural images and shows strong generalizability on medical images.

SimpleClick meets the computational efficiency requirement for a practical annotation tool, highlighting its readiness for real-world applications.

Related Work

Interactive Image Segmentation Interactive image segmentation is a longstanding problem for which increasingly better solution approaches have been proposed. Early works tackle this problem using graphs defined over image pixels. However, these methods only focus on low-level image features, and therefore tend to have difficulty with complex objects.

Thriving on large datasets, ConvNets have evolved as the dominant architecture for high quality interactive segmentation. ConvNet-based methods have explored various interaction types, such as bounding boxes , polygons , clicks , and scribbles . Click-based approaches are the most common due to their simplicity and well-established training and evaluation protocols. Xu et al. first proposed a click simulation strategy that has been adopted by follow-up work . DEXTR extracts a target object from specifying its four extreme points (left-most, right-most, top, bottom pixels). FCA-Net demonstrates the critical role of the first click for better segmentation. Recently, ViTs have been applied to interactive segmentation. FocalClick uses SegFormer as the backbone network and achieves state-of-the-art segmentation results with high computational efficiency. iSegFormer uses a Swin Transformer as the backbone network for interactive segmentation on medical images. Besides the contribution on backbones, some works are exploring elaborate refinement modules built upon the backbone. FocalClick and FocusCut propose similar local refinement modules for high-quality segmentation. PseudoClick proposes a click-imitation mechanism by estimating the next-click to further reduce human annotation cost. Our method differs from all previous click-based methods in its plain, non-hierarchical ViT backbone, enjoying the benefits from readily available pretrained ViT models (e.g. MAE ).

Vision Transformers for Non-Interactive Segmentation Recently, ViT-based approaches have shown competitive performance on segmentation tasks compared to ConvNets. The original ViT is a non-hierarchical architecture that only maintains single-scale feature maps throughout. SETR and Segmenter use the original ViT as the encoder for semantic segmentation. To allow for more efficient segmentation, the Swin Transformer reintroduces a computational hierarchy into the original ViT architecture using shifted window attention, leading to a highly efficient hierarchical ViT backbone. SegFormer designs hierarchical feature representations based on the original ViT using overlapped patch merging, combined with a light-weight MLP decoder for efficient segmentation. HRViT integrates a high-resolution multi-branch architecture with ViTs to learn multi-scale representations. Recently, the original ViT has been reintroduced as a competitive backbone for semantic segmentation and object detection , with the aid of MAE pretraining and window attention. Inspired by this finding, we explore using a plain ViT as the backbone network for interactive segmentation.

Method

Our goal is not to propose new modules, but to adapt a plain-ViT backbone for interactive segmentation with minimal modifications so as to enjoy the readily available pretrained ViT weights. Sec. 3.1 introduces the main modules of SimpleClick. Sec. 3.2 describes the training and inference details of our method.

Adaptation of Plain-ViT Backbone We use a plain ViT as our backbone network, which only maintains single-scale feature maps throughout. The patch embedding layer divides the input image into non-overlapping fixed-size patches (e.g. 16 $\times$ 16 for ViT-B), each patch is flattened and linearly projected to a fixed-length vector (e.g. 768 for ViT-B). The resulting sequence of vectors is fed into a queue of Transformer blocks (e.g. 12 for ViT-B) for self-attention. We implement SimpleClick with three backbones: ViT-B, ViT-L, and ViT-H (Tab. 1 shows the number of parameters for the three backbones). The three backbones were pretrained on ImageNet-1k as MAEs . We adapt the pretrained backbones to higher-resolution inputs during finetuning using non-shifting window attention aided by a few global self-attention blocks (e.g. 2 for ViT-B), as introduced in ViTDet . Since the last feature map is subject to all the attention blocks, it should have the strongest representation. Therefore, we only use the last feature map to build a simple multi-scale feature pyramid.

Simple Feature Pyramid For the hierarchical backbone, a feature pyramid is commonly produced by an FPN to combine features from different stages. For the plain backbone, a feature pyramid can be generated in a much simpler way: by a set of parallel convolutional or deconvolutional layers using only the last feature map of the backbone. As shown in Fig. 2, given the input ViT feature map, a multi-scale feature map can be produced by four convolutions with different strides. Though the effectiveness of this simple feature pyramid design is first demonstrated in ViTDet for object detection, we show in this work the effectiveness of this simple feature pyramid design for interactive segmentation. We also propose several additional variants (Fig. 6) as part of an ablation study (Sec. 4.4).

All-MLP Segmentation Head We implement a lightweight segmentation head using only MLP layers. It takes in the simple feature pyramid and produces a segmentation probability mapThis probability map may be miscalibrated and can be improved by calibration approaches . of scale $1/4$ , followed by an upsampling operation to recover the original resolution. Note that this segmentation head avoids computationally demanding components and only accounts for up to 1% of the model parameters (Tab. 1). The key insight is that with a powerful pretrained backbone, a lightweight segmentation head is sufficient for interactive segmentation. The proposed all-MLP segmentation head works in three steps. First, each feature map from the simple feature pyramid goes through an MLP layer to transform it to an identical channel dimension (i.e. $C_{2}$ in Fig. 2). Second, all feature maps are upsampled to the same resolution (i.e. $1/4$ in Fig. 2) for concatenation. Third, the concatenated features are fused by another MLP layer to produce a single-channel feature map, followed by a sigmoid function to obtain a segmentation probability map, which is then transformed to a binary segmentation given a predefined threshold (i.e. 0.5).

Symmetric Patch Embedding and Beyond To fuse human clicks into the plain backbone, we introduce a patch embedding layer that is symmetric to the patch embedding layer in the backbone, followed by element-wise feature addition. The user clicks are encoded in a two-channel disk map, one for positive clicks and the other for negative clicks. The positive clicks should be placed on the foreground, while the negative clicks should be placed on the background. The previous segmentation and the two-channel click map are concatenated as a three-channel map for patch embedding. The two symmetric embedding layers operate on the image and the concatenated three-channel map, respectively. The inputs are patchified, flattened, and projected to two vector sequences of the same dimension, followed by element-wise addition before inputting into the self-attention blocks.

2 Training and Inference Settings

Backbone Pretraining Our backbone models are pretrained as MAEs on ImageNet-1K . In MAE pretraining, the ViT models reconstruct the randomly masked pixels of images while learning a universal representation. This simple self-supervised approach turns out to be an efficient and scalable way to pretrain ViT models . In this work, we do not perform pretraining ourselves. Instead, we simply use the readily available pretrained MAE weights from .

End-to-end Finetuning With the pretrained backbone, we finetune our model end-to-end on the interactive segmentation task. The finetuning pipeline can be briefly described as follows. First, we automatically simulate clicks based on the current segmentation and gold standard segmentation, without a human-in-the-loop providing the clicks. Specifically, we use a combination of random and iterative click simulation strategies, inspired by RITM . The random click simulation strategy generates clicks in parallel, without considering the order of the clicks. The iterative click simulation strategy generates clicks iteratively, where the next click should be placed on the erroneous region of a prediction that was obtained using the previous clicks. This strategy is more similar to human clicking behavior. Second, we incorporate the segmentation from the previous interaction as an additional input for the backbone, further improving the segmentation quality. This also allows our method to refine from an existing segmentation, which is a desired feature for a practical annotation tool. We use the normalized focal loss (NFL) to train all our models. Previous works show that NFL converges faster and achieves better performance than the widely used binary cross entropy loss for interactive segmentation tasks. Similar training pipelines have been proposed by RITM and its follow-up works .

Inference There are two inference modes: automatic evaluation and human evaluation. For automatic evaluation, clicks are automatically simulated based on the current segmentation and gold standard. For human evaluation, a human-in-the-loop provides all clicks based on their subjective evaluation of current segmentation results. We use automatic evaluation for quantitative analyses and human evaluation for a qualitative assessment of the interactive segmentation behavior.

Experiments

Datasets We conducted experiments on 10 public datasets including 7 natural image datasets and 3 medical datasets. The details are as follows:

GrabCut : 50 images (50 instances), each with clear foreground and background differences.

Berkeley : 96 images (100 instances); this dataset shares a small portion of images with GrabCut.

DAVIS : 50 videos; we only use the same 345 frames as used in for evaluation.

Pascal VOC : 1449 images (3427 instances) in the validation set. We only test on the validation set.

SBD : 8498 training images (20172 instances) and 2857 validation images (6671 instances). Following previous works , we train our model on the training set and evaluate on the validation set.

COCO +LVIS (C+L): COCO contains 118K training images (1.2M instances); LVIS shares the same images with COCO but has much higher segmentation quality. We combine the two datasets for training.

ssTEM : two image stacks, each contains 20 medical images. We use the same stack that was used in .

BraTS : 369 magnetic resonance image (MRI) volumes; we test on the same 369 slices used in .

OAIZIB : 507 MRI volumes; we test on the same 150 slices (300 instances) as used in .

Evaluation Metrics Following previous works , we automatically simulate user clicks by comparing the current segmentation with the gold standard. In this simulation, the next click will be put at the center of the region with the largest error. We use the Number of Clicks (NoC) as the evaluation metric to calculate the number of clicks required to achieve a target Intersection over Union (IoU). We set two target IoUs: 85% and 90%, represented by NoC%85 and NoC%90 respectively. The maximum number of clicks for each instance is set to 20. We also use the average IoU given $k$ clicks (mIoU@ $k$ ) as an evaluation metric to measure the segmentation quality given a fixed number of clicks.

Implementation Details We implement our models using Python and PyTorch . We implement three models based on three vanilla ViT models (i.e. ViT-B, ViT-L, and ViT-H). These backbone models are initialized with the MAE pretrained weights, and then are finetuned end-to-end with other modules. We train our models on either SBD or COCO+LVIS with 55 epochs; the initial learning rate is set to $5\times 10^{-5}$ and decreases to $5\times 10^{-6}$ after epoch 50. We set the batch size to 140 for ViT-Base, 72 for ViT-Large, and 32 for ViT-Huge to fit the models into GPU memory. All our models are trained on four NVIDIA RTX A6000 GPUs. We use the following data augmentation techniques: random resizing (scale range from 0.75 to 1.25), random flipping and rotation, random brightness contrast, and random cropping. Though the ViT backbone was pretrained on images of size 224 $\times$ 224, we finetune on $448\times 448$ with non-shifting window attention for better performance. We optimize using Adam with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ .

We show in Tab. 2 the comparisons with previous state-of-the-art results. Our models achieves the best performance on all the five benchmarks. Remarkably, when trained on SBD training set, our ViT-H model achieves 4.15 NoC@90 on the SBD validation set, outperforming the previous best score by 21.8%. Since the SBD validation set contains the largest number of instances (6671 instances) among the five benchmarks, this improvement is convincing. When trained on COCO+LVIS, our models also achieve state-of-the-art performance on all benchmarks. Fig. 7 shows several segmentation cases on DAVIS, including the worst case. Note that the DAVIS dataset requires high-quality segmentations because all its instances have a high-quality gold standard. Our models still achieve the state-of-the-art on DAVIS without using specific modules, such as a local refinement module , which is beneficial for high-quality segmentation. Fig. 9 shows that our method converges better than other methods with sufficient clicks, leading to fewer failure cases as shown in Fig. 4. We only report results on SBD and Pascal VOC, the top two largest datasets.

2 Out-of-Domain Evaluation on Medical Images

We further evaluate the generalizability of our models on three medical image datasets: ssTEM , BraTS , and OAIZIB . Tab. 3 reports the evaluation results on these three datasets. Fig. 5 shows the convergence analysis on BraTS and OAIZIB. Overall, our models generalize well to medical images. We also find that the models trained on larger datasets (i.e. C+L) generalize better than the models trained on smaller datasets (i.e. SBD).

3 Towards Practical Annotation Tool

Tiny Backbone To allow for practical applications, especially on low-end devices with limited computational resources, we implement an extremely tiny backbone (i.e. ViT-xTiny) for SimpleClick. Compared with ViT-Base, ViT-xTiny decreases the embedding dimension from 768 to 160 and the number of attention blocks from 12 to 8. We end up with a SimpleClick-xTiny model, which is comparable with the tiny FocalClick models in terms of parameters. Comparison results in Tab. 4 show that our model outperforms FocalClick models, even though it is trained from scratch due to the lack of readily available pretrained weights.

Computational Analysis Tab. 5 shows a comparison of computational requirements with respect to model parameters, FLOPs, GPU memory consumption, and speed; the speed is measured by seconds per click (SPC). Fig. 1 shows the interactive segmentation performance of methods in terms of FLOPs. In Fig. 1 and Tab. 5, each method is denoted by its backbone. For fair comparison, we evaluate all the methods on the same benchmark (i.e. GrabCut) and using the same computer (GPU: NVIDIA RTX A6000, CPU: Intel Silver $\times$ 2). We only calculate the FLOPs in a single forward pass. For methods like FocusCut which require multiple forward passes for each click, the FLOPs may be much higher than reported. By default, our method takes images of size 448 $\times$ 448 as the fixed input. Even for our ViT-H model, the speed (132ms) and memory consumption (3.22G) is sufficient to meet the requirements of a practical annotation tool.

4 Ablation Study

In this section, we ablate the backbone finetuning and feature pyramid design. Tab. 6 shows the ablation results. By default, we finetune the backbone along with other modules. As an ablation, we freeze the backbone during finetuning, leading to significantly worse performance. This ablation is explainable considering the ViT backbone takes most of the model parameters (Tab. 1). For the second ablation, we compare the default simple feature pyramid design with three variants depicted in Fig. 6 (i.e. (b), (c), and (d)). First, we observe that the multi-scale representation matters for the feature pyramid. By ablating the multi-scale property in the simple feature pyramid, the performance drops considerably. We also notice that the last feature map from the backbone is strong enough to build the feature pyramid. The parallel feature pyramid generated by multi-stage feature maps from the backbone does not surpass the simple feature pyramid that only uses the last feature map of the backbone.

Limitations and Remarks

Our best-performing model (ViT-H) is much larger than existing models, leading to concerns about an unfair comparison. We justify the effectiveness of SimpleClick by developing a tiny model and comparing it fairly with other methods. Other than this, our models may fail in some challenging scenarios such as objects with very thin and elongated shapes or cluttered occlusions((a) and (b) in Fig. 7). We leave the improvements for future work.

We are entering an era of large-scale pretraining on multimodal foundation models, which is dramatically transforming the landscape of vision and language tasks. In this context, we hope SimpleClick will serve as a strong baseline for a new wave of high-performing interactive segmentation methods based on ViTs and large-scale pertaining.

Conclusions

We proposed SimpleClick, the first plain-backbone method for interactive image segmentation. Our method leveraged a general-purpose ViT backbone that can benefit from readily available pretrained ViT models. With the MAE-pretrained weights, SimpleClick achieved state-of-the-art performance on natural images and demonstrated strong generalizability on medical images. We also developed a tiny SimpleClick model and provided a detailed computational analysis, highlighting the suitability of SimpleClick as a practical annotation tool.

References

Appendix A Datasets

This section supplements the “Datasets” (Sec. 4) in the main paper. Our models are trained either using SBD or the combined COCO +LVIS datasets. Before RITM , most of the deep learning-based interactive segmentation models were trained either using the SBD or Pascal VOC datasets. These two datasets only cover 20 categories of general objects such as persons, transportation vehicles, animals, and indoor objects. The authors of RITM constructed the combined COCO+LVIS dataset, which contains 118k training images of 80 diverse object classes, for interactive segmentation. This large and diverse training dataset contributes to the state-of-the-art performance of RITM models. Inspired by RITM and its follow-up works , we use SBD and COCO+LVIS as our training datasets.

Appendix B Implementation Details

This section supplements Sec. 3.1 “Network Architecture” in the main paper. Tab. 7 shows the main architecture parameters of our models. By default, our models use an input size of $448\times 448$ during training and evaluation. Our ViT-B and ViT-L models use a patch size of $16\times 16$ , while the ViT-H model uses a smaller patch size of $14\times 14$ . This leads to a higher resolution representation in terms of the number of patches. Each patch is flattened and projected to an embed dimension of $C_{0}$ through the patch embedding layer. The tokens generated by the patch embedding layer are processed by $N$ self-attention blocks, which $N$ is a hyper-parameter inherited from plain ViT models . Inspired by ViTDet , we build a simple feature pyramid with the four resolutions { $\frac{1}{32},\frac{1}{16},\frac{1}{8},\frac{1}{4}$ }. The $\frac{1}{16}$ resolution uses the last feature map of the ViT backbone. The $\frac{1}{32}$ resolution is built by a $2\times 2$ convolutional layer with a stride of 2. The $\frac{1}{8}$ (or $\frac{1}{4}$ ) resolution is built by one (or two) $2\times 2$ transposed convolution layer(s) with a stride of 2. We use a $1\times 1$ convolution layer with layer normalization to convert the channels of each feature map to predefined dimensions. Specifically, feature maps of resolutions { $\frac{1}{32},\frac{1}{16},\frac{1}{8},\frac{1}{4}$ } are converted to channel dimensions of { $8C_{1},4C_{1},2C_{1},C_{1}$ }, respectively. Each feature map is then converted to the same dimension of $C_{2}$ through an MLP layer in the segmentation head, followed by upsampling to the $\frac{1}{4}$ resolution. At this point, the four feature maps have the same resolution and the same number of channels. They are concatenated as a single feature map with $4C_{2}$ channels. Another MLP layer in the segmentation head converts this multi-channel feature map to a one-channel feature map, followed by a sigmoid function to obtain the final binary segmentation. We use $C_{1}$ and $C_{2}$ as hyper-parameters without tuning.

B.2 Clicks Encoding

This section also supplements Sec. 3.1 in the main paper. We encode clicks, which are represented by the coordinates in an image, as disks with a small radius of 5 pixels. Positive and negative clicks are encoded separately. In our implementation, we also attach the previous segmentation as an additional channel, resulting in a three-channel disk map. Two patch embedding layers, which are of the same structure, process the three-channel disk map and the RGB image separately. The tokens of the two inputs after the patch embedding layers are added element by element, without changing the input dimensions for the self-attention blocks. This design is more efficient than other designs such as concatenation and allows our ViT backbones to be initialized with pretrained ViT weights.

B.3 Finetuning on Higher-Resolution Images

This section supplements Sec. 3.2 “Training and Inference Settings”in the main paper. Our models are pretrained on an image size of $224\times 224$ but are finetuned on an image size of $448\times 448$ . We first interpolate the positional encoding to the high resolution. Then, we perform non-overlapping window attention with a few global blocks for cross-window attention. The high-resolution feature map is divided into regular non-overlapping windows. The non-global blocks perform self-attention within each window, while global blocks perform global self-attention. We set the number of global blocks to 2, 6, and 8 for the ViT-B, ViT-L, and ViT-H models, respectively.

Appendix C Additional Comparison Results

This section supplements Sec. 4.1 “Comparison with Previous Results” in the main paper. Fig. 9 shows convergence results for our models on four datasets: GrabCut , Berkeley , DAVIS , and COCO . Overall, our models perform better than other models on these datasets. However, the results in Fig. 9 are not as compelling as the results on SBD or Pascal VOC (shown in Fig. 3 of the main paper). This is likely due to the limited number of images in these datasets (e.g. GrabCut only contains 50 instances, while SBD contains 6671 instances for evaluation).

Appendix D Human Evaluation on Medical Images

This section supplemens Sec. 4.2 “Out-of-Domain Evaluation on Medical Images” in the main paper. In the main paper, we report quantitative results on medical images using an automatic evaluation mode where clicks are automatically simulated. In this section, we perform human evaluations where a human-in-the-loop provides all the clicks. Fig. 8 shows qualitative results on three medical image datasets: ssTEM , OAIZIB , and BraTS . For simple objects such as cell nuclei in ssTEM, it may take as little as one click for a good segmentation. However, for more challenging objects such as knee cartilage in the OAIZIB dataset or brain tumors in the BraTS dataset, it may take more than ten clicks until a high-quality segmentation is obtained. Considering our models are not finetuned on the label-scarce medical imaging datasets, our observed performance is quite promising. The attached videos demonstrate the evaluation process.