Transformer-Based Visual Segmentation: A Survey

Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei Liu, Chen Change Loy

cs.CV

Introduction

Visual segmentation aims to group pixels of the given image or video into a set of semantic regions. It is a fundamental problem in computer vision and involves numerous real-world applications, such as robotics, automated surveillance, image/video editing, social media, autonomous driving, etc. Starting from the hand-crafted features and classical machine learning models , segmentation problems have been involved with a lot of research efforts. During the last ten years, deep neural networks, Convolution Neural Networks (CNNs) , such as Fully Convolutional Networks (FCNs) have achieved remarkable successes for different segmentation tasks and led to much better results. Compared to traditional segmentation approaches, CNNs based approaches have better generalization ability. Because of their exceptional performance, CNNs and FCN architecture have been the basic components in the segmentation research works.

Recently, with the success of natural language processing (NLP), transformer is introduced as a replacement for recurrent neural networks . Transformer contains a novel self-attention design and can process various tokens in parallel. Then, based on transformer design, BERT and GPT-3 scale the model parameters up and pre-train with huge unlabeled text information. They achieve strong performance on many NLP tasks, accelerating the development of transformers into the vision community. Recently, researchers applied transformers to computer vision (CV) tasks. Early methods combine the self-attention layers to augment CNNs. Meanwhile, several works used pure self-attention layers to replace convolution layers. After that, two remarkable methods boost the CV tasks. One is vision transformer (ViT) , which is a pure transformer that directly takes the sequences of image patches to classify the full image. It achieves state-of-the-art performance on multiple image recognition datasets. Another is detection transformer (DETR) , which introduces the concept of object query. Each object query represents one instance. The object query replaces the complex anchor design in the previous detection framework, which simplifies the pipeline of detection and segmentation. Then, the following works adopt improved designs on various vision tasks, including representation learning , object detection , segmentation , low-level image processing , video understanding , 3D scene understanding , and image/video generation .

As for visual segmentation, recent state-of-the-art methods are all based on transformer architecture. Compared with CNN approaches, most transformer-based approaches have simpler pipelines but stronger performance. Because of a rapid upsurge in transformer-based vision models, there are several surveys on vision transformer . However, most of them mainly focus on general transformer design and its application on several specific vision tasks . Meanwhile, there are previous surveys on the deep-learning-based segmentation . However, to the best of our knowledge, there are no surveys focusing on using vision transformers for visual segmentation or query-based object detection. We believe it would be beneficial for the community to summarize these works and keep tracking this evolving field.

$\bullet$ Contribution. In this survey, we systematically introduce recent advances in transformer-based visual segmentation methods. We start by defining the task, datasets, and CNN-based approaches and then move on to transformer-based approaches, covering existing methods and future work directions. Our survey groups existing representative works from a more technical perspective of the method details. In particular, for the main review part, we first summarize the core framework of existing approaches into a meta-architecture in Sec. 3.1, which is an extension of DETR . By changing the components of the meta-architecture, we divide existing approaches into six categories in Sec. 3.2, including Representation Learning, Interaction Design in Decoder, Optimizing Object Query, Using Query For Association, and Conditional Query Generation.

Moreover, we also survey closely related specific subfields, including point cloud segmentation, tuning foundation models, domain-aware segmentation, data/model efficient segmentation, class agnostic segmentation and tracking, and medical segmentation. We also evaluate the performance of influential works published in top-tier conferences and journals on several widely used segmentation benchmarks. Additionally, we provide an overview of previous CNN-based models and relevant literature in other areas, such as object detection, object tracking, and referring segmentation in the background section.

$\bullet$ Scope. This survey will cover several mainstream segmentation tasks, including semantic segmentation, instance segmentation, panoptic segmentation, and their variants, such as video and point cloud segmentation. Additionally, we cover related subfields in Sec. 4. We focus on transformer-based approaches and only review a few closely related CNN-based approaches for reference. Although there are many preprints or published works, we only include the most representative works.

$\bullet$ Organization. The rest of the survey is organized as follows. Overall, Fig. 1 shows the pipeline of our survey. We first introduce the background knowledge on problem definition, datasets, and CNN-based approaches in Sec. 2. Then, we review representative papers on transformer-based segmentation methods in Sec. 3 and Sec. 4. We compare the experiment results in Sec. 5. Finally, we raise the future directions in Sec. 6 and conclude the survey in Sec. 7. We provide more benchmarks and details in the appendix.

Background

In this section, we first present a unified problem definition of different segmentation tasks. Then, we detail the common datasets and evaluation metrics. Next, we present a summary of previous approaches before the transformer. Finally, we present a review of basic concepts in transformers. To facilitate understanding of this survey, we list the brief notations in Tab. I for reference.

$\bullet$ Related Problems. Object detection and instance-wise segmentation (IS/VIS/VPS) are closely related tasks. Object detection involves predicting object bounding boxes, which can be considered a coarse form of IS. After introducing the DETR model, many works have treated object detection and IS as the same task, as IS can be achieved by adding a simple mask prediction head to object detection. Similarly, video object detection (VOD) aims to detect objects in every video frame. In our survey, we also examine query-based object detectors for both object detection and VOD. Point cloud segmentation is another segmentation task, where the goal is to segment each point in a point cloud into pre-defined categories. We can apply the same definitions of semantic segmentation, instance segmentation, and panoptic segmentation to this task, resulting in point cloud semantic segmentation (PCSS), point cloud instance segmentation (PCIS), and point cloud panoptic segmentation (PCPS). Referring segmentation is a task that aims to segment objects described in natural language text input. There are two subtasks in referring segmentation: referring image segmentation (RIS), which performs language-driven segmentation, and referring video object segmentation (RVOS), which segments and tracks a specific object in a video based on required text inputs. Finally, video object segmentation (VOS) involves tracking an object in a video by predicting pixel-wise masks in every frame, given a mask of the object in the first frame.

2 Datasets and Metrics

$\bullet$ Commonly Used Datasets. For image segmentation, the most commonly used datasets are COCO , ADE20k and Cityscapes . For video segmentation, the most used datasets are VSPW and Youtube-VIS . We will compare several dataset results in Sec. 5. More datasets are listed in the Tab. II.

$\bullet$ Common Metric. For SS and VSS, the commonly used metric is mean intersection over union (mIoU), which calculates the pixel-wised Union of Interest between output image and video masks and ground truth masks. For IS, the metric is mask mean average precision (mAP), which is extended from the object detection via replacing box IoU with mask IoU. For VIS, the metric is 3D mAP, which extends mask mAP in a spatial-temporal manner. For PS, the metric is the panoptic quality (PQ), which unifies both thing and stuff prediction by setting a fixed threshold 0.5. For VPS, the commonly used metrics are video panoptic quality (VPQ) and segmentation tracking quality (STQ). The former extends PQ into temporal window calculation, while the latter decouples the segmentation and tracking in a per-pixel-wised manner. Note that there are other metrics, including pixel accuracy and temporal consistency. For simplicity, we only report the primary metrics used in the literature. We present the detailed formulation of these metrics in the supplementary material.

3 Segmentation Approaches Before Transformer

$\bullet$ Semantic Segmentation. Prior to the emergence of ViT and DETR. SS was typically approached as a dense pixel classification problem, as initially proposed by FCN. Then, the following works are all based on the FCN framework. These methods can be divided into the following aspects, including better encoder-decoder frameworks , larger kernels , multiscale pooling , multiscale feature fusion , non-local modeling , efficient modeling , and better boundary delineation . After the transformer was proposed, with the goal of global context modeling, several works design variants of self-attention operators to replace the CNN prediction heads .

$\bullet$ Instance Segmentation. IS aims to detect and segment each object, which goes beyond object detection. Most IS approaches focus on how to represent instance masks beyond object detection, which can be divided into two categories: top-down approaches and bottom-up approaches . The former extends the object detector with an extra mask head. The designs of mask heads are various, including FCN heads , diverse mask encodings , and dynamic kernels . The latter performs instance clustering from semantic segmentation maps to form instance masks. The performance of top-down approaches is closely related to the choice of detector , while bottom-up approaches depend on both semantic segmentation results and clustering methods . Besides, there are also several approaches using gird representation to learn instance masks directly. The ideas using kernels and different mask encodings are also extended into several transformer-based approaches, which will be detailed in Sec. 3.

$\bullet$ Panoptic Segmentation. Previous works for PS mainly focus on how to fuse the results of both SS and IS, which treats PS as two independent tasks. Based on IS subtask, the previous works can also be divided into two categories: top-down approaches and bottom-up approaches , according to the way to generate instance masks. Several works use a shared backbone with multitask heads to jointly learn IS and SS, focusing on mutual task association. Meanwhile, several bottom-up approaches use the sequential pipeline by performing instance clustering from semantic segmentation results and then fusing both. In summary, most PS methods include complex pipelines and are highly engineered.

$\bullet$ Video Segmentation. The research for VSS mainly focuses on better spatial-temporal fusion or acceleration using extra cues in the video. VIS requires segmenting and tracking each instance. Most VIS approaches focus on learning instance-wised spatial, temporal relation, and feature fusion. Several works learn the 3D temporal embeddings. Like PS, VPS can also be top-down and bottom-up approaches . The top-down approaches learn to link the temporal features and then perform instance association online. In contrast, the bottom-up approaches predict the center map of the near frame and perform instance association in a separate stage. Most of these approaches are highly engineering. For example, MaskPro adopts state-of-the-art IS segmentation models , deformable CNN , and offline mask propagation in one system. There are also several video segmentation tasks, including video object segmentation (VOS) , referring video segmentation , multi-Object tracking, and segmentation (MOTS) .

$\bullet$ Point Cloud Segmentation. This task aims to group point clouds into semantic or instance categories, similar to image and video segmentation. Depending on the input scene, it is typically categorized as either indoor or outdoor scenes. Indoor scene segmentation mainly includes point cloud semantic segmentation (PSS) and point cloud instance segmentation (PIS). PSS is commonly achieved using the Point-Net , while PIS can be achieved through two approaches: top-down approaches and bottom-up approaches . The former extracts 3D bounding boxes and uses a mask learning branch to predict masks, while the latter predicts semantic labels and utilizes point embedding to group points into different instances. For outdoor scenes, point cloud segmentation can be divided into point-based and voxel-based approaches. Point-based methods focus on processing individual points, while voxel-based methods divide the point cloud into 3D grids and apply 3D convolution. Like panoptic segmentation, most 3D panoptic segmentation methods first predict semantic segmentation results, separate instances based on these predictions and fuse the two results to obtain the final results.

4 Transformer Basics

$\bullet$ Vanilla Transformer is a seminal model in the transformer-based research field. It is an encoder-decoder structure that takes tokenized inputs and consists of stacked transformer blocks. Each block has two sub-layers: a multi-head self-attention (MHSA) layer and a position-wise fully-connected feed-forward network (FFN). The MHSA layer allows the model to attend to different parts of the input sequence while the FFN processes the output of the MHSA layer. Both sub-layers use residual connections and layer normalization for better optimization.

In the vanilla transformer, the encoder and decoder both use the same architecture. However, the decoder is modified to include a mask that prevents it from attending to future tokens during training. Additionally, the decoder uses sine and cosine functions to produce positional embeddings, which allow the model to understand the order of the input sequence. Subsequent models such as BERT and GPT-2 have built upon its architecture and achieved state-of-the-art results on a wide range of natural language processing tasks.

where $d$ is the hidden dimension. The Query and Key are usually used to generate the attention map in SA. Then the SA is performed as follows:

According to Equ. 2, given an input $X$ , self-attention allows each token $x_{i}$ to attend to all the other tokens. Thus, it has the ability of global perception compared with local CNN operator. Motivated by this, several works treat it as a fully-connected graph or a non-local module for visual recognition task.

$\bullet$ Feed-Forward Network. The goal of feed-forward network (FFN) is to enhance the non-linearity of attention layer outputs. It is also called multi-layer perceptron (MLP) since it consists of two successive linear layers with non-linear activation layers.

Methods: A Survey

In this section, based on DETR-like meta-architecture, we review the key techniques of transformer-based segmentation. As shown in Fig. 3, the meta-architecture contains a feature extractor, object query, and a transformer decoder. Then, according to the meta-architecture, we survey existing methods by considering the modification or improvements to each component of the meta-architecture in Sec. 3.2.1, Sec. 3.2.2 and Sec. 3.2.3. Finally, based on such meta-architecture, we present several detailed applications in Sec. 3.2.4 and Sec. 3.2.5.

$\bullet$ Neck. Feature pyramid network (FPN) has been shown effective in object detection and instance segmentation for scale variation modeling. FPN maps the features from different stages into the same channel dimension $C$ for the decoder. Several works design stronger FPNs via cross-scale modeling using dilation or deformable convolution. For example, Deformable DETR proposes a deformable FPN to model cross-scale fusion using deformable attention. Lite-DETR further refines the deformable cross-scale attention design by efficiently sampling high-level features and low-level features in an interleaved manner. The output features are used for decoding the boxes and masks. The role of FPN is the same as previous detection-based or FCN-based segmentation methods. The FPN generates multi-scale features to handle and balance both small and large objects in the scene. For the transformer-based method, FPN architecture is often used to refine object queries from different scales, which can lead to stronger results than single-scale refinement.

$\bullet$ Transformer Decoder. Transformer decoder is a crucial architecture component in transformer-based segmentation and detection models. Its main operation is cross-attention, which takes in the object query $Q_{obj}$ and the image/video feature $F$ . It outputs a refined object query, denoted as $Q_{out}$ . The cross-attention operation is derived from the vanilla transformer architecture, where $Q_{obj}$ serves as the query, and $F$ is used as the key and value in the self-attention mechanism. After obtaining the refined object query $Q_{out}$ , it is passed through a prediction FFN, which typically consists of a 3-layer perceptron with a ReLU activation layer and a linear projection layer. The FFN outputs the final prediction, which depends on the specific task. For example, for classification, the refined query is mapped directly to class prediction via a linear layer. For detection, the FFN predicts the normalized center coordinates, height, and width of the object bounding box. For segmentation, the output embedding is used to perform dot product with feature $F$ , which results in the binary mask logits. The transformer decoder iteratively repeats cross-attention and FFN operations to refine the object query and obtain the final prediction. The intermediate predictions are used for auxiliary losses during training and discarded during inference. The outputs from the last stage of the decoder are taken as the final detection or segmentation results. We show the detailed process in Fig. 3 (b).

$\bullet$ Mask Prediction Representation. Transformer-based segmentation approaches adopt two formats to represent the mask prediction: pixel-wise prediction as FCNs and per-mask-wise prediction as DETR. The former is used in semantic-aware segmentation tasks, including SS, VSS, VOS, and etc. The latter is used in instance-aware segmentation tasks, including IS, VIS, and VPS, where each query represents each instance.

$\bullet$ Bipartite Matching and Loss Function. Object query is usually combined with bipartite matching during training, uniquely assigning predictions with ground truth. This means each object query builds the one-to-one matching during training. Such matching is based on the matching cost between ground truth and predictions. The matching cost is defined as the distance between prediction and ground truth, including labels, boxes, and masks. By minimizing the cost with the Hungarian algorithm , each object query is assigned by its corresponding ground truth. For object detection, each object query is trained with classification and box regression loss . For instance-aware segmentation, each object query is trained with classification loss and segmentation loss. The output masks are obtained via the inner product between object query and decoder features. The segmentation loss usually contains binary cross-entropy loss and dice loss .

$\bullet$ Discussion on Scope of Meta-Architecture. We admit our meta-architecture may not cover all transformer-based segmentation methods. In semantic segmentation, methods such as Segformer and SETR employ a fully connected layer and predict each pixel as previous FCN-based methods . These methods concentrate on enhanced feature representation. We believe that this represents a basic form of our meta-architecture, wherein each query corresponds to a class category. The cascaded cross-attention layers are omitted and bipartite matching is also removed. Thus, the object query plays the same role as a fully connected layer.

2 Method Categorization

In this section, we review five aspects of transformer-based segmentation methods. Rather than classifying the literature by the task settings, our goal is to extract the essential and common techniques used in the literature. We summarize the methods, techniques, related tasks, and corresponding references in Tab. III. Most approaches are based on the meta-architecture described in Sec. 3.1. We list the comparison of representative works in Tab. IV.

Learning a strong feature representation always leads to better segmentation results. Taking the SS task as an example, SETR is the first to replace CNN backbone with the ViT backbone. It achieves state-of-the-art results on the ADE20k dataset without bells and whistles. After ViTs, researchers start to design better vision transformers. We categorize the related works into three aspects: better vision transformer design, hybrid CNNs/transformers/MLPs, and self-supervised learning.

$\bullet$ Better ViTs Design. Rather than introducing local bias, these works follow the original ViTs design and process feature using the original MHSA for token mixing. DeiT proposes knowledge distillation and provides strong data augmentation to train ViT efficiently. Starting from DeiT, nearly all ViTs adopt the stronger training procedure. MViT-V1 introduces the multiscale feature representation and pooling strategies to reduce the computation cost in MHSA. MViT-V2 further incorporates decomposed relative positional embeddings and residual pooling design in MViT-V1, which leads to better representation. Motivated by MViT, from the architecture level, MPViT introduces multiscale patch embedding and multi-path structure to explore tokens of different scales jointly. Meanwhile, from the operator level, XCiT operates across feature channels rather than token inputs and proposes cross-covariance attention, which has linear complexity in the number of tokens. This design makes it easy to adapt to segmentation tasks, which always have high-resolution inputs. Pyramid ViT is the first work to build multiscale features for detection and segmentation tasks. There are also several works exploring cross-scale modeling via MHSA, which exchange long-range information on different feature pyramids.

$\bullet$ Hybrid CNNs/Transformers/MLPs. Rather than modifying the ViTs, many works focus on introducing local bias into ViT or using CNNs with large kernels directly. To build a multi-stage pipeline, Swin adopts shift-window attention in a CNN style. They also scale up the models to large sizes and achieve significant improvements on many vision tasks. From an efficient perspective, Segformer designs a light-weight transformer encoder. It contains a sequence reduction during MHSA and a light-weight MLP decoder. Segformer achieves better speed and accuracy trade-off for SS. Meanwhile, several works directly add CNN layers to a transformer to explore the local context. Several works explore the pure MLPs design to replace the transformer. With specific designs such as shifting and fusion , MLP models can also achieve comparable results with ViTs. Later, several works point out that CNNs can achieve stronger results than ViTs if using the same data augmentation pipeline. In particular, DWNet re-visits the training pipeline of ViTs and proposes dynamic depth-wise convolution. Then, ConvNeXt uses the larger kernel depth-wise convolution and a stronger data training pipeline. It achieves stronger results than Swin . Motivated by ConvNeXt, SegNext designs a CNN-like backbone with linear self-attention and performs strongly on multiple SS benchmarks. Meanwhile, Meta-Former shows that the meta-architecture of ViT is the key to achieving stronger results. Such meta-architecture contains a token mixer, a MLP, and residual connections. The token mixer is a simple MHSA layer. Meta-Former shows that the token mixer is not as important as meta-architecture. Using simple pooling as a token mixer can achieve stronger results. Following the Meta-Former, recent work re-benchmarks several previous works using a unified architecture to eliminate unfair engineering techniques. However, under stronger settings, the authors find the spatial token mixer design still matters. Meanwhile, several works explore the MLP-like architecture for dense prediction.

$\bullet$ Self-Supervised Learning (SSL). SSL has achieved huge progress in recent years . Compared with supervised learning, SSL exploits unlabeled data via specially designed pseudo tasks and can be easily scaled up. MoCo-v3 is the first study that trains ViTs in SSL. It freezes the patch projection layer to stabilize the training process. Motivated by BERT, BEiT proposes the BERT-like per-training (Mask Image Modeling, MIM) of vision transformers. After BEiT, MAE shows that ViTs can be trained with the simplest MIM style. By masking a portion of input tokens and reconstructing the RGB images, MAE achieves better results than supervised training. As a concurrent work, MaskFeat mainly studies reconstructing targets of the MIM framework, such as the histogram of oriented gradient (HOG) features. The following works focus on improving the MIM framework or replacing the backbone of ViTs with CNN architecture . DINO series find the self-supervised learned feature itself has grouping effects, which is always used in unsupervised learning contexts. (Sec. 4.4) Recently, several works on VLM also adopt SSL by utilizing easily obtained text-image pairs. Recent work demonstrates the effectiveness of VLM in downstream tasks, including IS and SS. Moreover, several recent works adopt multi-modal SSL pre-training and design a unified model for many vision tasks. For video representation learning, most current works verify such representation learning on action or motion learning, such as action recognition. Several works adopt a video backbone for video segmentation. However, for video segmentation, from the method design perspective, most works focus on matching and association of entities or pixels, which is discussed in Sec. 3.2.2 and Sec. 3.2.4.

2.2 Cross-Attention Design in Decoder

In this section, we review the new transformer decoder designs. We categorize the decoder design into two groups: one for improved cross-attention design in image segmentation and the other for spatial-temporal cross-attention design in video segmentation. The former focuses on designing a better decoder to refine the original decoder in the original DETR. The latter extends the query-based object detector and segmenter into the video domain for VOD, VIS, and VPS, focusing on modeling temporal consistency and association.

$\bullet$ Improved Cross-Attention Design. Cross-attention is the core operation of meta-architecture for segmentation and detection. Current solutions for improved cross-attention mainly focus on designing new or enhanced cross-attention operators and improved decoder architectures. Following DETR, Deformable DETR proposes deformable attention to efficiently sample point features and perform cross-attention with object query jointly. Meanwhile, several works bring object queries into previous RCNN frameworks. Sparse-RCNN uses RoI pooled features to refine the object query for object detection. They also propose a new dynamic convolution and self-attention to enhance object query without extra cross-attention. In particular, the pooled query features reweight the object query, and then self-attention is applied to the object query to obtain the global view. After that, several works add the extra mask heads for IS. QueryInst adds mask heads and refines mask query with dynamic convolution. Meanwhile, several works extend Deformable DETR by directly applying MLP on the shared query. Inspired by MEInst , SOLQ utilizes mask encodings on object query via MLP. By applying the strong Deformable DETR detector and Swin transformer backbone, it achieves remarkable results on IS. However, these works still need extra box supervision, which makes the system complex. Moreover, most RoI-based approaches for IS have low mask quality issues since the mask resolution is limited within the boxes .

To fix the issues of extra box heads, several works remove the box prediction and adopt pure mask-based approaches. Earlier work, OCRNet characterizes a pixel by exploiting the representation of the corresponding object class that forms a category query. Then, Segmenter adopts a strong ViT backbone with the class query to directly decode class-wise masks. Pure mask-based approaches directly generate segmentation masks from high-resolution features and naturally have better mask quality. Max-Deeplab is the first to remove the box head and design a pure-mask-based segmenter for PS. It also achieves stronger performance than box-based PS method . It combines a CNN-transformer hybrid encoder and a transformer decoder as an extra path. Max-Deeplab still needs extra auxiliary loss functions, such as semantic segmentation loss, and instance discriminative loss. K-Net uses mask pooling to group the mask features and designs a gated dynamic convolution to update the corresponding query. By viewing the segmentation tasks as convolution with different kernels, K-Net is the first to unify all three image segmentation tasks, including SS, IS, and PS. Meanwhile, MaskFormer extends the original DETR by removing the box head and transferring the object query into the mask query via MLPs. It proves simple mask classification can work well enough for all three segmentation tasks. Compared to MaskFormer, K-Net is good at training data efficiency. This is because K-Net adopts mask pooling to localize object features and then update object queries accordingly. Motivated by this, Mask2Former proposes masked cross-attention and replaces the cross-attention in MaskFormer. Masked cross-attention makes object query only attend to the object area, guided by the mask outputs from previous stages. Mask2Former also adopts a stronger Deformable FPN backbone , stronger data augmentation , and multiscale mask decoding. The above works only consider updating object queries. To handle this, CMT-Deeplab proposes an alternating procedure for object query and decoder features. It jointly updates object queries and pixel features. After that, inspired by the k-means clustering algorithm, kMaX-DeepLab proposes k-means cross-attention by introducing cluster-wise argmax operation in the cross-attention operation. Meanwhile, PanopticSegformer proposes a decoupling query strategy and deeply supervised mask decoder to speed up the training process. For real-time segmentation setting, SparseInst proposes a sparse set of instance activation maps highlighting informative regions for each foreground object.

Besides segmentation tasks, several works speed up the convergence of DETR by introducing new decoder designs, and most approaches can be extended into IS. Several works bring such semantic priors in the DETR decoder. SAM-DETR projects object queries into semantic space and searches salient points with the most discriminative features. SMAC conducts location-aware co-attention by sampling features of high near estimated bounding box locations. Several works adopt dynamic feature re-weights. From the multiscale feature perspective, AdaMixer samples feature over space and scales using the estimated offsets. It dynamically decodes sampled features with an MLP, which builds a fast-converging query-based detector. ACT-DETR clusters the query features adaptively using a locality-sensitive hashing and replaces the query-key interaction with the prototype-key interaction to reduce cross-attention cost. From the feature re-weighting view, Dynamic-DETR introduces dynamic attention to both the encoder and decoder parts of DETR using RoI-wise dynamic convolution. Motivated by the sparsity of the decoder feature, Sparse-DETR selectively updates the referenced tokens from the decoder and proposes an auxiliary detection loss on the selected tokens in the encoder to keep the sparsity. In summary, dynamically assigning features into query learning speeds up the convergence of DETR.

$\bullet$ Spatial-Temporal Cross-Attention Design.

After extending the object query in the video domain, each object query represents a tracked object across different frames, which is shown in Fig. 4. The simplest extension is proposed by VisTR for VIS. VisTR extends the cross-attention in DETR into multiple frames by stacking all clip features into flattened spatial-temporal features. The spatial-temporal features also involve temporal embeddings. During inference, one object query can directly output spatial-temporal masks without extra tracking. Meanwhile, TransVOD proposes to link object query and corresponding features across the temporal domain. It splits the clips into sub-clips and performs clip-wise object detection. TransVOD utilizes the local temporal information and achieves better speed and accuracy trade-off. IFC adopts message tokens to exchange temporal context among different frames. The message tokens are similar to learnable queries, which perform cross-attention with features in each frame and self-attention among the tokens. After that, TeViT proposes a novel messenger shift mechanism for temporal fusion and a shared spatial-temporal query interaction mechanism to utilize both frame-level and instance-level temporal context information. Seqformer combines Deformable-DETR and VisTR in one framework. It also proposes to use image datasets to augment video segmentation training. Mask2Former-VIS extends masked cross-attention in Mask2Former into temporal masked cross-attention. Following VisTR, it also directly outputs spatial-temporal masks.

In addition to VIS, several works have shown that query-based methods can naturally unify different segmentation tasks. Following this pipeline, there are also several works solving multiple video segmentation tasks in one framework. In particular, based on K-Net , Video K-Net proposes to unify VPS/VIS/VSS via tracking and linking kernels and works in an online manner. Meanwhile, TubeFormer extends Max-Deeplab into the temporal domain by obtaining the mask tubes. Cross-attention is carried out in a clip-wise manner. During inference, the instance association is performed by mask-based matching. Moreover, several works propose the local temporal window to refine the global spatial-temporal cross-attention. For example, VITA aggregates the local temporal query on top of an off-the-shelf transformer-based image instance segmentation model . Recently, several works have explored the cross-clip association for video segmentation. In particular, Tube-Link designs a universal video segmentation framework via learning cross-tube relations. It performs better than task-specific methods in VSS, VIS, and VPS.

2.3 Optimizing Object Query

Compared with Faster-RCNN , DETR needs a much longer schedule for convergence. Due to the critical role of object query, several approaches have launched studies on speeding up training schedules and improving performance. According to the methods for the object query, we divide the following literature into two aspects: adding position information and adopting extra supervision. The position information provides the cues to sample the query feature for faster training. The extra supervision focuses on designing specific loss functions in addition to default ones in DETR.

$\bullet$ Adding Position Information into Query. Conditional DETR finds cross-attention in DETR relies highly on the content embeddings for localizing the four extremities. The authors introduce conditional spatial query to explore the extremity regions explicitly. Conditional DETR V2 introduces the box queries from the image content to improve detection results. The box queries are directly learned from image content, which is dynamic with various image inputs. The image-dependent box query helps locate the object and improve the performance. Motivated by previous anchor designs in object detectors, several works bring anchor priors in DETR. The Efficient DETR adopts hybrid designs by including query-based and dense anchor-based predictions in one framework. Anchor DETR proposes to use anchor points to replace the learnable query and also designs an efficient self-attention head for faster training. Each object query predicts multiple objects at one position. DAB-DETR finds the localization issues of the learnable query and proposes dynamic anchor boxes to replace the learnable query. Dynamic anchor boxes make the query learning more explainable and explicitly decouple the localization and content part, further improving the detection performance.

$\bullet$ Adding Extra Supervision into Query. DN-DETR finds that the instability of bipartite graph matching causes the slow convergence of DETR and proposes a denoising loss to stabilize query learning. In particular, the authors feed GT bounding boxes with noises into the transformer decoder and train the model to reconstruct the original boxes. Motivated by DN-DETR, based on Mask2Former, MP-Former finds inconsistent predictions between consecutive layers. It further adds class embeddings of both ground truth class labels and masks to reconstruct the masks and labels. Meanwhile, DINO improves DN-DETR via a contrastive way of denoising training and a mixed query selection for better query initialization. Mask DINO extends DINO by adding an extra query decoding head for mask prediction. Mask DINO proposes a unified architecture and joint training process for both object detection and instance segmentation. By sharing the training data, Mask DINO can scale up and fully utilize the detection annotations to improve IS results. Meanwhile, motivated by contrastive learning, IUQ introduces two extra supervisions, including cross-image contrastive query loss via extra memory blocks and equivalent loss against geometric transformations. Both losses can be naturally adapted into query-based detectors. Meanwhile, there are also several works exploring query supervision from the target assignment perspective. In particular, since DETR lacks the capability of exploiting multiple positive object queries, DE-DETR first introduces one-to-many label assignment in query-based instance perception framework, to provide richer supervision for model training. Group DETR proposes group-wise one-to-many assignments during training. H-DETR adds auxiliary queries that use one-to-many matching loss during training. Rather than adding more queries, Co-DETR proposes a collaborative hybrid training scheme using parallel auxiliary heads supervised by one-to-many label assignments. All these approaches drop the extra supervision heads during inference. These extra supervision designs can be easily extended to query-based segmentation methods .

2.4 Using Query For Association

Benefiting from the simplicity of query representation, several recent works have adopted it as an association tool to solve downstream tasks. There are mainly two usages: one for instance-level association and the other for task-level association. The former adopts the idea of instance discrimination, for instance-wise matching problems in video, such as joint segmentation and tracking. The latter adopts queries to link features for multitask learning.

$\bullet$ Using Query for Instance Association. The research in this area can be divided into two aspects: one for designing extra tracking queries and the other for using object queries directly. TrackFormer is the first to treat multi-object tracking as a set prediction problem by performing joint detection and tracking-by-attention. TransTrack uses the object query from the last frame as a new track query and outputs tracking boxes from the shared decoder. MOTR introduces the extra track query to model the tracked instances of the entire video. In particular, MOTR proposes a new tracklet-awared label assignment to train track queries and a temporal aggregation module to fuse temporal features. There are also several works adopting object query solely for tracking. In particular, MiniVIS directly uses object query for matching without extra tracking head modeling for VIS, where it adopts image instance segmentation training. Both Video K-Net and IDOL learn the association embeddings directly from the object query using a temporal contrastive loss. During inference, the learned association embeddings are used to match instances across frames. These methods are usually verified in VIS and VPS tasks. All methods pre-train their image baseline on image datasets, including COCO and Cityscapes, and fine-tune their video architecture in the video datasets.

$\bullet$ Using Query for Linking Multi-Tasks. Several works use object query to link features across different tasks to achieve mutual benefits. Rather than directly fusing multitask features, using object query fusion not only selects the most discriminative parts to fuse but also is more efficient than dense feature fusion. In particular, Panoptic-PartFormer links part and panoptic features via different object queries into an end-to-end framework, where joint learning leads to better part segmentation results. Several works combine segmentation features, and depth features using the MHSA layer on corresponding depth query and segmentation query, which unify the depth prediction and panoptic segmentation prediction via shared masks. Both works find the mutual effect for both segmentation and depth prediction. Recently, several works have adopted the vision transformers with multiple task-aware queries for multi-task dense prediction tasks. In particular, they treat object queries as task-specific hidden features for fusion and perform cross-task reasoning using MSHA on task queries. Moreover, in addition to dense prediction tasks, FashionFormer unifies fashion attribute prediction and instance part segmentation in one framework. It also finds the mutual effect on instance segmentation and attribute prediction via query sharing. Recently, X-Decoder uses two different queries for segmentation and language generation tasks. The authors jointly pre-train two different queries using large-scale vision language datasets, where they find both queries can benefit corresponding tasks, including visual segmentation and caption generation.

2.5 Conditional Query Fusion

In addition to using object query for multitask prediction, several works adopt conditional query design for cross-modal and cross-image tasks. The query is conditional on the task inputs, and the decoder head uses such a conditional query to obtain the corresponding segmentation masks. Based on the source of different inputs, we split these works into two aspects: language features and image features.

$\bullet$ Conditional Query Fusion From Language Feature. Several works adopt conditional query fusion according to input language feature for both referring image segmentation (RIS) and referring video object segmentation (RVOS) tasks. In particular, VLT firstly adopts the vision transformer for the RIS task and proposes a query generation module to produce multiple sets of language-conditional queries, which enhances the diversified comprehensions of the language. Then, it adaptively selects the output features of these queries via the proposed query balance module. Following the same idea, LAVT designs a new gated cross-attention fusion where the image features are the query inputs of a self-attention layer in the encoder part. Compared with previous CNN approaches , using a vision transformer significantly improves the language-driven segmentation quality. With the help of CLIP’s knowledge, CRIS proposes vision-language decoding and contrastive learning for achieving text-to-pixel alignment. Meanwhile, several works adopt video detection transformer in Sec. 3.2.2 for the RVOS task. MTTR models the RVOS task as a sequence prediction problem and proposes both language and video features jointly. Each object query in each frame combines the language features before sending it into the decoder. To speed up the query learning, ReferFormer designs a small set of object queries conditioned on the language as the input to the transformer. The conditional queries are transformed into dynamic kernels to generate tracked object masks in the decoder. With the same design as VisTR, ReferFormer can segment and track object masks with given language inputs. In this way, each object tracklet is controlled by a given language input. In addition to referring segmentation tasks, MDETR presents an end-to-end modulated detector that detects objects in an image conditioned on a raw text query. In particular, they fuse the text embedding directly into visual features and jointly train the fused feature and object query. X-DETR proposes an effective architecture for instance-wise vision-language tasks via using dot-product to align vision and language. In summary, these works fully utilize the interaction of language features and query features.

$\bullet$ Condition Query Fusion From Image Feature. Several tasks take multiple images as references and refine corresponding object masks of the main image. The multiple images can be support images in few shot segmentation or the same input image in matting and semantic segmentation . These works aim to model the correspondences between the main image and other images via condition query fusion. For SS, StructToken presents a new framework by doing interactions between a set of learnable structure tokens and the image features, where the image features are the spatial priors. In the video, BATMAN fuses optical flow features and previous frame features into mixed features and uses such features as a query to decode the current frame outputs. For few-shot segmentation, CyCTR aggregates pixel-wise support features into query features. In particular, CyCTR performs cross-attention between features from different images in a cycle manner, where support image features and query image features are the query inputs of the transformer jointly. Meanwhile, MM-Former adopts a class-agnostic method to decompose the query image into multiple segment proposals. Then, the support and query image features are used to select the correct masks via a transformer module. Then, for few-shot instance segmentation, RefTwice proposes an object query enhanced framework to weight query image features via object queries from support queries. In image matting, MatteFormer designs a new attention layer called prior-attentive window self-attention based on Swin . The prior token represents the global context feature of each trimap region, which is the query input of window self-attention. The prior token introduces spatial cues and achieves thinner matting results. In summary, according to the different tasks, the image features play as the decoder features in previous Sec. 3.2.2, which enhance the features in the main images.

Specific Subfields

In this section, we revisit several related subfields that adopt vision transformers for segmentation tasks. The subfields include point cloud segmentation, domain-aware segmentation, label and model efficient segmentation, class agnostic segmentation, tracking, and medical segmentation.

$\bullet$ Semantic Level Point Cloud Segmentation. Like image segmentation and video semantic segmentation, adopting transformers for semantic level processing mainly focuses on learning a strong representation (Sec. 3.2.1). The works focus on transferring the success in image/video representation learning into the point cloud. Early works directly use modified self-attention as backbone networks and design U-Net-like architectures for segmentation. In particular, Point-Transformer proposes vector self-attention and subtraction relation to aggregate local features progressively. The concurrent work PCT also adopts a self-attention operation and enhances input embedding with the support of farthest point sampling and nearest neighbor searching. However, the ability to model long-range context and cross-scale interaction is still limited. Stratified-Transformer extends the idea of Swin Transformer into the point cloud and dived 3D inputs into cubes. It proposes a mixed key sampling method for attention input and enlarges the effective receptive field via merging different cube outputs. Meanwhile, several works also focus on better pre-training or distilling the knowledge of 2D pre-trained models. PointBert designs the first Masked Point Modeling (MPM) task to pre-train point cloud transformers. It divides a point cloud into several local point patches as the input of a standard transformer. Moreover, it also pre-trains a point cloud Tokenizer with a discrete variational autoEncoder to encode the semantic contents and train an extra decoder using the reconstruction loss. Following MAE , several works simply the MIM pretraining process. Point-MAE divides the input point cloud into irregular point patches and randomly masks them at a high ratio. Then, it uses a standard transformer-based autoencoder to reconstruct the masked points. Point-M2AE designs a multiscale MIM pretraining by making the encoder and decoder into pyramid architectures to model spatial geometries and multilevel semantics progressively. Meanwhile, benefiting from the same transformer architecture for point cloud and image, several works adopt image pre-trained standard transformer by distilling the knowledge from large-scale image dataset pre-trained models.

$\bullet$ Instance Level Point Cloud Segmentation. As shown in Sec. 2, previous PCIS / PCPS approaches are based on manually-tuned components, including a voting mechanism that predicts hand-selected geometric features for top-down approaches and heuristics for clustering the votes for bottom-up approaches. Both approaches involve many hand-crafted components and post-processing, The usage of transformers in instance-level point cloud segmentation is similar to the image or video domain, and most works use bipartite matching for instance-level masks for indoor and outdoor scenes. For example, Mask3D proposes the first Transformer-based approach for 3D semantic instance segmentation. It models each object instance as an instance query and uses the transformer decoder to refine each instance query by attending to point cloud features at different scales. Meanwhile, SPFormer learns to group the potential features from point clouds into super-points , and directly predicts instances through instance query with a masked-based transformer decoder. The super-points utilize geometric regularities to represent homogeneous neighboring points, which is more efficient than all point features. The transformer decoder works similarly to Mask2Former, where the cross-attention between instance query and super-point features is guided by the attention mask from the previous stage. PUPS proposes a unified PPS system for outdoor scenes. It presents two types of learnable queries named semantic score and grouping score. The former predicts the class label for each point, while the latter indicates the probability of grouping ID for each point. Then, both queries are refined via grouped point features, which share the same ideas from previous Sparse-RCNN and K-Net . Moreover, PUPS also presents a context-aware mixing to balance the training instance samples, which achieves the new state-of-the-art results .

2 Tuning Foundation Models

We divide this section into two aspects: vision adapter design and open vocabulary learning. The former introduces new ways to adapt the pre-trained large-scale foundation models for downstream tasks. The latter tries to detect and segment unknown objects with the help of the pre-trained vision language model and zero-shot knowledge transfer on unseen segmentation datasets. The core idea for vision adapter design is to extract the knowledge of foundation models and design better ways to fit the downstream settings. For open vocabulary learning, the core idea is to align pre-trained VLM features into current detectors to achieve novel class classification.

$\bullet$ Vision Adapter and Prompting Modeling. Following the idea of prompt tuning in NLP, early works adopt learnable parameters with the frozen foundation models to better transfer the downstream datasets. These works use small image classification datasets for verification and achieve better results than original zero-shot results . Meanwhile, there are several works designing adapter and frozen foundation models for video recognition tasks. In particular, the pre-trained parameters are frozen, and only a few learnable parameters or layers are tuned. Following the idea of learnable tuning, recent works extend the vision adapter into dense prediction tasks, including segmentation and detection. In particular, ViT-Adapter proposes a spatial prior module to solve the issue of the location prior assumptions in ViTs. The authors design a two-stream adaption framework using deformable attention and achieve comparable results in downstream tasks. From the CLIP knowledge usage view, DenseCLIP converts the original image-text in CLIP to a pixel-text matching problem and uses the pixel-text score maps to guide the learning of dense prediction models. From the task prompt view, CLIPSeg builds a system to generate image segmentations based on arbitrary prompts at test time. A prompt can be a text or an image where the CLIP visual model is frozen during training. In this way, the segmentation model can be turned into a different task driven by the task prompt. Previous works only focus on a single task. OneFormer extends the Mask2Former with multiple target training setting and perform segmentation driven by the task prompt. Moreover, using a vision adapter and text prompt can easily reduce the taxonomy problems of each dataset and learn a more general representation for different segmentation tasks. Recently, SAM proposes more generalized prompting methods, including mask, points, box, and text. The authors build a larger dataset with 1 billion masks. SAM achieves good zero-shot performance in various segmentation datasets.

$\bullet$ Open Vocabulary Learning. Recent studies focus on the open vocabulary and open world setting, where their goal is to detect and segment novel classes, which are not seen during the training. Different from zero-shot learning, an open vocabulary setting assumes that large vocabulary data or knowledge can provide cues for final classification. Most models are trained by leveraging pre-trained language-text pairs, including captions and text prompts, or with the help of VLM. Then, trained models can detect and segment the novel classes with the help of weakly annotated captions or existing publicly available VLM. In particular, VilD distills the knowledge from a trained open vocabulary image classification model CLIP into a two-stage detector. However, VilD still needs an extra visual CLIP encoder for visual distillation. To handle this, Forzen-VLM adopts the frozen visual clip model and combines the scores of both learned visual embedding and CLIP embedding for novel class detection. From the data augmentation view, MViT combines the Deformable DETR and CLIP text encoder for the open world class-agnostic detection, where the authors build a large dataset by mixing existing detection datasets. Motivated by the more balanced samples from image classification datasets, Detic improves the performance of the novel classes with existing image classification datasets by supervising the max-size proposal with all image labels. OV-DETR designs the first query-based open vocabulary framework by learning conditional matching between class text embedding and query features. Besides these open vocabulary detection settings, recent works perform open vocabulary segmentation. In particular, L-Seg presents a new setting for language-driven semantic image segmentation and proposes a transformer-based image encoder that computes dense per-pixel embeddings according to the language inputs. OpenSeg learns to generate segmentation masks for possible candidates using a DETR-like transformer. Then it performs visual-semantic alignments by aligning each word in a caption to one or a few predicted masks. BetrayedCaption presents a unified transformer framework by joint segmentation and caption learning, where the caption part contains both caption generation and caption grounding. The novel class information is encoded into the network during training. With the goal of unifying different segmentation with text prompts, FreeSeg adopts a similar pipeline as OpenSeg to crop frozen CLIP features for novel class classification. Meanwhile, open set segmentation requires the model to output class agnostic masks and enhance the generality of segmentation models. Recently, ODISE uses a frozen diffusion model as the feature extractor, a Mask2Former head, and joint training with caption data to perform open vocabulary panoptic segmentation. There are also several works focusing on open-world object detection, where the task detects a known set of object categories while simultaneously identifying unknown objects. In particular, OW-DETR adopts the DETR as the base detector and proposes several improvements, including attention-driven pseudo-labeling, novelty classification, and objectness scoring. In summary, most approaches adopt the idea of region proposal network to generate class-agnostic mask proposals via different approaches, including anchor-based and query-based decoders in Sec. 3.1. Then, the open vocabulary problem turns into a region-level matching problem to match the visual region features with pre-trained VLM language embedding.

3 Domain-aware Segmentation

$\bullet$ Domain Adaption. Unsupervised Domain Adaptation (UDA) aims at adapting the network trained with source (synthetic) domain into target (real) domain without access to target labels. UDA has two different settings, including semantic segmentation and object detection. Before ViTs, the previous works mainly design domain-invariant representation learning strategies. DAFormer replaces the outdated backbone with the advanced transformer backbone and proposes three training strategies, including rare class sampling, thing-class ImageNet feature loss, and a learning rate warm-up method. It achieves new state-of-the-art results and is a strong baseline for UDA segmentation. Then, HRDA improves DAFormer via a multi-resolution training approach and uses various crops to preserve fine segmentation details and long-range contexts. Motivated by MIM , MIC proposes a masked image consistency to learn spatial context relations of the target domain as additional clues. MIC enforces the consistency between predictions of masked target images and pseudo-labels via a teacher-student framework. It is a plug-in module that is verified among various UDA settings. For detection transformers on UDA, SFA finds feature distribution alignment on CNN brings limited improvements. Instead, it proposes a domain query-based feature alignment and a token-wise feature alignment module to enhance. In particular, the alignment is achieved by introducing a domain query and performing the domain classification on the decoder. DA-DETR proposes a hybrid attention module (HAM), which contains a coordinate attention module and a level attention module along with the transformer encoder. A single domain-aware discriminator supervises the output of HAM. MTTrans presents a teacher-student framework and a shared object query strategy. Meanwhile, SePiCo introduces a new framework that extracts the semantic meaning of individual pixels to learn class-discriminative and class-balanced pixel representations. It supports both CNN and Transformer architecture. The image and object features between source and target domains are aligned at local, global, and instance levels.

$\bullet$ Multi-Dataset Segmentation. The goal of multi-dataset segmentation is to learn a universal segmentation model on various domains. MSeg re-defines the taxonomies and aligns the pixel-level annotations by relabeling several existing semantic segmentation benchmarks. Then, the following works try to avoid taxonomy conflicts via various approaches. For example, Sentence-Seg replaces each class label with a vector-valued embedding. The embedding is generated by a language model . To further handle inflexible one-hot common taxonomy, LMSeg extends such embedding with learnable tokens and proposes a dataset-specific augmentation for each dataset. It dynamically aligns the segment queries in MaskFormer with the category embeddings for both SS and PS tasks. Meanwhile, there are several works on multi-dataset object detection . In particular, Detection-Hub proposes to adapt object queries on language embedding of categories per dataset. Rather than previously shared embedding for all datasets, it learns semantic bias for each dataset based on the common language embedding to avoid the domain gap. Meanwhile, several works focus on segmentation domain generation, which directly transfers learned knowledge from one domain to the remaining domains. Recently, TarVIS jointly pre-trains one video segmentation model for different tasks spanning multiple benchmarks, where it extends Mask2Former into the video domain and adopts the unified image datasets pretraining and video fine-tuning.

4 Label and Model Efficient Segmentation

$\bullet$ Weakly Supervised Segmentation. Weakly supervised segmentation methods learn segmentation with weaker annotations, such as image labels and object boxes. For weakly supervised semantic segmentation, previous works improve the typical CNN pipeline with class activation maps (CAM) and use refined CAM as training labels, which requires an extra model for training. ViT-PCM shows the self-supervised transformers with a global max pooling can leverage patch features to negotiate pixel-label probability and achieve end-to-end training and test with one model. MCTformer adopts the idea that the attended regions of the one-class token in the vision transformer can be leveraged to form a class-agnostic localization map. It extends to multiple classes by using multiple class tokens to learn interactions between the class tokens and the patch tokens to generate the segmentation labels. For weakly supervised instance segmentation, previous works mainly leverage the box priors to supervise mask heads. Recently, MAL shows that vision transformers are good mask auto-labelers. It takes the box-cropped images as inputs and adopts a teacher-student framework, where the two vision transformers are trained with multiple instances loss . MAL proves the zero-shot segmentation ability and achieves nearly mask-supervised performance on various baselines. Meanwhile, several works explore the text-only supervision for semantic segmentation. One representative work, GroupViT adopts ViT to group image regions into progressively larger shaped segments.

$\bullet$ Mobile Segmentation. Most transformer-based segmentation methods have huge computational costs and memory requirements, which make these methods unsuitable for mobile devices. Different from previous real-time segmentation methods , the mobile segmentation methods need to be deployed on mobile devices with considering both power cost and latency. Several earlier works focus on a more efficient transformer backbone. In particular, Mobile-ViT introduces the first transformer backbone for mobile devices. It reduces image patches via MLPs before performing MHSA and shows better task-level generalization properties. There are also several works designing mobile semantic segmentation using transformers. TopFormer proposes a token pyramid module that takes the tokens from various scales as input to produce the scale-aware semantic feature. SeaFormer proposes a squeeze-enhanced axial transformer that contains a generic attention block. The block mainly contains two branches: a squeeze axial attention layer to model efficient global context and a detail enhancement module to preserve the details.

5 Class Agnostic Segmentation and Tracking

$\bullet$ Fine-grained Object Segmentation. Several applications, such as image and video editing, often need fine-grained details of object mask boundaries. Earlier CNN-based works focus on refining the object masks with extra convolution modules , or extra networks . Most transformer-based approaches adopt vision transformers due to their fine-grained multiscale features and long-range context modeling. Transfiner refines the region of the coarse mask via a quad-tree transformer. By considering multiscale point features, it produces more natural boundaries while revealing details for the objects. Then, Video-Transfiner refines the spatial-temporal mask boundaries by applying Transfiner to the video segmentation method . It can refine the existing video instance segmentation datasets . PatchDCT adopts the idea of ViT by making object masks into patches. Then, each mask is encoded into a DCT vector , and PatchDCT designs a classifier and a regressor to refine each encoded patch. Entity segmentation aims to segment all visual entities without predicting their semantic labels. Its goal is to obtain high-quality and generalized segmentation results.

$\bullet$ Video Object Segmentation. Recent approaches for VOS mainly focus on designing better memory-based matching methods . Inspired by the Non-local network in image recognition tasks, the representative work STM is the first to adopt cross-frame attention, where previous features are seen as memory. Then, the following works design a better memory-matching process. associating objects with transformers (AOT) matches and decodes multiple objects jointly. The authors propose a novel hierarchical matching and propagation, named long short-term transformer, where they joint persevere an identity bank and long-short term attention. XMem proposes a mixed memory design to handle the long video inputs. The mixed memory design is also based on the self-attention architecture. Meanwhile, Clip-VOS introduces per-clip memory matching for inference efficiency. Recently, to enhance instance-level context, Wang et al. adds an extra query from Mask2Former into memory matching for VOS.

6 Medical Image Segmentation

CNNs have achieved milestones in medical image analysis. In particular, the U-shaped architecture and skip-connections have been widely applied in various medical image segmentation tasks. With the success of ViTs, recent representative works adopt vision transformers into the U-Net architecture and achieve better results. TransUNet merges transformer and U-Net, where the transformer encodes tokenized image patches to build the global context. Then decoder upsamples the encoded features, which are then combined with the high-resolution CNN feature maps to enable precise localization. Swin-Unet designs a symmetric Swin-like decoder to recover fine details. TransFuse combines transformers and CNNs in a parallel style, where global dependency and low-level spatial details can be efficiently captured jointly. UNETR focuses on 3D input medical images and designs a similar U-Net-like architecture. The encoded representations of different layers in the transformer are extracted and merged with a decoder via skip connections to get the final 3D mask outputs.

Benchmark Results

In this section, we report recent transformer-based visual segmentation and tabulate the performance of previously discussed algorithms. For each reviewed field, the most widely used datasets are selected for performance benchmark in Sec. 5.1 and Sec. 5.3. We further re-benchmark several representative works in Sec. 5.2 using the same data augmentations and feature extractor. Note that we only list published works for reference. For simplicity, we have excluded several works on representation learning and only present specific segmentation methods. For a comprehensive method comparison, please refer to the supplementary material that provides a more detailed analysis. In addition, several works achieve better results. However, due to the extra datasets they used, we do not list them here.

$\bullet$ Results On Semantic Segmentation Datasets. In Tab. V, Mask2Former and OneFormer perform the best on Cityscapes and ADE20K dataset, while SegNext achieves the best results on COCO-Stuff and Pascal-Context datasets.

$\bullet$ Results on COCO Instance Segmentation. In Tab. VI, Mask DINO achieves the best results on the COCO instance segmentation with both ResNet and Swin-L backbones.

$\bullet$ Results on Panoptic Segmentation. In Tab. VII, for panoptic segmentation, Mask DINO and K-Max Deeplab achieve the best results on the COCO dataset. K-Max Deeplab also achieves the best results on Cityscapes. OneFormer performs the best on ADE20K.

2 Re-Benchmarking For Image Segmentation

$\bullet$ Motivation. We perform re-benchmarking on two segmentation tasks: semantic segmentation and panoptic segmentation on four public datasets, including ADE20K, COCO, Cityscapes, and COCO-Stuff datasets. In particular, we want to explore the effect of the transformer decoder. Thus, we use the same encoder and neck architecture for a fair comparison.

$\bullet$ Results on Semantic Segmentation. As shown in Tab. IX, we carry out re-benchmark experiments for SS. In particular, using the same neck architecture, Segformer+ achieves the best results on COCO-Stuff and Cityscapes. Mask2Former achieves the best result on the ADE-20k dataset.

$\bullet$ Results on Instance Segmentation. In Tab. X, we also explore the instance segmentation methods on COCO datasets. Under the same neck architecture, we observe gains on both K-Net and MaskFormer, compared with origin results in Tab. VI. Mask2Former achieve the best results.

$\bullet$ Results on Panoptic Segmentation. In Tab. XI, we present the re-benchmark results for PS. In particular, Mask2Former achieves the best results on all three datasets. Compared with K-Net and MaskFormer, both K-Net+ and MaskFormer+ achieve over 3-4% improvements due to the usage of stronger neck , which close the gaps between their original results and Mask2Former.

3 Main Results for Video Segmentation Datasets

$\bullet$ Results On Video Semantic Segmentation In Tab. VIII, we report VSS results on VPSW. Among the methods, TubeFormer achieves the best results.

$\bullet$ Results on Video Instance Segmentation In Tab. XII, for VIS, CTVIS achieves the best result on YT-VIS-2019 and YT-VIS-2021 using ResNet50 backbone. GenVIS achieves better results on OVIS using ResNet50 backbone. When adopting Swin-L backbone, CTVIS achieves the best results.

$\bullet$ Results on Video Panoptic Segmentation In Tab. XIII, for VPS, SLOT-VPS achieves the best results on Cityscapes-VPS. TubeLink achieves the best results on the VIP-Seg dataset. Video K-Net achieves the best results on the KITTI-STEP dataset.

Future Directions

$\bullet$ General and Unified Image/Video Segmentation. The trend of using transformers to unify diverse segmentation tasks is gaining traction. Recent studies have employed query-based transformers for various segmentation tasks within a unified architecture. A promising research avenue is the integration of image and video segmentation tasks in a universal model across different datasets. Such models may achieve general, robust segmentation capabilities in multiple scenarios, like detecting rare classes for improved robotic decision-making. This approach holds significant practical value, particularly in applications like robot navigation and autonomous vehicles.

$\bullet$ Joint Learning with Multi-Modality. Transformers’ inherent flexibility in handling various modalities positions them as ideal for unifying vision and language tasks. Segmentation tasks, which offer pixel-level information, can enhance associated vision-language tasks such as text-image retrieval and caption generation . Recent studies demonstrate the potential of a universal transformer architecture that concurrently learns segmentation alongside visual language tasks, paving the way for integrated multi-modal segmentation learning.

$\bullet$ Life-Long Learning for Segmentation. Existing segmentation methods are usually benchmarked on closed-world datasets with a set of predefined categories, i.e., assuming that the training and testing samples have the same categories and feature spaces that are known beforehand. However, realistic scenarios are usually open-world and non-stationary, where novel classes may occur continuously . For example, unseen situations can occur unexpectedly in self-driving vehicles and medical diagnoses. There is a distinct gap between the performance and capabilities of existing methods in realistic and open-world settings. Thus, it is desired to gradually and continuously incorporate novel concepts into the existing knowledge base of segmentation models, making the model capable of lifelong learning.

$\bullet$ Long Video Segmentation in Dynamic Scenes. Long videos introduce several challenges . First, existing video segmentation methods are designed to work with short video inputs and may struggle to associate instances over longer periods. Thus, new methods must incorporate long-term memory design and consider the association of instances over a more extended period. Second, maintaining segmentation mask consistency over long periods can be difficult, especially when instances move in and out of the scene. This requires new methods to incorporate temporal consistency constraints and update the segmentation masks over time. Third, heavy occlusion can occur in long videos, making it challenging to segment all instances accurately. New methods should incorporate occlusion reasoning and detection to improve segmentation accuracy. Finally, long video inputs often involve various scene inputs, which can bring domain robustness challenges for video segmentation models. New methods must incorporate domain adaptation techniques to ensure the model can handle diverse scene inputs. In short, addressing these challenges requires the development of new long video segmentation models that incorporate advanced memory design, temporal consistency constraints, occlusion reasoning, and detection techniques.

$\bullet$ Generative Segmentation. With the rise of stronger generative models, recent works solve image segmentation problems via generative modeling, inspired by a stronger transformer decoder and high-resolution representation in the diffusion model . Adopting a generative design avoids the transformer decoder and object query design, which makes the entire framework simpler. However, these generative models typically introduce a complicated training pipeline. A simpler training pipeline is needed for further research.

$\bullet$ Segmentation with Visual Reasoning. Visual reasoning requires the robot to understand the connections between objects in the scene, and this understanding plays a crucial role in motion planning. Previous research has explored using segmentation results as input to visual reasoning models for various applications, such as object tracking and scene understanding. Joint segmentation and visual reasoning can be a promising direction, with the potential for mutual benefits for both segmentation and relation classification. By incorporating visual reasoning into the segmentation process, researchers can leverage the power of reasoning to improve the segmentation accuracy, while segmentation can provide better input for visual reasoning.

Conclusion

This survey provides a comprehensive review of recent advancements in transformer-based visual segmentation, which, to our knowledge, is the first of its kind. The paper covers essential background knowledge and an overview of previous works before transformers and summarizes more than 120 deep-learning models for various segmentation tasks. The recent works are grouped into six categories based on the meta-architecture of the segmenter. Additionally, the paper reviews five specific subfields and reports the results of several representative segmentation methods on widely-used datasets. To ensure fair comparisons, we also re-benchmark several representative works under the same settings. Finally, we conclude by pointing out future research directions for transformer-based visual segmentation.

Acknowledgement. This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contributions from the industry partner(s). It is also supported by Singapore MOE AcRF Tier 1 (RG16/21). We also acknowledge the GPU resource provided by SenseTime Research for benchmarking experiments. We also thank Xia Li for proofreading and suggestions.

Overview. In this appendix, we provide more details as a supplementary adjunct to the main paper.

More descriptions on task metrics. (Sec. .1)

Representative works in the section ”Specific Subfields” in the main paper. (Sec. .2)

More detailed benchmark and re-benchmark results on remaining segmentation tasks. (Sec. .3)

Detailed experiment settings and implementation details for re-benchmarking. (Sec. .4)

In this section, we present detailed descriptions for different segmentation task metrics.

Mean Intersection over Union (mIoU). It is a metric used to evaluate the performance of a semantic segmentation model. The predicted segmentation mask is compared to the ground truth segmentation mask for each class. The IoU score measures the overlap between the predicted mask and the ground truth mask for each class. The IoU score is calculated as the ratio of the area of intersection between the predicted and ground truth masks to the area of union between the two masks. The mean IoU score is then calculated as the average of the IoU scores across all the classes. A higher mean IoU score indicates better performance of the model in accurately segmenting the image into different classes. A mean IoU score of 1 indicates perfect segmentation, where the predicted and ground truth masks completely overlap for all classes. The mean IoU score indicates how well the model can separate the different objects or regions in the image only based on the semantic class. It is a commonly used metric to evaluate the performance of semantic segmentation models as well as video or point cloud semantic segmentation.

Mean Average Precision (mAP). This is calculated by comparing the predicted segmentation masks with the ground truth masks for each object in the image. The mAP score is calculated by averaging the Average Precision (AP) scores across all the object categories in the image. The AP score for each object category is calculated based on the intersection-over-union (IoU) between the predicted segmentation mask and the ground truth mask. IoU measures the overlap between the two masks, and a higher IoU indicates a better match between the predicted and ground truth masks. The AP scores for all object categories are averaged to calculate the mAP score. The mAP score ranges from 0 to 1, with a higher score indicating better performance of the model in detecting and localizing objects in the image. For the COCO dataset, the mAP metric is usually reported at different IoU thresholds (typically, 0.5, 0.75, and 0.95). This measures the performance of the model at different levels. Then, the mAP score is reported as the average of these three IoU thresholds.

Similar to instance segmentation in images, the mAP score for VIS is calculated by comparing the predicted segmentation masks with the ground truth masks for each object in each frame of the video. The mAP score is then calculated by averaging the AP scores across all the object categories in all the frames of the video.

Panoptic Quality (PQ). This is a default evaluation metric for panoptic segmentation. PQ is computed by comparing the predicted segmentation masks with the ground truth masks for both objects and stuff in an image. In particular, the PQ is measured at segment level. Given a semantic class $c$ , by matching predictions $p$ to ground-truth segments $g$ based on the IoU scores, these segments can be divided into true positives (TP), false positives (FP), and false negatives (FN). Only a threshold of greater than 0.5 IoU is chosen to guarantee the unique matching. Then, the PQ is calculated as follows:

Video Panoptic Quality (VPQ). This metric extends the PQ into video by calculating the spatial-temporal mask IoU along different temporal window sizes. It is designed for VPS tasks . When the temporal window size is 1, the VPQ is the same as PQ. Following PQ, VPQ performs matching by a threshold of greater than 0.5 IoU for all temporal segment predictions. In particular, for the thing classes, only tracked objects with the same semantic classes are considered as TP. There are only two differences: one for temporal mask prediction and the other for the different temporal window sizes. For the former, any cross-frame inconsistency of semantic or instance label prediction will result in a low tube IoU, and may drop the match out of the TP set. For the latter, short window sizes measure the consistency for short clip inputs, while long window sizes focus on long clip inputs. The final VPQ is obtained by averaging the results of different window sizes.

Segmentation and Tracking Quality (STQ). Both PQ and VPQ have different thresholds and extra parameters for VPS. To solve this and give a decoupled view for segmentation and tracking, STQ is proposed to evaluate the entire video clip at the pixel level. STQ combines association quality (AQ) and segmentation quality (SQ), which measure the tracking and segmentation quality, respectively. The proposed AQ is designed to work at the pixel level of a full video. All correct and incorrect associations influence the score, independent of whether a segment is above or below an IoU threshold. Motivated by HOTA , AQ is calculated by jointly considering the localization and association ability for each pixel. SQ has the same definition as mIoU. The overall STQ score is the geometric mean of AQ and SQ.

Region Similarity, J. This is used for region-based segmentation similarity for VOS. The previous VOS methods adopt the Jaccard index, $J$ defined as the intersection over the union of estimated segmentation and the ground truth mask.

Contour Accuracy, F. This is used for contour-based segmentation similarity for VOS. Previous works adopt the F-measure between the contour points from both the predicted segmentation mask and the ground truth mask.

.2 Representative Works in Specific Subfields

Due to the limited space of the main paper, in Tab. XV, we list several representative works to augment the Specific Subfields section. Moreover, we also list the detailed improvements for different techniques to improve K-Net using COCO-panoptic datasets, and we adopt the default settings for re-benchmarking.

.3 More Benchmark Results

$\bullet$ Results on Point Cloud Segmentation. In Tab. XVI, we list the transformer-based point cloud segmentation methods on ScanNet and S3DIS datasets. The OneFormer3D achieves the best results on three subtasks.

$\bullet$ Results on Open Vocabulary Semantic Segmentation. In Tab. XVII, we report methods on open vocabulary semantic segmentation in the self-evaluation setting. The self-evaluation setting splits the classes into base classes for training and treats the novel classes as background. It uses both base and novel classes for testing. From that table, FreeSeg achieves the best results. In Tab. XVIII, we list several representative works under the cross-evaluation setting. Since different methods adopt different datasets and supervision types for pre-training and co-training, we also show the detailed extra data for reference. Among these methods, X-Decoder achieves the best results. We refer the reader to the work for a more comprehensive comparison.

$\bullet$ Results on Open Vocabulary Instance Segmentation. In Tab. XIX, we report open vocabulary instance segmentation on COCO datasets. Among these methods, CGG achieves the best performance.

$\bullet$ Results on Open Vocabulary Panoptic Segmentation. In Tab. XX, we report open vocabulary panoptic segmentation on COCO datasets. Among these methods, FreeSeg achieves the best performance.

$\bullet$ Results on Weakly-Supervised Semantic Segmentation. In Tab. XXI, we list the transformer-based weakly-supervised semantic segmentation methods. The WeakTr achieves the best performance on VOC and COCO datasets.

$\bullet$ Results on Unsupervised Semantic Segmentation. We also list several unsupervised semantic segmentation methods in Tab. XXII using ViT-S and ViT-B as backbone. The CLAUSE achieves the best results on both regular unsupervised setting and linear probing setting.

$\bullet$ Improved Techniques in Re-benchmarking. In Tab. XIV, we explore various techniques used in re-benchmarking. Here, we use the K-Net trained on COCO-panoptic dataset for reference. Adding lsj augmentation can improve the performance by 0.5%. Adding deformable FPN can lead to 1.0% gains. Adding both leads to better results. Finally, we follow the default Mask2Former design by extending the training epoch to 50 for re-benchmarking, which leads to the best performance.

.4 Details of Benchmark Experiment

Implementation Details on Semantic Segmentation Benchmarks. We adopt the MMSegmentation codebase with the same setting to carry out the re-benchmarking experiments for semantic segmentation. In particular, we train the models with the AdamW optimizer for 160K iterations on ADE20K, Cityscapes, and 80K iterations on COCO-Stuff. For Cityscapes, we adopt a random crop size of 1024. For the other datasets, we adopt the crop size of 512.

Implementation Details on Instance and Panoptic Segmentation Benchmarks. We adopt the MMDetection codebase with the same setting to carry out the re-benchmarking experiments for panoptic segmentation. We strictly follow the Mask2Former settings for all models and all datasets. In particular, a learning rate multiplier of 0.1 is applied to the backbone. For data augmentation, we use the default large-scale jittering (LSJ) augmentation with a random scale sampled from the range 0.1 to 2.0 with the crop size of 1024 $\times$ 1024 in the COCO dataset. For the ADE20k dataset, the crop size is set to 640. The training iteration is set to 160k. For Cityscapes, we set the crop size to $512\times 1024$ with 90k training iterations. We refer the readers to our code for more details.

.5 More Future Directions

In this section, we present more discussion on potential future directions, including mobile segmentation, using synthetic datasets for joint training, efficient modeling in segmentation, domain generation, and 4D point cloud segmentation.

$\bullet$ Mobile Segmentation. Several works design mobile segmentation methods using Transformer. However, these designs mainly focus on the semantic level with pure image inputs. With the rise of short videos in mobile applications, instance-level mobile video segmentation may need more research efforts. Thus, efficiently segmenting and tracking each instance on mobile devices in the video clip may require more research for potential applications.

$\bullet$ Using Synthetic Datasets for Joint Training. Segmentation models always need huge pixel-wised annotations. While methods like collecting image-text pairs and using open vocabulary approaches can reduce annotation costs, they still require significant manual effort in data collection. One alternative solution is to use a generated synthetic dataset. Recently, diffusion-based generation models have emerged as a promising option for high-quality image and mask generation. These models can create synthetic images and masks with fewer domain gaps than natural images, making them ideal for training segmentation models without requiring access to real data. Additionally, synthetic datasets can be tailored to specific application scenarios, such as few-shot segmentation tasks or long-tail segmentation tasks. Using synthetic datasets for joint training has the potential to significantly reduce annotation costs and accelerate the development of segmentation models for real-world applications.

$\bullet$ Weakly Supervised or Unsupervised Segmentation. Most segmentation approaches need lots of mask annotations for training, which leads to huge manual annotation costs. Thus, developing annotation-efficient segmentation algorithms is needed. Weakly supervised annotations, including image labels or boxes, can be used to replace fine-grained mask annotations. Recent work shows the transformer itself can learn a good mask classifier with only box supervision. This finding makes the training instance segmentation model easier. Moreover, with the recent progress of contrastive pre-training, exploring the vision transformer itself as a mask generator for unsupervised segmentation is also a promising direction.

$\bullet$ Domain Generation. This task aims to adapt a segmenter from the seen domains to the new unseen domains without re-training or accessing new unseen images during the training. Previous works adopt specific designs, including data augmentation, feature distillation, and feature whitening. Only a few works explore the effectiveness of segmentation transformer in cases of data augmentation for domain generation. However, there are no works using stronger foundation models to build stronger baselines. Moreover, current works only consider similar domains in the driving scene, which are not generalizable in real applications. A robot should adapt themselves to various scene inputs, including outdoor scenes and indoor scenes. Thus, methods that analyze the generation ability in various domains are needed in the future.

$\bullet$ 4D Point Cloud Panoptic Segmentation. This task requires the model to segment and track each point in the video. Current methods for 4D point cloud panoptic segmentation usually adopt specific pipelines, including point segmentation, point clustering, and point association. A unified solution using a transformer may facilitate this direction by simplifying the pipeline and performing task association.