OneFormer: One Transformer to Rule Universal Image Segmentation

Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi

Introduction

Image Segmentation is the task of grouping pixels into multiple segments. Such grouping can be semantic-based (e.g., road, sky, building), or instance-based (objects with well-defined boundaries). Earlier segmentation approaches tackled these two segmentation tasks individually, with specialized architectures and therefore separate research effort into each. In a recent effort to unify semantic and instance segmentation, Kirillov et al. proposed panoptic segmentation, with pixels grouped into an amorphous segment for amorphous background regions (labeled “stuff”) and distinct segments for objects with well-defined shape (labeled “thing”). This effort, however, led to new specialized panoptic architectures instead of unifying the previous tasks (see Fig. 1a). More recently, the research trend shifted towards unifying image segmentation with new panoptic architectures, such as K-Net , MaskFormer , and Mask2Former . Such panoptic architectures can be trained on all three tasks and obtain high performance without changing architecture. They do need to, however, be trained individually on each task to achieve the best performance (see Fig. 1b). The individual training policy requires extra training time and produces different sets of model weights for each task. In that regard, they can only be considered a semi-universal approach. For example, Mask2Former is trained for 160K iterations on ADE20K for each of the semantic, instance, and panoptic segmentation tasks to obtain the best performance for each task, yielding a total of 480k iterations in training, and three models to store and host for inference.

In an effort to truly unify image segmentation, we propose a multi-task universal image segmentation framework (OneFormer), which outperforms existing state-of-the-arts on all three image segmentation tasks (see Fig. 1c), by only training once on one panoptic dataset. Through this work, we aim to answer the following questions:

(i) Why are existing panoptic architectures not successful with a single training process or model to tackle all three tasks? We hypothesize that existing methods need to train individually on each segmentation task due to the absence of task guidance in their architectures, making it challenging to learn the inter-task domain differences when trained jointly or with a single model. To tackle this challenge, we introduce a task input token in the form of text: “the task is {task}”, to condition the model on the task in focus, making our architecture task-guided for training, and task-dynamic for inference, all with a single model. We uniformly sample {task} from {panoptic, instance, semantic} and the corresponding ground truth during our joint training process to ensure our model is unbiased in terms of tasks. Motivated by the ability of panoptic data to capture both semantic and instance information, we derive the semantic and instance labels from the corresponding panoptic annotations during training. Consequently, we only need panoptic data during training. Moreover, our joint training time, model parameters, and FLOPs are comparable to the existing methods, decreasing training time and storage requirements up to 3×, making image segmentation less resource intensive and more accessible.

(ii) How can the multi-task model better learn inter-task and inter-class differences during the single joint training process? Following the recent success of transformer frameworks in computer vision, we formulate our framework as a transformer-based approach, which can be guided through the use of query tokens. To add task-specific context to our model, we initialize our queries as repetitions of the task token (obtained from the task input) and compute a query-text contrastive loss with the text derived from the corresponding ground-truth label for the sampled task as shown in Fig. 2. We hypothesize that a contrastive loss on the queries helps guide the model to be more task-sensitive. Furthermore, it also helps reduce the category mispredictions to a certain extent.

We evaluate OneFormer on three major segmentation datasets: ADE20K , Cityscapes , and COCO , each with all three (semantic, instance, and panoptic) segmentation tasks. OneFormer sets the new state of the arts for all three tasks with a single jointly trained model. To summarize, our main contributions are:

We propose OneFormer, the first multi-task universal image segmentation framework based on transformers that need to be trained only once with a single universal architecture, a single model, and on a single dataset, to outperform existing frameworks across semantic, instance, and panoptic segmentation tasks, despite the latter need to be trained separately on each task using multiple times of the resources.

OneFormer uses a task-conditioned joint training strategy, uniformly sampling different ground truth domains ( semantic, instance, or panoptic) by deriving all labels from panoptic annotations to train its multi-task model. Thus, OneFormer actually achieves the orignial unification goal of panoptic segmentation .

We validate OneFormer through extensive experiments on three major benchmarks: ADE20K , Cityscapes , and COCO . OneFormer sets a new state-of-the-art performance on all three segmentation tasks compared with methods using the standard Swin-L backbone, and improves even more with new ConvNeXt and DiNAT backbones.

Related Work

Image segmentation is one of the most fundamental tasks in image processing and computer vision. Traditional works usually tackle one of the three image segmentation tasks with specialized network architectures (Fig. 1a).

Semantic Segmentation. Semantic segmentation was long tackled as a pixel classification problem with CNNs . More recent works have shown the success of transformer-based methods in semantic segmentation following its success in language and vision . Among them, MaskFormer treated semantic segmentation as a mask classification problem following early works , through using a transformer decoder with object queries . We also formulate semantic segmentation as a mask classification problem.

Instance Segmentation. Traditional instance segmentation methods are also formulated as mask classifiers, which predict binary masks and a class label for each mask. We also formulate instance segmentation as a mask classification problem.

Panoptic Segmentation. Panoptic Segmentation was proposed to unify instance and semantic segmentation. One of the earliest architectures in this scope was Panoptic-FPN , which introduced separate instance and semantic task branches. Works that followed significantly improved performance with transformer-based architectures . Despite the progress made so far, panoptic segmentation models are still behind in performance compared to individual instance and semantic segmentation models, therefore not living up to their full unification potential. Motivated by this, we design our OneFormer to be trained with panoptic annotations only.

2 Universal Image Segmentation

The concept of universal image segmentation has existed for some time, starting with image and scene parsing , followed by panoptic segmentation as an effort to unify semantic and instance segmentation . More recently, promising architectures designed specifically for panoptic segmentation have emerged which also perform well on semantic and instance segmentation tasks. K-Net , a CNN, uses dynamic learnable instance and semantic kernels with bipartite matching. MaskFormer is a transformer-based architecture, serving as a mask classifier. It was inspired by DETR’s reformulation of object detection in the scope of transformers, where the image is fed to the encoder, and the decoder produces proposals based on queries. Mask2Former improved upon MaskFormer with learnable queries, deformable multi-scale attention in the decoder, a masked cross-attention and set the new state of the art on all three tasks. Unfortunately, it requires training the model individually on each task to achieve the best performance. Therefore, there remains a gap in truly unifying the three segmentation tasks. To the best of our knowledge, OneFormer is the first framework to beat state of the art on all three image segmentation tasks with a single universal model.

3 Transformer-based Architectures

Architectures based on the transformer encoder-decoder structure have proved effective in object detection since the introduction of DETR . Mask2Former demonstrated the effectiveness of such architectures for image segmentation with a mask classification formulation. Inspired by this success, we also formulate our framework as a query-based mask classification task. Additionally, we claim that calculating a query-text contrastive loss on the task-guided queries can help the model learn inter-task differences and reduce the category mispredictions in the model outputs. Concurrent to our work, LMSeg uses text derived from multiple datasets’ taxonomy to calculate a query-text contrastive loss and tackle the multi-dataset segmentation training challenge. Unlike LMSeg , our work focuses on multiple tasks and uses the classes present in the training sample’s ground-truth label to calculate the query-text contrastive loss.

Method

In this section, we introduce OneFormer, a universal image segmentation framework jointly trained on the panoptic, semantic, and instance segmentation and outperforms individually trained models. We provide an overview of OneFormer in Fig. 2. OneFormer uses two inputs: sample image and task input of the form “the task is {task}”. During our single joint training process, the task is uniformly sampled from {panoptic, instance, semantic} for each image. Firstly, we extract multi-scale features from the input image using a backbone and a pixel decoder. We tokenize the task input to obtain a 1-D task token used to condition the object queries and, consequently, our model on the task for each input. Additionally, we create a text list representing the number of binary masks for each class present in the GT label and map it to text query representations. Note that the text list depends on the input image and the {task}. For supervision of the model’s task-dynamic predictions, we derive the corresponding ground-truths from panoptic annotations. As the ground truth is task-dependent, we calculate a query-text contrastive loss between the object and text queries to ensure there is task distinction in the object queries. The object queries and multi-scale features are fed into a transformer decoder to produce final predictions. We provide more details in the following sections.

Existing semi-universal architectures for image segmentation face a significant drop in performance when jointly trained on all three segmentation tasks (Tab. 7). We attribute their failure to tackle the multi-task challenge to the absence of task-conditioning in their architecture.

We tackle the multi-task train-once challenge for image segmentation using a task-conditioned joint training strategy. Particularly, we first uniformly sample the task from {panoptic, semantic, instance} for the GT label. We realize the unification potential of panoptic annotations by deriving the task-specific labels from the panoptic annotations, thus, using only one set of annotations.

Next, we extract a set of binary masks for each category present in the image from the task-specific GT label, i.e., semantic task guarantees only one amorphous binary mask for each class present in the image, whereas, instance task signifies non-overlapping binary masks for only thing classes, ignoring the stuff regions. Panoptic task denotes a single amorphous mask for stuff classes and non-overlapping masks for thing classes as shown in Fig. 3. Subsequently, we iterate over the set of masks to create a list of text ( $\mathbf{T}_{\text{list}}$ ) with a template “a photo with a {CLS}”, where CLS is the class name for the corresponding binary mask. The number of binary masks per sample varies over the dataset. Therefore, we pad $\mathbf{T}_{\text{list}}$ with “a/an {task} photo” entries to obtain a padded list ( $\mathbf{T}_{\text{pad}}$ ) of constant length $N_{\text{text}}$ , with padded entries representing no-object masks. We later use $\mathbf{T}_{\text{pad}}$ for computing a query-text contrastive loss (Sec. 3.3).

We condition our architecture on the task using a task input ( $\mathbf{I}_{\text{task}}$ ) with the template “the task is {task}”, which is tokenized and mapped to a task-token ( $\mathbf{Q}_{\text{task}}$ ). We use $\mathbf{Q}_{\text{task}}$ to condition OneFormer on the task (Sec. 3.2).

2 Query Representations

During training, we use two sets of queries in our architecture: text queries ( $\mathbf{Q}_{\text{text}}$ ) and object queries ( $\mathbf{Q}$ ). $\mathbf{Q}_{\text{text}}$ is the text-based representation for the segments in the image, while $\mathbf{Q}$ is the image-based representation.

To obtain $\mathbf{Q}_{\text{text}}$ , we first tokenize the text entries $\mathbf{T}_{\text{pad}}$ and pass the tokenized representations through a text-encoder , which is a 6-layer transformer . The encoded $N_{\text{text}}$ text embeddings represent the number of binary masks and their corresponding classes in the input image. We further concatenate a set of $N_{\text{ctx}}$ learnable text context embeddings ( $\mathbf{Q}_{\text{ctx}}$ ) to the encoded text embeddings to obtain the final $N$ text queries ( $\mathbf{Q}_{\text{text}}$ ), as shown in Fig. 4. Our motivation behind using $\mathbf{Q}_{\text{ctx}}$ is to learn a unified textual context for a sample image. We only use the text queries during training; therefore, we can drop the text mapper module during inference to reduce the model size.

To obtain $\mathbf{Q}$ , we first initialize the object queries ( $\mathbf{Q^{\prime}}$ ) as a $N-1$ times repetitions of the task-token ( $\mathbf{Q}_{\text{task}}$ ). Then, we update $\mathbf{Q^{\prime}}$ with guidance from the flattened $1/4$ -scale features inside a 2-layer transformer . The updated $\mathbf{Q^{\prime}}$ from the transformer (rich with image-contextual information) is concatenated with $\mathbf{Q}_{\text{task}}$ to obtain a task-conditioned representation of $N$ queries, $\mathbf{Q}$ . Unlike the vanilla all-zeros or random initialization , the task-guided initialization of the queries and the concatenation with $\mathbf{Q}_{\text{task}}$ is critical for the model to learn multiple segmentation tasks (Sec. 4.3).

3 Task Guided Contrastive Queries

Developing a single model for all three segmentation tasks is challenging due to the inherent differences among the three tasks. The meaning of the object queries, $\mathbf{Q}$ , is task-dependent. Should the queries focus only on the thing classes (instance segmentation), or should the queries predict only one amorphous object for each class present in the image (semantic segmentation) or a mix of both (panoptic segmentation)? Existing query-based architectures do not take such differences into account and hence, fail at effectively training a single model on all three tasks.

To this end, we propose to calculate a query-text contrastive loss using $\mathbf{Q}$ and $\mathbf{Q}_{\text{text}}$ . We use $\mathbf{T}_{\text{pad}}$ to obtain the text queries representation, $\mathbf{Q}_{text}$ , where $\mathbf{T}_{pad}$ is a list of textual representations for each mask-to-be-detected in a given image with “a/an {task} photo” representing the no-object detections in $\mathbf{Q}$ . Thus, the text queries align with the purpose of object queries, representing the objects/segments present in an image. Therefore, we can successfully learn the inter-task distinctions in the query representations using a contrastive loss between the ground truth-derived text and object queries. Moreover, contrastive learning on the queries enables us to attend to inter-class differences and reduce category misclassifications.

Considering that we have a batch of $B$ object-text query pairs $\{(q_{i}^{obj},x_{i}^{txt})\}_{i=1}^{B}$ , where $q_{i}^{obj}$ and $q_{i}^{txt}$ are the corresponding object and text queries, respectively, of the $i$ -th pair, we measure the similarity between the queries by calculating a dot product. The total contrastive loss is composed of two losses : (i) an object-to-text contrastive loss ( $\mathcal{L}_{\mathbf{Q}\rightarrow{\mathbf{Q}_{\text{text}}}}$ ) and; (ii) a text-to-object contrastive loss ( $\mathcal{L}_{{\mathbf{Q}_{\text{text}}}\rightarrow{\mathbf{Q}}}$ ) as shown in Eq. 1. $\tau$ is a learnable temperature parameter to scale the contrastive logits.

4 Other Architecture Components

Backbone and Pixel Decoder: We use the widely used ImageNet pre-trained backbones to extract multi-scale feature representations from the input image. Our pixel decoder aids the feature modeling by gradually upsampling the backbone features. Motivated by the recent success of multi-scale deformable attention , we use the same Multi-Scale Deformable Transformer (MSDeformAttn) based architecture for our pixel decoder.

Transformer Decoder: We use a multi-scale strategy to utilize the higher resolution maps inside our transformer decoder. Specifically, we feed the object queries ( $\mathbf{Q}$ ) and the multi-scale outputs from the pixel decoder ( $F_{i}$ ), $i\in\{1/4,1/8,1/16,1/32\}$ as inputs. We use the features with resolution $1/8$ , $1/16$ and $1/32$ of the original image alternatively to update $\mathbf{Q}$ using a masked cross-attention (CA) operation , followed by a self-attention (SA) and finally a feed-forward network (FFN). We perform these sets of alternate operations $L$ times inside the transformer decoder.

The final query outputs from the transformer decoder are mapped to a $K+1$ dimensional space for class predictions, where $K$ denotes the number of classes and an extra $+1$ for the no-object predictions. To obtain the final masks, we decode the pixel features ( $F_{1/4}$ ) at $1/4$ resolution of the original image with the help of an einsum operation between $Q$ and $F_{1/4}$ . During inference, we follow the same post-processing technique as to obtain the final panoptic, semantic, and instance segmentation predictions. We only keep predictions with scores above a threshold of 0.5, 0.8, and 0.8 during post-processing for panoptic segmentation on the ADE20K , Cityscapes and COCO datasets, respectively.

5 Losses

In addition to the contrastive loss on the queries, we calculate the standard classification CE-loss ( $\mathcal{L}_{\text{cls}}$ ) over the class predictions. Following , we use a combination of binary cross-entropy ( $\mathcal{L}_{\text{bce}}$ ) and dice loss ( $\mathcal{L}_{\text{dice}}$ ) over the mask predictions. Therefore, our final loss function is a weighted sum of the four losses (Eq. 2). We empirically set $\lambda_{\mathbf{Q}\leftrightarrow{\mathbf{Q}_{\text{text}}}}=0.5$ , $\lambda_{\text{cls}}=2$ , $\lambda_{\text{bce}}=5$ and $\lambda_{\text{dice}}=5$ . To find the least cost assignment, we use bipartite matching between the set predictions and the ground truths. We set $\lambda_{\text{cls}}$ as $0.1$ for the no-object predictions .

Experiments

We illustrate that OneFormer, when trained only once with our task-conditioned joint-training strategy, generalizes well to all three image segmentation tasks on three widely used datasets. Furthermore, we provide extensive ablations to demonstrate the significance of OneFormer’s components. Due to space constraints, we provide implementation details in the appendix.

Datasets. We experiment on three widely used datasets that support all three: semantic, instance, and panoptic segmentation tasks. Cityscapes consists of a total 19 (11 “stuff” and 8 “thing”) classes with 2,975 training, 500 validation and 1,525 test images. ADE20K is another benchmark dataset with 150 (50 “stuff” and 100 “thing”) classes among the 20,210 training and 2,000 validation images. COCO has 133 (53 “stuff” and 80 “thing”) classes with 118k training and 5,000 validation images.

Evaluation Metrics. For all three image segmentation tasks, we report the PQ , AP , and mIoU scores. Since we only have a single model for all three tasks, we use the value of the task token to decide the scores to consider. For e.g., when task is panoptic, we report the PQ score and similarly we report AP and mIoU scores when task is instance and semantic, respectively.

2 Main Results

ADE20K. We compare OneFormer with the existing state-of-the-art pseudo-universal and specialized architectures on the ADE20K val dataset in Tab. 1. With the standard Swin-L† backbone, OneFormer, while being trained only once, outperforms Mask2Former’s individually trained models on all three image segmentation tasks and sets a new state-of-the-art performance when compared with other methods using the same backbone.

Cityscapes. We compare OneFormer with the existing state-of-the-art pseudo-universal and specialized architectures on the Cityscapes val dataset in Tab. 2. With Swin-L† backbone, OneFormer outperforms Mask2Former with a $+\mathbf{0.6}\%$ and $+\mathbf{1.9}\%$ improvement on the PQ and AP metrics, respectively. Additionally, with ConvNeXt-L† and ConvNeXt-XL† backbone, OneFormer sets a new state-of-the-art of $68.5\%$ PQ and $46.7\%$ AP, respectively.

COCO. We compare OneFormer with the existing state-of-the-art pseudo-universal and specialized architectures on the COCO val2017 dataset in Tab. 3. With Swin-L† backbone, OneFormer performs on-par with the individually trained Mask2Former with a $+0.1\%$ improvement in the PQ score. Due to the discrepancies between the panoptic and instance annotations in COCO , we evaluate the AP score using the instance ground truths derived from the panoptic annotations. We provide more information in the appendix. Following , we evaluate mIoU on semantic ground truths derived from panoptic annotations.

3 Ablation Studies

We analyze OneFormer’s components through a series of ablation studies. Unless stated otherwise, we ablate with Swin-L† OneFormer on the Cityscapes dataset.

Task-Conditioned Architecture. We validate the importance of the task token ( $\mathbf{Q}_{\text{task}}$ ), initializing the queries with repetitions of the task token (task-guided query init.) and the learnable text context ( $\mathbf{Q}_{\text{ctx}}$ ) by removing each component one at a time in Tab. 4. Without the task token, we observe a significant drop in the AP score ( $-2.7\%$ ). Furthermore, using a learnable text context ( $\mathbf{Q}_{\text{ctx}}$ ) leads to an improvement of $+4.5\%$ in the PQ score, proving its significance. Lastly, initializing the queries as repetitions of the task token (task-guided query init.) instead of using an all-zeros initialization leads to an improvement of $+1.4\%$ in the PQ and $+1.1\%$ in the AP score, indicating the importance of task-conditioning the initialization of the queries.

Contrastive Query Loss. We report results without the query-text contrastive loss ( $\mathcal{L}_{\mathbf{Q}\leftrightarrow{\mathbf{Q}_{\text{text}}}}$ ) in Tab. 5. We observe that the contrastive loss significantly benefits the PQ ( $+8.4\%$ ) and AP ( $+3.2\%$ ) scores. We also conduct experiments substituting our query-text contrastive loss with a classification loss ( $\mathcal{L}_{\text{cls}}$ ) on the queries. $\mathcal{L}_{\text{cls}}$ can be regarded as a straightforward alternative for $\mathcal{L}_{\mathbf{Q}\leftrightarrow{\mathbf{Q}_{\text{text}}}}$ as the both provide supervision for the number of masks for each class present in the image. However, we observe significant drops on all the metrics ( $-0.8\%$ PQ, $-0.9\%$ AP and $-0.4\%$ mIoU) using the classification loss instead of the contrastive loss. We attribute the drops to the inability of the classification loss to capture the inter-task differences effectively.

Input Text Template. We study the importance of the template choice for the entries in the text list ( $\mathbf{T}_{\text{list}}$ ) in Tab. 6. We experiment with “a photo with a {CLS} {TYPE}” template for our text entries where CLS is the class name for the object mask and TYPE is the task-dependent class-type: “stuff” for amorphous masks (panoptic and semantic task) and “thing” for all distinct object masks. We also experiment with the identity template “{CLS}”. Our choice of the template: “a photo with a {CLS}” gives a strong performance as a baseline. We believe more exploration in the text template space could help in improving the performance further.

Task Conditioned Joint Training. As a baseline for comparison, we train a Swin-L† Mask2Former-Joint with our joint training strategy, i.e., uniformly sampling each task’s GT on the ADE20K dataset. We compare the Mask2Former-Joint baseline with our Swin-L† OneFormer in Tab. 7. We train both models for 160k iterations with a batch size of 16. Our OneFormer achieves a $+1.1\%$ , $+2.2\%$ , and $+0.8\%$ improvement on the PQ, AP and mIoU metrics, respectively, proving the importance of our architecture design for practical multi-task joint training.

Task Token Input. We demonstrate that our framework is sensitive to the task token input by setting the value of {task} during inference as panoptic, instance, or semantic in Tab. 8. We report results with our Swin-L† OneFormer trained on ADE20K dataset. We observe a significant drop in the PQ and mIoU metrics when task is instance compared to panoptic. Moreover, the PQ ${}^{\text{St}}$ drops to $1.5\%$ , and there is only a $-0.8\%$ drop on PQ ${}^{\text{Th}}$ metric, proving that the network learns to focus majorly on the distinct “thing” instances when the task is instance. Similarly, there is a sizable drop in the PQ, PQ ${}^{\text{Th}}$ and AP metrics for the semantic task with PQ ${}^{\text{St}}$ staying the same, showing that our framework can segment out amorphous masks for “stuff” regions but does not predict different masks for “thing” objects. Therefore, OneFormer dynamically learns the inter-task distinctions which is critical for a train-once multi-task architecture. We include qualitative analysis on the task dynamic nature of OneFormer in the appendix.

Reduced Category Misclassifications. Our query-text contrastive loss helps OneFormer learn the inter-task distinctions and reduce the number of category misclassifications in the predictions. Mask2Former incorrectly predicts “wall” as “fence” in the first row, “vegetation” as “terrain”, and “terrain” as “sidewalk”. At the same time, our OneFormer produces more accurate predictions in regions (inside blue boxes) with similar classes, as shown in Fig. 5.

Conclusion

In this work, we present OneFormer, a new multi-task universal image segmentation framework with transformers and task-guided queries to unify semantic, instance, and panoptic segmentation with a single universal architecture, a single model, and training on a single dataset. Our jointly trained single OneFormer model outperforms the individually trained specialized Mask2Former models, the previous single-architecture state of the art, on all three segmentation tasks across major datasets. Consequently, OneFormer can cut training time, weight storage, and inference hosting requirements down to a third, making image segmentation more accessible. We believe OneFormer is a significant step towards making image segmentation more universal and accessible and will support further research in this direction by open-sourcing our codes and models.

We thank Intelligence Advanced Research Projects Activity (IARPA), University of Oregon, University of Illinois at Urbana-Champaign, and Picsart AI Research (PAIR) for their generous support that made this work possible.

References

Appendix A Implementation Details

We implement our framework using the Detectron2 library.

Multi-Scale Feature Modeling. We adopt the settings from for modeling the image pixel-level features. More specifically, we use 6 MSDeformAttn inside our pixel decoder, applied to feature maps with resolutions $1/8$ , $1/16$ , and $1/32$ of the original image. We use lateral connections and upsampling to aggregate the multi-scale features to a final $1/4$ resolution scale. We map all the features to a hidden dimension of $256$ .

Unified Task-Conditioned Query Formulation. We initialize the $N-1$ queries as repetitions of task-token, $\mathbf{Q}_{\text{task}}$ . Unless stated otherwise, we set $N=250$ and $N_{\text{ctx}}=16$ . Our text tokenizer and text encoder are the same as . We use a single linear layer to project the tokenized task input, followed by a layer-norm to obtain $\mathbf{Q}_{\text{task}}$ .

Task-Dynamic Mask and Class Prediction Formation. Following , we set $L=3$ inside the transformer decoder. Therefore, we have a total of $3L$ (9) stages inside our transformer decoder. We also calculate an auxiliary loss on each intermediate class and mask predictions after every transformer decoder stage .

Training Settings. We train our model with a batch size of 16. When training on ADE20K and Cityscapes , we use the AdamW optimizer with a base learning rate of $0.0001$ , poly learning rate decay and weight decay $0.1$ . We use a crop size of $512\!\times\!512$ and $512\!\times\!1024$ on ADE20K and Cityscapes, respectively. We train for 90k and 160k iterations on Cityscapes and ADE20K, respectively. For data augmentation, we use shortest edge resizing, fixed size cropping, and color jittering followed by a random horizontal flip.

When training on COCO , we use a step learning rate schedule along with the AdamW optimizer, a base learning rate of $0.0001$ , $10$ warmup iterations, and a weight decay of $0.05$ . We decay the learning rate at $0.9$ and $0.95$ fractions of the total number of training steps by a factor of $10$ . We train for a total of 100 epochs with LSJ augmentation with a random scale sampled from the range $0.1$ to $2.0$ followed by a fixed size crop to $1024\!\times\!1024$ resolution.

Evaluation Settings. We follow the same evaluation settings as Mask2Former . Unless stated otherwise, we report results for the single-scale inference setting. Unlike the training stage, during evaluation, we use the ground-truth annotations from the respective task GT labels to calculate the metric scores instead of deriving the labels from the panoptic annotations. Additionally, we set the value of task in “the task is {task}” as panoptic, instance and semantic to obtain the corresponding task predictions.

Appendix B Additional Ablations

Ablation on Number of Queries. We study the effect of the different number of queries on the COCO dataset in Tab. I. We conduct experiments using the ResNet-50 (R50) backbone and train for 50 epochs. We find that $N=150$ performs the best.

Additionally, we tune the number of queries on the Swin-L† backbone separately. During our experiments, we found that $N=250$ is the best setting with Swin-l† on ADE20K and Cityscapes datasets. On COCO , $N=150$ gives the best performance with Swin-L†. We also noticed that with smaller backbones like R50 , $N=150$ is the optimal setting on the ADE20K dataset.

Ablation on Contrastive Loss’ Weight. We run ablations on the weight for the contrastive loss’ weight on the COCO dataset in Tab. III. We conduct our experiments using the ResNet-50 (R50) backbone and train for 50 epochs. We find that $\lambda_{\mathbf{Q}\leftrightarrow{\mathbf{Q}_{\text{text}}}}=0.5$ is the optimal weight setting.

Ablation on Number of Learnable Text Context Embeddings. We study the effect of different number of learnable text context embeddings on the ADE20K dataset in Tab. II. We conduct our experiments using the ResNet-50 (R50) backbone and train for 160k iterations. We find that $N_{\text{ctx}}=16$ performs best.

Appendix C Individual Training

In this section, we analyze our OneFormer’s performance with individual training on the panoptic, instance, and semantic segmentation task. For this study, we conduct experiments with the ResNet-50 (R50) backbone on the ADE20K dataset. We train all models for 160k iterations with a batch size of 16.

As shown in Tab. IV, OneFormer outperforms Mask2Former (the previous SOTA pseudo-universal image segmentation method) with every training strategy. Furthermore, with joint training, Mask2Former suffers a significant drop in performance, and OneFormer achieves the highest PQ, AP and mIoU scores.

In order to train OneFormer on a single task, we set the value of task as that of the corresponding task in our task token input: “the task is {task}” for the samples during training. Therefore, under Panoptic Training, only panoptic ground truth labels will be used, and similarly, for Semantic and Instance Training, only semantic and instance ground truth labels shall be used, respectively. The joint training strategy remains the same as described in Sec 3.1 (main text) with uniform sampling for each task-specific ground truth label. Note that for training OneFormer, we derive all ground truth labels from the panoptic annotations.

Appendix D Analysis on the Task-Dynamic Nature of OneFormer

We analyze OneFormer’s ability to capture the inter-task differences by changing the value of {task} in the task token input: “the task is {task}” as panoptic, instance, or semantic, during inference. We report quantitative report results with our Swin-L† OneFormer trained on Cityscapes dataset in Tab. V. When we set task as “instance”, we observe that PQ ${}^{\text{St}}$ drops to $0.0\%$ , and there is only a $-0.2\%$ drop on PQ ${}^{\text{Th}}$ metric as compared to the setting when task is panoptic. This observation proves that OneFormer learns to change its feed-forward output depending on the task dynamically. Similarly, there is a sizable drop in the PQ, PQ ${}^{\text{Th}}$ and AP metrics for the semantic task with PQ ${}^{\text{St}}$ improving by $+0.2\%$ showing that our framework can segment out amorphous masks for “stuff” regions but does not predict different masks for “thing” objects.

We further provide qualitative evidence in Fig. II. As demonstrated by the first example in Fig. II, the rider and bicycle regions are detected. However, the other “stuff” regions are misclassified in the semantic inference output when task=“instance”. Similarly, the people are detected in the second example, and the other “stuff” regions are misclassified. In further evidence, in both examples, the distinct “thing” objects are segmented into a single amorphous mask in the panoptic and instance inference outputs when task=“semantic”. Therefore, the differences in the qualitative results demonstrate OneFormer’s ability to output task-dependent class and mask predictions, which our task token input can guide.

Appendix E Comparison to SOTA Methods at System-Level for Image Segmentation

In this section, we compare OneFormer to other SOTA systems for panoptic, instance, and semantic segmentation tasks on the ADE20K val , Cityscapes val , and COCO val2017 datasets. As shown in Fig. I, our single OneFormer model outperforms Mask2Former for the three image segmentation tasks on all three datasets. Note that we are comparing the same OneFormer models referenced in our main text to other systems without applying additional system-level training techniques or using additional data and huge backbones.

As shown in Tab. VI, without using any extra training data, Swin-L OneFormer sets new state-of-the-art performance on instance segmentation with 37.8% AP, and DiNat-L OneFormer sets new state-of-the-art performance on panoptic segmentation with 51.5% PQ beating the previous state-of-the-art Swin-L Mask2Former’s 34.9% AP and ConvNeXt-L KMaX-DeepLab’s 50.9% PQ, respectively. Furthermore, DiNAT-L OneFormer and ConvNeXt-L OneFormer achieve the new-state-of-the-art single-scale and multi-scale mIoU scores of 58.3% and 58.8%, respectively, compared to other systems that do not use extra data during training.

E.2 SOTA Systems on Cityscapes val

Without any extra data during training, our ConvNeXt-L OneFormer sets the new state-of-the-art performance on panoptic segmentation with 68.5% PQ with single-scale inference. Similarly, ConvNeXt-XL OneFormer achieves a new state-of-the-art 46.7% AP score with single-scale inference as shown in Tab. VII.

E.3 SOTA Systems on COCO val

Without using any extra training data, DiNAT-L OneFormer matches the previous state-of-the-art KMaX-DeepLab with 58.0% PQ score. Swin-L OneFormer achieves the best PQ ${}^{\text{Th}}$ score of 64.4%. For evaluating on the semantic segmentation task, we generate semantic GT annotations from the corresponding panoptic annotations. As shown in Tab. VIII, DiNAT-L OneFormer achieves an impressive 68.1% mIoU.

While analyzing the COCO dataset, we found serious discrepancies between the GT panoptic and instance annotations. Therefore, for fair comparison, during evaluation, we generate the instance annotations from the panoptic annotations for calculating the AP scores as only use panoptic annotations during training. We provide more information about the discrepancies in Appendix F. DiNAT-L OneFormer achieves 49.2% AP outperforming Mask2Former-Instance .

Appendix F Analysis on Discrepancy between Instance and Panoptic Annotations in COCO

During our joint training, we derive the semantic and instance ground-truth labels from the corresponding panoptic annotations. Unlike, Cityscapes and ADE20K datasets, which combine the semantic and instance annotations to generate the corresponding panoptic annotations while preparing the data, COCO has separate sets of panoptic and instance annotations. As expected, there are no discrepancies between the panoptic and instance annotations in the Cityscapes and ADE20K datasets. However, because COCO has separately developed panoptic and instance annotations, we discover significant discrepancies in the COCO train2017 and val2017 datasets as shown in Fig. III and Fig. IV, respectively.

In Fig. III, the instance annotations merge the “tie” object into the “person” object. In another example, instance annotations merge the “dog” and “boat” into a single instance, while the panoptic annotations segment the two instances correctly.

In Fig. IV, the instance annotations skip multiple “person” and “motorcycle” objects in different images, while the panoptic annotations include them all. In another example, instance annotations leave out a group of “person” object instances in the background, and panoptic annotations merge those instances into a single object mask.

These discrepancies are a significant barrier to developing and evaluating a unified image segmentation model. As demonstrated in Fig. III and Fig. IV, our predictions match the panoptic annotations much more than the instance annotations which is expected from our training strategy involving only panoptic annotations. Therefore, while comparing our Swin-L† OneFormer to other SOTA methods in Tab. 3 (main text), we evaluate the AP score on instance GTs derived from the panoptic annotations.