Per-Pixel Classification is Not All You Need for Semantic Segmentation

Bowen Cheng, Alexander G. Schwing, Alexander Kirillov

Introduction

The goal of semantic segmentation is to partition an image into regions with different semantic categories. Starting from Fully Convolutional Networks (FCNs) work of Long et al. , most deep learning-based semantic segmentation approaches formulate semantic segmentation as per-pixel classification (Figure 1 left), applying a classification loss to each output pixel . Per-pixel predictions in this formulation naturally partition an image into regions of different classes.

Mask classification is an alternative paradigm that disentangles the image partitioning and classification aspects of segmentation. Instead of classifying each pixel, mask classification-based methods predict a set of binary masks, each associated with a single class prediction (Figure 1 right). The more flexible mask classification dominates the field of instance-level segmentation. Both Mask R-CNN and DETR yield a single class prediction per segment for instance and panoptic segmentation. In contrast, per-pixel classification assumes a static number of outputs and cannot return a variable number of predicted regions/segments, which is required for instance-level tasks.

Our key observation: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks. In fact, before FCN , the best performing semantic segmentation methods like O2P and SDS used a mask classification formulation. Given this perspective, a natural question emerges: can a single mask classification model simplify the landscape of effective approaches to semantic- and instance-level segmentation tasks? And can such a mask classification model outperform existing per-pixel classification methods for semantic segmentation?

To address both questions we propose a simple MaskFormer approach that seamlessly converts any existing per-pixel classification model into a mask classification. Using the set prediction mechanism proposed in DETR , MaskFormer employs a Transformer decoder to compute a set of pairs, each consisting of a class prediction and a mask embedding vector. The mask embedding vector is used to get the binary mask prediction via a dot product with the per-pixel embedding obtained from an underlying fully-convolutional network. The new model solves both semantic- and instance-level segmentation tasks in a unified manner: no changes to the model, losses, and training procedure are required. Specifically, for semantic and panoptic segmentation tasks alike, MaskFormer is supervised with the same per-pixel binary mask loss and a single classification loss per mask. Finally, we design a simple inference strategy to blend MaskFormer outputs into a task-dependent prediction format.

We evaluate MaskFormer on five semantic segmentation datasets with various numbers of categories: Cityscapes (19 classes), Mapillary Vistas (65 classes), ADE20K (150 classes), COCO-Stuff-10K (171 classes), and ADE20K-Full (847 classes). While MaskFormer performs on par with per-pixel classification models for Cityscapes, which has a few diverse classes, the new model demonstrates superior performance for datasets with larger vocabulary. We hypothesize that a single class prediction per mask models fine-grained recognition better than per-pixel class predictions. MaskFormer achieves the new state-of-the-art on ADE20K (55.6 mIoU) with Swin-Transformer backbone, outperforming a per-pixel classification model with the same backbone by 2.1 mIoU, while being more efficient (10% reduction in parameters and 40% reduction in FLOPs).

Finally, we study MaskFormer’s ability to solve instance-level tasks using two panoptic segmentation datasets: COCO and ADE20K . MaskFormer outperforms a more complex DETR model with the same backbone and the same post-processing. Moreover, MaskFormer achieves the new state-of-the-art on COCO (52.7 PQ), outperforming prior state-of-the-art by 1.6 PQ. Our experiments highlight MaskFormer’s ability to unify instance- and semantic-level segmentation.

Related Works

Both per-pixel classification and mask classification have been extensively studied for semantic segmentation. In early work, Konishi and Yuille apply per-pixel Bayesian classifiers based on local image statistics. Then, inspired by early works on non-semantic groupings , mask classification-based methods became popular demonstrating the best performance in PASCAL VOC challenges . Methods like O2P and CFM have achieved state-of-the-art results by classifying mask proposals . In 2015, FCN extended the idea of per-pixel classification to deep nets, significantly outperforming all prior methods on mIoU (a per-pixel evaluation metric which particularly suits the per-pixel classification formulation of segmentation).

Per-pixel classification became the dominant way for deep-net-based semantic segmentation since the seminal work of Fully Convolutional Networks (FCNs) . Modern semantic segmentation models focus on aggregating long-range context in the final feature map: ASPP uses atrous convolutions with different atrous rates; PPM uses pooling operators with different kernel sizes; DANet , OCNet , and CCNet use different variants of non-local blocks . Recently, SETR and Segmenter replace traditional convolutional backbones with Vision Transformers (ViT) that capture long-range context starting from the very first layer. However, these concurrent Transformer-based semantic segmentation approaches still use a per-pixel classification formulation. Note, that our MaskFormer module can convert any per-pixel classification model to the mask classification setting, allowing seamless adoption of advances in per-pixel classification.

Mask classification is commonly used for instance-level segmentation tasks . These tasks require a dynamic number of predictions, making application of per-pixel classification challenging as it assumes a static number of outputs. Omnipresent Mask R-CNN uses a global classifier to classify mask proposals for instance segmentation. DETR further incorporates a Transformer design to handle thing and stuff segmentation simultaneously for panoptic segmentation . However, these mask classification methods require predictions of bounding boxes, which may limit their usage in semantic segmentation. The recently proposed Max-DeepLab removes the dependence on box predictions for panoptic segmentation with conditional convolutions . However, in addition to the main mask classification losses it requires multiple auxiliary losses (i.e., instance discrimination loss, mask-ID cross entropy loss, and the standard per-pixel classification loss).

From Per-Pixel to Mask Classification

In this section, we first describe how semantic segmentation can be formulated as either a per-pixel classification or a mask classification problem. Then, we introduce our instantiation of the mask classification model with the help of a Transformer decoder . Finally, we describe simple inference strategies to transform mask classification outputs into task-dependent prediction formats.

For per-pixel classification, a segmentation model aims to predict the probability distribution over all possible $K$ categories for every pixel of an $H\times W$ image: $y=\{p_{i}|p_{i}\in\Delta^{K}\}_{i=1}^{H\cdot W}$ . Here $\Delta^{K}$ is the $K$ -dimensional probability simplex. Training a per-pixel classification model is straight-forward: given ground truth category labels $y^{\text{gt}}=\{y_{i}^{\text{gt}}|y_{i}^{\text{gt}}\in\{1,\dots,K\}\}_{i=1}^{H\cdot W}$ for every pixel, a per-pixel cross-entropy (negative log-likelihood) loss is usually applied, i.e., $\mathcal{L}_{\text{pixel-cls}}(y,y^{\text{gt}})=\sum\nolimits_{i=1}^{H\cdot W}-\log p_{i}(y_{i}^{\text{gt}})$ .

2 Mask classification formulation

Mask classification splits the segmentation task into 1) partitioning/grouping the image into $N$ regions ( $N$ does not need to equal $K$ ), represented with binary masks $\{m_{i}|m_{i}\in^{H\times W}\}_{i=1}^{N}$ ; and 2) associating each region as a whole with some distribution over $K$ categories. To jointly group and classify a segment, i.e., to perform mask classification, we define the desired output $z$ as a set of $N$ probability-mask pairs, i.e., $z=\{(p_{i},m_{i})\}_{i=1}^{N}.$ In contrast to per-pixel class probability prediction, for mask classification the probability distribution $p_{i}\in\Delta^{K+1}$ contains an auxiliary “no object” label ( $\varnothing$ ) in addition to the $K$ category labels. The $\varnothing$ label is predicted for masks that do not correspond to any of the $K$ categories. Note, mask classification allows multiple mask predictions with the same associated class, making it applicable to both semantic- and instance-level segmentation tasks.

To train a mask classification model, a matching $\sigma$ between the set of predictions $z$ and the set of $N^{\text{gt}}$ ground truth segments $z^{\text{gt}}=\{(c_{i}^{\text{gt}},m_{i}^{\text{gt}})|c_{i}^{\text{gt}}\in\{1,\dots,K\},m_{i}^{\text{gt}}\in\{0,1\}^{H\times W}\}_{i=1}^{N^{\text{gt}}}$ is required.Different mask classification methods utilize various matching rules. For instance, Mask R-CNN uses a heuristic procedure based on anchor boxes and DETR optimizes a bipartite matching between $z$ and $z^{\text{gt}}$ . Here $c_{i}^{\text{gt}}$ is the ground truth class of the $i^{\text{th}}$ ground truth segment. Since the size of prediction set $|z|=N$ and ground truth set $|z^{\text{gt}}|=N^{\text{gt}}$ generally differ, we assume $N\geq N^{\text{gt}}$ and pad the set of ground truth labels with “no object” tokens $\varnothing$ to allow one-to-one matching.

For semantic segmentation, a trivial fixed matching is possible if the number of predictions $N$ matches the number of category labels $K$ . In this case, the $i^{\text{th}}$ prediction is matched to a ground truth region with class label $i$ and to $\varnothing$ if a region with class label $i$ is not present in the ground truth. In our experiments, we found that a bipartite matching-based assignment demonstrates better results than the fixed matching. Unlike DETR that uses bounding boxes to compute the assignment costs between prediction $z_{i}$ and ground truth $z_{j}^{\text{gt}}$ for the matching problem, we directly use class and mask predictions, i.e., $-p_{i}(c_{j}^{\text{gt}})+\mathcal{L}_{\text{mask}}(m_{i},m_{j}^{\text{gt}})$ , where $\mathcal{L}_{\text{mask}}$ is a binary mask loss.

To train model parameters, given a matching, the main mask classification loss $\mathcal{L}_{\text{mask-cls}}$ is composed of a cross-entropy classification loss and a binary mask loss $\mathcal{L}_{\text{mask}}$ for each predicted segment:

Note, that most existing mask classification models use auxiliary losses (e.g., a bounding box loss or an instance discrimination loss ) in addition to $\mathcal{L}_{\text{mask-cls}}$ . In the next section we present a simple mask classification model that allows end-to-end training with $\mathcal{L}_{\text{mask-cls}}$ alone.

3 MaskFormer

We now introduce MaskFormer, the new mask classification model, which computes $N$ probability-mask pairs $z=\{(p_{i},m_{i})\}_{i=1}^{N}$ . The model contains three modules (see Fig. 2): 1) a pixel-level module that extracts per-pixel embeddings used to generate binary mask predictions; 2) a transformer module, where a stack of Transformer decoder layers computes $N$ per-segment embeddings; and 3) a segmentation module, which generates predictions $\{(p_{i},m_{i})\}_{i=1}^{N}$ from these embeddings. During inference, discussed in Sec. 3.4, $p_{i}$ and $m_{i}$ are assembled into the final prediction.

Note, we empirically find it is beneficial to not enforce mask predictions to be mutually exclusive to each other by using a softmax activation. During training, the $\mathcal{L}_{\text{mask-cls}}$ loss combines a cross entropy classification loss and a binary mask loss $\mathcal{L}_{\text{mask}}$ for each predicted segment. For simplicity we use the same $\mathcal{L}_{\text{mask}}$ as DETR , i.e., a linear combination of a focal loss and a dice loss multiplied by hyper-parameters $\lambda_{\text{focal}}$ and $\lambda_{\text{dice}}$ respectively.

4 Mask-classification inference

First, we present a simple general inference procedure that converts mask classification outputs $\{(p_{i},m_{i})\}_{i=1}^{N}$ to either panoptic or semantic segmentation output formats. Then, we describe a semantic inference procedure specifically designed for semantic segmentation. We note, that the specific choice of inference strategy largely depends on the evaluation metric rather than the task.

General inference partitions an image into segments by assigning each pixel $[h,w]$ to one of the $N$ predicted probability-mask pairs via $\operatorname*{arg\,max}_{i:c_{i}\neq\varnothing}p_{i}(c_{i})\cdot m_{i}[h,w]$ . Here $c_{i}$ is the most likely class label $c_{i}=\operatorname*{arg\,max}_{c\in\{1,\dots,K,\varnothing\}}p_{i}(c)$ for each probability-mask pair $i$ . Intuitively, this procedure assigns a pixel at location $[h,w]$ to probability-mask pair $i$ only if both the most likely class probability $p_{i}(c_{i})$ and the mask prediction probability $m_{i}[h,w]$ are high. Pixels assigned to the same probability-mask pair $i$ form a segment where each pixel is labelled with $c_{i}$ . For semantic segmentation, segments sharing the same category label are merged; whereas for instance-level segmentation tasks, the index $i$ of the probability-mask pair helps to distinguish different instances of the same class. Finally, to reduce false positive rates in panoptic segmentation we follow previous inference strategies . Specifically, we filter out low-confidence predictions prior to inference and remove predicted segments that have large parts of their binary masks ( $m_{i}>0.5$ ) occluded by other predictions.

Semantic inference is designed specifically for semantic segmentation and is done via a simple matrix multiplication. We empirically find that marginalization over probability-mask pairs, i.e., $\operatorname*{arg\,max}_{c\in\{1,\dots,K\}}\sum_{i=1}^{N}p_{i}(c)\cdot m_{i}[h,w]$ , yields better results than the hard assignment of each pixel to a probability-mask pair $i$ used in the general inference strategy. The argmax does not include the “no object” category ( $\varnothing$ ) as standard semantic segmentation requires each output pixel to take a label. Note, this strategy returns a per-pixel class probability $\sum_{i=1}^{N}p_{i}(c)\cdot m_{i}[h,w]$ . However, we observe that directly maximizing per-pixel class likelihood leads to poor performance. We hypothesize, that gradients are evenly distributed to every query, which complicates training.

Experiments

We demonstrate that MaskFormer seamlessly unifies semantic- and instance-level segmentation tasks by showing state-of-the-art results on both semantic segmentation and panoptic segmentation datasets. Then, we ablate the MaskFormer design confirming that observed improvements in semantic segmentation indeed stem from the shift from per-pixel classification to mask classification.

Datasets. We study MaskFormer using four widely used semantic segmentation datasets: ADE20K (150 classes) from the SceneParse150 challenge , COCO-Stuff-10K (171 classes), Cityscapes (19 classes), and Mapillary Vistas (65 classes). In addition, we use the ADE20K-Full dataset annotated in an open vocabulary setting (we keep 874 classes that are present in both train and validation sets). For panotic segmenation evaluation we use COCO (80 “things” and 53 “stuff” categories) and ADE20K-Panoptic (100 “things” and 50 “stuff” categories). Please see the appendix for detailed descriptions of all used datasets.

Evaluation metrics. For semantic segmentation the standard metric is mIoU (mean Intersection-over-Union) , a per-pixel metric that directly corresponds to the per-pixel classification formulation. To better illustrate the difference between segmentation approaches, in our ablations we supplement mIoU with PQ ${}^{\text{St}}$ (PQ stuff) , a per-region metric that treats all classes as “stuff” and evaluates each segment equally, irrespective of its size. We report the median of 3 runs for all datasets, except for Cityscapes where we report the median of 5 runs. For panoptic segmentation, we use the standard PQ (panoptic quality) metric and report single run results due to prohibitive training costs.

Baseline models. On the right we sketch the used per-pixel classification baselines. The PerPixelBaseline uses the pixel-level module of MaskFormer and directly outputs per-pixel class scores. For a fair comparison, we design PerPixelBaseline+ which adds the transformer module and mask embedding MLP to the PerPixelBaseline. Thus, PerPixelBaseline+ and MaskFormer differ only in the formulation: per-pixel vs. mask classification. Note that these baselines are for ablation and we compare MaskFormer with state-of-the-art per-pixel classification models as well.

Backbone. MaskFormer is compatible with any backbone architecture. In our work we use the standard convolution-based ResNet backbones (R50 and R101 with 50 and 101 layers respectively) and recently proposed Transformer-based Swin-Transformer backbones. In addition, we use the R101c model which replaces the first $7\times 7$ convolution layer of R101 with 3 consecutive $3\times 3$ convolutions and which is popular in the semantic segmentation community .

Pixel decoder. The pixel decoder in Figure 2 can be implemented using any semantic segmentation decoder (e.g., ). Many per-pixel classification methods use modules like ASPP or PSP to collect and distribute context across locations. The Transformer module attends to all image features, collecting global information to generate class predictions. This setup reduces the need of the per-pixel module for heavy context aggregation. Therefore, for MaskFormer, we design a light-weight pixel decoder based on the popular FPN architecture.

Following FPN, we $2\times$ upsample the low-resolution feature map in the decoder and sum it with the projected feature map of corresponding resolution from the backbone; Projection is done to match channel dimensions of the feature maps with a $1\times 1$ convolution layer followed by GroupNorm (GN) . Next, we fuse the summed features with an additional $3\times 3$ convolution layer followed by GN and ReLU activation. We repeat this process starting with the stride 32 feature map until we obtain a final feature map of stride 4. Finally, we apply a single $1\times 1$ convolution layer to get the per-pixel embeddings. All feature maps in the pixel decoder have a dimension of 256 channels.

Transformer decoder. We use the same Transformer decoder design as DETR . The $N$ query embeddings are initialized as zero vectors, and we associate each query with a learnable positional encoding. We use 6 Transformer decoder layers with 100 queries by default, and, following DETR, we apply the same loss after each decoder. In our experiments we observe that MaskFormer is competitive for semantic segmentation with a single decoder layer too, whereas for instance-level segmentation multiple layers are necessary to remove duplicates from the final predictions.

Segmentation module. The multi-layer perceptron (MLP) in Figure 2 has 2 hidden layers of 256 channels to predict the mask embeddings $\mathcal{E}_{\text{mask}}$ , analogously to the box head in DETR. Both per-pixel $\mathcal{E}_{\text{pixel}}$ and mask $\mathcal{E}_{\text{mask}}$ embeddings have 256 channels.

Loss weights. We use focal loss and dice loss for our mask loss: $\mathcal{L}_{\text{mask}}(m,m^{\text{gt}})=\lambda_{\text{focal}}\mathcal{L}_{\text{focal}}(m,m^{\text{gt}})+\lambda_{\text{dice}}\mathcal{L}_{\text{dice}}(m,m^{\text{gt}})$ , and set the hyper-parameters to $\lambda_{\text{focal}}=20.0$ and $\lambda_{\text{dice}}=1.0$ . Following DETR , the weight for the “no object” ( $\varnothing$ ) in the classification loss is set to 0.1.

2 Training settings

Semantic segmentation. We use Detectron2 and follow the commonly used training settings for each dataset. More specifically, we use AdamW and the poly learning rate schedule with an initial learning rate of $10^{-4}$ and a weight decay of $10^{-4}$ for ResNet backbones, and an initial learning rate of $6\cdot 10^{-5}$ and a weight decay of $10^{-2}$ for Swin-Transformer backbones. Backbones are pre-trained on ImageNet-1K if not stated otherwise. A learning rate multiplier of $0.1$ is applied to CNN backbones and $1.0$ is applied to Transformer backbones. The standard random scale jittering between $0.5$ and $2.0$ , random horizontal flipping, random cropping as well as random color jittering are used as data augmentation . For the ADE20K dataset, if not stated otherwise, we use a crop size of $512\times 512$ , a batch size of $16$ and train all models for 160k iterations. For the ADE20K-Full dataset, we use the same setting as ADE20K except that we train all models for 200k iterations. For the COCO-Stuff-10k dataset, we use a crop size of $640\times 640$ , a batch size of 32 and train all models for 60k iterations. All models are trained with 8 V100 GPUs. We report both performance of single scale (s.s.) inference and multi-scale (m.s.) inference with horizontal flip and scales of $0.5$ , $0.75$ , $1.0$ , $1.25$ , $1.5$ , $1.75$ . See appendix for Cityscapes and Mapillary Vistas settings.

Panoptic segmentation. We follow exactly the same architecture, loss, and training procedure as we use for semantic segmentation. The only difference is supervision: i.e., category region masks in semantic segmentation vs. object instance masks in panoptic segmentation. We strictly follow the DETR setting to train our model on the COCO panoptic segmentation dataset for a fair comparison. On the ADE20K panoptic segmentation dataset, we follow the semantic segmentation setting but train for longer (720k iterations) and use a larger crop size ( $640\times 640$ ). COCO models are trained using 64 V100 GPUs and ADE20K experiments are trained with 8 V100 GPUs. We use the general inference (Section 3.4) with the following parameters: we filter out masks with class confidence below 0.8 and set masks whose contribution to the final panoptic segmentation is less than 80% of its mask area to VOID. We report performance of single scale inference.

3 Main results

Semantic segmentation. In Table 1, we compare MaskFormer with state-of-the-art per-pixel classification models for semantic segmentation on the ADE20K val set. With the same standard CNN backbones (e.g., ResNet ), MaskFormer outperforms DeepLabV3+ by 1.7 mIoU. MaskFormer is also compatible with recent Vision Transformer backbones (e.g., the Swin Transformer ), achieving a new state-of-the-art of 55.6 mIoU, which is 2.1 mIoU better than the prior state-of-the-art . Observe that MaskFormer outperforms the best per-pixel classification-based models while having fewer parameters and faster inference time. This result suggests that the mask classification formulation has significant potential for semantic segmentation. See appendix for results on test set.

Beyond ADE20K, we further compare MaskFormer with our baselines on COCO-Stuff-10K, ADE20K-Full as well as Cityscapes in Table 2 and we refer to the appendix for comparison with state-of-the-art methods on these datasets. The improvement of MaskFormer over PerPixelBaseline+ is larger when the number of classes is larger: For Cityscapes, which has only 19 categories, MaskFormer performs similarly well as PerPixelBaseline+; While for ADE20K-Full, which has 847 classes, MaskFormer outperforms PerPixelBaseline+ by 3.5 mIoU.

Although MaskFormer shows no improvement in mIoU for Cityscapes, the PQ ${}^{\text{St}}$ metric increases by 2.9 PQ ${}^{\text{St}}$ . We find MaskFormer performs better in terms of recognition quality (RQ ${}^{\text{St}}$ ) while lagging in per-pixel segmentation quality (SQ ${}^{\text{St}}$ ) (we refer to the appendix for detailed numbers). This observation suggests that on datasets where class recognition is relatively easy to solve, the main challenge for mask classification-based approaches is pixel-level accuracy (i.e., mask quality).

Panoptic segmentation. In Table 3, we compare the same exact MaskFormer model with DETR on the COCO panoptic val set. To match the standard DETR design, we add 6 additional Transformer encoder layers after the CNN backbone. Unlike DETR, our model does not predict bounding boxes but instead predicts masks directly. MaskFormer achieves better results while being simpler than DETR. To disentangle the improvements from the model itself and our post-processing inference strategy we run our model following DETR post-processing (MaskFormer (DETR)) and observe that this setup outperforms DETR by 2.2 PQ. Overall, we observe a larger improvement in PQ ${}^{\text{St}}$ compared to PQ ${}^{\text{Th}}$ . This suggests that detecting “stuff” with bounding boxes is suboptimal, and therefore, box-based segmentation models (e.g., Mask R-CNN ) do not suit semantic segmentation. MaskFormer also outperforms recently proposed Max-DeepLab without the need of special network design as well as sophisticated auxiliary losses (i.e., instance discrimination loss, mask-ID cross entropy loss, and per-pixel classification loss in ). MaskFormer, for the first time, unifies semantic- and instance-level segmentation with the exact same model, loss, and training pipeline.

We further evaluate our model on the panoptic segmentation version of the ADE20K dataset. Our model also achieves state-of-the-art performance. We refer to the appendix for detailed results.

4 Ablation studies

We perform a series of ablation studies of MaskFormer using a single ResNet-50 backbone .

Per-pixel vs. mask classification. In Table 4b, we verify that the gains demonstrated by MaskFromer come from shifting the paradigm to mask classification. We start by comparing PerPixelBaseline+ and MaskFormer. The models are very similar and there are only 3 differences: 1) per-pixel vs. mask classification used by the models, 2) MaskFormer uses bipartite matching, and 3) the new model uses a combination of focal and dice losses as a mask loss, whereas PerPixelBaseline+ utilizes per-pixel cross entropy loss. First, we rule out the influence of loss differences by training PerPixelBaseline+ with exactly the same losses and observing no improvement. Next, in Table 4a, we compare PerPixelBaseline+ with MaskFormer trained using a fixed matching (MaskFormer-fixed), i.e., $N=K$ and assignment done based on category label indices identically to the per-pixel classification setup. We observe that MaskFormer-fixed is 1.8 mIoU better than the baseline, suggesting that shifting from per-pixel classification to mask classification is indeed the main reason for the gains of MaskFormer. In Table 4b, we further compare MaskFormer-fixed with MaskFormer trained with bipartite matching (MaskFormer-bipartite) and find bipartite matching is not only more flexible (allowing to predict less masks than the total number of categories) but also produces better results.

Number of queries. The table to the right shows results of MaskFormer trained with a varying number of queries on datasets with different number of categories. The model with 100 queries consistently performs the best across the studied datasets. This suggest we may not need to adjust the number of queries w.r.t. the number of categories or datasets much. Interestingly, even with 20 queries MaskFormer outperforms our per-pixel classification baseline.

We further calculate the number of classes which are on average present in a training set image. We find these statistics to be similar across datasets despite the fact that the datasets have different number of total categories: 8.2 classes per image for ADE20K (150 classes), 6.6 classes per image for COCO-Stuff-10K (171 classes) and 9.1 classes per image for ADE20K-Full (847 classes). We hypothesize that each query is able to capture masks from multiple categories.

The figure to the right shows the number of unique categories predicted by each query (sorted in descending order) of our MaskFormer model on the validation sets of the corresponding datasets. Interestingly, the number of unique categories per query does not follow a uniform distribution: some queries capture more classes than others. We try to analyze how MaskFormer queries group categories, but we do not observe any obvious pattern: there are queries capturing categories with similar semantics or shapes (e.g., “house” and “building”), but there are also queries capturing completely different categories (e.g., “water” and “sofa”).

Number of Transformer decoder layers. Interestingly, MaskFormer with even a single Transformer decoder layer already performs well for semantic segmentation and achieves better performance than our 6-layer-decoder PerPixelBaseline+. For panoptic segmentation, however, multiple decoder layers are required to achieve competitive performance. Please see the appendix for a detailed discussion.

Discussion

Our main goal is to show that mask classification is a general segmentation paradigm that could be a competitive alternative to per-pixel classification for semantic segmentation. To better understand its potential for segmentation tasks, we focus on exploring mask classification independently of other factors like architecture, loss design, or augmentation strategy. We pick the DETR architecture as our baseline for its simplicity and deliberately make as few architectural changes as possible. Therefore, MaskFormer can be viewed as a “box-free” version of DETR.

In this section, we discuss in detail the differences between MaskFormer and DETR and show how these changes are required to ensure that mask classification performs well. First, to achieve a pure mask classification setting we remove the box prediction head and perform matching between prediction and ground truth segments with masks instead of boxes. Secondly, we replace the compute-heavy per-query mask head used in DETR with a more efficient per-image FPN-based head to make end-to-end training without box supervision feasible.

Matching with masks is superior to matching with boxes. We compare MaskFormer models trained using matching with boxes or masks in Table 5. To do box-based matching, we add to MaskFormer an additional box prediction head as in DETR . Observe that MaskFormer, which directly matches with mask predictions, has a clear advantage. We hypothesize that matching with boxes is more ambiguous than matching with masks, especially for stuff categories where completely different masks can have similar boxes as stuff regions often spread over a large area in an image.

MaskFormer mask head reduces computation. Results in Table 5 also show that MaskFormer performs on par with DETR when the same matching strategy is used. This suggests that the difference in mask head designs between the models does not significantly influence the prediction quality. The new head, however, has significantly lower computational and memory costs in comparison with the original mask head used in DETR. In MaskFormer, we first upsample image features to get high-resolution per-pixel embeddings and directly generate binary mask predictions at a high-resolution. Note, that the per-pixel embeddings from the upsampling module (i.e., pixel decoder) are shared among all queries. In contrast, DETR first generates low-resolution attention maps and applies an independent upsampling module to each query. Thus, the mask head in DETR is $N$ times more computationally expensive than the mask head in MaskFormer (where $N$ is the number of queries).

Conclusion

The paradigm discrepancy between semantic- and instance-level segmentation results in entirely different models for each task, hindering development of image segmentation as a whole. We show that a simple mask classification model can outperform state-of-the-art per-pixel classification models, especially in the presence of large number of categories. Our model also remains competitive for panoptic segmentation, without a need to change model architecture, losses, or training procedure. We hope this unification spurs a joint effort across semantic- and instance-level segmentation tasks.

Acknowledgments and Disclosure of Funding

We thank Ross Girshick for insightful comments and suggestions. Work of UIUC authors Bowen Cheng and Alexander G. Schwing was supported in part by NSF under Grant #1718221, 2008387, 2045586, 2106825, MRI #1725729, NIFA award 2020-67021-32799 and Cisco Systems Inc. (Gift Award CG 1377144 - thanks for access to Arcetri).

We first provide more information regarding the datasets used in our experimental evaluation of MaskFormer (Appendix A). Then, we provide detailed results of our model on more semantic (Appendix B) and panoptic (Appendix C) segmentation datasets. Finally, we provide additional ablation studies (Appendix D) and visualization (Appendix E).

Appendix A Datasets description

We study MaskFormer using five semantic segmentation datasets and two panoptic segmentation datasets. Here, we provide more detailed information about these datasets.

ADE20K contains 20k images for training and 2k images for validation. The data comes from the ADE20K-Full dataset where 150 semantic categories are selected to be included in evaluation from the SceneParse150 challenge . The images are resized such that the shortest side is no greater than 512 pixels. During inference, we resize the shorter side of the image to the corresponding crop size.

COCO-Stuff-10K has 171 semantic-level categories. There are 9k images for training and 1k images for testing. Images in the COCO-Stuff-10K datasets are a subset of the COCO dataset . During inference, we resize the shorter side of the image to the corresponding crop size.

ADE20K-Full contains 25k images for training and 2k images for validation. The ADE20K-Full dataset is annotated in an open-vocabulary setting with more than 3000 semantic categories. We filter these categories by selecting those that are present in both training and validation sets, resulting in a total of 847 categories. We follow the same process as ADE20K-SceneParse150 to resize images such that the shortest side is no greater than 512 pixels. During inference, we resize the shorter side of the image to the corresponding crop size.

Cityscapes is an urban egocentric street-view dataset with high-resolution images ( $1024\times 2048$ pixels). It contains 2975 images for training, 500 images for validation, and 1525 images for testing with a total of 19 classes. During training, we use a crop size of $512\times 1024$ , a batch size of 16 and train all models for 90k iterations. During inference, we operate on the whole image ( $1024\times 2048$ ).

Mapillary Vistas is a large-scale urban street-view dataset with 65 categories. It contains 18k, 2k, and 5k images for training, validation and testing with a variety of image resolutions, ranging from $1024\times 768$ to $4000\times 6000$ . During training, we resize the short side of images to 2048 before applying scale augmentation. We use a crop size of $1280\times 1280$ , a batch size of $16$ and train all models for 300k iterations. During inference, we resize the longer side of the image to 2048 and only use three scales (0.5, 1.0 and 1.5) for multi-scale testing due to GPU memory constraints.

A.2 Panoptic segmentation datasets

COCO panoptic is one of the most commonly used datasets for panoptic segmentation. It has 133 categories (80 “thing” categories with instance-level annotation and 53 “stuff” categories) in 118k images for training and 5k images for validation. All images are from the COCO dataset .

ADE20K panoptic combines the ADE20K semantic segmentation annotation for semantic segmentation from the SceneParse150 challenge and ADE20K instance annotation from the COCO+Places challenge . Among the 150 categories, there are 100 “thing” categories with instance-level annotation. We find filtering masks with a lower threshold (we use 0.7 for ADE20K) than COCO (which uses 0.8) gives slightly better performance.

Appendix B Semantic segmentation results

ADE20K test. Table I compares MaskFormer with previous state-of-the-art methods on the ADE20K test set. Following , we train MaskFormer on the union of ADE20K train and val set with ImageNet-22K pre-trained checkpoint and use multi-scale inference. MaskFormer outperforms previous state-of-the-art methods on all three metrics with a large margin.

COCO-Stuff-10K. Table IIa compares MaskFormer with our baselines as well as the state-of-the-art OCRNet model on the COCO-Stuff-10K dataset. MaskFormer outperforms our per-pixel classification baselines by a large margin and achieves competitive performances compared to OCRNet. These results demonstrate the generality of the MaskFormer model.

ADE20K-Full. We further demonstrate the benefits in large-vocabulary semantic segmentation in Table IIb. Since we are the first to report performance on this dataset, we only compare MaskFormer with our per-pixel classification baselines. MaskFormer not only achieves better performance, but is also more memory efficient on the ADE20K-Full dataset with 847 categories, thanks to decoupling the number of masks from the number of classes. These results show that our MaskFormer has the potential to deal with real-world segmentation problems with thousands of categories.

Cityscapes. In Table IIIa, we report MaskFormer performance on Cityscapes, the standard testbed for modern semantic segmentation methods. The dataset has only 19 categories and therefore, the recognition aspect of the dataset is less challenging than in other considered datasets. We observe that MaskFormer performs on par with the best per-pixel classification methods. To better analyze MaskFormer, in Table IIIb, we further report PQ ${}^{\text{St}}$ . We find MaskFormer performs better in terms of recognition quality (RQ ${}^{\text{St}}$ ) while lagging in per-pixel segmentation quality (SQ ${}^{\text{St}}$ ). This suggests that on datasets, where recognition is relatively easy to solve, the main challenge for mask classification-based approaches is pixel-level accuracy.

Mapillary Vistas. Table IV compares MaskFormer with state-of-the-art per-pixel classification models on the high-resolution Mapillary Vistas dataset which contains images up to $4000\times 6000$ resolution. We observe: (1) MaskFormer is able to handle high-resolution images, and (2) MaskFormer outperforms mulit-scale per-pixel classification models even without the need of mult-scale inference. We believe the Transformer decoder in MaskFormer is able to capture global context even for high-resolution images.

Appendix C Panoptic segmentation results

COCO panoptic test-dev. Table V compares MaskFormer with previous state-of-the-art methods on the COCO panoptic test-dev set. We only train our model on the COCO train2017 set with ImageNet-22K pre-trained checkpoint and outperforms previos state-of-the-art by 2 PQ.

ADE20K panoptic. We demonstrate the generality of our model for panoptic segmentation on the ADE20K panoptic dataset in Table VI, where MaskFormer is competitive with the state-of-the-art methods.

Appendix D Additional ablation studies

We perform additional ablation studies of MaskFormer for semantic segmentation using the same setting as that in the main paper: a single ResNet-50 backbone , and we report both the mIoU and the PQ ${}^{\text{St}}$ . The default setting of our MaskFormer is: 100 queries and 6 Transformer decoder layers.

Inference strategies. In Table VII, we ablate inference strategies for mask classification-based models performing semantic segmentation (discussed in Section 3.4). We compare our default semantic inference strategy and the general inference strategy which first filters out low-confidence masks (a threshold of 0.3 is used) and assigns the class labels to the remaining masks. We observe 1) general inference is only slightly better than the PerPixelBaseline+ in terms of the mIoU metric, and 2) on multiple datasets the general inference strategy performs worse in terms of the mIoU metric than the default semantic inference. However, the general inference has higher PQ ${}^{\text{St}}$ , due to better recognition quality (RQ ${}^{\text{St}}$ ). We hypothesize that the filtering step removes false positives which increases the RQ ${}^{\text{St}}$ . In contrast, the semantic inference aggregates mask predictions from multiple queries thus it has better mask quality (SQ ${}^{\text{St}}$ ). This observation suggests that semantic and instance-level segmentation can be unified with a single inference strategy (i.e., our general inference) and the choice of inference strategy largely depends on the evaluation metric instead of the task.

Number of Transformer decoder layers. In Table VIII, we ablate the effect of the number of Transformer decoder layers on ADE20K for both semantic and panoptic segmentation. Surprisingly, we find a MaskFormer with even a single Transformer decoder layer already performs reasonably well for semantic segmentation and achieves better performance than our 6-layer-decoder per-pixel classification baseline PerPixelBaseline+. Whereas, for panoptic segmentation, the number of decoder layers is more important. We hypothesize that stacking more decoder layers is helpful to de-duplicate predictions which is required by the panoptic segmentation task.

To verify this hypothesis, we train MaskFormer models without self-attention in all 6 Transformer decoder layers. On semantic segmentation, we observe MaskFormer without self-attention performs similarly well in terms of the mIoU metric, however, the per-mask metric PQ ${}^{\text{St}}$ is slightly worse. On panoptic segmentation, MaskFormer models without self-attention performs worse across all metrics.

“Semantic” queries vs. “panoptic” queries. In Figure I we visualize predictions for the “car” category from MaskFormer trained with semantic-level and instance-level ground truth data. In the case of semantic-level data, the matching cost and loss used for mask prediction force a single query to predict one mask that combines all cars together. In contrast, with instance-level ground truth, MaskFormer uses different queries to make mask predictions for each car. This observation suggests that our model has the capacity to adapt to different types of tasks given different ground truth annotations.

Appendix E Visualization

We visualize sample semantic segmentation predictions of the MaskFormer model with Swin-L backbone (55.6 mIoU) on the ADE20K validation set in Figure II.