MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers

Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

Introduction

The goal of panoptic segmentation is to predict a set of non-overlapping masks along with their corresponding class labels. Modern panoptic segmentation methods address this mask prediction problem by approximating the target task with multiple surrogate sub-tasks. For example, Panoptic-FPN adopts a ‘box-based pipeline’ with three levels of surrogate sub-tasks, as demonstrated in a tree structure in \figreffig:surrogate. Each level of this proxy tree involves manually-designed modules, such as anchors , box assignment rules , non-maximum suppression (NMS) , thing-stuff merging , \etc. Although there are good solutions to individual surrogate sub-tasks and modules, undesired artifacts are introduced when these sub-tasks fit into a pipeline for panoptic segmentation, especially in the challenging conditions (\figreffig:teaser).

Recent work on panoptic segmentation attempted to simplify this box-based pipeline. For example, UPSNet proproses a parameter-free panoptic head, permitting back-propagation to both semantic and instance segmentation modules. Recently, DETR presents an end-to-end approach for box detection, which is used to replace detectors in panoptic segmentation, but the whole training process of DETR still relies heavily on the box detection task.

Another line of work made efforts to completely remove boxes from the pipeline, which aligns better with the mask-based definition of panoptic segmentation. The state-of-the-art method in this regime, Axial-DeepLab , along with other box-free methods , predicts pixel-wise offsets to pre-defined instance centers. But this center-based surrogate sub-task makes it challenging to deal with highly deformable objects, or near-by objects with close centers. As a result, box-free methods do not perform as well as box-based methods on the challenging COCO dataset .

In this paper, we streamline the panoptic segmentation pipeline with an end-to-end approach. Inspired by DETR , our model directly predicts a set of non-overlapping masks and their corresponding semantic labels with a mask transformer. The output masks and classes are optimized with a panoptic quality (PQ) style objective. Specifically, inspired by the definition of PQ , we define a similarity metric between two class-labeled masks as the multiplication of their mask similarity and their class similarity. Our model is trained by maximizing this similarity between ground truth masks and predicted masks via one-to-one bipartite matching . This direct modeling of panoptic segmentation enables end-to-end training and inference, removing those hand-coded priors that are necessary in existing box-based and box-free methods (\tabreftab:intro). Our method is dubbed MaX-DeepLab for extending Axial-DeepLab with a Mask Xformer.

In companion with direct training and inference, we equip our mask transformer with a novel architecture. Instead of stacking a traditional transformer on top of a Convolutional Neural Network (CNN) , we propose a dual-path framework for combining CNNs with transformers. Specifically, we enable any CNN layer to read and write a global memory, using our dual-path transformer block. This block supports all types of attention between the CNN-path and the memory-path, including memory-path self-attention (M2M), pixel-path axial self-attention (P2P), memory-to-pixel attention (M2P), and finally pixel-to-memory attention (P2M). The transformer block can be inserted anywhere in a CNN, enabling communication with the global memory at any layer. Besides this communication module, our MaX-DeepLab employs a stacked-hourglass-style decoder . The decoder aggregates multi-scale features into a high resolution output, which is then multiplied with the global memory feature, to form our mask set prediction. The classes for the masks are predicted with another branch of the mask transformer.

We evaluate MaX-DeepLab on one of the most challenging panoptic segmentation datasets, COCO , against the state-of-the-art box-free method, Axial-DeepLab , and state-of-the-art box-based method, DetectoRS (\figreffig:teaser). Our MaX-DeepLab, without test time augmentation (TTA), achieves the state-of-the-art result of 51.3% PQ on the test-dev set. This result surpasses Axial-DeepLab (with TTA) by 7.1% PQ in the box-free regime, and outperforms DetectoRS (with TTA) by 1.7% PQ, bridging the gap between box-based and box-free methods for the first time. For a fair comparison with DETR , we also evaluate a lightweight model, MaX-DeepLab-S, that matches the number of parameters and M-Adds of DETR. We observe that MaX-DeepLab-S outperforms DETR by 3.3% PQ on the val set and 3.0% PQ on the test-dev set. In addition, we perform extensive ablation studies and analyses on our end-to-end formulation, model scaling, dual-path architectures, and our loss functions. We also notice that the extra-long training schedule of DETR is not necessary for MaX-DeepLab.

To summarize, our contributions are four-fold:

MaX-DeepLab is the first end-to-end model for panoptic segmentation, inferring masks and classes directly without hand-coded priors like object centers or boxes.

We propose a training objective that optimizes a PQ-style loss function via a PQ-style bipartite matching between predicted masks and ground truth masks.

Our dual-path transformer enables CNNs to read and write a global memory at any layer, providing a new way of combining transformers with CNNs.

MaX-DeepLab closes the gap between box-based and box-free methods and sets a new state-of-the-art on COCO, even without using test time augmentation.

Related Work

Transformers , first introduced for neural machine translation, have advanced the state-of-the-art in many natural language processing tasks . Attention , as the core component of Transformers, was developed to capture both correspondence of tokens across modalities and long-range interactions in a single context (self-attention) . Later, the complexity of transformer attention has been reduced , by introducing local or sparse attention , together with a global memory . The global memory, which inspires our dual-path transformer, recovers long-range context by propagating information globally.

Transformer and attention have been applied to computer vision as well, by combining non-local modules with CNNs or by applying self-attention only . Both classes of methods have boosted various vision tasks such as image classification , object detection , semantic segmentation , video recognition , image generation , and panoptic segmentation . It is worth mentioning that DETR stacked a transformer on top of a CNN for end-to-end object detection.

Box-based panoptic segmentation.

Most panoptic segmentation models, such as Panoptic FPN , follow a box-based approach that detects object bounding boxes and predicts a mask for each box, usually with a Mask R-CNN and FPN . Then, the instance segments (‘thing’) and semantic segments (‘stuff’) are fused by merging modules to generate panoptic segmentation. For example, UPSNet developed a parameter-free panoptic head, which facilitates unified training and inference . Recently, DETR extended box-based methods with its transformer-based end-to-end detector. And DetectoRS advanced the state-of-the-art with recursive feature pyramid and switchable atrous convolution.

Box-free panoptic segmentation.

Contrary to box-based approaches, box-free methods typically start with semantic segments . Then, instance segments are obtained by grouping ‘thing’ pixels with various methods, such as instance center regression , Watershed transform , Hough-voting , or pixel affinity . Recently, Axial-DeepLab advanced the state-of-the-art by equipping Panoptic-DeepLab with a fully axial-attention backbone. In this work, we extend Axial-DeepLab with a mask transformer for end-to-end panoptic segmentation.

Method

In this section, we describe how MaX-DeepLab directly predicts class-labeled masks for panoptic segmentation, followed by the PQ-style loss used to train the model. Then, we introduce our dual-path transformer architecture as well as the auxiliary losses that are helpful in training.

The $K$ ground truth masks $m_{i}\in{\{0,1\}}^{H\times W}$ do not overlap with each other, \ie, $\sum_{i=1}^{K}m_{i}\leq 1^{H\times W}$ , and $c_{i}$ denotes the ground truth class label of mask $m_{i}$ .

Our MaX-DeepLab directly predicts outputs in the exact same form as the ground truth. MaX-DeepLab segments the image $I$ into a fixed-size set of class-labeled masks:

End-to-end inference of MaX-DeepLab is enabled by adopting the same formulation for both ground truth definition and model prediction. As a result, the final panoptic segmentation prediction is obtained by simply performing argmax twice. Specifically, the first argmax predicts a class label for each mask:

And the other argmax assigns a mask-ID $\hat{z}_{h,w}$ to each pixel:

In practice, we filter each argmax with a confidence threshold – Masks or pixels with a low confidence are removed as described in \secrefsec:exp. In this way, MaX-DeepLab infers panoptic segmentation directly, dispensing with common manually-designed post-processing, \eg, NMS and thing-stuff merging in almost all previous methods . Besides, MaX-DeepLab does not rely on hand-crafted priors such as anchors, object boxes, or instance mass centers, \etc.

2 PQ-style loss

In addition to simple inference, MaX-DeepLab enables end-to-end training as well. In this section, we introduce how we train MaX-DeepLab with our PQ-style loss, which draws inspiration from the definition of panoptic quality (PQ) . This evaluation metric of panoptic segmentation, PQ, is defined as the multiplication of a recognition quality (RQ) term and a segmentation quality (SQ) term:

Based on this decomposition of PQ, we design our objective in the same manner: First, we define a PQ-style similarity metric between a class-labeled ground truth mask and a predicted mask. Next, we show how we match a predicted mask to each ground truth mask with this metric, and finally how to optimize our model with the same metric.

Our mask similarity metric ${\rm sim}(\cdot,\cdot)$ between a class-labeled ground truth mask $y_{i}=(m_{i},c_{i})$ and a prediction $\hat{y}_{j}=(\hat{m}_{j},\hat{p}_{j}(c))$ is defined as

where $\hat{p}_{j}(c_{i})\in$ is the probability of predicting the correct class (recognition quality) and ${\rm Dice}(m_{i},\hat{m}_{j})\in$ is the Dice coefficient between a predicted mask $\hat{m}_{j}$ and a ground truth $m_{i}$ (segmentation quality). The two terms are multiplied together, analogous to the decomposition of PQ.

This mask similarity metric has a lower bound of 0, which means either the class prediction is incorrect, OR the two masks do not overlap with each other. The upper bound, 1, however, is only achieved when the class prediction is correct AND the mask is perfect. The AND gating enables this metric to serve as a good optimization objective for both model training and mask matching.

Mask matching.

In order to assign a predicted mask to each ground truth, we solve a one-to-one bipartite matching problem between the prediction set $\{\hat{y}_{i}\}_{i=1}^{N}$ and the ground truth set $\{y_{i}\}_{i=1}^{K}$ . Formally, we search for a permutation of $N$ elements $\sigma\in\mathfrak{S}_{N}$ that best assigns the predictions to achieve the maximum total similarity to the ground truth:

The optimal assignment is computed efficiently with the Hungarian algorithm , following prior work . We refer to the $K$ matched predictions as positive masks which will be optimized to predict the corresponding ground truth masks and classes. The $(N-K)$ masks left are negatives, which should predict the $\varnothing$ class (no object).

Our one-to-one matching is similar to DETR , but with a different purpose: DETR allows only one positive match in order to remove duplicated boxes in the absence of NMS, while in our case, duplicated or overlapping masks are precluded by design. But in our case, assigning multiple predicted masks to one ground truth mask is problematic too, because multiple masks cannot possibly be optimized to fit a single ground truth mask at the same time. In addition, our one-to-one matching is consistent with the PQ metric, where only one predicted mask can theoretically match (\ie, have an IoU over 0.5) with each ground truth mask.

PQ-style loss.

Given our mask similarity metric and the mask matching process based on this metric, it is straight forward to optimize model parameters $\theta$ by maximizing this same similarity metric over matched (\ie, positive) masks:

Substituting the similarity metric (\equrefeq:similarity) gives our PQ-style objective ${\cal O}_{\rm PQ}^{\rm pos}$ to be maximized for positive masks:

In practice, we rewrite ${\cal O}_{\rm PQ}^{\rm pos}$ into two common loss terms by applying the product rule of gradient and then changing a probability $\hat{p}$ to a log probability $\bm{\log{}}\hat{p}$ . The change from $\hat{p}$ to $\bm{\log{}}\hat{p}$ aligns with the common cross-entropy loss and scales gradients better in practice for optimization:

where the loss weights are constants (\ie, no gradient is passed to them). This reformulation provides insights by bridging our objective with common loss functions: Our PQ-style loss is equivalent to optimizing a dice loss weighted by the class correctness and optimizing a cross-entropy loss weighted by the mask correctness. The logic behind this loss is intuitive: we want both of the mask and class to be correct at the same time. For example, if a mask is far off the target, it is a false negative anyway, so we disregard its class. This intuition aligns with the down-weighting of class losses for wrong masks, and vice versa.

Apart from the ${\cal L}_{\rm PQ}^{\rm pos}$ for positive masks, we define a cross-entropy term ${\cal L}_{\rm PQ}^{\rm neg}$ for negative (unmatched) masks:

This term trains the model to predict $\varnothing$ for negative masks. We balance the two terms by $\alpha$ , as a common practice to weight positive and negative samples :

where ${\cal L}_{\rm PQ}$ denotes our final PQ-style loss.

3 MaX-DeepLab Architecture

As shown in \figreffig:architecture, MaX-DeepLab architecture includes a dual-path transformer, a stacked decoder, and output heads that predict the masks and classes.

Instead of stacking a transformer on top of a CNN , we integrate the transformer and the CNN in a dual-path fashion, with bidirectional communication between the two paths. Specifically, we augment a 2D pixel-based CNN with a 1D global memory of size $N$ (\ie, the total number of predictions) and propose a transformer block as a drop-in replacement for any CNN block or an add-on for a pretrained CNN block. Our transformer block enables all four possible types of communication between the 2D pixel-path CNN and the 1D memory-path: (1) the traditional memory-to-pixel (M2P) attention, (2) memory-to-memory (M2M) self-attention, (3) pixel-to-memory (P2M) feedback attention that allows pixels to read from the memory, and (4) pixel-to-pixel (P2P) self-attention, implemented as axial-attention blocks . We select axial-attention rather than global 2D attention for efficiency on high resolution feature maps. One could optionally approximate the pixel-to-pixel self-attention with a convolutional block that only allows local communication. This transformer design with a memory path besides the main CNN path is termed dual-path transformer. Unlike previous work , it allows transformer blocks to be inserted anywhere in the backbone at any resolution. In addition, the P2M feedback attention enables the pixel-path CNN to refine its feature given the memory-path features that encode mask information.

where a single softmax is performed over the concatenated dimension of size $(\hat{H}\hat{W},+N)$ , inspired by ETC .

Stacked decoder.

Unlike previous work that uses a light-weight decoder, we explore stronger hourglass-style stacked decoders . As shown in \figreffig:architecture, our decoder is stacked $L$ times, traversing output strides (4, 8, and 16 ) multiple times. At each decoding resolution, features are fused by simple summation after bilinear resizing. Then, convolutional blocks or transformer blocks are applied, before the decoder feature is sent to the next resolution. This stacked decoder is similar to feature pyramid networks designed for pyramidal anchor predictions , but our purpose here is only to aggregate multi-scale features, \ie, intermediate pyramidal features are not directly used for prediction.

Output heads.

In practice, we use batch norm on $f$ and ( $f\cdot g$ ) to avoid deliberate initialization, and we bilinear upsample the mask prediction $\hat{m}$ to the original image resolution. Finally, the combination $\{(\hat{m}_{i},\hat{p}_{i}(c))\}_{i=1}^{N}$ is our mask transformer output to generate panoptic results as introduced in \secrefsec:model.

Our mask prediction head is inspired by CondInst and SOLOv2 , which extend dynamic convolution to instance segmentation. However, unlike our end-to-end method, these methods require hand-designed object centers and assignment rules for instance segmentation, and a thing-stuff merging module for panoptic segmentation.

4 Auxiliary losses

In addition to the PQ-style loss (\secrefsec:pqloss), we find it beneficial to incorporate auxiliary losses in training. Specifically, we propose a pixel-wise instance discrimination loss that helps cluster decoder features into instances. We also use a per-pixel mask-ID cross-entropy loss that classifies each pixel into $N$ masks, and a semantic segmentation loss. Our total loss function thus consists of the PQ-style loss ${\cal L}_{\rm PQ}$ and these three auxiliary losses.

This gives us $K$ instance embeddings $\{t_{i,:}\}_{i=1}^{K}$ representing $K$ ground truth masks. Then, we let each pixel feature $g_{:,h,w}$ perform an instance discrimination task, \ie, each pixel should correctly identify which mask embedding (out of $K$ ) it belongs to, as annotated by the ground truth masks. The contrastive loss at a pixel $(h,w)$ is written as:

where $\tau$ denotes the temperature, and note that $m_{i,h,w}$ is non-zero only when pixel $(h,w)$ belongs to the ground truth mask $m_{i}$ . In practice, this per-pixel loss is applied to all instance pixels in an image, encouraging features from the same instance to be similar and features from different instances to be distinct, in a contrastive fashion, which is exactly the property required for instance segmentation.

Our instance discrimination loss is inspired by previous works . However, they discriminate instances either unsupervisedly or with image classes , whereas we perform a pixel-wise instance discrimination task, as annotated by panoptic segmentation ground truth.

Mask-ID cross-entropy.

In \equrefeq:pixelargmax, we describe how we infer the mask-ID map given our mask prediction. In fact, we can train this per-pixel classification task by applying a cross-entropy loss on it. This is consistent with the literature that uses a cross-entropy loss together with a dice loss to learn better segmentation masks.

Semantic segmentation.

We also use an auxiliary semantic segmentation loss to help capture per pixel semantic feature. Specifically, we apply a semantic head on top of the backbone if no stacked decoder is used (\ie, $L=0$ ). Otherwise, we connect the semantic head to the first decoder output at stride 4, because we find it helpful to separate the final mask feature $g$ with semantic segmentation.

Experiments

We report our main results on COCO, comparing with state-of-the-art methods. Then, we provide a detailed ablation study on the architecture variants and losses. Finally, we analyze how MaX-DeepLab works with visualizations.

Most of our default settings follow Axial-DeepLab . Specifically, we train our models with 32 TPU cores for 100k (400k for main results) iterations (54 epochs), a batch size of 64, Radam Lookahead , a ‘poly’ schedule learning rate of $10^{-3}$ ( $3\times 10^{-4}$ for MaX-DeepLab-L), a backbone learning rate multiplier of 0.1, a weight decay of $10^{-4}$ , and a drop path rate of 0.2. We resize and pad images to 641 $\times$ 641 (1025 $\times$ 1025 for main results) for inference and M-Adds calculation. During inference, we set masks with class confidence below 0.7 to void and filter pixels with mask-ID confidence below 0.4. Finally, following previous work , we filter stuff masks with an area limit of 4096 pixels, and instance masks with a limit of 256 pixels. In training, we set our PQ-style loss weight (\equrefeq:pqloss, normalized by $N$ ) to 3.0, with $\alpha=0.75$ . Our instance discrimination uses $\tau=0.3$ , and a weight of 1.0. We set the mask-ID cross-entropy weight to 0.3, and semantic segmentation weight to 1.0. We use an output size $N=128$ and $D=128$ channels. We fill the initial memory with learnable weights (more details and architectures in \secrefsec:appendix_details).

1 Main results

We present our main results on COCO val set and test-dev set , with a small model, MaX-DeepLab-S, and a large model, MaX-DeepLab-L.

MaX-DeepLab-S augments ResNet-50 with axial-attention blocks in the last two stages. After pretaining, we replace the last stage with dual-path transformer blocks and use an $L=0$ (not stacked) decoder. We match parameters and M-Adds to DETR-R101 , for fair comparison.

MaX-DeepLab-L stacks an $L=2$ decoder on top of Wide-ResNet-41 . And we replace all stride 16 residual blocks by our dual-path transformer blocks with wide axial-attention blocks . This large variant is meant to be compared with state-of-the-art results.

In \tabreftab:coco_val, we report our validation set results and compare with both box-based and box-free panoptic segmentation methods. As shown in the table, our single-scale MaX-DeepLab-S already outperforms all other box-free methods by a large margin of more than 4.5 % PQ, no matter whether other methods use test time augmentation (TTA, usually flipping and multi-scale) or not. Specifically, it surpasses single-scale Panoptic-DeepLab by 8.7% PQ, and single-scale Axial-DeepLab by 5.0% PQ with similar M-Adds. We also compare MaX-DeepLab-S with DETR , which is based on an end-to-end detector, in a controlled environment of similar number of parameters and M-Adds. Our MaX-DeepLab-S outperforms DETR by 3.3% PQ in this fair comparison. Next, we scale up MaX-DeepLab to a wider variant with stacked decoder, MaX-DeepLab-L. This scaling further improves the single-scale performance to 51.1% PQ, outperforming multi-scale Axial-DeepLab by 7.2% PQ with similar inference M-Adds.

Test-dev set.

Our improvements on the val set transfers well to the test-dev set, as shown in \tabreftab:coco_test. On the test-dev set, we are able to compare with more competitive methods and stronger backbones equipped with group convolution , deformable convolution , or recursive backbone , while we do not use these improvements in our model. In the regime of no TTA, our MaX-DeepLab-S outperforms Axial-DeepLab by 5.4% PQ, and DETR by 3.0% PQ. Our MaX-DeepLab-L without TTA further attains 51.3% PQ, surpassing Axial-DeepLab with TTA by 7.1% PQ. This result also outperforms the best box-based method DetectoRS with TTA by 1.7% PQ, closing the large gap between box-based and box-free methods on COCO for the first time. Our MaX-DeepLab sets a new state-of-the-art on COCO, even without using TTA.

2 Ablation study

In this subsection, we provide more insights by teasing apart the effects of MaX-DeepLab components on the val set. We first define a default baseline setting and then vary each component of it: We augment Wide-ResNet-41 by applying dual-path transformer to all blocks at stride 16, enabling all four types of attention. For faster wall-clock training, we use an $L=0$ (not stacked) decoder and approximate P2P attention with convolutional blocks.

We first study the scaling of MaX-DeepLab in \tabreftab:scaling. We notice that replacing convolutional blocks with axial-attention blocks gives the most improvement. Further changing the input resolution to $1025\times 1025$ improves the performance to 49.4% PQ, with a short 100k schedule (54 epochs). Stacking the decoder $L=1$ time improves 1.4% PQ, but further scaling to $L=2$ starts to saturate. Training with more iterations helps convergence, but we find it not as necessary as DETR which is trained for 500 epochs.

Dual-path transformer.

Next, we vary attention types of our dual-path transformer and the stages (strides) where we apply transformer blocks. Note that we always apply M2P attention that attaches the transformer to the CNN. And P2P attention is already ablated above. As shown in \tabreftab:transformer, removing our P2M feedback attention causes a drop of 0.7% PQ. On the other hand, we find MaX-DeepLab robust (-0.6% PQ) to the removal of M2M self-attention. We attribute this robustness to our non-overlapping mask formulation. Note that DETR relies on M2M self-attention to remove duplicated boxes. In addition, it is helpful (+1.0% PQ) to apply transformer blocks to stride 8 also, which is impossible for DETR without our dual-path design. Pushing it further to stride 4 does not show more improvements.

Loss ablation.

Finally, we ablate our PQ-style loss and auxiliary losses in \tabreftab:losses. We first switch our PQ-style similarity in \equrefeq:similarity from $\text{RQ}\times\text{SQ}$ to $\text{RQ}+\text{SQ}$ , which differs in the hungarian matching (\equrefeq:matching) and removes dynamic loss weights in \equrefeq:pqlosspos. We observe that $\text{RQ}+\text{SQ}$ works reasonably well, but $\text{RQ}\times\text{SQ}$ improves 0.8% PQ on top of it, confirming the effect of our PQ-style loss in practice, besides its conceptual soundness. Next, we vary auxiliary losses applied to MaX-DeepLab, without tuning loss weights for remaining losses. Our PQ-style loss alone achieves a reasonable performance of 39.5% PQ. Adding instance discrimination significantly improves PQTh, showing the importance of a clustered feature embedding. Mask-ID prediction shares the same target with the Dice term in \equrefeq:pqlosspos, but helps focus on large masks when the Dice term is overwhelmed by small objects. Combining both of the auxiliary losses leads to a large 5.6% PQ gain. Further multi-tasking with semantic segmentation improves 0.6% PQ, because its class-level supervision helps stuff classes but not instance-level discrimination for thing classes.

3 Analysis

We provide more insights of MaX-DeepLab by plotting our training curves and visualizing the mask output head.

We first report the validation PQ curve in \figreffig:valpq, with our default ablation model. MaX-DeepLab converges quickly to around 46% PQ within 100k iterations (54 epochs), 1/10 of DETR . In \figreffig:classsim and \figreffig:masksim, we plot the characteristics of all matched masks in an image. The matched masks tend to have a better class correctness than mask correctness. Besides, we report per-pixel accuracies for instance discrimination (\figreffig:instdis) and mask-ID prediction (\figreffig:maskid). We see that most pixels learn quickly to find their own instances (out of $K$ ) and predict their own mask-IDs (out of $N$ ). Only 10% of all pixels predict wrong mask-IDs, but they contribute to most of the PQ error.

Visualization.

In order to intuitively understand the normalized decoder output $g$ , the transformer mask feature $f$ , and how they are multiplied to generate our mask output $\hat{m}$ , we train a MaX-DeepLab with $D=3$ and directly visualize the normalized features as RGB colors. As shown in \figreffig:vis3channels, the decoder feature $g$ assigns similar colors (or feature vectors) to pixels of the same mask, no matter the mask is a thing or stuff, while different masks are colored differently. Such effective instance discrimination (as colorization) facilitates our simple mask extraction with an inner product.

Conclusion

In this work, we have shown for the first time that panoptic segmentation can be trained end-to-end. Our MaX-DeepLab directly predicts masks and classes with a mask transformer, removing the needs for many hand-designed priors such as object bounding boxes, thing-stuff merging, \etc. Equipped with a PQ-style loss and a dual-path transformer, MaX-DeepLab achieves the state-of-the-art result on the challenging COCO dataset, closing the gap between box-based and box-free methods for the first time.

We would like to thank Maxwell Collins and Sergey Ioffe for their feedbacks on the paper, Jiquan Ngiam for Hungarian Matching implementation, Siyuan Qiao for DetectoRS segmentation results, Chen Wei for instance discrimination insights, Jieneng Chen for dice loss comments and the support from Google Mobile Vision. This work is supported by Google Research Faculty Award.

Appendix A Appendix

Similar to the case study in \figreffig:teaser, we provide more panoptic segmentation results of our MaX-DeepLab-L and compare them to the state-of-the-art box-free method, Axial-DeepLab , the state-of-the-art box-based method, DetectoRS , and the first Detection Transformer, DETR in \figreffig:comparison and \figreffig:failures. MaX-DeepLab demonstrates robustness to the challenging cases of similar object bounding boxes and nearby objects with close centers, while other methods make systematic mistakes because of their individual surrogate sub-task design. MaX-DeepLab also shows exceptional mask quality, and performs well in the cases of many small objects. Similar to DETR , MaX-DeepLab fails typically when there are too many object masks.

A.2 Runtime

In \tabreftab:runtime, we report the end-to-end runtime (i.e., inference time from an input image to final panoptic segmentation) of MaX-DeepLab on a V100 GPU. All results are obtained by (1) a single-scale input without flipping, and (2) built-in TensorFlow library without extra inference optimization. In the fast regime, MaX-DeepLab-S takes 67 ms with a typical 641 $\times$ 641 input. This runtime includes 5 ms of postprocessing and 15 ms of batch normalization that can be easily optimized. This fast MaX-DeepLab-S does not only outperform DETR-R101 , but is also around 2x faster. In the slow regime, the standard MaX-DeepLab-S takes 131 ms with a 1025 $\times$ 1025 input, similar to Panoptic-DeepLab-X71 . This runtime is also similar to our run of the official DETR-R101 which takes 128 ms on a V100, including 63 ms for box detection and 65 ms for the heavy mask decoding.

A.3 Mask Output Slot Analysis

In this subsection, we analyze the statistics of all $N=128$ mask prediction slots using MaX-DeepLab-L. In \figreffig:distribution, we visualize the joint distribution of mask slot firings and the classes they predict. We observe that the mask slots have imbalanced numbers of predictions and they specialize on ‘thing’ classes and ‘stuff’ classes. Similar to this Mask-Class joint distribution, we visualize the Mask-Pixel joint distribution by extracting an average mask for each mask slot, as shown in \figreffig:pixel. Specifically, we resize all COCO validation set panoptic segmentation results to a unit square and take an average of masks that are predicted by each mask slot. We split all mask slots into three categories according to their total firings and visualize mask slots in each category. We observe that besides the class-level specialization, our mask slots also specialize on certain regions of an input image. This observation is similar to DETR , but we do not see the pattern that almost all slots have a mode of predicting large image-wide masks.

A.4 Mask Head Visualization

In \figreffig:vis3channels, we visualize how the mask head works by training a MaX-DeepLab with only $D=3$ decoder feature channels (for visualization purpose only). Although this extreme setting degrades the performance from 45.7% PQ to 37.8% PQ, it enables us to directly visualize the decoder features as RGB colors. Here in \figreffig:maskhead we show more examples using this model, together with the corresponding panoptic sementation results. We see a similar clustering effect of instance colors, which enables our simple mask extraction with just a matrix multiplication (a.k.a. dynamic convolution ).

A.5 Transformer Attention Visualization

We also visualize the M2P attention that connects the transformer to the CNN. Specifically, given an input image from COCO validation set, we first select four output masks of interest from the MaX-DeepLab-L panoptic prediction. Then, we probe the attention weights between the four masks and all the pixels, in the last dual-path transformer block. Finally, we colorize the four attention maps with four colors and visualize them in one figure. This process is repeated for two images and all eight attention heads as shown in \figreffig:attention. We omit our results for the first transformer block since it is mostly flat. This is expected because the memory feature in the first transformer block is unaware of the pixel-path input image at all. Unlike DETR which focuses on object extreme points for detecting bounding boxes, our MaX-DeepLab attends to individual object (or stuff) masks. This mask-attending property makes MaX-DeepLab relatively robust to nearby objects with similar bounding boxes or close mass centers.

A.6 More Technical Details

In \figreffig:axial_block, \figreffig:building_blocks, and \figreffig:archs, we include more details of our MaX-DeepLab architectures. As marked in the figure, we pretrain our model on ImageNet . The pretraining model uses only P2P attention (could be a convolutional residual block or an axial-attention block), without the other three types of attention, the feed-forward network (FFN), or the memory. We directly pretrain with an average pooling followed by a linear layer. This pretrained model is used as a backbone for panoptic segmentation, and it uses the backbone learning rate multiplier we mentioned in \secrefsec:exp. After pretraining the CNN path, we apply (with random initialization) our proposed memory path, including the memory, the three types of attention, the FFNs, the decoding layers, and the output heads for panoptic segmentation. In addition, we employ multi-head attention with 8 heads for all attention operations. In MaX-DeepLab-L, we use shortcuts in the stacked decoder. Specifically, each decoding stage (resolution) is connected to the nearest two previous decoding stage outputs of the same resolution.