Unifying Training and Inference for Panoptic Segmentation

Qizhu Li, Xiaojuan Qi, Philip H. S. Torr

Introduction

As a pixel-wise classification task, panoptic segmentation aims to achieve a seamless semantic understanding of all countable and uncountable objects in a scene - a.k.a. “things” and “stuff” respectively, and delineate the instance boundaries of objects where semantically possible.

While early attempts at tackling panoptic segmentation often resort to two separate networks for instance and semantic segmentation, recent works are able to improve the overall efficiency by constructing the two branches on a single, shared feature extractor, and training the multi-head, multi-task network jointly. However, these works have stopped short of devising an end-to-end pipeline for panoptic segmentation, as they all adopt a post-processing stage with heuristics to combine the different outputs of their multi-task networks, following . Such pipelines suffer from several shortcomings. Firstly, post-processing often requires a time-consuming trial-and-error procedure to mine a good set of hyperparameters, which may need to be repeated for each image domain. As the performance of an algorithm can be quite sensitive to the choice of hyperparameters, how well a method performs can quickly degenerate to a function of the amount of computation resources at its disposal . Secondly, methods without an explicit loss function for panoptic segmentation cannot directly optimise for the ultimate goal. Even with expert knowledge, it is difficult to design an exhaustive set of rules and remedies for all failure modes. An example is shown in Fig. 1 (c): after the heuristic post-processing, the missing part of the car cannot be recovered.

To achieve an end-to-end system, we reckon three challenging steps need to be taken: (1) unify the training and inference, enabling the network to differentiably produce panoptic segmentation during training; (2) embed a data-driven mechanism in the multi-task network whereby imperfect and coarse cues can be cleaned and corrected; (3) design an appropriate loss function to directly optimise the global objective for panoptic segmentation.

To achieve (1) and (2), we propose a novel pipeline using segmentation and localisation cues to predict a coherent panoptic segmentation in an end-to-end manner. At the heart of this pipeline lie a dynamic potential head – a parameter-free stage that represents a dynamic number of panoptic instances, and a dense instance affinity head – a parametrised, efficient, and data-driven module that predicts and utilises the likelihood for any pair of pixels to belong to the same “thing” instance or “stuff” class. These two differentiable heads produces full panoptic segmentation during training and inference, eradicating the train-test logic discrepancy.

Furthermore, to fulfil (3), we propose a panoptic matching loss which computes loss directly on panoptic segmentations. This objective function, together with the differentiable nature of our proposed panoptic head, enables the network to learn in an end-to-end manner. To our best knowledge, our loss is the first to perform online segment matching before computing a cross entropy loss in an end-to-end panoptic segmentation system. The matching step allows training the network with predicted detections, thereby incentivising it to handle imperfect localisation cues. While the idea is not convoluted, our ablation studies (Table C, Supplementary) show that doing so – as opposed to training with ground truth detections – yields performance gains.

By closing the gap between training and inference, the network enjoys improved accuracy in challenging scenarios. As illustrated in Fig. 1, by aggregating panoptic logits across the whole image according to the predicted affinity strengths (Fig. 1e), our parametrised panoptic head is able to fix inaccurate predictions from a previous stage - truncated objects due to imperfect bounding box localisations (Fig. 1c).

Last but not least, thanks to its power of improving coarse panoptic logits, our network achieves competitive performance even without using object mask cues, which are required in most recent approaches . This means our method can offer an additional degree of flexibility in terms of network design, a trait desirable for applications with a limited computation and time budget. On the challenging Cityscapes and COCO datasets, our models set new records for ResNet-50-based networks, achieving panoptic qualities (PQ) of 61.4 and 43.4 respectively.

Related work

Arguably, the problem of panoptic segmentation can be viewed as a combination of instance and semantic segmentation. Indeed, this interpretation has guided many recent works on panoptic segmentation , where it is largely approached as a bi-task problem, and the focus is placed on solving both sub-problems simultaneously and efficiently. Shared features of these works include the use of networks with multiple specialised subnets for each sub-task, and the lack of an explicit objective on panoptic segmentation.

In addition to the inclusion of “stuff” classes, another major difference between panoptic and instance segmentation is that the former requires all pixels to be given a unique label, whereas the latter does not. As a result, “thing” predictions from an off-the-shelf detection-driven instance segmentation network – e.g., Mask-RCNN – cannot be readily inserted into the panoptic prediction, as pixels need to have their conflicting instance labels resolved. Moreover, contradictions between the semantic and instance branch must also be carefully resolved. This prompted recent works to adopt an offline postprocessing step first described in to perform conflict resolution and merger of instance and semantic predictions, based on a set of carefully tuned heuristics. A number of works have also attempted to encourage consistency between semantic and instance predictions by adding a communication mechanism between the two subnets . However, as these proposed changes do not modify the output format of the network, they still rely on postprocessing to produce panoptic predictions. In addition, Liu et al. proposes to directly learn the ordering of “thing” instances for conflict resolution . However, this approach does not handle overlapping instances pixel-by-pixel – as it predicts a single ranking score for each instance – and does not reconcile conflicts between “stuff” and “thing”.

A small number of works have attempted to advance towards an end-to-end network with a unified train-test logic. We observe that extends a dynamically instantiated instance segmentation network described in to solve the panoptic segmentation problem. It produces non-overlapping segments by design, and is trained end-to-end, given detections. However, it is prone to failures when objects of the same class are nearby and similarly coloured. Moreover, its Instance CRF suffers from the very small number of trainable parameters (since the compatibility transforms are frozen as the Potts model), and is made less attractive by the need to grid search good kernel variances for the bilateral filters in the message passing step.

Recently, Xiong et al. modifies the unary terms of and proposes a parameter-free, differentiable panoptic head to fuse semantic and instance segmentation predictions during training. Similar to , it allows a panoptic loss to be directly applied on the fused probabilities. However, in the inference phase, it still resorts to several heuristic strategies (e.g., overlap-based instance mask pruning) and relies on a complex voting mechanism to determine the semantic categories of predicted segments, deviating from a unified training and inference pipeline. Furthermore, the effectiveness of their parameter-free panoptic head heavily depends on the quality of semantic and instance predictions it receives, since it arguably functions as an online heuristic merger due to the absence of learnable weights.

Also pertinent to this work is the extensive research carried out around the techniques of long-range contextual aggregation. Aside from CRF-driven methods , Bertasius et al. proposes a semantic segmentation method based on random walks to learn and predict inter-pixel affinity graphs, and iteratively multiply the learnt affinity with an initial segmentation to achieve convergence . Lately, another technique, self-attention, has been successful in several vision tasks . However, its quadratic memory and computation complexity has cast doubt over its practicality. To mitigate this problem, Shen et al. suggests to invoke the associativity of matrix multiplication and avoid the explicit production of expensive attention maps. This approach effectively reduces the complexity to a linear one, $O(HW)$ , making it suitable for pixel-level labelling tasks.

Albeit sharing certain operational similarities with self-attention and non-local methods , our proposed dense instance affinity head serves a different purpose, and cannot be substituted by directly inserting these operations in the backbone. The aforementioned methods work by enhancing the expressiveness of extracted features, as reflected in the fact that these actions are performed in the feature space, and can generally lead to performance gains for many tasks. In contrast, our proposed instance affinity is not a generic feature enhancer. It is specifically designed and tasked to model the pairwise probability for any two pixels to belong in the same “thing” instance or “stuff” category. This relationship in turn enables our network to revise and resolve. With this purpose in mind, we incorporate insights from to construct a module that is lightweight, learnable, and agnostic to the number of channels, allowing us to model a dynamic number of instances across different images.

Proposed approach

Our proposed network (Fig. 2) consists of four blocks. A shared fully convolutional backbone extracts a set of features. Operating on these features, a semantic segmentation submodule and an object detection submodule produce segmentation and localisation cues, which are fused and revised by the proposed panoptic segmentation submodule. All components are differentiable and trained jointly, end-to-end.

The pipeline starts with a shared fully convolutional backbone, which takes an input image of spatial dimension $H\times W$ , and generates a set of features $\bm{F}$ . In our experiments, we adopt a simple ResNet-FPN backbone that outputs four multi-scale feature maps , following a common practice in prior works . To encourage global consistency, we carry out a squeeze-and-excitation operation on the top-level ResNet feature before producing the first FPN feature. A similar strategy is used in .

2 Semantic segmentation submodule

The backbone features $\bm{F}$ are fed into the semantic segmentation submodule to produce a $\frac{H}{d}\times\frac{W}{d}\times(N_{st}+N_{th})$ tensor $\bm{V}$ , where $N_{st}$ and $N_{th}$ are the number of “stuff” and “thing” classes respectively. $V_{i}(l)$ denotes the probability that pixel $p_{i}$ belongs to semantic class $l$ . The spatial dimension is downsampled $d$ times to strike a balance between resolution and complexity. We choose $d$ as 4 in the experiments.

Multiple implementations for this submodule have been proposed in the literature, all showing decent performance . In this work, we modify the design in by inserting a Group Normalisation operation after each convolution, which has been observed to help stabilise training. Please refer to the supplementary for further details.

3 Object detection submodule

In parallel, the features $\bm{F}$ are also passed to an object detection submodule, which generates $D$ object detections, consisting of bounding boxes $\bm{B}=\{B_{1},B_{2},B_{3},...,B_{D}\}$ , confidence scores $\bm{s}=\{s_{1},s_{2},s_{3},...,s_{D}\}$ , and predicted classes $\bm{c}=\{c_{1},c_{2},c_{3},...,c_{D}\}$ . Additionally, we add a whole image bounding box for each “stuff” class to the object detection predictions, raising the total number of detections to $D+N_{st}$ . Doing so allows the panoptic submodule to process “things” and “stuff” with a unified architecture.

Notably, the versatility of the panoptic submodule allows our network to work with or without object masks. When the object detection submodule has the capability to predict instance masks for “things” $\bm{M}=\{M_{1},M_{2},M_{3},...,M_{D}\}$ , they are easily incorporated into the dynamic potential $\bm{\Psi}$ . Details will be given in Sec. 3.4.1.

4 Panoptic segmentation submodule

This submodule serves as the mastermind of the pipeline. Receiving cues from the two prior submodules, the panoptic segmentation submodule combines them into a dynamic potential $\bm{\Psi}$ (Sec. 3.4.1) and revises it according to predicted pairwise instance affinities (Sec. 3.4.2), producing the final panoptic segmentations with the same logic in training and inference. This pipeline is illustrated in Fig. 3.

The dynamic potential head functions as an assembly node for segmentation and localisation cues from prior submodules. This head is capable of representing varying numbers of instances as it outputs a dynamic number of channels, one for each object instance or “stuff” class. We present three variants of dynamic head design, as illustrated in Fig. 4. Variant A is proposed in , whereas the mask-free parent of B and C is first described in as the box consistency term. A main difference between variant A and the rest is the absence of detection score in A. We argue that leveraging detection scores can suppress false positives in the final output, as unconfident detections will be attenuated by its score. Thus, we will describe variant B and C in more details.

Given $(D+N_{st})$ bounding boxes $\bm{B}$ and box classes $\bm{c}$ (including the dummy full-image “stuff” boxes), it populates each box region with a combination of semantic segmentation probabilities $\bm{V}$ and box confidence scores $\bm{s}$ to produce a dynamic potential $\bm{\Psi}$ with $(D+N_{st})$ channels:

Optionally, if provided with object masks $\bm{M}$ , the dynamic potential head can also incorporate them into $\bm{\Psi}$ . Defining $\bm{M}$ to be image-resolution instance masks where the raw masks have been resized to their actual dimensions and pasted to appropriate spatial locations in image, the dynamic potential with object masks can be summarised as:

In variant B and C, operator $\odot$ is multiplication and summation respectively. More analysis of the variant B and C are included in the supplementary.

4.2 Dense instance affinity head

We observe that the dynamic potential $\bm{\Psi}$ often carries conflicts and errors due to imperfect cues from semantic segmentation and object localisation. This motivates the design of this parametrised head, with the aim to enable a data-driven mechanism that resolves and revises the output of the dynamic potential head. The main difficulty with injecting parameters into an instance-level head is the varying number of instances across images, which practically translates to a dynamic number of channels in the input tensor. On the other hand, the fundamental building block of a convolutional neural network – convolution – is designed to handle a fixed number of input channels. This apparent incompatibility has led prior works on panoptic segmentation to use either no parameter at all , or only single scaling factors for entire tensors providing limited modelling capacity.

This conundrum can be tackled by driving this head with a pairwise dense instance affinity, which is predicted from data, fully differentiable, and compatible with a dynamic number of input channels. By integrating global information according to the pairwise affinities, it produces the final panoptic segmentation probabilities, from which inference can be trivially made with an argmax operation along the channel dimension. Thus, it is amenable to a direct panoptic loss, an ingredient of an end-to-end network.

To construct the dense instance affinity, this head first extracts from the backbone features $\bm{F}$ a single feature tensor $\bm{Q}$ of dimension $\frac{H}{d}\times\frac{W}{d}\times C$ , where $C$ is the number of feature channels, and $d$ is a downsampling factor. This corresponds to the affinity feature extractor in Fig. 5. The spatial dimensions of $\bm{Q}$ can be easily collapsed to produce a $\frac{HW}{d^{2}}\times C$ feature matrix.

Normally, the pairwise instance affinities $\bm{A}$ – a large $\frac{HW}{d^{2}}\times\frac{HW}{d^{2}}$ matrix – would then be produced by performing a matrix multiplication $\bm{A}=\bm{Q}\bm{Q}^{T}$ . This would be followed by multiplying $\bm{A}$ with a $\frac{HW}{d^{2}}\times C^{\prime}$ input tensor to complete the process. It is, however, prohibitively expensive due to the quadratic complexity with respect to $HW$ . In a typical training step, where $(H,W)=(800,1300)$ and $d=4$ , a single precision matrix with the size of $\bm{A}$ would occupy 15.7GB of GPU memory, making this approach unpractical.

Drawing from insight of , we design a lightweight pipeline for computing and applying the dense instance affinities (Fig. 5). Instead of sequentially computing $\bm{Q}\bm{Q}^{T}\bm{\Psi}$ which explicitly produces $\bm{A}$ , we compute $\bm{Q}\big{(}\bm{Q}^{T}\bm{\Psi}\big{)}$ , since:

The result of $\bm{Q}^{T}\bm{\Psi}$ is a very small $C\times(D+N_{st})$ tensor, taking only tens of kilobytes. In terms of computation, using the same $H$ , $W$ , $d$ as the example above and $(C,D,N_{st}=128,100,53)$ as typically used in experiments, the efficient implementation reduces the total number of multiply-adds by 99.8% to 5 billion FLOPS. For reference, a ResNet-50-FPN backbone at the same input resolution requires 140 billion FLOPS.

Finally, we add the product back to the input, forming a residual connection to ease the learning task. As such, the full action of our dense instance affinity applier can be summarised with the following expression:

where $\phi_{0}$ and $\phi_{1}$ are each a $1\times 1$ convolution followed by an activation. From this formulation, inference is straight forward and does not require any post-processing, as an argmax operation on $\bm{P}$ along the channel direction readily produces the panoptic segmentation prediction.

Note that we do not compute a loss directly over $\bm{Q}$ ; instead, the instance affinities are implicitly trained by supervision from the panoptic matching loss described in the next section. In the preliminary experiments, we tried directly supervising $\bm{Q}$ with a contrastive loss, but did not observe performance gains. This shows that our end-to-end training scheme with the panoptic matching loss is already able to guide the model to learn effectively. Detailed discussion of the dense instance affinity operation, with ablation studies and visualisations, is provided in Sec. 4.1.

For simplicity, the affinity feature extractor adopts the same architecture as our semantic segmentation submodule. We use $C=128$ in all experiments.

5 Panoptic matching loss

For instance-level segmentation, different permutations of the indices in the segmentation map are qualitatively equivalent, since the indices merely act to distinguish between each other, and do not carry actual semantic meanings.

During training, we feed predicted object detections into the panoptic segmentation submodule. As a result, the indices of the instances are not fixed or known before hand. To compute loss, we first match the ground truth segmentation to the predicted detections by maximising the intersection over union between their bounding boxes (box IoU). Given a set of $\alpha$ ground truth segments $\bm{\mathcal{T}}=\{\mathcal{T}_{1},\mathcal{T}_{2},\mathcal{T}_{3},...,\mathcal{T}_{\alpha}\}$ , and a set of $\beta$ predicted bounding boxes $\bm{B}=\{B_{1},B_{2},B_{3},...,B_{\beta}\}$ , we find the “matched” ground truth $\bm{\mathcal{T}}^{\star}$ which satisfies:

Unlike ours, the panoptic loss used by does not have the matching stage and its panoptic head is trained with ground truth detections instead. As a result, the models of are not trained to handle imperfect localisations. In addition, our loss differs from as the loss used by their spatial ranking module does not directly supervise panoptic segmentation, does not take “stuff” into account, and thus does not globally optimise in an end-to-end way.

Experimental evaluation

The Cityscapes dataset features high resolution road scenes with 11 “stuff” and 8 “thing” classes. There are 2,975 training images, 500 validation images, and 1,525 test images. We report on its validation set and test set.

The COCO panoptic dataset has a greater number of images and categories. It features 118k training images, 5k validation images, and 20k test-dev images. There are 133 semantic classes, including 53 “stuff” and 80 “thing” categories. We report on its validation set and test-dev set.

Our main evaluation metric is the panoptic quality (PQ), which is the product of segmentation quality (SQ) and recognition quality (RQ) . SQ captures the average segmentation quality of matched segments, whereas RQ measures the ability of an algorithm to correctly detect objects.

We also report the mean Intersection over Union (IoU) score of our initial category-level segmentation $\bm{V}$ , and the box Average Precision ( $AP_{box}$ ) of our predicted bounding boxes $\bm{B}$ . Additionally, for models which predict object instance masks $\bm{M}$ in the object detection submodule, we report its mask Average Preicision ( $AP_{mask}$ ) as well. Both $AP_{box}$ and $AP_{mask}$ are averaged across IoU thresholds between $0.5$ and $0.95$ , at increments of $0.05$ .

We follow most of the learning settings described in . We distribute the 32 crops in a minibatch over 4 GPUs instead. The weights for the detection, semantic segmentation, and panoptic segmentation losses are set to 0.25, 1.0, and 1.0 respectively.

We follow most of the learning settings for COCO experiments in . For the learning schedule, we train for 200k iterations with a base learning rate of 0.02, and reduce it by a factor of 10 at 150k and 190k iterations. While this learning schedule differs from that used in , we found that our panoptic submodule with its additional parameters benefits from the new schedule. In terms of loss weights, we use 1.0, 0.2, and 0.1 for the object detection, semantic segmentation, and panoptic segmentation losses.

1 Ablation studies

We conduct detailed ablation studies for five different settings, including two architecture choices (msk. and aff.), one training strategy (e2e.), and two inference options (heu. and amx.). We report the results in Table 1. Explanations for the abbreviations can be found in the table caption. For clarity, we provide a brief description of the ablation models:

2 Comparison with state-of-the-art

We compare our results with other methods on Cityscapes validation set in Table 2. All entries are ResNet-50 based except . We sort prior works into two tracks, depending on whether the network performs instance segmentation internally. For both tracks, our method achieves the state-of-art. The most telling comparison is between our model and UPSNet, as these methods have a similar network architecture other than our proposed panoptic segmentation submodule. Our network is able to outperform UPSNet by $2.1$ PQ. On the other hand, among methods that do not rely on instance segmentation , our system outperforms the previous state-of-art by $3.5$ PQ, even though they utilise stronger backbones (Xception-71 and ResNet-101 ) than ours (ResNet-50).

Speed-wise, our design compares favourably with other state-of-the-art models. On Cityscapes, inference takes $386$ msObtained by running our re-implementation. and $201$ msObtained by running its publicly released code. per image for and , whereas our full model runs at $197$ ms per image. All models are ResNet-50 based and timed on a single RTX 2080Ti card.

Results on the COCO panoptic validation set are reported in Table 3. Due to the disentangling power of our proposed pipeline and unified train-test logic, we are able to outperform the previous state-of-art method by $0.9$ in terms of overall PQ, and 2.1 in terms of PQ for “stuff”.

Results on the Cityscapes test set and COCO test-dev set are reported in Table 4 and 5. We perform single-scale inference, without any test-time augmentation. For fair comparison, only methods that are ResNe(X)t-based are reported. Our method achieves the state-of-art performance on both datasets with a PQ of $63.3$ and $47.2$ respectively.

Qualitative results are shown in Fig 7 where we compare with our re-implementation of Panoptic FPN. As the instance affinity operation integrates information from pixels locally and globally, our method can resolve errors in the detection stage by propagating meaningful information from other pixels. The “void” region (displayed in black) shown in Fig 7c are typically present in results produced by the heuristic merging process popularised by . They are due to the method’s inability to resolve inconsistencies between semantic and instance predictions. In contrast, our method successfully handles such cases, as evident in Fig. 7d.

Conclusion

We have presented an end-to-end panoptic segmentation approach that exploits a novel pairwise instance affinity operation. It is lightweight, learnt from data, and capable of modelling a dynamic number of instances. By integrating information across the image in a differentiable manner, the instance affinity operation with the panoptic matching loss enables end-to-end training and heuristics-free inference, leading to improved qualities for panoptic segmentation. Furthermore, our method bestows additional flexibility upon network design, allowing our model to perform well even if it only uses bounding boxes as localisation cues.

This work was supported by Huawei Technologies Co., Ltd., the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. We would also like to thank the Royal Academy of Engineering and FiveAI.

References

Appendices

Our semantic segmentation submodule is modified from , by performing Group Normalisation after each $3\times 3$ convolution. We illustrate the pipeline in Fig. H. Note that the architecture of the feature decoder inside this submodule is also adopted by our dense instance affinity head to extract affinity features $\bm{Q}$ . This submodule is supervised by a cross-entropy loss, unless otherwise stated.

A.2 Object detection submodule

In our experiments, we use the standard box head from Faster-RCNN and optionally the mask head from Mask-RCNN for this submodule, following . For the mask head, we use the Lovasz Hinge loss to replace the binary cross entropy loss. Thanks to the modular design of our network, it is easy to substitute it with any other detector architecture.

A.3 Dynamic potential head

We refer to the design variant B and C presented in Sec. 3.4.1 (Fig. 4). At first glance, variant B, which multiplies semantic segmentation probabilities $V_{i}(c_{k})$ with mask scores $M_{i}(k)$ , appears to be a more appropriate method than variant C which sums probabilities instead. The output of variant B is high only when both inputs are unanimously high. This can filter out spurious misclassifications from either input, and improve robustness towards false positive predictions. Indeed, on Cityscapes, we observe that variant B achieves a $1.1$ PQ lead over the variant C counterpart (first row of Table F).

However, on COCO, we notice a high tendency for the semantic segmentation submodule to mistake “things” for “stuff” (Table. G2). The multiplicative action of variant B can systematically and substantially weaken the panoptic logits for “thing” classes, relative to the unattenuated panoptic logits of “stuff” classes. This can be undesirable for models whose semantic segmentation submodule is already prone to misclassifying “things” as “stuff”. On the other hand, the opposite is true for variant C, as summation strengthens panoptic logits of “things” in comparison to unmodified “stuff” scores. This led us to use variant C for COCO, and we observe a 0.7 PQ improvement in comparison to B (second row of Table F).

A.4 Training with predicted detections

In contrast with the practice in , we argue that, during training, the dynamic potential head should use predicted detections instead of ground truth ones to construct its output $\bm{\Psi}$ . This allows the network to learn from realistic examples, and build up its robustness towards imperfections in detection localisation and scoring. To test our hypothesis, we carried out an ablation study on Cityscapes using our mask-free model. When training with ground truth boxes, a uniform score of $1.0$ is used for their confidence scores. Results are shown in Table H. As expected, training with predicted detections yields performance improvements across all panoptic metrics, including a $0.4$ increase in PQ. A large boost in observed for $AP_{box}$ ( $+1.3$ ), because training with predicted boxes allows gradients from the panoptic segmentation submodule to flow to the object detection submodule, giving it more fine-grained supervision. IoU has not changed, as this ablation setting does not affect the semantic segmentation module.

B Implementation details

We run our experiments on four V100-32GB GPUs. This allows us to load each GPU with eight image crops and obtain an effective batch size of 32. The large number of crops per GPU enables us to use a Lovasz Softmax loss instead of a cross entropy loss for supervising semantic segmentation, which we found to be effective. Following , we use a base learning rate of $0.01$ , a weight decay of $0.0001$ , and train for a total of $65$ k iterations. The learning rate is reduced by $10$ folds after the first $40$ k iterations, and once more after another $15$ k iterations. Additionally, we adopt a “warm-up” period at the start of training – linearly increasing the learning rate from a third of the base rate to the full rate in 500 iterations, which helps stabilise the training.

We augment input images on-the-fly during training to reduce the network’s tendency to overfit. Our augmentation pipeline resizes the input image by a random factor between $0.5$ and $2$ , takes a random $512\times 1024$ crop, and applies a horizontal flip with $50$ % chance. On top of these techqniues, we also apply image relighting, randomly adjusting the brightness, contrast, hue, and saturation of the image by a small amount, as used in .

On COCO, as the dataset is larger than Cityscapes, less overfitting is observed. Therefore, in terms of data augmentation techniques, we only apply resizing where the shorter size is resized to $800$ and the longer size is kept under $1333$ , and random horizontal flipping with $0.5$ probability.

We use ImageNet pretrained ResNet-50 to initialise all experiments. The batch normalisation statistics are kept unchanged, though further performance gains are likely if they are finetuned on the target dataset. When a normalisation step is used in either the semantic or panoptic submodules, we use the Group Normalisation operation , as it is less sensitive to batch sizes.

We conduct single-scale inference for all experiments, letting the network process and make predictions on full-resolution images in a single forward pass. Note that only detection predictions whose confidence scores are more than a threshold are fed into the dynamic potential head during inference, to minimise unnecessary computation. This cut-off is $0.5$ and $0.75$ for Cityscapes and COCO respectively.

C Evaluation of “stuff”

The PQ metrics effectively treats “stuff” classes as image-wide instances – making all “stuff” segments undergo the same matching procedure with ground truth segments as “thing” segments. While this approach has its merits including a unified evaluation logic and a simplified PQ implementation, it should be noted that matching “stuff” predictions to ground truth is not strictly necessary, since at most one “stuff” instance for each “stuff” class is present per image.

Furthermore, this approach towards “stuff” is neither robust nor fair as a measure for “stuff” segmentation quality, and arguably encourages post-processing of panoptic predictions. Under the PQ formulation, misclassifying even a single pixel into a “stuff’ class absent in the ground truth will increment false positive detections by one, and such mistakes – exacerbated by the relatively small number of ground truth “stuff” segments in a dataset – attract a large penalty on the “stuff” RQ, even though the practical impact on perceptual quality is minimal. This also contrasts in spirit with the mean IoU metric widely adopted to measure semantic segmentation quality, as the mean IoU accumulates intersection and union counts over the whole dataset and is minimally affected by individual pixels.

On the other hand, CNN-based semantic segmentation models are typically prone to produce spurious misclassifications, as they usually do not explicitly enforce smoothness. As a result, recent panoptic segmentation works collectively resort to setting small “stuff” segments to “void” in the final panoptic segmentation. Therefore, to foster meaningful comparison with other state-of-the-art panoptic segmentation approaches, unless specified otherwise, we also carry out this strategy as part of evaluation.

On Cityscapes validation set, we test our full model, our re-implemented Panoptic FPN , and the released UPSNet model with and without trimming off small “stuff” regions, to quantitatively assess the impact of this step on state-of-the-art models. The findings are reported in Table I.

The results show that PQ and RQ are very sensitive to such operations, as removing small stuff segments consistently results in an increase of approximately 2 points for “stuff” PQ, and 2.5 points for “stuff” RQ. This can be largely attributed to the reduced number of false positive stuff segments. On the other hand, the “stuff” IoU metric is insensitive to such modifications, as in all three cases, it suffers a slight decrease of 0.1 or 0.2 points. This prompts us to believe that “stuff” IoU is a better metric for capturing “stuff” segmentation quality than the “thing”-centric PQ family.

D Detailed validation set results

We report the detailed results of our models on the Cityscapes and COCO validation sets in Table J. In addition to the metrics reported in the main paper, this table also includes breakdowns of SQ and RQ by “stuff” and “thing”.

E Visualisation of learnt instance affinities

Additional visualisations of some predicted instance affinities are provided in Fig. I. Note that these instance affinities are extracted from our mask-free model. Interestingly, the model has learnt to resolve cars regions covered by multiple car bounding boxes – a problem difficult for methods only using boxes as localisation cues – by creating strong instance affinities to the bottoms and tyres of cars. The model has found that these regions of cars are normally not covered by multiple bounding boxes, and therefore it is most helpful for instance discrimination by associating uncertain pixels with these regions.

F Qualitative results

We show more qualitative results in Fig. J and K, and comparisons to previous state-of-the-art methods .