Probabilistic Anchor Assignment with IoU Prediction for Object Detection

Kang Kim, Hee Seok Lee

Introduction

Object detection in which objects in a given image are classified and localized, is considered as one of the fundamental problems in Computer Vision. Since the seminal work of R-CNN, recent advances in object detection have shown rapid improvements with many innovative architectural designs , training objectives and post-processing schemes with strong CNN backbones. For most of CNN-based detectors, a dominant paradigm of representing objects of various sizes and shapes is to enumerate anchor boxes of multiple scales and aspect ratios at every spatial location. In this paradigm, anchor assignment procedure in which anchors are assigned as positive or negative samples needs to be performed. The most common strategy to determine positive samples is to use Intersection-over-Union (IoU) between an anchor and a ground truth (GT) bounding box. For each GT box, one or more anchors are assigned as positive samples if its IoU with the GT box exceeds a certain threshold. Target values for both classification and localization (i.e. regression offsets) of these anchors are determined by the object category and the spatial coordinate of the GT box.

Although the simplicity and intuitiveness of this heuristic make it a popular choice, it has a clear limitation in that it ignores the actual content of the intersecting region, which may contain noisy background, nearby objects or few meaningful parts of the target object to be detected. Several recent studies have identified this limitation and suggested various new anchor assignment strategies. These works include selecting positive samples based on the detection-specific likelihood, the statistics of anchor IoUs or the cleanness score of anchors. All these methods show improvements compared to the baseline, and verify the importance of anchor assignment in object detection.

In this paper we would like to extend some of these ideas further and propose a novel anchor assignment strategy. In order for an anchor assignment strategy to be effective, a flexible number of anchors should be assigned as positives (or negatives) not only on IoUs between anchors and a GT box but also on how probable it is that a model can reason about the assignment. In this respect, the model needs to take part in the assignment procedure, and positive samples need to vary depending on the model. When no anchor has a high IoU for a GT box, some of the anchors need to be assigned as positive samples to reduce the impact of the improper anchor design. In this case, anchors in which the model finds the most meaningful cues about the target object (that may not necessarily be anchors of the highest IoU) can be assigned as positives. On the other side, when there are many anchors that the model finds equally of high quality and competitive, all of these anchors need to be treated as positives not to confuse the training process. Most importantly, to satisfy all these conditions, the quality of anchors as a positive sample needs to be evaluated reflecting the model’s current learning status, i.e. its parameter values.

With this motivation, we propose a probabilistic anchor assignment (PAA) strategy that adaptively separates a set of anchors into positive and negative samples for a GT box according to the learning status of the model associated with it. To do so we first define a score of a detected bounding box that reflects both the classification and localization qualities. We then identify the connection between this score and the training objectives and represent the score as the combination of two loss objectives. Based on this scoring scheme, we calculate the scores of individual anchors that reflect how the model finds useful cues to detect a target object in each anchor. With these anchor scores, we aim to find a probability distribution of two modalities that best represents the scores as positive or negative samples as in Figure 1. Under the found probability distribution, anchors with probabilities from the positive component are high are selected as positive samples. This transforms the anchor assignment problem to a maximum likelihood estimation for a probability distribution where the parameters of the distribution is determined by anchor scores. Based on the assumption that anchor scores calculated by the model are samples drawn from a probability distribution, it is expected that the model can infer the sample separation in a probabilistic way, leading to easier training of the model compared to other non-probabilistic assignments. Moreover, since positive samples are adaptively selected based on the anchor score distribution, it does not require a pre-defined number of positive samples nor an IoU threshold.

On top of that, we identify that in most modern object detectors, there is inconsistency between the testing scheme (selecting boxes according to the classification score only during NMS) and the training scheme (minimizing both classification and localization losses). Ideally, the quality of detected boxes should be measured based not only on classification but also on localization. To improve this incomplete scoring scheme and at the same time to reduce the discrepancy of objectives between the training and testing procedures, we propose to predict the IoU of a detected box as a localization quality, and multiply the classification score by the IoU score as a metric to rank detected boxes. This scoring is intuitive, and allows the box scoring scheme in the testing procedure to share the same ground not only with the objectives used during training, but also with the proposed anchor assignment strategy that brings both classification and localization into account, as depicted in Figure 2. Combined with the proposed PAA, this simple extension significantly contributes to detection performance. We also compare the IoU prediction with the centerness prediction and show the superiority of the proposed method.

With an additional improvement in post-processing named score voting, each of our methods shows clear improvements as revealed in the ablation studies. In particular, on COCO test-dev set all our models achieve new state-of-the-art performance with significant margins. Our model only requires to add a single convolutional layer, and uses a single anchor per spatial locations similar to , resulting in a smaller number of parameters compared to RetinaNet. The proposed anchor assignment can be parallelized using GPUs and does not require extra computes in testing time. All this evidence verifies the efficacy of our proposed methods. The contributions of this paper are summarized as below:

1. We model the anchor assignment as a probabilistic procedure by calculating anchor scores from a detector model and maximizing the likelihood of these scores for a probability distribution. This allows the model to infer the assignment in a probabilistic way and adaptively determines positive samples.

2. To align the objectives of anchor assignment, optimization and post-processing procedures, we propose to predict the IoU of detected boxes and use the unified score of classification and localization as a ranking metric for NMS. On top of that, we propose the score voting method as an additional post-processing using the unified score to further boost the performance.

3. We perform extensive ablation studies and verify the effectiveness of the proposed methods. Our experiments on MS COCO dataset with five backbones set up new AP records for all tested settings.

Related Work

Since Region-CNN and its improvements, the concept of anchors and offset regression between anchors and ground truth (GT) boxes along with object category classification has been widely adopted. In many cases, multiple anchors of different scales and aspect ratios are assigned to each spatial location to cover various object sizes and shapes. Anchors that have IoU values greater than a threshold with one of GT boxes are considered as positive samples. Some systems use two-stage detectors, which apply the anchor mechanism in a region proposal network (RPN) for class-agnostic object proposals. A second-stage detection head is run on aligned features of each proposal. Some systems use single-stage detectors, which does not have RPN and directly predict object categories and regression offsets at each spatial location. More recently, anchor-free models that do not rely on anchors to define positive and negative samples and regression offsets have been introduced. These models predict various key points such as corners, extreme points, center points or arbitrary feature points induced from deformable convolution. combines anchor-based detectors with anchor-free detection by adding additional anchor-free regression branches. It has been found in that anchor-based and anchor-free models show similar performance when they use the same anchor assignment strategy.

2 Anchor Assignment in Object Detection

The task of selecting which anchors (or locations for anchor-free models) are to be designated as positive or negative samples has recently been identified as a crucial factor that greatly affects a model’s performance. In this regard, several methods have been proposed to overcome the limitation of the IoU-based hard anchor assignment. MetaAnchor predicts the parameters of the anchor functions (the last convolutional layers of detection heads) dynamically and takes anchor shapes as an argument, which provides the ability to change anchors in training and testing. Rather than enumerating pre-defined anchors across spatial locations, GuidedAnchoring defines the locations of anchors near the center of GTs as positives and predicts their shapes. FreeAnchor proposes a detection-customized likelihood that considers both the recall and precision of samples into account and determines positive anchors based on the estimated likelihood. ATSS suggests an adaptive anchor assignment that calculates the mean and standard deviation of IoU values from a set of close anchors for each GT. It assigns anchors whose IoU values are higher than the sum of the mean and the standard deviation as positives. Although these works show some improvements, they either require additional layers and complicated structures, or force only one anchor to have a full classification score which is not desirable in cases where multiple anchors are of high quality and competitive, or rely on IoUs between pre-defined anchors and GTs and consider neither the actual content of the intersecting regions nor the model’s learning status.

Similar to our work, MultipleAnchorLearning (MAL) and NoisyAnchor define anchor score functions based on classification and localization losses. However, they do not model the anchor selection procedure as a likelihood maximization for a probability distribution; rather, they choose a fixed number of best scoring anchors. Such a mechanism prevents these models from selecting a flexible number of positive samples according to the model’s learning status and input. MAL uses a linear scheduling that reduces the number of positives as training proceeds and requires a heuristic feature perturbation to mitigate it. NoisyAnchor fixes the number of positive samples throughout training. Also, they either miss the relation between the anchor scoring scheme and the box selection objective in NMS or only indirectly relate them using soft-labels.

3 Predicting Localization Quality in Object Detection

Predicting IoUs as a localization quality of detected bounding boxes is not new. YOLO and YOLOv2 predict “objectness score”, which is the IoU of a detected box with its corresponding GT box, and multiply it with the classification score during inference. However, they do not investigate its effectiveness compared to the method that uses classification scores only, and their latest version removes this prediction. IoU-Net also predicts the IoUs of predicted boxes and proposed “IoU-guided NMS” that uses predicted IoUs instead of classification scores as the ranking keyword, and adjusts the selected box’s score as the maximum score of overlapping boxes. Although this approach can be effective, they do not correlate the classification score with the IoU as a unified score, nor do they relate the NMS procedure and the anchor assignment process. In contrast to predicting IoUs, some works add an additional head to predict the variance of localization to regularize training or penalize the classification score in testing.

Proposed Methods

Our goal here is to devise an anchor assignment strategy that takes three key considerations into account: Firstly, it should measure the quality of a given anchor based on how likely the model associated with it finds evidence to identify the target object with that anchor. Secondly, the separation of anchors into positive and negative samples should be adaptive so that it does not require a hyperparameter such as an IoU threshold. Lastly, the assignment strategy should be formulated as a likelihood maximization for a probability distribution in order for the model to be able to reason about the assignment in a probabilistic way. In this respect, we design an anchor scoring scheme and propose an anchor assignment that brings the scoring scheme into account.

Specifically, let us define the score of an anchor that reflects the quality of its bounding box prediction for the closest ground truth (GT) $g$ . One intuitive way is to calculate a classification score (compatibility with the GT class) and a localization score (compatibility with the GT box) and multiply them:

where $S_{cls}$ , $S_{loc}$ , and $\lambda$ are the score of classification and localization of anchor $a$ given $g$ and a scalar to control the relative weight of two scores, respectively. $x$ and $f_{\theta}$ are an input image and a model with parameters $\theta$ . Note that this scoring function is dependent on the model parameters $\theta$ . We can define and get $S_{cls}$ from the output of the classification head. How to define $S_{loc}$ is less obvious, since the output of the localization head is encoded offset values rather than a score. Here we use the Intersection-over-Union (IoU) of a predicted box with its GT box as $S_{loc}$ , as its range matches that of the classification score and its values naturally correspond to the quality of localization:

Taking the negative logarithm of score function $S$ , we get the following:

where $\mathcal{L}_{cls}$ and $\mathcal{L}_{IoU}$ denote binary cross entropy lossWe assume a binary classification task. Extending it to a multi-class case is straightforward. and IoU loss respectively. One can also replace any of the losses with a more advanced objective such as Focal Loss or GIoU Loss. It is then legitimate that the negative sum of the two losses can act as a scoring function of an anchor given a GT box.

To allow a model to be able to reason about whether it should predict an anchor as a positive sample in a probabilistic way, we model anchor scores for a certain GT as samples drawn from a probability distribution and maximize the likelihood of the anchor scores w.r.t the parameters of the distribution. The anchors are then separated into positive and negative samples according to the probability of each being a positive or a negative. Since our goal is to distinguish a set of anchors into two groups (positives and negatives), any probability distribution that can model the multi-modality of samples can be used. Here we choose Gaussian Mixture Model (GMM) of two modalities to model the anchor score distribution.

where $w_{1},m_{1},p_{1}$ and $w_{2},m_{2},p_{2}$ represent the weight, mean and precision of two Gaussians, respectively. Given a set of anchor scores, the likelihood of this GMM can be optimized using Expectation-Maximization (EM) algorithm.

With the parameters of GMM estimated by EM, the probability of each anchor being a positive or a negative sample can be determined. With these probability values, various techniques can be used to separate the anchors into two groups. Figure 3 illustrates different examples of separation boundaries based on anchor probabilities. The proposed algorithm using one of these boundary schemes is described in Procedure 1. To calculate anchor scores, anchors are first allocated to the GT of the highest IoU (Line 3). To make EM efficient, we collect top $K$ anchors from each pyramid level (Line 5-11) and perform EM (Line 12). Non-top $K$ anchors are assigned as negative samples (Line 16).

Note that the number of positive samples is adaptively determined depending on the estimated probability distribution conditioned on the model’s parameters. This is in contrast to previous approaches that ignore the model or heuristically determine the number of samples as a hyperparameter without modeling the anchor assignment as a likelihood maximization for a probability distribution. FreeAnchor defines a detection-customized likelihood and models the product of the recall and the precision as the training objective. But their approach is significantly different than ours in that we do not separately design likelihoods for recall and precision, nor do we restrict the number of anchors that have a full classification score to one. In contrast, our likelihood is based on a simple one-dimensional GMM of two modalities conditioned on the model’s parameters, allowing the anchor assignment strategy to be easily identified by the model. This results in easier learning compared to other anchor assignment methods that require complicated sub-routines (e.g. the mean-max function to stabilize training or the anchor depression procedure to avoid local minima) and thus leads to better performance as shown in the experiments.

To summarize our method and plug it into the training process of an object detector, we formulate the final training objective for an input image $x$ (we omit $x$ for brevity):

where $P_{pos}(a,\theta,g)$ and $P_{neg}(a,\theta,g)$ indicate the probability of an anchor being a positive or a negative and can be obtained by the proposed PAA. $\varnothing$ means the background class. Our PAA algorithm can be viewed as a procedure to compute $P_{pos}$ and $P_{neg}$ and approximate them as binary values (i.e. separate anchors into two groups) to ease optimization. In each training iteration, after estimating $P_{pos}$ and $P_{neg}$ , the gradients of the loss objectives w.r.t. $\theta$ can be calculated and stochastic gradient descent can be performed.

2 IoU Prediction as Localization Quality

The anchor scoring function in the proposed anchor assignment is derived from the training objective (i.e. the combined loss of two tasks), so the anchor assignment procedure is well aligned with the loss optimization. However, this is not the case for the testing procedure where the non-maximum suppression (NMS) is performed solely on the classification score. To remedy this, the localization quality can be incorporated into NMS procedure so that the same scoring function (Equation 1) can be used. However, GT information is only available during training, and so IoU between a detected box and its corresponding GT box cannot be computed at test time.

Here we propose a simple solution to this: we extend our model to predict the IoU of a predicted box with its corresponding GT box. This extension is straightforward as it requires a single convolutional layer as an additional prediction head that outputs a scalar value per anchor. We use Sigmoid activation on the output to obtain valid IoU values. The training objective then becomes (we omit input x for brevity):

where ${L}_{IoUP}$ is IoU prediction loss defined as binary cross entropy between predicted IoUs and true IoUs. With the predicted IoU, we compute the unified score of the detected box using Equation 1 and use it as a ranking metric for NMS procedure. As shown in the experiments, bringing IoU prediction into NMS significantly improves performance, especially when coupled with the proposed probabilistic anchor assignment. The overall network architecture is exactly the same as the one in FCOS and ATSS, which is RetinaNet with modified feature towers and an auxiliary prediction head. Note that this structure uses only a single anchor per spatial location and so has a smaller number of parameters and FLOPs compared to RetinaNet-based models using nine anchors.

3 Score Voting

As an additional improvement method here we propose a simple yet effective post-processing scheme. The proposed score voting method works on each box $b$ of remaining boxes after NMS procedure as follows:

where $\hat{b}$ , $s_{i}$ and $\sigma_{t}$ is the updated box, the score computed by Equation 1 and a hyperparameter to adjust the weights of adjacent boxes $b_{i}$ respectively. It is noted that this voting algorithm is inspired by “variance voting” described in and $p_{i}$ is defined in the same way. However, we do not use the variance prediction to calculate the weight of each neighboring box. Instead we use the unified score of classification and localization $s_{i}$ as a weight along with $p_{i}$ .

We found that using $p_{i}$ alone as a box weight leads to a performance improvement, and multiplying it by $s_{i}$ further boost the performance. In contrast to the variance voting, detectors without the variance prediction are capable of using the score voting by just weighting boxes with $p_{i}$ . Detectors with IoU prediction head, like ours, can multiply it by $s_{i}$ for better accuracy. Unlike the classification score only, $s_{i}$ can act as a reliable weight since it does not assign large weights to boxes that have a high classification score and a poor localization quality.

Experiments

In this section we conduct extensive experiments to verify the effectiveness of the proposed methods on MS COCO benchmark. We follow the common practice of using ‘trainval35k’ as training data (about 118k images) for all experiments. For ablation studies we measure accuracy on ‘minival’ of 5k images and comparisons with previous methods are done on ‘test-dev’ of about 20k images. All accuracy numbers are computed using the official COCO evaluation code.

We use a COCO training setting which is the same as in the batch size, frozen Batch Normalization, learning rate, etc. The exact setting can be found in the supplementary material. For ablation studies we use Res50 backbone and run 135k iterations of training. For comparisons with previous methods we run 180k iterations with various backbones. Similar to recent works, we use GroupNorm in detection feature towers, Focal Loss as the classification loss, GIoU Loss as the localization loss, and add trainable scalars to the regression head. $\lambda_{1}$ is set to 1 to compute anchor scores and 1.3 when calculating Equation 3. $\lambda_{2}$ is set to 0.5 to balance the scales of each loss term. $\sigma_{t}$ is set to 0.025 if the score voting is used. Note that we do not use “centerness” prediction or “center sampling” in our models. We set $\mathcal{K}$ to 9 although our method is not sensitive to its value similar to . For GMM optimization, we set the minimum and maximum score of the candidate anchors as the mean of two Gaussians and set the precision values to one as an initialization of EM.

2 Ablation Studies

Here we compare the anchor separation boundaries depicted in Figure 3. The left table in Table LABEL:table:ablation shows the results. All the separation schemes work well, and we find that (c) gives the most stable performance. We also compare our method with two simpler methods, namely fixed numbers of positives (FNP) and fixed positive score ranges (FSR). FNP defines a pre-defined number of top-scoring samples as positives while FSR treats all anchors whose scores exceed a certain threshold as positives. As the results in the right of Table LABEL:table:ablation show, both methods show worse performance than PAA. FSR ( $>$ 0.3) fails because the model cannot find anchors whose scores are within the range at early iterations. This shows an advantage of PAA that adaptively determines the separation boundaries without hyperparameters that require careful hand-tuning and so are hard to be adaptive per data.

2.2 Effects of individual modules

In this section we verify the effectiveness of individual modules of the proposed methods. Accuracy numbers for various combinations are in Table LABEL:table:ablation. Changing anchor assignment from the IoU-based hard assignment to the proposed PAA shows improvements of 5.3% in AP score. Adding IoU prediction head and applying the unified score function in NMS procedure further boosts the performance to 40.8%. To further verify the impact of IoU prediction, we compare it with centerness prediction used in . As can be seen in the results, centerness does not bring improvements to PAA. This is expected as weighting scores of detected boxes according to its centerness can hinder the detection of acentric or slanted objects. This shows that centerness-based scoring does not generalize well and the proposed IoU-based scoring can overcome this limitation. We also verify that IoU prediction is more effective than centerness prediction for ATSS (39.8% vs. 39.4%). Finally, applying the score voting improves the performance to 41.0%, surpassing previous methods with Res50 backbone in Table 2.Left with significant margins.

2.3 Accuracy of IoU prediction

We calculate the average error of IoU prediction for various backbones in Table 2.Right. All backbones show less than 0.1 errors, showing that IoU prediction is plausible with an additional convolutional head.

2.4 Visualization of anchor assignment

We visualize positive and negative samples separated by PAA in Figure 4(a). As training proceeds, the distinction between positive and negative samples becomes clearer. Note that the positive anchors do not necessarily have larger IoU values with the target bounding box than the negative ones. Also, many negative anchors in the iteration 30k and 50k have high IoU values. Methods with a fixed number of positive samples can assign these anchors as positives, and the model might predict these anchors with high scores during inference. Finally, many positive anchors have more accurate localization as training proceeds. In contrast to ours, methods like FreeAnchor penalize all these anchors except the single best one, which can confuse training.

2.5 Statistics of positive samples

To compare our method and recent works that also select positive samples by scoring anchors, we plot the number of positive samples according to training iterations in Figure 4(b). Unlike methods that either fix the number of samples or use a linear decay, ours choose a different number of samples per iteration, showing the adaptability of the method.

3 Comparison with State-of-the-art Methods

To verify our methods with previous state-of-the-art ones, we conduct experiments with five backbones as in Table 3. We first compare our models trained with Res10 and previous models trained with the same backbone. Our Res101 model achieves 44.8% accuracy, surpassing previous best models of 43.6 %. With ResNext101 our model improves to 46.6% (single-scale testing) and 49.4% (multi-scale testing) which also beats the previous best model of 45.9% and 47.0%. Then we extend our models by applying the deformable convolution to the backbones and the last layer of feature towers same as. These models also outperform the counterparts of ATSS, showing 1.1% and 1.3% improvements. Finally, with the deformable ResNext152 backbone, our models set new records for both the single scale testing (50.8%) and the multi-scale testing (53.5%).

Conclusions

In this paper we proposed a probabilistic anchor assignment (PAA) algorithm in which the anchor assignment is performed as a likelihood optimization for a probability distribution given anchor scores computed by the model associated with it. The core of PAA is in determining positive and negative samples in favor of the model so that it can infer the separation in a probabilistically reasonable way, leading to easier training compared to the heuristic IoU hard assignment or non-probabilistic assignment strategies. In addition to PAA, we identified the discrepancy of objectives in key procedures of object detection and proposed IoU prediction as a measure of localization quality to apply a unified score of classification and localization to NMS procedure. We also provided the score voting method which is a simple yet effective post-processing scheme that is applicable to most dense object detectors. Experiments showed that the proposed methods significantly boosted the detection performance, and surpassed all previous methods on COCO test-dev set.

References

Appendix

We train our models with 8 GPUs each of which holds two images during training. The parameters of Batch Normalization layers are frozen as is a common practice. All backbones are pre-trained with ImageNet dataset. We set the initial learning rate to 0.01 and decay it by a factor of 10 at 90k and 120k iterations for the 135k setting and at 120k and 160k for the 180k setting. For the 180k setting the multi-scale training strategy (resizing the shorter side of input images to a scale randomly chosen from 640 to 800) is adopted as is also a common practice. The momentum and weight decay are set to 0.9 and 1e-4 respectively. Following we use the learning rate warmup for the first 500 iterations. It is noted that multiplying individual localization losses by the scores of an auxiliary task (in our case, this is predicted IoUs with corresponding GT boxes, and centerness scores when using the centerness prediction as in ), which is also applied in previous works, helps train faster and leads to a better performance.

2 Network architecture

Here we provide Figure 5 for a visualization of our network architecture. It is a modified RetinaNet architecture with a single anchor per spatial location which is exactly the same as models used in FCOS and ATSS. The only difference is that the additional head in our model predicts IoUs of predicted boxes whereas FCOS and ATTS models predict centerness scores.

3 More Ablation Studies

We conduct additional ablation studies regarding the effects of topk $\mathcal{K}$ and the default anchor scale. All the experiments in the main paper are conducted with $\mathcal{K}=9$ and the default anchor scale of 8. The anchor size for each pyramid level is determined by the product of its stride and the default anchor scaleSo with the default anchor scale 8 and a feature pyramid of strides from 8 to 128, the anchor sizes are from 64 to 1024.. Table 4 shows the results on different default anchor scales. It shows that the proposed probabilistic anchor assignment is robust to both $\mathcal{K}$ and anchor sizes.

4 More Visualization of Anchor Assignment

We visualize the proposed anchor assignment during training. Figure 6 shows anchor assignment results on COCO training set. Figure 7 shows anchor assignment results on a non-COCO image.

5 Visualization of Detection Results

We visualize detection results on COCO minival set in Figure 8.