End-to-End Object Detection with Fully Convolutional Network

Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, Nanning Zheng

Introduction

Object detection is a fundamental topic in computer vision, which predicts a set of bounding boxes with pre-defined category labels for each image. Most of mainstream detectors utilize some hand-crafted designs such as anchor-based label assignment and non-maximum suppression (NMS). Recently, a quite number of methods have been proposed to eliminate the pre-defined set of anchor boxes by using distance-aware and distribution-based label assignments. Although they achieve remarkable progress and superior performance, there is still a challenge of discarding the NMS post-processing, which hinders the fully end-to-end training.

To tackle this issue, Learnable NMS , Soft NMS and other NMS variants , and CenterNet are proposed to improve the duplicate removal, but they still do not provide an effective end-to-end training strategy. Meanwhile, many approaches based on recurrent neural networks have been introduced to predict the bounding box for each instance by using an autoregressive decoder. These approaches give naturally sequential modeling for the prediction of bounding boxes. But they are only evaluated on some small datasets without modern detectors, and the iterative manner makes the inference process inefficient.

Recently, DETR introduces a bipartite matching based training strategy and transformers with the parallel decoder to enable end-to-end detection. It achieves competitive performance against many state-of-the-art detectors. However, DETR currently suffers from much longer training duration to coverage and relatively lower performance on the small objects. To this end, this paper explores a new perspective: could a fully convolutional network achieve competitive end-to-end object detection?

In this paper, we attempt to answer this question in two dimensions, i.e., label assignment and network architecture. As shown in Fig. 1, most of fully convolutional detectors adopt a one-to-many label assignment rule, i.e., assigning many predictions as foreground samples for one ground-truth instance. This rule provides adequate foreground samples to obtain a strong and robust feature representation. Nevertheless, the massive foreground samples lead to duplicate predicted boxes for a single instance, which prevents end-to-end detection. To demonstrate it, we first give an empirical comparison of different existing hand-designed label assignments. We find that the one-to-one label assignment plays a crucial role in eliminating the post-processing of duplicate removal. However, there is still a drawback in the hand-designed one-to-one assignment. The fixed assignment could cause ambiguity issues and reduce the discriminability of features, since the predefined regions of an instance may not be the best choice for training. To solve this issue, we propose a prediction-aware one-to-one (POTO) label assignment, which dynamically assigns the foreground samples according to the quality of classification and regression simultaneously.

Furthermore, for the modern FPN based detector , the extensive experiment demonstrates that the duplicate bounding boxes majorly come from the nearby regions of the most confident prediction across adjacent scales. Therefore, we design a 3D Max Filtering (3DMF), which can be embedded into the FPN head as a differentiable module. This module could improve the discriminability of convolution in the local regions by using a simple 3D max filtering operator across adjacent scales. Besides, to provide adequate supervision for feature representation learning, we modify a one-to-many assignment as an auxiliary loss.

With the proposed techniques, our end-to-end detection framework achieves competitive performance against many state-of-the-art detectors. On COCO dataset, our end-to-end detector based on FCOS framework and ResNeXt-101 backbone remarkably outperforms the baseline with NMS by 1.1% mAP. Furthermore, our end-to-end detector is more robust and flexible for crowded detection. To demonstrate the superiority in the crowded scenes, we construct more experiments on CrowdHuman dataset. Under the ResNet-50 backbone, our end-to-end detector achieves 3.0% AP50 and 6.0% mMR absolute gains over FCOS baseline with NMS.

Related Work

Owing to the success of convolution networks , object detection has achieved tremendous progress during the last decade. Modern one-stage or two-stage detectors heavily rely on the anchors or anchor-based proposals. In these detectors, the anchor boxes are made up of pre-defined sliding windows, which are assigned as foreground or background samples with bounding box offsets. Due to the hand-designed and data-independent anchor boxes, the training targets of anchor-based detectors are typically sub-optimal and require careful tuning of hyper-parameters. Recently, FCOS and CornerNet give a different perspective for fully convolutional detectors by introducing an anchor-free framework. Nevertheless, these frameworks still need a hand-designed post-processing step for duplicate removal, i.e., non-maximum suppression (NMS). Since NMS is a heuristic approach and adopts a constant threshold for all the instances, it needs carefully tuning and might not be robust, especially in crowded scenes. In contrast, based on the anchor-free framework, this paper proposes a prediction-aware one-to-one assignment rule for classification to discard the non-trainable NMS.

2 End-to-End Object Detection

To achieve end-to-end detection, many approaches are explored in the previous literature. Concretely, in the earlier researches, numerous detection frameworks based on recurrent neural networks attempt to produce a set of bounding boxes directly. Albeit they allow end-to-end learning in principle, they are only demonstrated effectiveness on some small datasets and not against the modern baselines . Meanwhile, Learnable NMS is proposed to learn duplicate removal by using a very deep and complex network, which achieves comparable performance against NMS. But it is constructed by discrete components and does not give an effective solution to realize end-to-end training. Recently, the relation network and DETR apply the attention mechanism to object detection, which models pairwise relations between different predictions. By using one-to-one assignment rules and direct set losses, they do not need any additional post-processing steps. Nevertheless, when performing massive predictions, these methods require highly expensive cost, making them not appropriate for the dense prediction frameworks. Due to the lack of image prior and multi-scale fusion mechanism, DETR also suffers from much longer training duration than mainstream detectors and lower performance on the small objects. Different from the approaches mentioned above, our method is the first to enable end-to-end object detection based on a fully convolutional network.

Methodology

To reveal the effect of label assignment on end-to-end object detection, we construct several ablation studies of conventional label assignments on COCO dataset. As shown in Tab. 1, all the experiments are based on FCOS framework, whose centerness branch is removed to achieve a head-to-head comparison. The results demonstrate the superiority of one-to-many assignment on feature representation and the potential of one-to-one assignment on discarding the NMS. The detailed analysis is elaborated in the following sections.

Since the NMS post-processing is widely adopted in dense prediction frameworks , one-to-many label assignment becomes a conventional way to assign training targets. The adequate foreground samples lead to a strong and robust feature representation. However, when discarding the NMS, due to the redundant foreground samples of one-to-many label assignment, the duplicate false-positive predictions could cause a dramatic drop in performance, e.g., 28.4% mAP absolute drop on FCOS baseline. In addition, the reported mAR in Tab. 1 indicates the recall rates for the predictions of the top 100 scores. Without NMS, the one-to-many assignment rule leads to numerous duplicate predictions with high scores, thus reducing the recall rate. Therefore, the detector is hard to achieve competitive end-to-end detection by relying only on the one-to-many assignment.

1.2 Hand-designed One-to-one Label Assignment

MultiBox and YOLO demonstrate the potential in applying the one-to-one label assignment to a dense prediction framework. In this paper, we evaluate two one-to-one label assignment rules to reveal the undergoing connection with discarding NMS. These rules are modified by two widely-used one-to-many label assignments: Anchor rule and Center rule. Concretely, Anchor rule is based on RetinaNet , each ground-truth instance is only assigned to the anchor with the maximum Intersection-over-Union (IoU). Center rule is based on FCOS , each ground-truth instance is only assigned to the pixel closest to the center of the instance in the pre-defined feature layer. Besides, other anchors or pixels are set as background samples.

As shown in Tab. 1, compared with the one-to-many label assignment, the one-to-one label assignment allows the fully convolutional detectors without NMS to greatly reduce the gap between with and without NMS and achieve reasonable performance. For instance, the detector based on Center rule achieves 21.5% mAP absolute gains over the FCOS baseline. Besides, as it avoids the error suppression of the NMS in complex scenes, the recall rate is further increased. Nevertheless, there still exist two unresolved issues. First, when one-to-one label assignment is applied, the performance gap between detectors with and without NMS remains non-negligible. Second, due to the less supervision for each instance, the performance of the one-to-one label assignment is still inferior to the FCOS baseline.

2 Our Methods

In this paper, to enable competitive end-to-end object detection, we propose a mixture label assignment and a new 3D Max Filtering (3DMF). The mixture label assignment is made up of the proposed prediction-aware one-to-one (POTO) label assignment and a modified one-to-many label assignment (auxiliary loss). With these techniques, our end-to-end framework can discard the NMS post-processing and keep the strong feature representation.

The hand-designed one-to-one label assignment follows a fixed rule. However, this rule may be sub-optimal for various instances in complex scenes, e.g., Center rule for an eccentric object . Thus if the assignment procedure is forced to assign the sub-optimal prediction as the unique foreground sample, the difficulty for the network to converge could be dramatically increased, leading to more false-positive predictions. To this end, we propose a new rule named Prediction-aware One-To-One (POTO) label assignment by dynamically assigning samples according to the quality of predictions.

Let $\Psi$ denotes the index set of all the predictions. $G$ and $N$ correspond to the number of ground-truth instances and predictions, respectively, where typically $G\ll N$ in dense prediction detectors. $\hat{\pi}\in\Pi_{G}^{N}$ indicates a $G$ -permutation of $N$ predictions. Our POTO aims to generate a suitable permutation $\hat{\pi}$ of predictions as the foreground samples. The training loss is formulated as Eq. 1, which consists of the foreground loss $\mathcal{L}_{\mathit{fg}}$ and the background loss $\mathcal{L}_{\mathit{bg}}$ .

where $\mathcal{R}(\hat{\pi})$ denotes the corresponding index set of the assigned foreground samples. For the $i$ -th ground-truth, $c_{i}$ and $b_{i}$ are its category label and bounding box coordinates, respectively. While for the $\hat{\pi}(i)$ -th prediction, $\hat{p}_{\hat{\pi}(i)}$ and $\hat{b}_{\hat{\pi}(i)}$ correspond to its predicted classification scores and predicted box coordinates, respectively.

To achieve competitive end-to-end detection, we need to find a suitable label assignment $\hat{\pi}$ . As shown in Eq. 2, previous works treat it as a bipartite matching problem by using foreground loss as the matching cost, which can be rapidly solved by the Hungarian algorithm .

However, foreground loss typically needs additional weights to alleviate optimization issues, e.g., unbalanced training samples and joint training of multiple tasks. As shown in Tab. 1, this property makes the training loss not the optimal choice for the matching cost. Therefore, as presented in Eq. 3 and Eq. 4, we propose a more clean and effective formulation (POTO) to find a better assignment.

2.2 3D Max Filtering

In addition to the label assignment, we attempt to design an effective architecture to realize more competitive end-to-end detection. To this end, we first reveal the distribution of duplicate predictions. As shown in Tab. 2, for a modern FPN based detector , the performance has a noticeable degradation when applying the NMS to each scale separately. Moreover, we find that the duplicate predictions majorly come from the nearby spatial regions of the most confident prediction. Therefore, we propose a new module called 3D Max Filtering (3DMF) to suppress duplicate predictions.

Convolution is a linear operation with translational equivariance, which produces similar outputs for similar patterns at different positions. However, this property has a great obstacle to duplicate removal, since different predictions of the same instance typically have similar features for the dense prediction detectors. Max filter is a rank-based non-linear filter , which could be used to compensate for the discriminant ability of convolutions in a local region. Besides, max filter has also been utilized in the key-point based detectors, e.g., CenterNet and CornerNet , as a new post-processing step to replace the non-maximum suppression. It demonstrates some potentials to perform duplicate removal, but the non-trainable manner hinders the effectiveness and end-to-end training. Meanwhile, the max filter only considers the single-scale feature, which is not appropriate for the widely-used FPN based detectors .

Therefore, we extend the max filter to a multi-scale version, called 3D Max Filtering, which transforms the features in each scale of FPN. The 3D Max Filtering is respectively adopted in each channel of a feature map.

Specifically, as shown in Eq. 5, given an input feature $x^{s}$ in the scale $s$ of FPN, we first adopt the bilinear operator to interpolate the features from $\tau$ adjacent scales as the same size of input feature $x^{s}$ .

As shown in Eq. 6, for a spatial location $i$ in scale $s$ , the maximum value $y^{s}_{i}$ is then obtained in a pre-defined 3D neighbour tube with $\tau$ scales and $\phi\times\phi$ spatial distance. This operation can be easily implemented by a highly efficient 3D max-pooling operator .

Furthermore, to embed the 3D Max Filtering into the existing frameworks and enable end-to-end training, we propose a new module, as shown in Fig. 3. This module leverages the max filtering to select the predictions with the highest activation value in a local region and could enhance the distinction with other predictions, which is further verified in Sec. 4.2.1. Owing to this property, as shown in Fig. 2, we adopt the 3DMF to refine the coarsely dense predictions and suppress the duplicate predictions. Besides, all the modules are constructed by simple differentiable operators and only have slightly computational overhead.

2.3 Auxiliary Loss

In addition, when using the NMS, as shown in Tab. 1, the performance of POTO and 3DMF is still inferior to the FCOS baseline. This phenomenon may be attributed to the fact that one-to-one label assignment provides less supervision, making the network difficult to learn the strong and robust feature representation . It could further reduce the discrimination of classification, thus causing a decrease in performance. To this end, motivated by many previous works , we introduce an auxiliary loss based on one-to-many label assignment to provide adequate supervision, which is illustrated in Fig. 2.

Similar to ATSS , our auxiliary loss adopts the focal loss with a modified one-to-many label assignment. Specifically, the one-to-many label assignment first takes the top-9 predictions as candidates in each FPN stage, according to the proposed matching quality in Eq. 4. It then assigns the candidates as foreground samples whose matching qualities beyond a statistical threshold. The statistical threshold is calculated by the summation of the mean and the standard deviation of all the candidate matching qualities. In addition, different forms of one-to-many label assignment for the auxiliary loss are elaborately reported in the supplementary material.

Experiments

As same as FCOS , our detector adopts a pair of 4-convolution heads for classification and regression, respectively. The output channel numbers of the first convolution and the second convolution in 3DMF are 256 and 1, respectively. All the backbones are pre-trained on the ImageNet dataset with frozen batch normalizations . In the training phase, input images are reshaped so that their shorter side is 800 pixels. All the training hyper-parameters are identical to the 2x schedule (180k iterations) in the Detectron2 if not specifically mentioned.

2 Ablation Studies on COCO

As shown in Fig. 4, we present the visualization of the classification scores from the FCOS baseline and our proposed framework. For a single instance, the FCOS baseline with one-to-many assignment rule outputs massive duplicate predictions, which are highly activated and have comparable activating scores with the most confident one. These duplicate predictions are evaluated as false-positive samples and greatly affect performance. In contrast, by using the proposed POTO rule, the scores of duplicate samples are significantly suppressed. This property is crucial for the detector to achieve direct bounding box prediction without NMS. Moreover, with the proposed 3DMF module, this property is further enhanced, especially in the nearby regions of the most confident prediction. Besides, since the 3DMF module introduces the multi-scale competitive mechanism, the detector can well perform unique predictions across different FPN stages, e.g., an instance in the Fig. 4 has single highly activated scores in various stages.

2.2 Prediction-Aware One-to-One Label Assignment

Spatial prior. As shown in Tab. 3, for the spatial range of assignment, the center sampling strategy is relatively superior to the inside box and global strategies on the COCO dataset. It reflects that the prior knowledge of images is essential in the real world scenario.

Classification vs. regression. The hyper-parameter $\alpha$ , as shown in Eq. 4, controls the ratio of the importance between classification and regression. As reported in Tab. 3, when $\alpha=1$ , the gap with NMS is not narrowed. It could be attributed to the misalignment between the best positions for classification and regression. When $\alpha=0$ , the assignment rule only relies on the predicted scores of classification. Under this condition, the gap with NMS is considerably eliminated, but the absolute performance is still unsatisfactory, which could be caused by overfitting the sub-optimal initialization. In contrast, with a proper fusion of classification and regression quality, the absolute performance is remarkably improved.

2.3 3D Max Filtering

Components. As shown in Tab. 5, without NMS post-processing, our end-to-end detector with POTO achieves 19.0% mAP absolute gains over the vanilla FCOS. By using the proposed 3DMF, the performance is further improved by 1.8% mAP, and the gap with NMS is narrowed to 0.2% mAP. As shown in Fig. 4, the result shows the crucial role of the multi-scale and local-range suppression for end-to-end object detection. The proposed auxiliary loss gives adequate supervision, making our detector obtain competitive performance against the FCOS with NMS.

End-to-end. To demonstrate the superiority of the end-to-end training manner, we replace the 2D Max Filtering of CenterNet with the 3D Max Filtering as new post-processing for duplicate removal. This post-processing is further adopted to the FCOS detector. As shown in Tab. 5, the end-to-end manner achieves significant absolute gains by 1.1% mAP.

Kernel size. As shown in Tab. 6, we evaluate different settings of spatial range $\phi$ and scale range $\tau$ in the 3DMF. When $\phi=3$ and $\tau=2$ , our method obtains the highest performance on the COCO dataset. This phenomenon reflects the duplicate predictions majorly come from a local region across adjacent scales, which is similar to the observation in Sec. 3.2.2.

Performance w.r.t. training duration. As illustrated in Fig. 5(a), at the very beginning, the performance on COCO val set of our end-to-end detectors is inferior to the detectors with NMS. As the training progressed, the performance gap becomes smaller and smaller. After 180k training iterations, our method finally outperforms other detectors with NMS. This phenomenon also occurs on CrowdHuman val set, which is shown in Fig. 5(c). Moreover, due to the removal of hand-designed post-processing, Fig. 5(b) demonstrates the superiority of our method in the recall rate against the NMS based methods.

2.4 Larger Backbone

To further demonstrate the robustness and effectiveness of our method, we provide experiments with larger backbones. The detailed results are reported in Tab. 7. Concretely, when using the ResNet-101 as the backbone, our method is slightly superior to FCOS by 0.5% mAP. But when introducing more stronger backbone, i.e., ResNeXt-101 with deformable convolutions , our end-to-end detector achieves 1.1% mAP absolute gains over the FCOS with NMS. It might be attributed to the flexibly spatial modeling of deformable convolutions. Moreover, the proposed 3DMF is efficient and easy to implement. As shown in Tab. 7, our 3DMF module only has a slightly computational overhead against the baseline detector with NMS.

3 Evaluation on CrowdHuman

We evaluate our model on the CrowdHuman dataset , which is a large human detection dataset with various kinds of occlusions. Compared with the COCO dataset, CrowdHuman has more complex and crowded scenes, giving severe challenges to conventional duplicate removal. Our end-to-end detector is more robust and flexible in crowded scenes. As shown in Tab. 8 and Fig. 5, our method significantly outperforms several state-of-the-art detectors with NMS, e.g., 3.0% mAP and 6.0% mMR absolute gains over the FCOS. Moreover, the recall rate of our method is even superior to the ground-truth boxes with NMS.

Conclusion

This paper has presented a prediction-aware one-to-one label assignment and a 3D Max Filtering to bridge the gap between fully convolutional network and end-to-end object detection. With the auxiliary loss, our end-to-end framework achieves superior performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets. Our method is also demonstrated great potential in complex and crowded scenes, which may benefit many other instance-level tasks.

Acknowledgement

This research was supported by National Key R&D Program of China (No. 2017YFA0700800), National Natural Science Foundation of China (No. 61790563 and 61751401) and Beijing Academy of Artificial Intelligence (BAAI).

References

Appendix A Auxiliary Loss

In this section, we evaluate different one-to-many label assignment rules for the auxiliary loss. The detailed implementations are elaborated as follows:

FCOS. We adopt the assignment rule in FCOS .

ATSS. We adopt the assignment rule in ATSS .

Quality-ATSS. The rule is elaborated in Sec. 3.2.3.

Quality-FCOS. Similar to FCOS, each ground-truth instance is assigned to the pixels in the pre-defined central area of a specific FPN stage. But the specific FPN stage is selected according to the proposed quality instead of the size of instances.

Quality-Top- $k$ . Each ground-truth instance is assigned to pixels with top- $k$ highest qualities over all the FPN stages. We set $k=9$ to align with other rules.

As shown in Tab. 9, the results demonstrate the superiority of our proposed prediction-aware quality function over the hand-designed matching metrics. Compared with the standard ATSS framework, the quality based rule can obtain 1.3% mAP absolute gains.

Appendix B Comparison to DETR

As shown in Tab. 10 and Tab. 11, we give the comparison of different methods based on ResNet-50 backbone, where the NMS is not utilized except for FCOS.

Compared with transformers, convolutions have been extensively tested in vision applications and have many variants for better performance than the DETR, e.g., deformable convolutions in Tab. 10. Moreover, as shown in Tab. 11, our framework has great advantages over the DETR in convergence speed and crowded scenes.