Rank-DETR for High Quality Object Detection

Yifan Pu, Weicong Liang, Yiduo Hao, Yuhui Yuan, Yukang Yang, Chao Zhang, Han Hu, Gao Huang

Introduction

The landscape of modern object detection systems has undergone significant transformation since the pioneering work DEtection TRansformer (DETR) . Since DETR yielded impressive results in object detection, numerous subsequent research works such as Deformable-DETR , DINO , and H-DETR have further advanced this field. Moreover, these DETR-based approaches have been successfully extended to address various core visual recognition tasks, including instance / panoptic segmentation , pose estimation , and multi-object tracking . The notable progress in these areas can be credited to ongoing advancements in enhancing the DETR-based framework for improved object detection performance.

Considerable endeavors have been dedicated to advancing the performance of DETR-based methods from various perspectives. These efforts include refining transformer encoder and decoder architectures as well as redesigning of query formulations . While substantial research has been dedicated to developing accurate ranking mechanisms for dense one-stage object detectors like FCOS or ATSS , few studies have specifically investigated this aspect for modern object detectors based on DETR. However, ranking mechanisms are vital in enhancing the average precision performance, particularly under high IoU thresholds.

This study’s primary focus revolves around constructing a high-quality object detector using DETR that exhibits strong performance at relatively high IoU thresholds. We acknowledge the criticality of establishing an accurate ranking order for bounding box predictions in constructing these detectors. To achieve this, we introduce two rank-oriented designs that effectively leverage the benefits of precise ranking information. First, we propose a rank-adaptive classification head and a query rank layer after each Transformer decoder layer. Rank-adaptive classification head adjusts the classification scores using rank-aware learnable logit bias vectors, while the query rank layer fuses additional ranking embeddings into the object queries. Second, we propose two rank-oriented optimization techniques: a loss function modification and a matching cost design. These functions facilitate the ranking procedure of the model and prioritize more accurate bounding box predictions with higher IoU scores when compared to the ground truth. In summary, our rank-oriented designs consistently enhance object detection performance, particularly the AP scores under high IoU thresholds.

To validate the efficacy of our approach, we conducted comprehensive experiments, showcasing consistent performance improvements across recent strong DETR-based methods such as H-DETR and DINO-DETR. For example, based on H-DETR, our method demonstrates a notable increase in AP75 of +2.1%2.1\% (52.9%52.9\% vs. 55.0%55.0\%) and +2.7%2.7\% (55.1%55.1\% vs. 57.8%57.8\%) when utilizing ResNet-5050 and Swin-T backbones, respectively. It is worth highlighting that our approach achieves competitive performance, reaching an 50.2%50.2\% AP in the 1×\times training schedule on the COCO val dataset. These results serve as compelling evidence for the effectiveness and reliability of our proposed methodology.

Related Work

Since the groundbreaking introduction of transformers in 2D object detection by the pioneering work DETR , numerous subsequent studies have developed diverse and advanced extensions based on DETR. This is primarily due to DETR’s ability to eliminate the need for hand-designed components such as non-maximum suppression (NMS). One of the first foundational developments, Deformable-DETR , introduced a multi-scale deformable self/cross-attention scheme, which selectively attends to a small set of key sampling points in a reference bounding box. This approach yielded improved performance compared to DETR, particularly for small objects. Furthermore, DAB-DETR and DN-DETR demonstrated that a novel query formulation could also enhance performance. The subsequent work, DINO-DETR , achieved state-of-the-art results in object detection tasks, showcasing the advantages of DETR design by addressing the inefficiency caused by the one-to-one matching scheme. In contrast to these works, our focus lies in the design of the rank-oriented mechanism for DETR. We propose rank-oriented architecture designs and rank-oriented matching cost and loss function designs to construct a highly performant DETR-based object detector with competitive AP75 results.

Ranking for Object Detection.

There exists a lot of effort to study how to improve the ranking for object detection tasks. For example, IoU-Net constructed an additional IoU predictor and an IoU-guided NMS scheme that considers both classification scores and localization scores during inference. Generalized focal loss proposed a quality focal loss to act as a joint representation of the IoU score and classification score. VarifocalNet introduced an IoU-aware classification score to achieve a more accurate ranking of candidate detection results. TOOD defined a high-order combination of the classification score and the IoU score as the anchor alignment metric to encourage the object detector to focus on high-quality anchors dynamically. In addition, ranking-based loss functions are designed to encourage the detector to rank the predicted bounding boxes according to their quality and penalizes incorrect rankings. The very recent con-current work Stable-DINO and Align-DETR also applied the idea of IoU-aware classification score to improve the loss and matching design for DINO-DETR . In contrast to the aforementioned endeavors, we further introduce a query rank scheme aimed to reduce false positive rates.

Dynamic Neural Networks.

In contrast to static models, which have fixed computational graphs and parameters at the inference stage, dynamic neural networks can adapt their structures or parameters to different inputs, leading to notable advantages in terms of performance, adaptiveness , computational efficiency , and representational power . Dynamic networks are typically categorized into three types: sample-wise , spatial-wise , and temporal-wise . In this work, we introduce a novel query-wise dynamic approach, which dynamically integrates ranking information into the object queries based on their box quality ranking, endowing object queries with better representation ability.

Approach

Above all, we revisit the overall pipeline of the modern DETR-based methods in Section 3.1. The detailed design of the proposed method, including the rank-oriented architecture design and optimization design, is subsequently illustrated in Section 3.2 and Section 3.3. Eventually, we discuss the connections and differences between our approach and related works in Section 3.4.

Detection Transformers (DETRs) process an input image I\mathcal{I} by first passing it through a backbone network and a Transformer encoder to obtain a sequence of enhanced pixel embeddings X={x1,x2,,xN}\mathcal{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{\textsf{N}}\}. The enhanced pixel embeddings, along with a default set of object query embeddings Q0={q10,q20,,qn0}\mathcal{Q}^{0}=\{\mathbf{q}_{1}^{0},\mathbf{q}_{2}^{0},\cdots,\mathbf{q}_{n}^{0}\}, are then fed into the Transformer decoder. After each Transformer decoder layer, task-specific prediction heads are applied to the updated object query embeddings to generate a set of classification predictions Pl={p1l,p2l,,pnl}\mathcal{P}^{l}=\{\mathbf{p}_{1}^{l},\mathbf{p}_{2}^{l},\cdots,\mathbf{p}_{n}^{l}\} and bounding box predictions Bl={b1l,b2l,,bnl}\mathcal{B}^{l}=\{\mathbf{b}_{1}^{l},\mathbf{b}_{2}^{l},\cdots,\mathbf{b}_{n}^{l}\}, respectively, where l{1,2,,L}l\in\{1,2,\cdots,L\} denotes the layer index of the Transformer decoder. Finally, DETR performs one-to-one bipartite matching between the predictions and the ground-truth bounding boxes and labels G={g1,g2,,gm}\mathcal{G}=\{\mathbf{g}_{1},\mathbf{g}_{2},\cdots,\mathbf{g}_{m}\} by associating each ground truth with the prediction that has the minimal matching cost and applying the corresponding supervision.

Object Query.

To update the object query Q0\mathcal{Q}^{0} after each Transformer decoder layer, typically, DETRs form a total of LL subsets, i.e., {Q1,Q2,,QL}\{\mathcal{Q}^{1},\mathcal{Q}^{2},\cdots,\mathcal{Q}^{L}\}, for LL Transformer decoder layers. For both the initial object query Q0\mathcal{Q}^{0} and the updated ones after each layer, each Ql\mathcal{Q}^{l} is formed by adding two parts: content queries Qcl={qc,1l,qc,2l,,qc,nl}\mathcal{Q}^{l}_{c}=\{\mathbf{q}^{l}_{c,1},\mathbf{q}^{l}_{c,2},\cdots,\mathbf{q}^{l}_{c,n}\} and position queries Qpl={qp,1l,qp,2l,,qp,nl}\mathcal{Q}^{l}_{p}=\{\mathbf{q}^{l}_{p,1},\mathbf{q}^{l}_{p,2},\cdots,\mathbf{q}^{l}_{p,n}\}. The content queries capture semantic category information, while the position queries encode prior positional information such as the distribution of bounding box centers and sizes.

Ranking in DETR.

Rank-oriented design plays a crucial role in modern object detectors, particularly in achieving superior average precision (AP) scores under high Intersection over Union (IoU) thresholds. The success of state-of-the-art detectors, such as H-DETR and DINO-DETR, relies on using simple rank-oriented designs, specifically a two-stage scheme and mixed query selection. These detectors generate the initial positional query Qp0\mathcal{Q}_{p}^{0} by ranking the dense coarse bounding box predictions output by the Transformer encoder feature maps and selecting the top 300{\sim}300 confident ones. During evaluation, they gather n×Kn\times K bounding box predictions based on the object query embedding QL\mathcal{Q}^{L} produced by the final Transformer decoder layer (each query within QL\mathcal{Q}^{L} generates KK predictions associated with each category), sort them by their classification confidence scores in descending order, and only return the top 100{\sim}100 most confident predictions.

In this work, we focus on further extracting the benefits brought by the ranking-oriented designs and introduces a set of improved designs to push the envelope of high-quality object detection performance. The subsequent discussion provides further elaboration on these details.

2 Rank-oriented Architecture Design: ensure lower FP and FN

While the original rank-oriented design only incorporates rank information into the initial positional query Qp0\mathcal{Q}_{p}^{0}, we propose an enhanced approach that leverages the benefits of sorting throughout the entire Transformer decoder process. Specifically, we introduce a rank-adaptive classification head after each Transformer decoder layer, and a query rank layer before each of the last L1L-1 Transformer decoder layers. This novel design is intended to boost the detection of true positives while suppressing false positives and correcting false negatives, leading to lower false positive rates and false negative rates. Figure 1 illustrates the detailed pipeline of our rank-oriented architecture designs.

Rank-adaptive Classification Head. We modify the original classification head by adding a set of learnable logit bias vectors Sl={s1l,s2l,,snl}\mathcal{S}^{l}=\{\mathbf{s}_{1}^{l},\mathbf{s}_{2}^{l},\cdots,\mathbf{s}_{n}^{l}\} to the classification scores Tl={t1l,t2l,,tnl}\mathcal{T}^{l}=\{\mathbf{t}_{1}^{l},\mathbf{t}_{2}^{l},\cdots,\mathbf{t}_{n}^{l}\} (before Sigmoid()\operatorname{Sigmoid}(\cdot) function) associated with each object query independently. The classification predictions of the ll-th decoder layer Pl={p1l,p2l,,pnl}\mathcal{P}^{l}=\{\mathbf{p}_{1}^{l},\mathbf{p}_{2}^{l},\cdots,\mathbf{p}_{n}^{l}\} can be formulated as:

where Ql={q1l,q2l,,qnl}\mathcal{Q}^{l}=\{\mathbf{q}_{1}^{l},\mathbf{q}_{2}^{l},\cdots,\mathbf{q}_{n}^{l}\} represents the output embedding after the ll-th Transformer decoder layer. The hidden dimensions of both til\mathbf{t}^{l}_{i} and sil\mathbf{s}^{l}_{i} are the number of categories, i.e., KK. The overall pipeline is shown in Figure 1(b). It is noteworthy that we can directly incorporate a set of learnable embedding, denoted as Sl\mathcal{S}^{l}, into the classification scores Tl\mathcal{T}^{l}. This is practicable because the associated Ql\mathcal{Q}^{l} has already been sorted in the query rank layer, as explained below.

Query Rank Layer. We further introduce a query rank layer before each of the last L1L-1 Transformer decoder layers to regenerate the sorted positional query and content query accordingly.

First, we explain how to construct the rank-aware content query:

where we first sort the output of (l1)(l-1)-th Transformer decoder layer Qcl1\mathcal{Q}_{c}^{l-1} in descending order of P^l1=MLPcls(Qcl1)\mathcal{\hat{P}}^{l-1}=\text{MLP}_{\text{cls}}(\mathcal{Q}_{c}^{l-1}). Since each element in P^l1\mathcal{\hat{P}}^{l-1} is a KK-dimensional vector, we use the maximum value over KK categories (classification confidence) as the ranking basis. The operation symbol Sort(A;B)\operatorname{Sort}(A;B) sorts elements within AA based on the decreasing order of the elements in BB. Then, we concatenate (\|) the sorted object content queries Q^cl1\hat{\mathcal{Q}}_{c}^{l-1} with a set of randomly initialized content query Cl\mathcal{C}^{l} in the feature dimension, where l{2,,L}l\in\{2,\cdots,L\}. This set of content query Cl\mathcal{C}^{l} is optimized in an end-to-end manner. Subsequently, we fuse them back to the original dimension using a fully connected layer (MLPfuse\operatorname{MLP}_{\text{fuse}}). In other words, for each Transformer decoder layer, we maintain a set of rank-aware static content embeddings shared across different samples. These embeddings effectively model and utilize the distribution of the most frequent semantic information .

Next, we present the mathematical formulations for computing the rank-aware positional query. To align the order of the positional queries with the ranked content query, we either sort or recreate the positional queries, depending on the initialization method of positional queries for different DETR-based detectors. For H-DETR, which inherits Deformable DETR and uses the same positional query for all LL Transformer decoder layers, we simply sort the positional query of the previous layer:

For DINO-DETR, which generates new positional queries from the bounding box predictions in each Transformer decoder layer, we sort the bounding box predictions of each object query and recreate the positional query embedding from the sorted boxes:

where Bl1\mathcal{B}^{l-1} and P^l1\mathcal{\hat{P}}^{l-1} represent the bounding box predictions and classification predictions based on the output of (l1)(l-1)-th Transformer decoder layer, i.e., Ql1\mathcal{Q}^{l-1}. PE()\operatorname{PE}(\cdot) contains a sine position encoding function and a small multilayer perceptron to recreate the positional query embedding Qpl\overline{\mathcal{Q}}_{p}^{l}. In other words, each element of Qpl\overline{\mathcal{Q}}_{p}^{l} is estimated by qp,il=PE(bil1)\overline{\mathbf{q}}_{p,i}^{l}=\operatorname{PE}(\overline{\mathbf{b}}_{i}^{l-1}). In Figure 1(c), we illustrate the positional query update process for H-DETR (Equation 3) and omit that process for DINO-DETR (Equation 4), because we primarily conducted experiments on H-DETR.

Finally, we transmit the regenerated rank-aware positional query embedding and content query embedding to the subsequent Transformer decoder layer.

Analysis. The key motivations behind these two rank-oriented architecture designs are adjusting the classification scores of the object queries according to their ranking order information. Within each Transformer decoder layer, we incorporate two sets of learnable representations: the logit bias vectors Sl\mathcal{S}^{l} and the content query vectors Cl\mathcal{C}^{l}. By leveraging these two rank-aware representations, we have empirically demonstrated the capability of our approach to effectively address false positives (oLRPFP{}_{\text{FP}}: 24.5%<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mo></mo></mrow><annotationencoding="application/xtex"></annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3669em;"></span><spanclass="mrel"></span></span></span></span></span>24.1%24.5\%<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>→</mo></mrow><annotation encoding="application/x-tex">\to</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">→</span></span></span></span></span>24.1\%) and mitigate false negatives (oLRPFN{}_{\text{FN}}: 39.5%<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mo></mo></mrow><annotationencoding="application/xtex"></annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3669em;"></span><spanclass="mrel"></span></span></span></span></span>38.6%39.5\%<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>→</mo></mrow><annotation encoding="application/x-tex">\to</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">→</span></span></span></span></span>38.6\%). For a more comprehensive understanding of our findings, please refer to the experiment section.

3 Rank-oriented Matching Cost and Loss: boost the AP under high IoU thresholds

The conventional DETR and its derivatives specify the Hungarian matching cost function LHungarian\mathcal{L}_{\textrm{Hungarian}} and the training loss function L\mathcal{L} using the identical manner, as shown below:

GIoU-aware Classification Loss. Instead of applying the binary target to supervise the classification head, we propose to use the normalized GIoU scores to supervise the classification prediction:

where IoU represents the intersection over union scores between the predicted bounding box and the ground truth one. We adopt a larger value of α{\alpha} (e.g., >22) to prioritize the importance of localization accuracy, thereby promoting more accurate bounding box predictions and downgrading the inaccurate ones. It is worth noting that we use a high-order matching cost from the middle training stage, as most predictions exhibit poor localization quality during the early training stage.

Analysis. The rank-oriented loss function and matching cost are designed to enhance object detection performance at high IoU thresholds. The GIoU-aware classification loss facilitates the ranking procedure by endowing the classification score with GIoU-awareness, resulting in more accurate ranking in the query ranking layer. Meanwhile, the high-order matching cost selects the queries with both high classification confidence and superior IoU scores as positive samples, effectively suppressing challenging negative predictions with high classification scores but low localization IoU scores. This is achieved by magnifying the advantage of a more accurate localization score using γα\gamma^{\alpha}, where γ\gamma is the ratio of accurate localization score to less accurate score. Empirical results show significant boosts in AP75 with GIoU-aware classification loss (52.9%<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mo></mo></mrow><annotationencoding="application/xtex"></annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3669em;"></span><spanclass="mrel"></span></span></span></span></span>54.1%52.9\%<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>→</mo></mrow><annotation encoding="application/x-tex">\to</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">→</span></span></span></span></span>54.1\%) or high-order matching cost design (52.9%<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mo></mo></mrow><annotationencoding="application/xtex"></annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3669em;"></span><spanclass="mrel"></span></span></span></span></span>54.0%52.9\%<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>→</mo></mrow><annotation encoding="application/x-tex">\to</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em;"></span><span class="mrel">→</span></span></span></span></span>54.0\%). Detailed comparisons are provided in the experiment section.

4 Discussion

The concept of GIoU-aware classification loss has been explored in several prior works predating the era of DETR. These works aimed to address the discrepancy between classification scores and localization accuracy. In line with recent concurrent research , our method shares the same insight in designing rank-oriented matching cost and loss functions. However, our approach distinguishes itself by emphasizing the ranking aspect and introducing an additional rank-oriented architecture design, which includes the rank-adaptive classification head and query rank layer. Furthermore, our empirical results demonstrate the complementarity between the rank-oriented architecture design and the rank-oriented matching cost and loss design.

Experiment

We perform object detection experiments using the COCO object detection benchmark with detrex toolbox. Our model is trained on the train set and evaluated on the val set. We adhere to the same experimental setup as the original papers for H-DETR and DINO-DETR .

2 Main Results

Comparison with competing methods. Table 1 compares Rank-DETR with other high-performing DETR-based methods on the COCO object detection val dataset. The evaluation demonstrates that Rank-DETR achieves remarkable results, attaining an AP score of 50.2%50.2\% with only 1212 training epochs. This performance surpasses H-DETR by +1.5%1.5\% and outperforms the recent state-of-the-art method DINO-DETR by +1.2%1.2\% AP. Notably, we observe significant improvements in AP75, highlighting the advantage of our approach at higher IoU thresholds.

Improving H-DETR . Table 3 presents a detailed comparison of our proposed approach with the highly competitive H-DETR . The experimental evaluation demonstrates consistent enhancements in object detection performance across various backbone networks and training schedules. For example, under the 1212 epochs training schedule, our method achieves superior AP scores of 50.2%50.2\%, 52.7%52.7\%, and 57.3%57.3\% with ResNet-50, Swin-Tiny, and Swin-Large backbone networks, respectively. These results surpass the baseline methods by +1.5%1.5\%, +2.1%2.1\%, and +1.4%1.4\%, respectively. Extending the training schedule to 3636 epochs consistently improves the AP scores, resulting in +1.2%1.2\% for ResNet-50, +1.5%1.5\% for Swin-Tiny, and +1.1%1.1\% for Swin-Large. The AP improvement in performance is more significant under high IoU thresholds, which outperform the baseline by +2.1%2.1\%, +2.7%2.7\%, and +1.9%1.9\% in AP75 with ResNet-50, Swin-Tiny, and Swin-Large, respectively. These findings validate our proposed mechanism’s consistent and substantial performance improvements across diverse settings and with different backbone networks, especially under high IoU thresholds. We also validate our performance gain by providing the PR curves under different IoU thresholds in Figure 5(a).

Improving DINO-DETR . Table 3 shows the results of applying our approach to improve the DINO-DETR . Notably, our method demonstrates an increase of +1.4%1.4\% with the ResNet-5050 backbone and +0.8%0.8\% with the Swin-Large backbone. Under the higher IoU setting, our method further obtains +1.8%1.8\% AP75 improvement with ResNet-5050 and +1.1%1.1\% with Swin-Large. These results provide evidence for the generalization ability of our approach across different DETR-based models.

3 Ablation Study and Analysis

We conduct a systematic analysis to assess each proposed component’s influence within our method. We followed a step-by-step approach, progressively adding modules on top of the baseline (LABEL:tab:ablate:grad_add), incorporating each module into the baseline (LABEL:tab:ablate:add_each), and subsequently removing each module from our method (LABEL:tab:ablate:remove_each). This procedure allowed us to understand the contribution of each individual component to the final performance. Furthermore, we conducted statistical and qualitative analyses to comprehensively assess the functionality of each component. We mark the best-performing numbers with ◼ colored regions in each table along each column.

We choose H-DETR with ResNet-50 backbone as our baseline method. By progressively adding the proposed mechanisms on top of the baseline model, it is observed that the performance is steadily increasing, and the best performance is achieved by using all the proposed components (LABEL:tab:ablate:grad_add). It is also observed that the lowest false negative rate (oLRPFN{}_{\text{FN}}) is achieved when only using the rank-oriented architecture designs.

Rank-adaptive Classification Head (RCH).

LABEL:tab:ablate:add_each and LABEL:tab:ablate:remove_each demonstrate that rank-adaptive classification head can slightly improve the AP (+0.2%+0.2\% when adding RCH to the H-DETR baseline, comparing row1 and row2 in LABEL:tab:ablate:add_each; +0.4%+0.4\% when adding RCH to complete our methods, comparing row1 and row2 in LABEL:tab:ablate:remove_each). Furthermore, RCH improves AP75 more than AP.

Query Ranking Layer (QRL).

The proposed QRL mechanism effectively integrates ranking information into the DETR architecture, compensating for the absence of sequential handling of queries in attention layers. The detector’s performance is also consistently improved by utilizing QRL (+0.3%+0.3\% when adding QRL to the H-DETR baseline; +0.7%+0.7\% to complete our method). We further compute the cumulative probability distribution of classification scores of positive and negative queries. QRL yields enhanced classification confidence for matched positive queries (Figure 5(b)), while the classification confidence for unmatched queries is effectively suppressed (Figure 5(c)), thereby ranking true predictions higher than potential false predictions. This phenomenon is further apparent from the oLRPFP\operatorname{oLRP}_{\text{FP}} results showcased in LABEL:tab:ablate:add_each, which is reduced from 24.5%24.5\% to 23.8%23.8\% by using QRL, reducing the false positive rates. These results are in accordance with our design intent.

GIoU-aware Classification Loss (GCL).

Table 4 also shows the effectiveness of the proposed GIoU-aware classification loss, which gains 0.7%0.7\% mAP over the vanilla baseline by comparing row1 and row4 in LABEL:tab:ablate:add_each and 0.7%0.7\% mAP increase by comparing row1 and row4 in LABEL:tab:ablate:remove_each. We also ablate the formulation of the learning target tt in Eq. 6 in LABEL:{tab:ablate:1}. The results show that the performance of adopting IoU (and its exponent) as the optimization target is inferior to use (GIoU+1)/2(\text{GIoU}+1)/2 because GIoU can better model the relationship of two non-overlapped boxes. We use (GIoU+1)/2(\text{GIoU}+1)/2 rather than GIoU because 1<GIoU1-1<\text{GIoU}\leq 1 and 0<(GIoU+1)/210<(\text{GIoU}+1)/2\leq 1.

High-order Matching Cost (HMC).

We also show how the high-order matching cost affects the overall performance in Table 4. The HMC can also significantly improve the overall performance of the object detector (+0.6%+0.6\% when comparing row1 and row5 in LABEL:tab:ablate:add_each, +0.4%+0.4\% when comparing row1 and row5 in LABEL:tab:ablate:remove_each). We further ablate the formulation of the matching cost. As illustrated in LABEL:tab:ablate:2, a high-order exponent IoU can consistently improve the performance and achieve a peak when the power is 44. Using a high-order exponent can suppress the importance of the predicted boxes with low IoU. We can also observe the same trend from LABEL:tab:ablate:2 when replacing IoU with (GIoU+1)/2(\text{GIoU+1})/2, but the latter practice has a slightly inferior performance.

HMC Suppresses the Overlap between Negative Query and Ground Truth. Figure 6 illustrates the cumulative probability distribution of the IoU of unmatched queries. The IoU of each unmatched query is defined as the largest IoU between it and all ground truth boxes. As shown in Figure 6, the adoption of HMC can decrease the IoU between unmatched queries and all the ground truth bounding boxes, effectively pushing the negative queries away from the ground truth boxes. Furthermore, this phenomenon is increasingly remarkable in the latter Transformer decoder layers.

Comparison with Varifocal loss.

In order to assess the effectiveness of the proposed GIoU-aware classification loss (GCL, Eq. 6), we compare it with varifocal loss (VFL) due to their similar mathematical formulations. Following VFL , we utilize the training target t=IoUt=\text{IoU} in Eq. 7. To simplify the comparison and focus on the impact of GCL, we conduct the evaluation without HMC (row3 in LABEL:tab:ablate:grad_add). Our method, leveraging the GCL, achieves a mAP of 49.8%49.8\% (row4 in LABEL:tab:ablate:grad_add), whereas VFL achieves only 49.5%49.5\% mAP. The primary distinctions between GCL and VFL lie in the optimization target. By employing the normalized GIoU as the training target, our approach better models the distance between two non-overlapping boxes, leading to improved performance. In addition, VFL removes scaling factors on positive samples, as they are rare compared to negatives in CNN-based detectors. However, for DETR-based detectors, where positive examples are relatively more abundant, we empirically show that retaining a scaling factor can enhance performance.

Computational Efficiency.

In Table 6, we provide comprehensive data on the number of parameters, computational complexity measured in FLOPs, training time per epoch, inference frames per second (FPS), and Average Precision performance, for both the H-DETR baseline and our approach. These assessments were conducted on RTX 3090 GPUs, allowing us to evaluate testing and training efficiency. The results unequivocally highlight a substantial enhancement in detection performance achieved by our proposed method, with only a slight increase in FLOPs and inference latency. Considering the effectiveness and efficiency, our method has the potential to be adapted into 3D object detection , semantic segmentation tasks, or other applications .

Qualitative Analysis.

Figure 7 visualizes the predicted bounding boxes and their classification confidence scores of both the matched positive queries and the unmatched hard negative queries, respectively. We find that, compared to the baseline method (row1), the proposed approach (row2) effectively promotes the classification scores of positive samples, while that of hard negative queries is rapidly suppressed, progressing layer by layer. These qualitative results further illustrate how the proposed approach achieves high performance by decreasing the false positive rate.

Conclusion

This paper presents a series of simple yet effective rank-oriented designs to boost the performance of modern object detectors and result in the development of a high-quality object detector named Rank-DETR. The core insight behind the effectiveness lies in establishing a more precise ranking order of predictions, thereby ensuring robust performance under high IoU thresholds. By incorporating accurate ranking order information into the network architecture and optimization procedure, our approach demonstrates improved performance under high IoU thresholds within the DETR framework. While there remains ample scope to explore leveraging rank-oriented designs, we hope that our initial work serves as an inspiration for future efforts in building high-quality object detectors.

Acknowledgement

This work is supported by National Key R&D Program of China under Grant 2022ZD0114900 and 2018AAA0100300, National Nature Science Foundation of China under Grant 62071013 and 61671027. We also appreciate Ding Jia, Yichao Shen, Haodi He and Yutong Lin for their insightful discussions, as well as the generous donation of computing resources by High-Flyer AI.

References