Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He

Introduction

Modern object detectors in general consist of a backbone feature extractor that is agnostic to the detection task and a set of necks and heads that incorporate detection-specific prior knowledge. Common components in the necks/heads may include Region-of-Interest (RoI) operations , Region Proposal Networks (RPN) or anchors , Feature Pyramid Networks (FPN) , etc. If the design of the task-specific necks/heads is decoupled from the design of the backbone, they may evolve in parallel. Empirically, object detection research has benefited from the largely independent exploration of general-purpose backbones and detection-specific modules. For a long while, these backbones have been multi-scale, hierarchical architectures due to the de facto design of convolutional networks (ConvNet) , which has heavily influenced the neck/head design for detecting objects at multiple scales (e.g., FPN).

Over the past year, Vision Transformers (ViT) have been established as a powerful backbone for visual recognition. Unlike typical ConvNets, the original ViT is a plain, non-hierarchical architecture that maintains a single-scale feature map throughout. Its “minimalist” pursuit is met by challenges when applied to object detection—e.g., How can we address multi-scale objects in a downstream task with a plain backbone from upstream pre-training? Is a plain ViT too inefficient to use with high-resolution detection images? One solution, which abandons this pursuit, is to re-introduce hierarchical designs into the backbone. This solution, e.g., Swin Transformers and related works , can inherit the ConvNet-based detector design and has shown successful results.

In this work, we pursue a different direction: we explore object detectors that use only plain, non-hierarchical backbones.In this paper, “backbone” refers to architectural components that can be inherited from pre-training and “plain” refers to the non-hierarchical, single-scale property. If this direction is successful, it will enable the use of original ViT backbones for object detection; this will decouple the pre-training design from the fine-tuning demands, maintaining the independence of upstream vs. downstream tasks, as has been the case for ConvNet-based research. This direction also in part follows the ViT philosophy of “fewer inductive biases” in the pursuit of universal features. As the non-local self-attention computation can learn translation-equivariant features , they may also learn scale-equivariant features from certain forms of supervised or self-supervised pre-training.

In our study, we do not aim to develop new components; instead, we make minimal adaptations that are sufficient to overcome the aforementioned challenges. In particular, our detector builds a simple feature pyramid from only the last feature map of a plain ViT backbone (Figure 1). This abandons the FPN design and waives the requirement of a hierarchical backbone. To efficiently extract features from high-resolution images, our detector uses simple non-overlapping window attention (without “shifting”, unlike ). A small number of cross-window blocks (e.g., 4), which could be global attention or convolutions, are used to propagate information. These adaptations are made only during fine-tuning and do not alter pre-training.

Our simple design turns out to achieve surprising results. We find that the FPN design is not necessary in the case of a plain ViT backbone and its benefit can be effectively gained by a simple pyramid built from a large-stride (16), single-scale map. We also find that window attention is sufficient as long as information is well propagated across windows in a small number of layers.

More surprisingly, under some circumstances, our plain-backbone detector, named ViTDet, can compete with the leading hierarchical-backbone detectors (e.g., Swin , MViT ). With Masked Autoencoder (MAE) pre-training, our plain-backbone detector can outperform the hierarchical counterparts that are pre-trained on ImageNet-1K/21K with supervision (Figure 3). The gains are more prominent for larger model sizes. The competitiveness of our detector is observed under different object detector frameworks, including Mask R-CNN , Cascade Mask R-CNN , and their enhancements. We report 61.3 AP ${}^{\text{box}}$ on the COCO dataset with a plain ViT-Huge backbone, using only ImageNet-1K pre-training with no labels. We also demonstrate competitive results on the long-tailed LVIS detection dataset . While these strong results may be in part due to the effectiveness of MAE pre-training, our study demonstrates that plain-backbone detectors can be promising, challenging the entrenched position of hierarchical backbones for object detection.

Beyond these results, our methodology maintains the philosophy of decoupling the detector-specific designs from the task-agnostic backbone. This philosophy is in contrast to the trend of redesigning Transformer backbones to support multi-scale hierarchies . In our case, the detection-specific prior knowledge is introduced only during fine-tuning, without needing to tailor the backbone design a priori in pre-training. This makes our detector compatible with ViT developments along various directions that are not necessarily limited by the hierarchical constraint, e.g., block designs , self-supervised learning , and scaling . We hope our study will inspire future research on plain-backbone object detection.This work is an extension of a preliminary version that was unpublished and not submitted for peer review.

Related Work

Pioneered by the work of R-CNN , object detection and many other vision tasks adopt a pre-training + fine-tuning paradigm: a general-purpose, task-agnostic backbone is pre-trained with supervised or self-supervised training, whose structure is later modified and adapted to the downstream tasks. The dominant backbones in computer vision have been ConvNets of various forms, e.g., .

Earlier neural network detectors, e.g., , were based on a single-scale feature map when originally presented. While they use ConvNet backbones that are by default hierarchical, in principle, they are applicable on any plain backbone. SSD is among the first works that leverage the hierarchical nature of the ConvNet backbones (e.g., the last two stages of a VGG net ). FPN pushes this direction further by using all stages of a hierarchical backbone, approached by lateral and top-down connections. The FPN design is widely used in object detection methods. More recently, works including Trident Networks and YOLOF have revisited single-scale feature maps, but unlike our work they focus on a single-scale taken from a hierarchical backbone.

ViT is a powerful alternative to standard ConvNets for image classification. The original ViT is a plain, non-hierarchical architecture. Various hierarchical Transformers have been presented, e.g., Swin , MViT , PVT , and PiT . These methods inherit some designs from ConvNets, including the hierarchical structure and the translation-equivariant priors (e.g., convolutions, pooling, sliding windows). As a result, it is relatively straightforward to replace a ConvNet with these backbones for object detection.

Plain-backbone detectors.

The success of ViT has inspired people to push the frontier of plain backbones for object detection. Most recently, UViT is presented as a single-scale Transformer for object detection. UViT studies the network width, depth, and input resolution of plain ViT backbones under object detection metrics. A progressive window attention strategy is proposed to address the high-resolution inputs. Unlike UViT that modifies the architecture during pre-training, our study focuses on the original ViT architecture without a priori specification for detection. By maintaining the task-agnostic nature of the backbone, our approach supports a wide range of available ViT backbones as well as their improvements in the future. Our method decouples the backbone design from the detection task, which is a key motivation of pursuing plain backbones.

UViT uses single-scale feature maps for the detector heads, while our method builds a simple pyramid on the single-scale backbone. In the context of our study, it is an unnecessary constraint for the entire detector to be single-scale. Note the full UViT detector has several forms of multi-scale priors too (e.g., RPN and RoIAlign ) as it is based on Cascade Mask R-CNN . In our study, we focus on leveraging pre-trained plain backbones and we do not constrain the detector neck/head design.

Object detection methodologies.

Object detection is a flourishing research area that has embraced methodologies of distinct properties—e.g., two-stage vs. one-stage , anchor-based vs. anchor-free , and region-based vs. query-based (DETR) . Research on different methodologies has been continuously advancing understandings of the object detection problem. Our study suggests that the topic of “plain vs. hierarchical” backbones is worth exploring and may bring in new insights.

Method

Our goal is to remove the hierarchical constraint on the backbone and to enable explorations of plain-backbone object detection. To this end, we aim for minimal modifications to adapt a plain backbone to the object detection task only during fine-tuning time. After these adaptations, in principle one can apply any detector heads, for which we opt to use Mask R-CNN and its extensions. We do not aim to develop new components; instead, we focus on what new insights can be drawn in our exploration.

FPN is a common solution of building an in-network pyramid for object detection. If the backbone is hierarchical, the motivation of FPN is to combine the higher-resolution features from earlier stages and the stronger features from later stages. This is realized in FPN by top-down and lateral connections (Figure 1 left).

If the backbone is non-hierarchical, the foundation of the FPN motivation is lost, as all the feature maps in the backbone are of the same resolution. In our scenario, we simply use only the last feature map from the backbone, which should have the strongest features. On this map, we apply a set of convolutions or deconvolutions in parallel to produce multi-scale feature maps. Specifically, with the default ViT feature map of a scale of $\frac{1}{16}$ (stride = 16 ), we produce feature maps of scales $\{\frac{1}{32},\frac{1}{16},\frac{1}{8},\frac{1}{4}\}$ using convolutions of strides $\{2,1,\frac{1}{2},\frac{1}{4}\}$ , where a fractional stride indicates a deconvolution. We refer to this as a “simple feature pyramid” (Figure 1 right).

The strategy of building multi-scale feature maps from a single map is related to that of SSD . However, our scenario involves upsampling from a deep, low-resolution feature map, unlike , which taps into shallower feature maps. In hierarchical backbones, upsampling is often aided by lateral connection ; in plain ViT backbones, we empirically find this is not necessary (Sec. 4) and simple deconvolutions are sufficient. We hypothesize that this is because ViT can rely on positional embedding for encoding locations and also because the high-dimensional ViT patch embeddings do not necessarily discard information.With a patch size of 16 $\times$ 16 and 3 colors, a hidden dimension $\geq$ 768 (ViT-B and larger) can preserve all information of a patch if necessary.

We will compare with two FPN variants that are also built on a plain backbone (Figure 2). In the first variant, the backbone is artificially divided into multiple stages to mimic the stages of a hierarchical backbone, with lateral and top-down connections applied (Figure 2 (a)) . The second variant is like the first one, but uses only the last map instead of the divided stages (Figure 2 (b)). We show that these FPN variants are not necessary (Sec. 4).From a broader perspective, the spirit of FPN is “to build a feature pyramid inside a network”. Our simple feature pyramid follows this spirit. In the context of this paper, the term of “FPN” refers to the specific architecture design in .

Backbone adaptation.

Object detectors benefit from high-resolution input images, but computing global self-attention throughout the backbone is prohibitive in memory and is slow. In this study, we focus on the scenario where the pre-trained backbone performs global self-attention, which is then adapted to higher-resolution inputs during fine-tuning. This is in contrast to the recent methods that modify the attention computation directly with backbone pre-training (e.g., ). Our scenario enables us to use the original ViT backbone for detection, without redesigning pre-training architectures.

We explore using window attention with a few cross-window blocks. During fine-tuning, given a high-resolution feature map, we divide it into regular non-overlapping windows.We set the window size as the pre-training feature map size by default (14 $\times$ 14 ). Self-attention is computed within each window. This is referred to as “restricted” self-attention in the original Transformer .

Unlike Swin, we do not “shift” the windows across layers. To allow information propagation, we use a very few (by default, 4) blocks that can go across windows. We evenly split a pre-trained backbone into 4 subsets of blocks (e.g., 6 in each subset for the 24-block ViT-L). We apply a propagation strategy in the last block of each subset. We study these two strategies:

(i) Global propagation. We perform global self-attention in the last block of each subset. As the number of global blocks is small, the memory and computation cost is feasible. This is similar to the hybrid window attention in that was used jointly with FPN.

(ii) Convolutional propagation. As an alternative, we add an extra convolutional block after each subset. A convolutional block is a residual block that consists of one or more convolutions and an identity shortcut. The last layer in this block is initialized as zero, such that the initial status of the block is an identity . Initializing a block as identity allows us to insert it into any place in a pre-trained backbone without breaking the initial status of the backbone.

Our backbone adaptation is simple and makes detection fine-tuning compatible with global self-attention pre-training. As stated, it is not necessary to redesign the pre-training architectures.

Discussion.

Object detectors contain components that can be task agnostic, such as the backbone, and other components that are task-specific, such as RoI heads. This model decomposition enables the task-agnostic components to be pre-trained using non-detection data (e.g., ImageNet), which may provide an advantage since detection training data is relatively scarce.

Under this perspective, it becomes reasonable to pursue a backbone that involves fewer inductive biases, since the backbone may be trained effectively using large-scale data and/or self-supervision. In contrast, the detection task-specific components have relatively little data available and may still benefit from additional inductive biases. While pursuing detection heads with fewer inductive biases is an active area of work, leading methods like DETR are challenging to train and still benefit from detection-specific prior knowledge .

Driven by these observations, our work follows the spirit of the original plain ViT paper with respect to the detector’s backbone. While the ViT paper’s discussion focused on reducing inductive biases on translation equivariance, in our case, it is about having fewer or even no inductive bias on scale equivariance in the backbone. We hypothesize that the way for a plain backbone to achieve scale equivariance is to learn the prior knowledge from data, analogous to how it learns translation equivariance and locality without convolutions .

Our goal is to demonstrate the feasibility of this approach. Thus we choose to implement our method with standard detection specific components (i.e., Mask R-CNN and its extensions). Exploring even fewer inductive biases in the detection heads is an open and interesting direction for future work. We hope it can benefit from and build on our work here.

Implementation.

We use the vanilla ViT-B, ViT-L, ViT-H as the pre-training backbones. We set the patch size as 16 and thus the feature map scale is 1/16, i.e., stride = 16.Changing the stride affects the scale distribution and presents a different accuracy shift for objects of different scales. This topic is beyond the scope of this study. For simplicity, we use the same patch size of 16 for all of ViT-B, L, H (see the appendix). Our detector heads follow Mask R-CNN or Cascade Mask R-CNN , with architectural details described in the appendix. The input image is 1024 $\times$ 1024, augmented with large-scale jittering during training. Due to this heavy regularization, we fine-tune for up to 100 epochs in COCO. We use the AdamW optimizer and search for optimal hyper-parameters using a baseline version. More details are in the appendix.

Experiments

We perform ablation experiments on the COCO dataset . We train on the train2017 split and evaluate on the val2017 split. We report results on bounding-box object detection (AP ${}^{\text{box}}$ ) and instance segmentation (AP ${}^{\text{mask}}$ ).

By default, we use the simple feature pyramid and global propagation described in Sec. 3. We use 4 propagation blocks, evenly placed in the backbone. We initialize the backbone with MAE pre-trained on IN-1K without labels. We ablate these defaults and discuss our main observations as follows.

In Table 1 we compare the feature pyramid building strategies illustrated in Figure 2.

We study a baseline with no feature pyramid: both the RPN and RoI heads are applied on the backbone’s final, single-scale ( $\frac{1}{16}$ ) feature map. This case is similar to the original Faster R-CNN before FPN was proposed. All feature pyramid variants (Table 1 a-c) are substantially better than this baseline, increasing AP by up to 3.4 points. We note that using a single-scale feature map does not mean the detector is single-scale: the RPN head has multi-scale anchors and the RoI heads operate on regions of multiple scales. Even so, feature pyramids are beneficial. This observation is consistent with the observation in the FPN paper on hierarchical backbones.

However, the FPN design is not needed and our simple feature pyramid is sufficient for a plain ViT backbone to enjoy the benefit of a pyramid. To ablate this design, we mimic the FPN architecture (i.e., the top-down and lateral connections) as in Figure 2 (a, b). Table 1 (a, b) shows that while both FPN variants achieve strong gains over the baseline with no pyramid (as has been widely observed with the original FPN on hierarchical backbones), they are no better than our simple feature pyramid. The original FPN was motivated by combining lower-resolution, stronger feature maps with higher-resolution, weaker feature maps. This foundation is lost when the backbone is plain and has no high-resolution maps, which can explain why our simple pyramid is sufficient.

Our ablation reveals that the set of pyramidal feature maps, rather than the top-down/lateral connections, is the key to effective multi-scale detection. To see this, we study an even more aggressive case of the simple pyramid: we generate only the finest scale ( $\frac{1}{4}$ ) feature map by deconvolution and then from this finest map we subsample other scales in parallel by strided average pooling. There are no unshared, per-scale parameters in this design. This aggressively-simple pyramid is nearly as good: it has 54.5 AP (ViT-L), 3.3 higher than the no pyramid baseline. This shows the importance of pyramidal feature maps. For any variant of these feature pyramids, the anchors (in RPN) and regions (in RoI heads) are mapped to the corresponding level in the pyramid based on their scales, as in . We hypothesize that this explicit scale-equivariant mapping, rather than the top-down/lateral connection, is the main reason why a feature pyramid can greatly benefit multi-scale object detection.

Window attention is sufficient when aided by a few propagation blocks.

Table 2 ablates our backbone adaptation approach. In short, on top of a baseline that has purely window attention and none of the cross-window propagation blocks (Table 2, “none”), various ways of propagation can show decent gains.Even our baseline with no propagation in the backbone is reasonably good (52.9 AP). This can be explained by the fact that the layers beyond the backbone (the simple feature pyramid, RPN, and RoI heads) also induce cross-window communication.

In Table LABEL:tab:backbone_ablation:prop, we compare our global and convolutional propagation strategies vs. the no propagation baseline. They have a gain of 1.7 and 1.9 over the baseline. We also compare with the “shifted window” (Swin ) strategy, in which the window grid is shifted by a half-window size for every other block. The shifted window variant has a 1.1 gain over the baseline, but is worse than ours. Note that here we focus only on the “shifted window” aspect of Swin : the backbone is still a plain ViT, adapted to shifted window attention only during fine-tuning; it is not the Swin architecture, which we will compare to later.

Table LABEL:tab:backbone_ablation:conv_type compares different types of residual blocks for convolutional propagation. We study the basic (two 3 $\times$ 3) , bottleneck (1 $\times$ 1 $\rightarrow$ 3 $\times$ 3 $\rightarrow$ 1 $\times$ 1) , and a naïve block that has one 3 $\times$ 3 convolution. They all improve over the baseline, while the specific block design makes only marginal differences. Interestingly, even though convolution is a local operation if its receptive field covers two adjacent windows, it is sufficient in principle to connect all pixels of the two windows. This connectivity is thanks to the self-attention in both windows in the succeeding blocks. This may explain why it can perform as well as global propagation.

In Table LABEL:tab:backbone_ablation:block_place we study where cross-window propagation should be located in the backbone. By default 4 global propagation blocks are placed evenly. We compare with placing them in the first or last 4 blocks instead. Interestingly, performing propagation in the last 4 blocks is nearly as good as even placement. This is in line with the observation in that ViT has longer attention distance in later blocks and is more localized in earlier ones. In contrast, performing propagation only in the first 4 blocks shows no gain: in this case, there is no propagation across windows in the backbone after these 4 blocks. This again demonstrates that propagation across windows is helpful.

Table LABEL:tab:backbone_ablation:block_num compares the number of global propagation blocks to use. Even using just 2 blocks achieves good accuracy and clearly outperforms the baseline. For comprehensiveness, we also report a variant where all 24 blocks in ViT-L use global attention. This has a marginal gain of 0.5 points over our 4-block default, while its training requires special memory optimization (we use memory checkpointing ). This requirement makes scaling to larger models (like ViT-H) impractical. Our solution of window attention plus a few propagation blocks offers a practical, high-performing tradeoff.

We benchmark this tradeoff in Table 3. Using 4 propagation blocks gives a good trade-off. Convolutional propagation is the most practical, increasing memory and time by merely $\leq$ 5%, at a small cost of 4% more parameters. Global propagation with 4 blocks is also feasible and does not increase the model size. Global self-attention in all 24 blocks is not practical.

In sum, Table 2 shows that various forms of propagation are helpful, while we can keep using window attention in most or all blocks. Importantly, all these architecture adaptations are performed only during fine-tuning time; they do not require a redesign of the pre-training architecture.

Masked Autoencoders provide strong pre-trained backbones.

Table 4 compares backbone pre-training strategies. Supervised pre-training on IN-1K is slightly worse than no pre-training, similar to the observation in . Supervised pre-training on IN-21K is marginally better for ViT-L.

In contrast, MAE pre-training on IN-1K (without labels) shows massive gains, increasing AP ${}^{\text{box}}$ by 3.1 for ViT-B and 4.6 for ViT-L. We hypothesize that the vanilla ViT , with fewer inductive biases, may require higher-capacity to learn translation and scale equivariant features, while higher-capacity models are prone to heavier overfitting. MAE pre-training can help to relieve this problem. We discuss more about MAE in context next.

2 Comparisons with Hierarchical Backbones

Modern detection systems involve many implementation details and subtleties. To focus on comparing backbones under as fair conditions as possible, we incorporate the Swin and MViTv2 backbones into our implementation.

We use the same implementation of Mask R-CNN and Cascade Mask R-CNN for all ViT, Swin, and MViTv2 backbones. We use FPN for the hierarchical backbones of Swin/MViTv2. We search for optimal hyper-parameters separately for each backbone (see the appendix). Our Swin results are better than their counterparts in the original paper;For example, Swin-B (IN-1K, Cascade Mask R-CNN) has 51.9 AP ${}^{\text{box}}$ reported in the official repo. This result in our implementation is 52.7. our MViTv2 results are better than or on par with those reported in .

Following the original papers , Swin and MViTv2 both use relative position biases . For a fairer comparison, here we also adopt relative position biases in our ViT backbones as per , but only during fine-tuning, not affecting pre-training. This addition improves AP by $\scriptstyle\sim$ 1 point. Note that our ablations in Sec. 4.1 are without relative position biases.

Results and analysis.

Table 5 shows the comparisons. Figure 3 plots the tradeoffs. The comparisons here involve two factors: the backbone and the pre-training strategy. Our plain-backbone detector, combined with MAE pre-training, presents better scaling behavior. When the models are large, our method outperforms the hierarchical counterparts of Swin/MViTv2, including those using IN-21K supervised pre-training. Our result with ViT-H is 2.6 better than that with MViTv2-H. Moreover, the plain ViT has a better wall-clock performance (Figure 3 right, see ViT-H vs. MViTv2-H), as the simpler blocks are more hardware-friendly.

We are also curious about the influence of MAE on hierarchical backbones. This is largely beyond the scope of this paper, as it involves finding good training recipes for hierarchical backbones with MAE. To provide some insight, we implement a naïve extension of MAE with the MViTv2 backbone (see the appendix). We observe that MViTv2-L with this MAE pre-training on IN-1K is 1.3 better than that with IN-21K supervised pre-training (54.9 vs. 53.6 AP ${}^{\text{box}}$ ). As a comparison, this gap is 4 points for our plain-backbone detector (Table 4). This shows that the plain ViT backbone may benefit more from MAE pre-training than the hierarchical backbone, suggesting that the lack of inductive biases on scales could be compensated by the self-supervised training of MAE. While it is an interesting future topic on improving hierarchical backbones with MAE pre-training, our plain-backbone detector enables us to use the readily available ViT backbones from MAE to achieve strong results.

We also note that hierarchical backbones in general involve enhanced self-attention block designs. Examples include the shifted window attention in Swin and pooling attention in MViT v1/v2 . These block designs, if applied to plain backbones, may also improve accuracy and parameter-efficiency. While this may put our competitors at an advantage, our method is still competitive without these enhancements.

3 Comparisons with Previous Systems

Next we provide system-level comparisons with the leading results reported in previous papers. We refer to our system as ViTDet, i.e., ViT Detector, aiming at the usage of a ViT backbone for detection. Since these comparisons are system-level, the methods use a variety of different techniques. While we make efforts to balance the comparisons (as noted below), making a perfectly controlled comparison is infeasible in general; our goal, instead, is to situate our method in the context of current leading methods.

Table 6 reports the system-level comparisons on COCO. For a fairer comparison, here we make two changes following our competitors: we adopt soft-nms as is used by all competitors in this table and increase the input size (from 1024 to 1280) following . We note that we do not use these improvements in previous ablations. As in the previous subsection (Sec. 4.3), we use relative position biases here.

The leading systems thus far are all based on hierarchical backbones (Table 6). For the first time, we show that a plain-backbone detector can achieve highly accurate results on COCO and can compete with the leading systems.

We also compare with UViT which is a recent plain-backbone detection method. As discussed in Sec. 2, UViT and our work have different focuses. UViT aims at designing a new plain backbone that is good for detection, while our goal here is to support general-purpose ViT backbones including the original ones in . Despite the different focuses, both UViT and our work suggest that plain-backbone detection is a promising direction with strong potential.

Comparisons on LVIS.

We further report system-level comparisons on the LVIS dataset . LVIS contains $\scriptstyle\sim$ 2M high-quality instance segmentation annotations for 1203 classes that exhibit a natural, long-tailed object distribution. Unlike COCO, the class distribution is heavily imbalanced and many classes have very few (e.g., $<$ 10) training examples.

We follow the same model and training details as used for the COCO system-level comparison plus two common LVIS practices: we use the federated loss from and sample images with repeat factor sampling . We fine-tune for 100 epochs on the v1 train split.

Table 7 shows the results on the v1 val split. Our plain-backbone detector achieves competitive performance vs. previous leading results that all use hierarchical backbones. Ours is 5.0 points higher than the 2021 competition winner’s “strong baseline” (48.1 vs. 43.1 AP ${}^{\text{mask}}$ ), which uses HTC with CBNetV2 that combines two Swin-L backbones. A special issue in LVIS is on the long-tailed distribution, which is beyond the scope of our study. Techniques dedicated to this issue, e.g., using CLIP text embeddings or other advancements from , can largely increase AP on the rare classes (AP ${}^{\text{mask}}_{\text{rare}}$ ) and thus improve overall AP. These are orthogonal to our method and could be complementary. Nevertheless, our results on LVIS again suggest that plain-backbone detectors can compete with hierarchical ones.

Conclusion

Our exploration has demonstrated that plain-backbone detection is a promising research direction. This methodology largely maintains the independence of the general-purpose backbones and the downstream task-specific designs—which had been the case for ConvNet-based research but not for Transformer-based research. We hope decoupling pre-training from fine-tuning is a methodology that will generally benefit the community. For example, in natural language processing (NLP), general-purpose pre-training (GPT , BERT ) has greatly advanced the field and has been supporting various downstream tasks. In this study, our plain-backbone detector has benefited from the readily available pre-trained models from MAE . We hope this methodology will also help bring the fields of computer vision and NLP closer.

Appendix 0.A Appendix

Table 8 is the ViT-B counterpart of Table 2 on backbone adaptation. The observations are similar to that of ViT-L: comparing with the baseline using no propagation (“none”), various propagation strategies show good gains.

Table 9 presents Table 5 with additional details about FLOPs, parameters, and inference time, plotted in Figure 3.

Table 10 is the ablation on pre-training strategies for LVIS. Similar to Table 4, MAE pre-training has large gains over supervised pre-training.

Figure 4 is the LVIS counterpart of Figure 3. The trends are similar to those in COCO, while the gain of IN-21K supervised pre-training is larger because it significantly improves rare category detection in LVIS.

Figure 5 is the RetinaNet counterpart of Figure 3, showing the trade-off between accuracy and model size. Here, we evaluate ViTDet with a one-stage RetinaNet detector head and compare it to using Swin and MViTv2 as hierarchical backbones, all without hyper-parameter tuning. Compared to using Mask R-CNN and Cascade R-CNN (Table 5 and Figure 3), we observe similar trends with RetinaNet. In particular, our plain-backbone detector presents better scaling behavior (e.g. ViT-H gains +3.4 AP ${}^{\text{box}}$ over MViTv2-H). These results suggest that the proposed training recipe transfers well to different detectors and that our proposed plain backbone adaptations are general and can likely work with even more detection architectures.

A.2 Implementation Details

We build a simple feature pyramid of scales $\{\frac{1}{32},\frac{1}{16},\frac{1}{8},\frac{1}{4}\}$ (see Sec. 3). The $\frac{1}{32}$ scale is built by stride-2 2 $\times$ 2 max pooling (average pooling or convolution works similarly). The $\frac{1}{16}$ scale simply uses the ViT’s final feature map. Scale $\frac{1}{8}$ (or $\frac{1}{4}$ ) is built by one (or two) 2 $\times$ 2 deconvolution layer(s) with stride=2. In the $\frac{1}{4}$ scale case, the first deconvolution is followed by LayerNorm (LN) and GeLU . Then for each pyramid level, we apply a 1 $\times$ 1 convolution with LN to reduce dimension to 256 and then a 3 $\times$ 3 convolution also with LN, similar to the per-level processing of FPN .

We study three detection frameworks: Mask R-CNN , Cascade Mask R-CNN and RetinaNet . For Mask R-CNN and Cascade Mask R-CNN, we incorporate some common best practices developed since they were presented years ago. We use 2 hidden convolution layers for the RPN and 4 hidden convolution layers for the RoI heads as per . These hidden convolution layers are followed by LN. For all three detection frameworks, We use the same detection implementation for both plain and hierarchical backbones.

We use a patch size of 16 for all ViT backbones. As ViT-H in by default has a patch size of 14, after pre-training we interpolate the patch embedding filters from 14 $\times$ 14 $\times$ 3 to 16 $\times$ 16 $\times$ 3.

Hyper-parameters for COCO.

Our default training recipe is as follows (unless noted in context for ablation). The input size is 1024 $\times$ 1024, augmented during training by large-scale jitter with a scale range of $[0.1,2.0]$ . We use AdamW ( $\beta_{1},\beta_{2}{=}0.9,0.999$ ) with step-wise learning rate decay. We use linear learning rate warm-up for 250 iterations. The batch size is 64, distributed across 64 GPUs (1 image per GPU).

We search for the learning rate (lr), weight decay (wd), drop path rate (dp), and epochs, for each model size (B, L, H) and for each model type (ViT, Swin, MViTv2). The hyper-parameters used are in Table 11. We also use a layer-wise lr decay of 0.7/0.8/0.9 for ViT-B/L/H with MAE pre-training, which has a small gain of up to 0.3 AP; we have not seen this gain for hierarchical backbones or ViT with supervised pre-training.

Hyper-parameters for LVIS.

MAE for hierarchical backbones.

We implement a naïve extension of MAE pre-training for the hierarchical backbone ablation (Sec. 4.2). MAE enjoys the efficiency benefit from plain ViT by skipping the encoder mask token . Extending this strategy to hierarchical backbones is beyond the scope of this paper. Instead, we adopt a straightforward solution in which we do not skip the encoder mask token (similar to ), at the cost of slower training. We use normalized pixels as the MAE reconstruction target and set the decoder depth as 2.

Acknowledgement.

We would like to acknowledge Xinlei Chen, Saining Xie, Piotr Dollár, and Christoph Feichtenhofer for discussions and support.