HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs

Ting Yao, Yehao Li, Yingwei Pan, Tao Mei

Introduction

Inspired by the dominant Transformer structure in Natural Language Processing (NLP) , the Computer Vision (CV) field witnesses the rise of Vision Transformer (ViT) for designing vision backbones. This trend has been most visible for image/action recognition and dense prediction tasks like object detection . Many of these success can be attributed to the flexible modeling of long-range interaction among input visual tokens via self-attention mechanism in conventional Transformer block. Most recently, several concurrent studies point out that it is sub-optimal to directly employ pure Transformer block over visual token sequence. Such design inevitably lacks the right inductive bias of 2D regional structure modeling. To alleviate this limitation, they lead the new wave of instilling the 2D inductive bias of Convolution Neural Network (CNN) into ViT, yielding the CNN+ViT hybrid backbones.

A common practice in CNN backbone design is to enlarge network depth/width/input resolution , thereby enhancing model capacity by capturing more fine-grained patterns within inputs. Similar in spirit, our work wants to delve into the process of scaling CNN+ViT hybrid backbones with high resolution inputs. Nevertheless, in analogy to the scaling of CNN backbones, simply enlarging the input resolution of prevalent ViT backbones will introduce practical challenge especially the sharply increased computational costs. Taking a widely adopted ViT backbone, Swin Transformer , as an example, when directly enlarging the input resolution from 224 $\times$ 224 to 384 $\times$ 384, the top-1 accuracy on ImageNet-1K is evidently increased from 83.5% to 84.5%. However, as shown in Figure 1, the computational cost of Swin Transformer with 384 $\times$ 384 inputs (GFLOPs: 47.0, Inference time: 3.95 ms) is significantly heavier than that with 224 $\times$ 224 inputs (GFLOPs: 15.4, Inference time: 1.17 ms).

In view of this issue, our central question is – is there a principled way to scale up CNN+ViT hybrid backbones with high resolution inputs while maintaining comparable computational overhead? To this end, we devise a family of five-stage Vision Transformers tailored for high-resolution inputs, which contain two-branch building blocks in earlier stages that seek better balance between performance and computational cost. Specifically, the key component in remoulded stem/CNN block is a combination of high-resolution branch (with less convolution operations over high-resolution inputs) and low-resolution branch (more convolution operations over low-resolution inputs) in a parallel. Such two-branch design takes the place of a single branch with standard convolution operations in stem/CNN blocks. By doing so, we not only preserve the strengthened model capacity with high-resolution inputs, but also significantly reduce the computational cost with light-weight design of each branch. As shown in Figure 1, by enlarging the input resolution from 224 $\times$ 224 to 384 $\times$ 384, a clear performance boost is attained for our HIRI-ViT, but the computational cost only slightly increases (GFLOPs: 8.2 to 9.3, Inference time: 0.84 ms to 1.04 ms). Even when enlarging the input resolution to 768 $\times$ 768, HIRI-ViT leads to significant performance improvements, while requiring less computational cost than Swin Transformer.

By integrating this two-branch design into CNN+ViT hybrid backbone, we present a new principled five-stage vision backbone, namely HIRI-ViT, that efficiently scales up Vision Transformer with high-resolution inputs. In particular, we first upgrade the typical Conv-stem block by decomposing the single CNN branch into two parallel branches (i.e., high-resolution and low-resolution branches), leading to high resolution stem (HR-stem) blocks. Next, CNN blocks in earlier stages are remoulded by replacing CNN branch with the proposed two-branch design. Such new High Resolution block (HR-block) triggers a cost-efficient encoding of high-resolution inputs.

The main contribution of this work is the proposal of scaling up the CNN+ViT hybrid backbone with high resolution inputs while maintaining favorable computational cost. This also leads to the elegant design of decomposing the typical CNN operations over high resolution inputs into two parallel light-weight CNN branches. Through an extensive set of experiments over multiple vision tasks (e.g., image recognition, object detection and instance/semantic segmentation), we demonstrate the superiority of our new HIRI-ViT backbone in comparison to the state-of-the-art ViT and CNN backbones under comparable computational cost.

Related Work

Inspired by the breakthrough via AlexNet on ImageNet-1K benchmark, Convolutional Neural Networks (CNN) have become the de-facto backbones in computer vision field. Specifically, one of the pioneering works is VGG , which increases the network depth to enhance the model capability. ResNet trains deeper networks by introducing skip connections between convolutional blocks, leading to better generalization and impressive results. DenseNet further scales up the network with hundreds of layers by connecting each convolutional block to all the previous blocks. Besides going deeper, designing multi-branch block is another direction to enhance model capacity. InceptionNet integrates multiple paths with different kernels into a single convolutional block via the split-transform-merge strategy. ResNeXt demonstrates that increasing cardinality with the homogeneous and multi-branch architecture is an effective way of improving performance. Res2Net develops multiple receptive fields at a more granular level by constructing hierarchical residual-like connections. EfficientNet exploits neural architecture search to seek a better balance between network width, depth, and resolution. Recently, ConvNeXt modernizes ResNet by integrating it with Transformer designs, achieving competitive results gainst Vision Transformer while retaining the efficiency of CNN.

2 Vision Transformer

Inspired by Transformer in NLP field, Vision Transformer architectures start to dominate the construction of backbones in vision tasks recently. The debut of Vision Transformer splits the image into a sequence of patches (i.e., visual tokens) and then directly applies self-attention over the visual tokens. DeiT learns Vision Transformer in a data-efficient manner with the upgraded training strategies and distillation procedure. Since all layers of ViT/DeiT are designed under the same lower resolution, it might not be suitable to directly apply them for dense prediction tasks . To address this issue, PVT adopts a pyramid structure of ViT with four stages, whose resolutions progressively shrink from high to low. Swin integrates shifted windowing scheme into local self-attention, allowing cross-window connection with linear computational complexity. Twins interleaves locally-grouped attention and global sub-sampled attention to exploit both fine-grained and long-distance global information. DaViT further proposes dual attention mechanism with spatial window attention and channel group attention, aiming to enable local fine-grained and global interactions. Later on, CNN and ViT start to interact with each other, yielding numerous hybrid backbones. In particular, CvT and CeiT upgrade self-attention and feed-forward module with convolutions respectively. ViTAE introduces an extra convolution block in parallel to self-attention module, whose outputs are fused and fed into feed-forward module. iFormer couples max-pooling, convolution, and self-attention to learn both high- and low-frequency information. MaxViT performs both local and global spatial interactions by combining convolution, local self-attention and dilated global self-attention into a single block.

3 High-resolution Representation Learning

Significant advancements have been made in exploring high-resolution inputs in CNN backbone design. For example, HRNet maintains the high-resolution branch through the whole network, and fuses multiresolution features repeatedly. EfficientHRNet further unifies EfficientNet and HRNet, and designs a downwards scaling method to scale down the input resolution, backbone network, and high-resolution feature network. Later on, Lite-HRNet enhances the efficiency of HRNet by applying shuffle blocks and conditional channel weighting unit. Subsequently, several works start to build Transformer backbones with high-resolution inputs. In between, HR-NAS introduce a multi-resolution search space including both CNN and Transformer blocks for multi-scale information and global context modeling. Recently, HRViT and HRFormer target for constructing Vision Transformer with multi-scale inputs by keeping all resolutions throughout the network and performing cross-resolution interaction. Nevertheless, the inputs of those ViT backbones are still limited to a small resolution (i.e., 224 $\times$ 224). Even though most hybrid backbones can be directly scaled up with higher resolution (e.g., 384 $\times$ 384), the computation cost becomes much heavier, which scales quadratically w.r.t. the input resolution. Instead, our work paves a new way to scale up the CNN+ViT hybrid backbone with high resolution inputs, and meanwhile preserve the favorable computational overhead as in small resolution.

Preliminaries

Conventional multi-stage Vision Transformer (M-ViT) is commonly composed by a stem plus four stages as in ConvNets (see Figure 2 (a)). Specifically, the stem layer is first utilized to split the input image (resolution: 224 $\times$ 224) into patches. Each patch is regarded as a “visual token” and will be further fed into the following stages. Each stage contains multiple Transformer blocks, and each Transformer block consists of a multi-head self-attention module ( $\bf{MHA}$ ) followed by a feed-forward network ( $\bf{FFN}$ ). Typically, a downsampling layer ( $\bf{DS}$ ) is inserted between every two stages to merge the input “visual tokens” (i.e., reduce the resolution of the feature map) and meanwhile enlarge their channel dimension. Finally, a classifier layer is adopted to predict the probability distribution based on the last feature map.

where $W^{O}$ is weight matrix and ${\bf{Concat}}(\cdot)$ is the concatenation operation. Considering that the computational cost of self-attention scales quadratically w.r.t. the token number, spatial reduction is usually applied over keys/values to reduce the computational/memory overhead .

Feed-Forward Network. The original feed-forward network consists of two fully-connected ( $\bf{FC}$ ) layers coupled with a non-linear activation in between:

where $\sigma$ denotes the non-linear activation. Inspired by , we upgrade $\bf{FFN}$ with an additional convolutional operation to impose 2D inductive bias, yielding Convolutional Feed-Forward Network ( $\bf{CFFN}$ ). The overall operations of this $\bf{CFFN}$ are summarized as

where $\bf{DWConv}$ denotes the depth-wise convolution.

HIRI-ViT

In this paper, our goal is to design a principled Transformer structure (namely HIRI-ViT) that enables a cost-efficient scaling up of Vision Transformer with high resolution inputs. To do so, we upgrade typical four-stage M-ViT into a new family of five-stage ViT that contain two-branch building blocks in earlier stage, which decompose the single-branch CNN operations into two parallel CNN branches. This way leads to favorable computational overhead tailored for high resolution inputs. Figure 2 (b) illustrates an overall architecture of our HIRI-ViT.

The design of stem layer in conventional Vision Transformer can be briefly grouped into two dimensions: ViT-stem and Conv-stem . As shown in Figure 3 (a), ViT-stem is implemented as a single strided convolution layer (e.g., stride = 4, kernel size = 7 ) which aims to divide the input image into patches. Recently, reveals that replacing the ViT-stem with several stacked 3 $\times$ 3 convolutions (i.e., Conv-stem shown in Figure 3 (b)) can stabilize the network optimization procedure and meanwhile improve peak performance. Conv-stem results in a slight increase of the parameters and GFLOPs for typical input resolution (e.g., 224 $\times$ 224). However, when the input resolution significantly increases (e.g., 448 $\times$ 448), the GFLOPs of Conv-stem become much heavier than ViT-stem. To alleviate these issues, we design a new High Resolution stem layer (HR-stem in Figure 3 (c)) by remoulding the single-branch Conv-stem as two parallel CNN branches. Such design not only preserves high model capacity as Conv-stem, but also consumes favorable computational cost under high resolution inputs.

Technically, HR-stem first utilizes a strided convolution (stride = 2, kernel size = 3) to downsample the input image as in Conv-stem. After that, the downsampled feature map is fed into two parallel branches (i.e., high and low-resolution branches). The high-resolution branch contains a light-weight depth-wise convolution followed by a strided convolution. For the low-resolution branch, a strided convolution is first employed to downsample the feature map. Then two convolutions (3 $\times$ 3 and 1 $\times$ 1 convolutions) are applied to impose inductive bias. Finally, the output of HR-stem is achieved by aggregating the two branches and the sum is further normalized via batch normalization ( $\bf{BN}$ ).

2 High Resolution Block

In view of the fact that the input resolution of the first two stages in hybrid backbones is large, the computational costs are relatively higher for Transformer blocks. To address this limitation, we replace the Transformer blocks in the first two stages with our new High resolution block (HR block), enabling a cost-efficient encoding of high-resolution inputs in the earlier stages. Specifically, similar to HR-stem, HR block is composed of two branches in parallel. The light-weight high-resolution branch captures coarse-level information over high-resolution inputs, while the low-resolution branch utilizes more convolution operations to extract high-level semantics over low-resolution inputs. Figure 4 depicts the detailed architecture of HR block.

Concretely, the high-resolution branch is implemented as a light-weight depth-wise convolution. For low-resolution branch, a strided depth-wise convolution (stride = 2, kernel size = 3) with $\bf{BN}$ is first utilized to downsample the input feature map. Then, a feed-forward operation (i.e., two $\bf{FC}$ with an activation in between) is applied over the low-resolution feature map. After that, the low-resolution output is upsampled via repetition, which is further fused with the high-resolution output.

3 Inverted Residual Downsampling

In conventional M-ViT, the spatial downsampling is performed through a single strided convolution (e.g., stride = 2, kernel size = 3 ), as shown in Figure 5 (a). Inspired by ConvNets , we design a more powerful downsampling layer with two parallel branches, namely Inverted Residual Downsampling (IRDS). In particular, for the first two stages with high resolution inputs, we adopt IRDS-a (Figure 5 (b)) for downsampling. IRDS-a first uses a strided $3\times 3$ convolution to expand the dimension and reduce the spatial size, and then a $1\times 1$ convolution is utilized to shrink the channel dimension. For the last two downsampling layers, we leverage IRDS-b (Figure 5 (c)), which is similar to inverted residual block . The difference lies in that we only apply normalization and activation operations after the first convolution. Note that we add extra downsampling shortcuts to stabilize the training procedure.

4 Normalization of Block

ConvNets usually use $\bf{BN}$ to stabilize the training process. $\bf{BN}$ can also be merged into convolution operation to speedup inference. In contrast, Vision Transformer backbones tend to normalize the features with layer normalization ( $\bf{LN}$ ). $\bf{LN}$ is more friendly for dense prediction tasks with small training batch size (e.g., object detection and semantic segmentation) because it is independent of batch size. Compared to $\bf{BN}$ , $\bf{LN}$ can also lead to slightly better performance . Nevertheless, $\bf{LN}$ results in heavier computational cost for high-resolution inputs. Accordingly, we utilize $\bf{BN}$ for the first three stages with high-resolution inputs, while $\bf{LN}$ is applied on the last two stages with low-resolution inputs. Moreover, we also replace $\bf{LN}$ with $\bf{BN}$ for $\bf{CFFN}$ block. By doing so, the inference procedure can speedup 7.6%, while maintaining the performances.

5 EMA Distillation

During training, Exponential Moving Average (EMA) has been widely adopted to stabilize and improve the training procedure for both ConvNets and ViT . Nevertheless, the message passing in conventional EMA is unidirectional, i.e., teacher network is updated by EMA based on the parameters of student network, thereby resulting in a sub-optimal solution. In an effort to trigger the bi-directional message interaction between teacher and student networks, we propose a new EMA distillation strategy to train HIRI-ViT. EMA distillation additionally leverages the probability distribution learnt from teacher network to guide the training of student network. In contrast to traditional knowledge distillation , our EMA distillation does not rely on any extra large-scale pre-trained network.

6 Architecture Details

Table I details the architectures of our HIRI-ViT family. Following the basic network configuration of existing CNN+ViT hybrid backbones , we construct three variants of our HIRI-ViT in different model sizes, i.e., HIRI-ViT-S (small size), HIRI-ViT-B (base size), and HIRI-ViT-L (large size). Specifically, the entire architecture of HIRI-ViT is composed of one HR-stem layer and five stages. For the first two stages with high-resolution inputs, we replace the conventional Transformer blocks with our light-weight High Resolution blocks to avoid huge computational overhead. For the third stage, we leverage only $\bf{CFFN}$ blocks to handle the middle-resolution feature maps. Similar to conventional Vision Transformer, we employ Transformer blocks in the last two stages with low-resolution inputs. For each stage $i$ , $E_{i}$ , $C_{i}$ , and $HD_{i}$ represents the expansion ratio of feed-forward layer, channel dimension, and head number, respectively.

Experiments

We evaluate our HIRI-ViT on four vision tasks: image classification, object detection, instance segmentation, and semantic segmentation. In particular, HIRI-ViT is first trained from scratch on ImageNet-1K for image classification task. Next, we fine-tune the pre-trained HIRI-ViT for the rest three downstream tasks: object detection and instance segmentation on COCO , and semantic segmentation on ADE20K .

Setup. ImageNet-1K dataset contains 1.28 million training images and 50,000 validation images over 1,000 object classes. During training, we adopt the common data augmentation strategies in : random cropping, random horizontal flipping, Cutmix , Mixup , Random Erasing , RandAugment . The whole network is optimized via AdamW over 8 V100 GPUs, including 300 epochs with cosine decay learning rate scheduler and 5 epochs of linear warm-up on. The batch size, initial learning rate and weight decay are set as 1,024, 0.001 and 0.05, respectively. We report Top-1/5 accuracy on ImageNet-1K validation set, and Top-1 accuracy (i.e., V2 Top-1) on ImageNet V2 matched frequency test set as in .

Performance Comparison. Table II shows the performance comparisons between our HIRI-ViT family and existing CNN/ViT backbones. It is worthy to note that all the baselines are fed with typical resolution inputs (224 $\times$ 224), while our HIRI-ViT family scales up Vision Transformer with high resolution inputs (448 $\times$ 448). Overall, under comparable computational cost in each group, our HIRI-ViT (448 $\times$ 448) achieves consistent performance improvements against state-of-the-art backbones across all model sizes. Remarkably, for large model size (backbones with GFLOPs more than 11.7), the Top-1 accuracy of our HIRI-ViT-L (448 $\times$ 448) is 85.7%, which leads to the absolute performance gain of 0.5% than the best competitor MaxViT-L (85.2%). Although the input resolution of HIRI-ViT-L is significantly larger than MaxViT-L, our HIRI-ViT-L (448 $\times$ 448) requires less GFLOPs than MaxViT-L, and indicates the advantage of faster inference speed with almost doubled throughput. Such results clearly demonstrate that our HIRI-ViT achieves better balance between performance and computation cost, especially tailored for high resolution inputs. It is also worthy to note that when feeding with the typical resolution inputs (224 $\times$ 224), our HIRI-ViT under each model size achieves comparable performances in comparison to state-of-the-art backbones, while requiring significantly less computational cost. For example, in the group of large model size, the Top-1 accuracy of our HIRI-ViT-L (224 $\times$ 224) is 85.3% and the corresponding throughput is 660 images per second, which is extremely faster than the best competitor MaxViT-L (Top-1 accuracy: 85.2%, Throughput: 241 images per second). The results again confirm the cost-efficient design of our HIRI-ViT.

Performance Comparison at Higher Resolution. Table III illustrates the comparisons between our HIRI-ViT family and other state-of-the-art vision backbones with larger input image size. For this upgraded HIRI-ViT with higher resolution inputs (768 $\times$ 768), we adopt the AdamW optimizer on 8 V100 GPUs, with a momentum of 0.9, an initial learning rate of $1.0e^{-5}$ , and the weight decay of $1.0e^{-8}$ . The optimization process includes 30 epochs with cosine decay learning rate scheduler . Similarly, for each group with comparable computational cost, our HIRI-ViT consistently obtains performance gains in comparison to other vision backbones at higher resolution. These results clearly validate the effectiveness of our proposed five-stage ViT backbone tailored for high-resolution inputs. This design novelly decomposes the typical CNN operations into both high-resolution and low-resolution branches in parallel, and thus maintains favorable computational cost even under the setup with higher-resolution inputs.

2 Object Detection and Instance Segmentation on COCO

Setup. We conduct both object detection and instance segmentation tasks on COCO dataset. We adopt the standard setting in and train all models on COCO-2017 training set ( $\sim$ 118K images). The learnt model is finally evaluated over COCO-2017 validation set (5K images). We use two mainstream detectors (RetinaNet and Mask R-CNN ) for object detection and instance segmentation. The primary CNN backbones in each detector are replaced with our HIRI-ViT family (initially pre-trained on ImageNet-1K). All the other newly added layers are initialized with Xavier . We fine-tune the detectors on 8 V100 GPUs via AdamW optimizer (batch size: 16). For RetinaNet and Mask R-CNN, we adopt the standard 1 $\times$ training schedule (12 epochs). The shorter side of each image is resized to 1,600 pixels, while the longer side does not exceed 2,666 pixels. For object detection, we conduct experiments on four additional object detection methods: Cascade Mask R-CNN , ATSS , GFL , and Sparse RCNN . Following , the 3 $\times$ schedule (36 epochs) with multi-scale strategy is utilized for training. The input image is randomly resized by maintaining the shorter side within the range of , while the longer side is forced to be less than 2,666 pixels. We report the Average Precision score ( $AP$ ) across different IoU thresholds and three different object sizes, i.e., small ( $AP_{S}$ ), medium ( $AP_{M}$ ), large ( $AP_{L}$ ). For instance segmentation task, the bounding box and mask AP scores ( $AP^{b}$ , $AP^{m}$ ) are reported.

Performance Comparison. Table IV summarizes the object detection and instance segmentation performances of RetinaNet and Mask R-CNN with different backbones on COCO benchmark. Our HIRI-ViT-S and HIRI-ViT-B manage to consistently exhibit better performances across all metrics than other backbones in each group with comparable computational costs. Concretely, HIRI-ViT-S outperforms the best competitor ScalableViT-S by 1.4% ( $AP$ ) and Ortho-S by 0.5% ( $AP^{m}$ ) on the basis of RetinaNet and Mask R-CNN detector, respectively. Meanwhile, the input resolution of HIRI-ViT-S is twice larger than that of ScalableViT-S, while both of them require similar parameter number and GFLOPs. This clearly validates the superior generalizability of scaling up Vision Transformer via our design in downstream tasks. Table V further shows the performances of four additional object detectors under different backbones on COCO. Similarly, HIRI-ViT-S leads to consistent performance gains against other baselines for each object detector.

3 Semantic Segmentation on ADE20K

Setup. We next evaluate HIRI-ViT for semantic segmentation on ADE20K. This dataset covers 150 semantic categories and contains 20,000 images for training, 2,000 for validation, and 3,000 for testing. Here we follow and use UPerNet as the base model, where the CNN backbone is replaced with our HIRI-ViT. The whole network is trained with 160K iterations over 8 V100 GPUs via AdamW optimizer . We utilize the linear learning rate decay scheduler with 1,500 iterations linear warmup for optimization. The batch size, weight decay, and initial learning rate are set as 16, 0.01, and 0.00006. We adopt the standard data augmentations: random horizontal flipping, random photometric distortion, and random re-scaling within the ratio range of [0.5, 2.0]. All the other hyperparameters and detection heads are set as in Swin for fair comparison.

Performance Comparison. Table VI details the performances of different backbones on ADE20K validation set for semantic segmentation. Similar to the observations in object detection and instance segmentation downstream tasks, our HIRI-ViT-S and HIRI-ViT-B attain the best mIoU score within each group with comparable computational costs. Specifically, under the same base model of UPerNet, HIRI-ViT-S boosts up the mIoU score of HRViT-b2 by 1.2%, which again demonstrates the effectiveness of our proposal.

4 Ablation Study

In this section, we first illustrate how to construct a strong multi-stage Vision Transformer (M-ViT) with four stages, which acts as a base model. Then, we extend the base model with five stages, and study how each design in HIRI-ViT influences the overall performances on ImageNet-1K for image classification task. Table VII summarizes the performances of different ablated runs by progressively considering each design into the base model.

M-ViT. We start from a base model, i.e., multi-stage Vision Transformer (M-ViT) with four stages. The whole architecture of M-ViT is similar to PVT , which consists of one ViT-stem layer and four stages. Each stage contains a stack of Transformer blocks. Each Transformer block is composed of $\bf{MHA}$ and $\bf{FFN}$ . In the first two stages, M-ViT utilizes a strided convolution as spatial reduction to downsample the keys and values, while the last two stages do not use any spatial reduction operation. A single strided convolution is leveraged to perform spatial downsampling and meanwhile enlarge the channel dimension between every two stages. $\bf{LN}$ is adopted for feature normalization. As indicated in Table VII (Row 1), the top-1 accuracy of M-ViT achieves 82.7%.

FFN $\rightarrow$ CFFN. Row 2 upgrades the base model by integrating each $\bf{FFN}$ with additional depth-wise convolution (i.e., $\bf{CFFN}$ ). In this way, $\bf{CFFN}$ enables the exploitation of inductive bias and thus boosts up performance (82.9%).

Remove MHA. Next, we take the inspiration from , and remove the multi-head self-attention ( $\bf{MHA}$ ) in the first two stages. The channel expansion dimension in $\bf{CFFN}$ is also enlarged. As shown in Table VII (Row 3), this ablated run maintains the performances, while both GFLOPs and parameter number decrease.

LN $\rightarrow$ BN. Then, we replace $\bf{LN}$ with $\bf{BN}$ in the first two stages and each $\bf{CFFN}$ block (Row 4). The top-1 accuracy slightly increases to 83.0%, showing that $\bf{BN}$ is more suitable than $\bf{LN}$ for blocks with convolution operations.

ViT-stem $\rightarrow$ Conv-stem. When we replace ViT-stem with Conv-stem (Row 5), the top-1 accuracy is further improved to 83.4%. This observation demonstrates the merit of Conv-stem that injects a small dose of inductive bias in the early visual processing, thereby stabilizing the optimization and improving peak performance.

IRDS. After that, we apply the inverted residual downsampling between every two stages (Row 6). The final version of M-ViT with four stages manages to achieve the top-1 accuracy of 83.6% , which is competitive and even surpasses most existing ViT backbones.

Resolution 224 $\rightarrow$ 448. Next, we directly enlarge the input resolution from 224 $\times$ 224 to 448 $\times$ 448 for this four-stage structure (Row 7). This ablated run leads to clear performance boosts, while sharply increasing GFLOPs from 4.7 to 21.5. This observation aligns with existing four-stage ViT architectures (e.g., PVT and Swin Transformer) where the computational cost scales quadratically for enlarged input resolution.

Four Stages $\rightarrow$ Five Stages. Furthermore, we extend the four-stage ablated run (Row 7) by leveraging additional stage to further downsample high-resolution inputs. Such five-stage structure (Row 8) significantly decreases GFLOPs with high-resolution inputs (448 $\times$ 448) and the Top-1 accuracy drops into 84.0%, which still outperforms another four-stage ablated run (Row 6) under 224 $\times$ 224 inputs. The results basically confirm the effectiveness of our five-stage structure that seeks better cost-performance balance for high-resolution inputs.

Conv-stem $\rightarrow$ HR-stem. Then we replace Conv-stem with our HR-stem (remoulded in two-branch design). As shown in Table VII (Row 9), the GFLOPs is clearly dropped (5.1), but the top-1 accuracy still maintains (84.0%).

HR block. After that, we replace each $\bf{CFFN}$ with two stacked HR blocks in the first two stages (Row 10). Note that the Top-1 accuracy on ImageNet are almost saturated ( $\sim$ 84.0%) for small size (GFLOPs: $\sim$ 5.0) and it is relatively difficult to introduce large margin of improvement. However, our HR block still manages to achieve a performance gain of 0.1% in both Top-1 and Top-5 accuracies and meanwhile slightly reduces computational cost, which again validates the advantage of the cost-efficient encoding via our high and low-resolution branches in parallel.

EMA Distillation. Finally, Row 11 is the full version of our HIRI-ViT-S that is optimized with additional EMA distillation, leading to the best top-1 accuracy (84.3%).

5 Impact of Each Kind of Block in Five Stages

Here we further conduct additional ablation study to fully examine the impact of each kind of block (HR/CFFN/Transformer block) in five stages. Table VIII shows the performances of different ablated variants of HIRI-ViT by constructing five stages with different blocks. Concretely, the results of Row 1-3 indicate that replacing more HR blocks with Transformer blocks in the last three stages can generally lead to performance improvement, but suffer from heavier computational cost. For example, the use of Transformer block in the third stage (Row 3) obtains marginal performance gain (0.1% in Top-1 accuracy), while GFLOPs is clearly increased. The results basically validate the effectiveness of Transformer block with low-resolution inputs in the late stages, but Transformer block naturally results in huge computational overhead especially for high-resolution inputs in the early stages. Furthermore, we replace Transformer block with CFFN block to handle middle-resolution inputs in the third stage (Row 4). Such design nicely reduces GFLOPs and meanwhile maintains the Top-1 and Top-5 accuracies, which demonstrates the best trade-off between computational cost and performance in HIRI-ViT.

6 Computational Cost vs. Accuracy

Figure 6 further illustrates the accuracy curves with regard to computational cost (i.e., (a) GFLOPs, (b) model parameter number and (c) inference time) for our HIRI-ViT and other state-of-the-art vision backbones. As shown in this figure, the curves of our HIRI-ViT backbones are always over the ones of other vision backbones. That is, our HIRI-ViT backbones seek better computational cost-accuracy tradeoffs than existing vision backbones.

7 Extension to Other Backbones

Here we report the performance/computational cost by leveraging our proposal to scale up three different vision backbones (HRNet, PVTv2, DaViT) with higher-resolution inputs (224 $\times$ 224 to 448 $\times$ 448). As shown in Table IX, our five-stage structure with two-branch design in HIRI-ViT consistently leads to significant performance boost for each vision backbone, while retaining favorable computational cost. The results further validate the generalizability of our five-stage structure with two-branch design for scaling up vision backbones with high-resolution inputs.

Conclusions

In this work, we design a new five-stage ViT backbone for high-resolution inputs, namely HIRI-ViT, that novelly decomposes the typical CNN operations into both high-resolution and low-resolution branches in parallel. With such principled five-stage and two-branch design, our HIRI-ViT is armed with ability to scale up Vision Transformer backbone with high-resolution inputs in a cost-efficient fashion. Extensive experiments are conducted on ImageNet-1K (image classification), COCO (object detection and instance segmentation) and ADE20K datasets (semantic segmentation) to validate the effectiveness of our HIRI-ViT against competitive CNN or ViT backbones.

In spite of these observations, open problems remain. While our five-stage structure with two-branch design has clearly improved the efficiency for scaling up Vision Transformer, we observe less improvement on performance/computational cost when employing six-stage structure. Moreover, how to scale up Video Vision Transformer with high-resolution inputs still present a major challenge.