Global Context Vision Transformers

Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov

Introduction

During the recent years, Transformers (Vaswani et al., 2017) have achieved State-Of-The-Art (SOTA) performance in Natural Language Processing (NLP) benchmarks and became the de facto model for various tasks. A key element in the success of Transformers is the self-attention mechanism which allows for capturing contextual representations via attending to both distant and nearby tokens (Yin et al., 2021). Following this trend, Vision Transformer (ViT) (Dosovitskiy et al., 2020) proposed to utilize image patches as tokens in a monolithic architecture with minor differences comparing to encoder of the original Transformer. Despite the historic dominance of Convolutional Neural Network (CNN) in computer vision, ViT-based models have achieved SOTA or competitive performance in various computer vision tasks.

In essence, the self-attention mechanism in ViT allows for learning more uniform short and long-range information (Raghu et al., 2021) in comparison to CNN. However, the monolithic architecture of ViT and quadratic computational complexity of self-attention baffle their swift application to high resolution images (Yang et al., 2021a) in which capturing multi-scale long-range information is crucial for accurate representation modeling.

Several efforts (Liu et al., 2021; Dong et al., 2022; Chu et al., 2021a; Tu et al., 2022), most notably Swin Transformer (Liu et al., 2021), have attempted to address the balance between short- and long-range spatial dependencies by proposing multi-resolution architectures in which the self-attention is computed in local windows. In this paradigm, cross-window connections such as window shifting are used for modeling the interactions across different regions. Despite the progress, the limited receptive field of local windows challenges the capability of self-attention to capture long-range information, and window-connection schemes such as shifting only cover a small neighborhood in the vicinity of each window. Subsequent efforts such as Focal Transformer (Yang et al., 2021b) attempted to address this issue by designing highly sophisticated self-attention modules with increased model complexity.

In this work, we introduce the Global Context (GC) ViT network to address these limitations. Specifically, we propose a hierarchical ViT architecture consisting of local and global self-attention modules. At each stage, we compute global query tokens, using a novel fused inverted residual blocks, which we refer to as modified Fused-MBConv blocks, that encompass global contextual information from different image regions. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and value representations.

The design of our proposed framework for global query generator and self-attention is intuitive and simple and can be efficiently implemented using major deep learning framework. Hence, it eliminates sophisticated and computationally expensive operations and ensures the effectiveness of self-attention when applied to high-resolution images. In addition, we propose a novel downsampling block with a parameter-efficient fused-MBConv layer to address the lack of inductive bias in ViTs and enhancing the modeling of inter-channel dependencies.

We have extensively validated the effectiveness of the proposed GC ViT using three publicly available datasets for various computer vision tasks. For image classification using ImageNet-1K dataset, GC ViT with 5151M, 9090M, 201201M parameters achieve new SOTA benchmarks of 84.3%, 85.0%, 85.7% Top-1 accuracy and without using extra data or pre-training.

Hence, GC ViT consistently outperforms both ConvNeXt (Liu et al., 2022b), MaxViT (Tu et al., 2022) and Swin Transformer (Liu et al., 2021) models, sometimes by a significant margin (see Fig. 1).

Using an ImageNet-1K pre-trained GC ViT base backbone with a Cascade Mask RCNN (He et al., 2017) head, our model achieves a box mAP of 52.9 for object detection and a mask mAP of 45.8 for instance segmentation on the MS COCO dataset and by using single-scale inference. We also used an ImageNet-21K GC ViT model as backbone with a 4-scale DINO detection head and achieved a box AP of 58.3%.

In addition, using an UPerNet (Xiao et al., 2018) head, our model achieves a mIoU of 49.2 on ADE20K for semantic segmentation by only using a single-scale inference scheme. Other variants of GC ViT with different learning capacities also demonstrate SOTA results when compared to similarly-sized models on both MS COCO and ADE20K datasets. Hence, GC ViT demonstrates great scalability for high-resolution images on various downstream tasks, validating the effectiveness of the proposed framework in capturing both short and long-range information.

The main contributions of our work are summarized as follows:

We introduce a compute and parameter-optimized hierarchical ViT with reparametrization of the design space (e.g., embedding dimension, number of heads, MLP ratio).

We design an efficient CNN-like token generator that encodes spatial features at different resolutions for global query representations.

We propose global query tokens that can effectively capture contextual information in an efficient manner and model both local and global interactions.

We introduce a parameter-efficient downsampling module with modified Fused MB-Conv blocks that not only integrates inductive bias but also enables the modeling of inter-channel dependencies.

We demonstrate new SOTA benchmarks for : (1) ImageNet classification with Pareto fronts on ImageNet-1K for number of parameters and FLOPs (2) downstream tasks such as detection, instance segmentation and semantic segmentation on MS COCO and ADE20K, respectively.

GC ViT architecture

Architecture. Fig. 2 depicts the architecture of GC ViT. We propose a hierarchical framework to obtain feature representations at several resolutions (called stages) by decreasing the spatial dimensions while expanding the embedding dimension, both by factors of 22.

Every GC ViT stage is composed of alternating local and global self-attention modules to extract spatial features. Both operate in local windows like Swin Transformer (Liu et al., 2021), however, the global self-attention has access to global features extracted by the global query generator. The query generator is a CNN-like module that extracts features from the entire image only once at every stage. After each stage, the spatial resolution is decreased by 22 while the number of channels is increased by 22 via a downsampling block. Resulting features are passed through average pooling and linear layers to create an embedding for a downstream task.

The GC ViT architecture benefits from novel blocks such as a downsampling operator, a global query generator and a global self-attention module described in the next sections.

Downsampler. We leverage an idea of spatial feature contraction from CNN models that imposes locality bias and cross channel interaction while reducing dimensions. We utilize a modified Fused-MBConv block, followed by a max pooling layer with a kernel size of 33 and stride of 22 as a downsampling operator. The Fused-MBConv block in our work is similar to the one in EfficientNetV2 (Tan & Le, 2021) with modifications as in

where SE, GELU and DW-Conv3×3\text{DW-Conv}_{3\times 3} denote Squeeze and Excitation block (Hu et al., 2018), Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016) and 3×33\times 3 depth-wise convolution, respectively. In our proposed architecture, the Fused-MBConv blocks provide desirable properties such as inductive bias and modeling of inter-channel dependencies. It is ablated in Table 8.

Fig. 3 demonstrates the main idea behind our contribution. Local self-attention can only query patches within a local window, whereas the global attention can query different image regions while still operating within the window. At each stage, the global query component is pre-computed. The global self-attention utilizes the extracted global query tokens and shared across all blocks, to interact with the local key and value representations. In addition, GC ViT employs alternating local and global self-attention blocks to effectively capture both local and global spatial information. Fig. S.1 illustrates the difference between local and global self-attention. The global attention query qg\mathbf{q_{g}} has a size of B×C×h×wB\times C\times h\times w, wherein BB, CC, hh and ww denote batch size, embedding dimension, local window height and width, respectively. Moreover, qg\mathbf{q_{g}} is repeated along the batch dimension to compensate for the overall number of windows and aggregated batch size B=B×NB^{*}=B\times N^{*} where NN^{*} is the number of local windows. qg\mathbf{q_{g}} is further reshaped into multiple heads. The value and key are computed within each local window using a linear layer.

Since the partitioned windows only contain local information, interaction with rich contextual information embedded in the global query tokens provides an effective way of enlarging the receptive field and attending to various regions in the input feature maps. The self-attention module is computed as in

2 Complexity Analysis

Given an input feature map of xRH×W×Cx\in\mathcal{R}^{H\times W\times C} at each stage with a window size of h×wh\times w, the computational complexity of GC ViT is as follows

The efficient design of global query token generator and other components allows to maintain a similar computational complexity in comparison to Swin Transformer (Liu et al., 2021) while being able to capture long-range information and achieve better higher accuracy for classification and downstream tasks such as detection and segmentation.

Experiments

For image classification, we trained and tested our model on ImageNet-1K dataset (Deng et al., 2009). To allow for a fair comparison, all GC ViT variants are trained by following training configurations of previous efforts (Liu et al., 2021; Yang et al., 2021b; Chu et al., 2021a). Specifically, all models are trained with the AdamW (Kingma & Ba, 2014) optimizer for 300300 epochs with an initial learning rate of 0.0010.001, weight decay of 0.050.05, cosine decay scheduler and 20 warm-up and cool-down epochs, respectively.

For object detection and instance segmentation, we trained our model on MS COCO (Lin et al., 2014) with DINO (He et al., 2017) and a Mask-RCNN (He et al., 2017) heads, using ×3\times 3 LR schedule with an initial learning rate of 0.00010.0001, a batch size of 1616 and weight decay of 0.050.05. Following (Liu et al., 2022b), we compared against Tiny, Small and Base model variants using Cascade Mask-RCNN but only compared against Tiny variant using Mask-RCNN. For semantic segmentation, we used the ADE20K dataset (Zhou et al., 2017) with a UPerNet (Xiao et al., 2018) segmentation head. Following previous efforts, we used a random crop size of 512×512512\times 512 for the input images.

We present the ImageNet-1K classification benchmarks in Table 1 and compare against CNN and ViT-based models across different model sizes. Our model achieves better performance when compared to other established benchmarks such as ConvNeXt (Liu et al., 2022b). Furthermore, as shown in Fig. 1, GC ViT models have better or comparable computational efficiency in terms of number FLOPsover the competing counterpart models.

2 Detection and Instance Segmentation

In Table 2, we present object detection and instance segmentation benchmarks on MS COCO dataset. Using a Mask-RCNN head, the model with pre-trained GC ViT-T (47.9/43.2) backbone outperforms counterparts with pre-trained ConvNeXt-T (Liu et al., 2022b) (46.2/41.7) by +1.7 and +1.5 and Swin-T (Liu et al., 2021) (46.0/41.6) by +1.9 and +1.6 in terms of box AP and mask AP, respectively. Using a Cascade Mask-RCNN head, the models with pre-trained GC ViT-T (51.6/44.6) and GC ViT-S (52.4/45.4) backbones outperform ConvNeXt-T (Liu et al., 2022b) (50.4/43.7) by +1.2 and +0.9 and ConvNeXt-S (Liu et al., 2022b) (51.9/45.0) by +0.5 and +0.4 in terms of box AP and mask AP, respectively. Furthermore, the model with GC ViT-B (52.9/45.8) backbone outperforms the counterpart with ConvNeXt-B (Liu et al., 2022b) (52.7/45.6) by +0.2 and +0.2 in terms of box AP and mask AP, respectively.

As shown in Table 2, we have also tested the performance of GC ViT-L model, pre-trained on ImageNet-21K dataset, with a 4-scale DINO (Zhang et al., 2022) detection head and achieved a box AP of 58.3% on MS COCO dataset. Hence our model outperforms the counterpart with Swin-L backbone.

3 Semantic Segmentation

We present semantic segmentation benchmarks on ADE20K dataset in Table 4. The models using pre-trained GC ViT-T (47.0), GC ViT-S (48.3) and GC ViT-B (49.2) backbones outperform counterpart models with pre-trained Twins-SVT-S (Chu et al., 2021a) (46.2), Twins-SVT-B (Chu et al., 2021a) (47.7) and Twins-SVT-L (Chu et al., 2021a) (48.8) by +0.8, +0.6 and +0.4 in terms of mIoU, respectively. In addition, models with GC ViT backbones significantly outperform counterparts with Swin Transformer backbones, hence demonstrating the effectiveness of the global self-attention.

Ablation

Component-wise Analysis. As shown in Table 5, we study the role of each component in GC ViT model for classification, detection, instance and semantic segmentation. For simplicity, we start with Swin Transformer as the base model and progressively re-design the components to demonstrate their importance in improving the performance. Firstly, we remove the window shifting and predictably observe significant performance degradation across all tasks. Changing distribution of parameters to our design improves the performance by +1.7, +2.8, +2.2 and +1.7 in terms of accuracy, box AP, mask AP and mIoU. Such reparametrization includes changing the window size, MLP ratio, number of layers to name but a few. Adding the CNN-based stem of GC ViT to the model provides additional improvements of +0.3, +0.2, +0.2 and +0.2 in terms of accuracy, box AP, mask AP and mIoU. In addition, using our proposed downsampler further improves the accuracy, box AP, mask AP and mIoU by +0.4, +0.1, +0.1 and +0.3, respectively. The last two changes demonstrate the importance of convolutional inductive bias and capturing the inter-channel dependencies in our model. Finally, leveraging the proposed global self-attention improves the performance by by +0.9, +0.8, +0.6 and +1.2 in terms of accuracy, box AP, mask AP and mIoU. Hence, this validates the effectiveness of the proposed global self-attention, in particular for downstream tasks with high resolution images such as semantic segmentation in which modeling long-range spatial dependencies is critical.

In Table 6, we compare the performance of GC ViT-L model which pretrained on ImageNet-21K dataset and finetuned on ImageNet-1K dataset with counterpart approaches. GC ViT-L outperforms Swin-L and CSwin-L by +0.3% and +0.1% in terms of Top-1 accuracy respectively, while performing on-par with ConvNeXt-L model. As a result, it validates the effectiveness of the model in large-scale data regimes.

2 Generalizability

In Table 7, we have evaluated the performance of GC ViT on ImageNetV2 dataset (Recht et al., 2019) to further measure its robustness. Specifically, we have used different sampling strategies of Matched Frequency and Threshold-0.7. These benchmarks demonstrate the competetive performance of GC ViT on ImageNetV2 dataset and validates its effectiveness in robustness and generalizability.

3 Downsampler Design

We studied the effectiveness of various downsampler blocks in Table 8. The simplest alternative to our design is a pair of convolutional and maxpooling layers. However, it results in a reduction of ImageNet Top-1 accuracy by -0.8. Patch merging is another variant which was introduced in Swin Transformers (Liu et al., 2021).

However, it reduces the accuracy by -0.6. Finally, our down-sampler which consists of a modified Fused-MBConv block and strided convolution and shows the best result. Importance of the former component is explained by the SE operation which boosts cross channel interaction while keeping number of parameters and FLOPs low. We conclude that our proposed down-sampler is essential to achieve high accuracy as it introduces convolutional inductive bias.

Interpretability

To provide further insights on interpretability of the proposed global self-attention and query tokens, we demonstrate visualization of the learned attention and Grad-CAM (Selvaraju et al., 2017) maps in Fig. 5. The associated attention distributions uncovered by the global self-attention modules align with image semantics, and hence act as an informative source for local attention modules. In addition, corresponding Grad-CAM maps demonstrate accurate object localization with most intricate details.

Related work

ConvNet. Since the advent of deep learning, CNNs (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Howard et al., 2017; He et al., 2016; Szegedy et al., 2016; Huang et al., 2017; Hu et al., 2018) have dominated computer vision benchmarks with SOTA performance. Recently, ConvNeXt (Liu et al., 2022b) proposed modifications to the architecture of ResNet (He et al., 2016), and achieved competitive benchmarks for classification, detection and segmentation tasks.

Transformer. The ViT (Dosovitskiy et al., 2020) was first proposed as an alternative to CNNs with the advantage of enlarged receptive field, due to its self-attention layers. However, it lacked desirable properties of CNNs such as inductive biases and translation invariance and required large-scale training datasets to achieve competitive performance. Data-efficient Image Transformers (DeiT) (Touvron et al., 2021) introduced a distillation-based training strategy which significantly improved the classification accuracy.

Hybrid. LeViT (Graham et al., 2021) proposed a hybrid model with re-designed multi-layer perceptron (MLP) and self-attention modules that are highly-optimized for fast inference. Cross-covariance Image Transformer (XCiT) (Ali et al., 2021) introduced a transposed self-attention module for modeling the interactions of feature channels. Convolutional vision Transformer (CvT) (Wu et al., 2021) introduced convolutional token embedding layer and Transformer block in a hierarchical architecture to improve the efficiency and accuracy of ViTs. Conditional Position encoding Vision Transformer (CPVT) (Chu et al., 2021b) demonstrated improved performance on various tasks such as image classification and object detection by conditioning the position encoding on localized patch token. Tokens-To-Token Vision Transformer (T2T-ViT) (Yuan et al., 2021) proposed a transformation layer for aggregating adjacent tokens and establishing image prior by exploiting spatial correlations. Pyramid Vision Transformer (PVT) (Wang et al., 2021) proposed a hierarchical architecture with patch embedding at the beginning of each stage and spatial dimension reduction to improve the computational efficiency. Independently, Swin Transformers (Liu et al., 2021) also proposed a hierarchical architecture in which self-attention is computed within local windows which are shifted for region interaction. Twins Transformer (Chu et al., 2021a) proposed a spatially separable self-attention with locally-grouped and global sub-sampling modules to improve the efficiency.

Global Attention. Other efforts such as EdgeViT (Pan et al., 2022) in computer vision and BigBird (Zaheer et al., 2020) in NLP have proposed global self-attention in their respective applications. The global attention in GC ViT is fundamentally different than these approaches. For instance, EdgeViT samples representative tokens and only computes sparse self-attention between these representative tokens with reduced feature size. On the contrary, GC ViT computes self-attention between the global queries (not just the token) and local keys and values without any subsampling in their respective local regions. Furthermore, in EdgeViT, only subsampled representative tokens per region interact In the self-attention module; however, in GC ViT, the global queries interact with the entire local regions. Furtermore, BigBird uses a combination of random, window and global attention mechanisms, which is different from the proposed local and global self-attention scheme in GC ViT. BigBird does not have any specific mechanisms for extracting global tokens as the existing tokens or additional special tokens can be specified as global tokens. However, the global tokens in GC ViT are obtained by the query generator via extracting contextual information from the entire input features. Lastly, BigBird employs a set of global tokens which attend to the entire input sequence. However, in GC ViT, the global query tokens attend to local key and value tokens in partitioned windows, since attending to the entire input sequence is not feasible considering the larger size of input features.

Conclusion

In this work, we introduced a novel hierarchical ViT, referred to as GC ViT, which can efficiently capture global context by utilizing global query tokens and interact with local regions. We have extensively validated the effectiveness of our model on various tasks. The proposed GC ViT model achieves new SOTA benchmarks for image classification across various model sizes on ImageNet-1K dataset, and surpasses both CNN and ViT-based counterparts by a significant margin. We have also achieved SOTA or competitive performance for downstream tasks of detection and semantic segmentation on high-resolution images.

References

Appendix A Appendix

GC ViT model configurations are presented in Table S.1 describing the choice of internal hyper parameters to obtain models with various compute load and parameter number.

A.2 Ablation

We performed ablation studies to validate the effectiveness of the proposed global query. Using the same architecture, instead of global query, we compute: (1) global key and value features and interact them with local query (2) global value features and interact it with local query and key. As shown in Table S.2, replacing global query may significantly impact the performance for image segmentation and downstream tasks such as object detection, instance segmentation and semantic segmentation.

A.2.2 Effect of Global Context Module

In Fig. S.1, we illustrate the difference between GC ViT local and global attention blocks. In order to demonstrate the effectiveness of Global Context (GC) self-attention module, we use Swin Transformers as the base model and add our propoped GC module. In this analysis, we remove the window shifting operation from Swin Transformers, since GC module is capable of modeling cross-region interactions. As shown in Table S.3, addition of GC module improves the ImageNet Top-1 accuracy by +0.9%+0.9\% and +0.7%+0.7\% for Swin Transformers Tiny and Small variants respectively.

A.2.3 EMA and Batch Size

We also used used Exponential Moving Averages (EMA) and observed slight improvement in terms of ImageNet TOp-1 accuracy. Furthermore, the performance of the model across different batch sizes were stable as we did not observe significant changes. Table S.4 demonstrates the effect of EMA and batch size on the accuracy of a GCViT-T model.

A.3 Training Details

For image classification, GC ViT models were trained using four computational nodes with 32 NVIDIA A100 GPUs. The total training batch size is 10241024 (3232 per GPU) for GC ViT-S, GC ViT-B, GC ViT-L and 40964096 (128128 per GPU) for GC ViT-XXT, GC ViT-XT and GC ViT-T. On average, each model required 3232 hours of training with the specified hyper-parameters as indicated in the paper. All classification models were trained using the timm package (Wightman, 2019). Object detection and instance segmentation models as well as semantic segmentation models were trained using one computational node with 8 NVIDIA A40 GPUs using a total batch size of 1616, hence a batch size of 22 per GPU. Detection and instance segmentation models were trained using mmdetection (Chen et al., 2019) package and on average required 5656 hours of training. Semantic segmentation models were trained using mmsegmentation (Contributors, 2020) package, and on average required 3434 hours of training.

A.4 Interpretability

In Fig. S.2, we illustrate the learned global query token maps and demonstrate their effectiveness in capturing long-range contextual representations from different image regions.