ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Qiming Zhang, Yufei Xu, Jing Zhang, Dacheng Tao

Introduction

Transformers (Vaswani et al, 2017; Kenton and Toutanova, 2019) have become the popular frameworks in NLP studies owing to their strong ability in modeling long-range dependencies by the self-attention mechanism. Such success and good properties of transformers have inspired many following works that apply them in various computer vision tasks (Dosovitskiy et al, 2020; Zheng et al, 2021; Wang et al, 2021a). Among them, ViT (Dosovitskiy et al, 2020) is the pioneering work that adapts a pure transformer model for vision by embedding images into a sequence of visual tokens and modeling the global dependencies among them with stacked transformer blocks. Although it achieves promising performance on image classification, it experiences a severe data-hungry issue, i.e., requiring large-scale training data and a longer training schedule for better performance. One important reason is that ViT does not efficiently utilize the prior knowledge in vision tasks and lacks such inductive bias (IB) in modeling local visual clues (e.g., edges and corners) and dealing with objects at various scales like convolutions. Alternatively, ViT has to learn such IB implicitly from large-scale data.

Unlike vision transformers, Convolution Neural Networks (CNNs) are naturally equipped with the intrinsic IBs of locality and scale-invariance and still serve as prevalent backbones in vision tasks (He et al, 2016; Szegedy et al, 2017; Chen et al, 2017; Zhao et al, 2017). The success of CNNs inspires us to explore the benefits of introducing intrinsic IBs in vision transformers. We start by analyzing the above two IBs of CNNs, i.e., locality and scale-invariance. Convolution that computes local correlation among neighbor pixels is good at extracting local features such as edges and corners. Consequently, CNNs can provide great low-level features at the shallow layers (Zeiler and Fergus, 2014), which are then aggregated into high-level features progressively by a bulk of sequential convolutions (Huang et al, 2017; Simonyan and Zisserman, 2015; Szegedy et al, 2015). Moreover, CNNs have a hierarchy structure to extract multi-scale features at different layers (Simonyan and Zisserman, 2015; Krizhevsky et al, 2012; He et al, 2016). Intra-layer convolutions can also learn features at different scales by varying their kernel sizes and dilation rates (He et al, 2015; Szegedy et al, 2017; Chen et al, 2017; Lin et al, 2017; Zhao et al, 2017). Consequently, scale-invariant feature representations can be obtained via intra- or inter-layer feature fusion. Nevertheless, CNNs are not well suited to model long-range dependenciesDespite the projection layer in a transformer can be viewed as $1\times 1$ convolution (Chen et al, 2021c), the term of convolution here refers to those with larger kernels, e.g., $3\times 3$ , which are widely used in typical CNNs to extract spatial features., which is the key advantage of transformers. An interesting question then comes up: can we improve vision transformers by leveraging the good properties of CNNs? Recently, DeiT (Touvron et al, 2021a) explores the idea of distilling knowledge from CNNs to transformers to facilitate training and improve performance. However, it requires an off-the-shelf CNN model as the teacher and incurs extra training costs.

Different from DeiT, we explicitly introduce intrinsic IBs into vision transformers by re-designing the network structures in this paper. Current vision transformers always obtain tokens with single-scale context (Dosovitskiy et al, 2020; Yuan et al, 2021b; Wang et al, 2021a; Liu et al, 2021) and learn to adapt to objects at different scales from data. For example, T2T-ViT (Yuan et al, 2021b) improves ViT by delicately generating tokens in a soft split manner. Specifically, it uses a series of Tokens-to-Token transformation layers to aggregate single-scale neighboring contextual information and progressively structures the image to tokens. Motivated by the success of CNNs in dealing with scale variance, we explore a similar design in transformers, i.e., intra-layer convolutions with different receptive fields (Szegedy et al, 2017; Yu et al, 2017), to embed multi-scale context into tokens. Such a design allows tokens to carry useful features of objects at various scales, thereby naturally having the intrinsic scale-invariance IB and explicitly facilitating transformers to learn scale-invariant features more efficiently from data. On the other hand, low-level local features are fundamental elements to generate high-level discriminative features. Although transformers can also learn such features at shallow layers from data, they are not skilled as convolutions by design. Recently, (Yan et al, 2021; Li et al, 2021; Graham et al, 2021) stack convolutions and attention layers sequentially and demonstrate that locality is a reasonable compensation of global dependency. However, this serial structure ignores the global context during locality modeling (and vice versa). To avoid such a dilemma, we follow the “divide-and-conquer” idea and propose modeling locality and long-range dependencies in parallel and fusing the features to account for both. In this way, we empower transformers to learn local and long-range features within each block more effectively.

Technically, we propose a new Vision Transformers Advanced by Exploring Intrinsic Inductive Bias (ViTAE), which is a combination of two types of basic cells, i.e., reduction cell (RC) and normal cell (NC). RCs are used to downsample and embed the input images into tokens with rich multi-scale context, while NCs aim to jointly model locality and global dependencies in the token sequence. Moreover, these two types of cells share a simple basic structure, i.e., paralleled attention module and convolutional layers followed by a feed-forward network (FFN). It is noteworthy that RC has an extra pyramid reduction module with atrous convolutions of different dilation rates to embed multi-scale context into tokens. Following the setting in (Yuan et al, 2021b), we stack three reduction cells to reduce the spatial resolution by $1/16$ and a series of NCs to learn discriminative features from data. ViTAE outperforms representative vision transformers in terms of data efficiency and training efficiency (see Figure 1) as well as classification accuracy and generalization on downstream image classification tasks. In addition, we further scale up ViTAE to large models and show that the inductive bias still helps to obtain better performance, e.g., ViTAE-H with 644M parameters achieves 88.5% Top-1 classification accuracy on ImageNet without using extra private data.

Beyond image classification, backbone networks should adapt well to various downstream tasks such as object detection, semantic segmentation, and pose estimation. To this end, we extend the vanilla ViTAE to the multi-stage design, i.e., ViTAEv2. Specifically, a natural choice is to construct the model by re-arranging the reduction cells and normal cells according to the strategies in (Wang et al, 2021a; Liu et al, 2021) to have multi-scale feature outputs, i.e., several consecutive NC cells are used following one RC module at each stage (feature resolution) rather than using a series of NCs only at the last stage. As a result, the multi-scale features from different stages can be utilized for those various downstream tasks. One remaining issue is that the vanilla attention operations in transformers have a quadratic computational complexity, requiring a large memory footprint and computation cost, especially for feature maps with a large resolution. To mitigate this issue, we further explore another inductive bias, i.e., local window attention introduced in (Liu et al, 2021), in the RC and NC modules. Since the parallel convolution branch in the proposed two cells can encode position information and enable inter-window information exchange, special designs like the relative position encoding and window-shifting mechanism in (Liu et al, 2021) can be omitted. Consequently, our ViTAEv2 models outperform state-of-the-art methods for various vision tasks, including image classification, object detection, semantic segmentation, and pose estimation, while keeping a fast inference speed and reasonable memory footprint.

The contribution of this study is threefold. First, we explore two types of intrinsic IB in transformers, i.e., scale invariance and locality, and demonstrate the effectiveness of this idea by designing a new transformer architecture named ViTAE based on two new reduction and normal cells that incorporate the above two IBs. ViTAE outperforms representative vision transformers regarding classification accuracy, data efficiency, training efficiency, and generalization on downstream vision tasks. Second, we scale up our ViTAE model to 644M parameters and obtain 88.5% Top-1 classification accuracy on ImageNet without using extra private data, which is better than the state-of-the-art Swin Transformer, demonstrating that the introduced inductive bias still helps when the model size becomes large. Third, we extend the vanilla ViTAE to the multi-stage design, i.e., ViTAEv2. It learns multi-scale features at different stages efficiently while keeping a fast inference speed and reasonable memory footprint for large-size input images. Experiments on popular benchmarks demonstrate that it outperforms state-of-the-art methods for various downstream vision tasks, including image classification, object detection, semantic segmentation, and pose estimation.

The following of this paper is organized as follows. Section 2 describes the relevant works to our paper. We then detail the two basic cells, the vanilla ViTAE model, the scaling strategy for ViTAE, as well as the multi-stage design for ViTAEv2 in Section 3. Next, Section 4 presents the extensive experimental results and analysis. Finally, we conclude our paper in Section 5 and discuss the potential applications and future research directions.

Related Work

CNNs (Krizhevsky et al, 2012; Zeiler and Fergus, 2014; He et al, 2016) have explored several inductive biases with specially designed operations and have led to a series of breakthroughs in vision tasks, such as image classification, object detection, and semantic segmentation. For example, following the fact that local pixels are more likely to be correlated in images (LeCun et al, 1995), the convolution operations in CNNs extract features from the neighbor pixels within the receptive field determined by the kernel size (LeCun et al, 2015). By stacking convolution operations, CNNs have the inductive bias in modeling locality naturally.

In addition to the locality, another critical inductive bias in visual tasks is scale-invariance, where multi-scale features are needed to represent the objects at different scales effectively (Luo et al, 2016; Yu and Koltun, 2016). For example, to effectively learn features of large objects, a large receptive field is needed by either using large convolution kernels (Yu and Koltun, 2016; Yu et al, 2017) or a series of convolution layers in deeper architectures (He et al, 2016; Huang et al, 2017; Simonyan and Zisserman, 2015; Szegedy et al, 2015). However, such operations may ignore the features of small objects. To construct multi-scale feature representation for objects at different scales effectively, various image pyramid techniques (Chen et al, 2017; Adelson et al, 1984; Olkkonen and Pesola, 1996; Burt and Adelson, 1987; Lai et al, 2017; Demirel and Anbarjafari, 2010) have been explored, where features are extracted from a pyramid of images at different resolutions respectively (Lin et al, 2016; Chen et al, 2017; Ng and Henikoff, 2003; Rublee et al, 2011; Ke and Sukthankar, 2004; Bay et al, 2006), either in a hand-crafted manner or learned manner. Accordingly, features from the small-scale images mainly encode the large objects, while features from the large-scale images respond more to small objects. Then, features extracted from different resolutions are fused to form the scale-invariant feature, i.e., the inter-layer fusion. Another way to obtain the scale-invariant feature is to extract and aggregate multi-scale context by using multiple convolutions with different receptive fields in a parallel manner, i.e., the intra-layer fusion (Zhao et al, 2017; Szegedy et al, 2015, 2017, 2016). Either the inter-layer or intra-layer fusion empowers the CNNs with the scale-invariance inductive bias. It helps improve their performance in recognizing objects at different scales.

However, it is unclear whether these inductive biases can help the visual transformer to achieve better performance. This paper explores the possibility of introducing two types of inductive biases in the vision transformer, namely locality by introducing convolution in the vision transformer and scale-invariance by encoding a multi-scale contxt into each visual token using multiple convolutions with different dilation rates, following the convention of intra-layer fusion.

2 Vision transformers with inductive bias

ViT (Dosovitskiy et al, 2020) is the pioneering work that applies a pure transformer to vision tasks and achieves promising results. It treats images as a 1D sequence, embeds them into several tokens, and then processes them by stacked transformer blocks to get the final prediction. However, since ViT simply treats images as 1D sequences and thus lacks inductive bias in modeling local visual structures, it indeed implicitly learns the IB from a large amount of data. Similar phenomena can also be observed in models with fewer inductive biases in their structures (Tolstikhin et al, 2021; Ali et al, 2021; He et al, 2021).

To alleviate the data-hungry issue, the following works explicitly introduce inductive bias into vision transformers, e.g., leveraging the IB from CNNs to facilitate the training of vision transformers with less training data or shorter training schedules. For example, DeiT (Touvron et al, 2021a) proposes to distill knowledge from pre-trained CNNs to transformers during training via an extra distillation token to imitate the behavior of CNNs. However, it requires an off-the-shelf CNN model as a teacher, introducing extra computation cost. Recently, some works try to introduce the intrinsic IB of CNNs into vision transformers explicitly (Han et al, 2021; Peng et al, 2021; Graham et al, 2021; Li et al, 2021; d’Ascoli et al, 2021; Yan et al, 2021; Wu et al, 2021; Yuan et al, 2021a; Chen et al, 2021b; Liu et al, 2021). For example, (Li et al, 2021; Graham et al, 2021; Wu et al, 2021; Dai et al, 2021) stack convolutions and attention layers sequentially, resulting in a serial structure and modeling the locality and global dependency accordingly. (Wang et al, 2021a) design sequential multi-stage structures while (Liu et al, 2021) apply attention within local windows. However, these serial structures may ignore the global context during locality modeling (and vice versa). (Wang et al, 2021b) establishes connection across different scales at the cost of heavy computation. To jointly model global and local context, Conformer (Peng et al, 2021) and MobileFormer (Chen et al, 2022) employ a model-parallel structure, consisting of parallel individual convolution and transformer branches and a complicated bridge connection between the two branches. Different from them, we follow the “divide-and-conquer” idea and propose to model locality and global dependencies simultaneously via a parallel structure within each transformer layer. In this way, the convolution and attention modules are designed to complement each other within the transformer block, which is more beneficial for the models to learn better features for both classification and dense prediction tasks.

3 Self supervised learning and model scaling

As demonstrated in previous studies, scaled-up models are naturally few-shot learners and beneficial to obtain better performance no matter in language, image, or cross-modal domains (Kenton and Toutanova, 2019; Zhai et al, 2022; Radford et al, 2021). Recently, many efforts have been made to scale up vision models, e.g., BiT (Kolesnikov et al, 2020) and EfficientNet (Tan and Le, 2019) scale up the CNN models to hundreds of millions of parameters by employing wider and deeper networks, and obtain superior performance on many vision tasks. However, they need to train the scaled-up models with a much larger scale of private data, i.e., JFT300M (Kolesnikov et al, 2020). Similar phenomena can be observed when training the scaled-up vision transformer models for better performance (Dosovitskiy et al, 2020; Zhai et al, 2022).

However, it is not easy to gather such large amounts of labeled data to train the scaled-up models. On the other hand, self-supervised learning can help train scaled-up models using data without labels. For example, CLIP (Radford et al, 2021) adopts paired text and image data captured from the Internet and exploits the consistency between text and images to train a big transformer model, which obtains good performance on image and text generation tasks. (Liu et al, 2019) adopt masked language modeling (MLM) as pretext tasks and generate supervisory signals from the input data. Specifically, they take masked sentences with several words overrode with mask and predicted the masked words with the words from the sentence before masking as supervision. In this way, these models do not require additional labels for the training data and achieve superior performance on translation, sentiment analysis, etc. Inspired by the superior performance of MLM tasks in language, masked image modeling (MIM) tasks have been explored in vision tasks recently. For example, BEiT (Bao et al, 2021) tokenizes the images into visual tokens and randomly masks some tokens using a block-wise manner. The vision transformer model must predict the original tokens for those masked tokens. In this way, BEiT obtains superior classification and dense prediction performance using publicly available ImageNet-22K dataset (Deng et al, 2009). MAE (He et al, 2022) simplifies the requirement of tokenizers and simply treats the image pixels as the targets for reconstruction. Using only ImageNet-1K training data, MAE obtains impressive performance. It is under-explored whether the vision transformers with introduced inductive bias can be scaled up, e.g., in a self-supervised setting. Besides, whether inductive bias can still help these scaled-up models achieve better performance remains unclear. In this paper, we make an attempt to answer this question by scaling up the ViTAE model and training it in a self-supervised manner. Experimental results confirm the value of introducing inductive bias in scaled-up vision transformers.

4 Comparison to the conference version

A preliminary version of this work was presented in (Xu et al, 2021). This paper extends the previous study by introducing three major improvements.

We scale up the ViTAE model to different model sizes, including ViTAE-B, ViTAE-L, and ViTAE-H. With the help of inductive bias, the proposed ViTAE-H model with 644M parameters obtains the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 classification accuracy on ImageNet Real validation set, without using extra private data. It demonstrates that the introduced inductive bias still helps when the model size becomes large. We also show the excellent few-shot learning ability of the scaled-up ViTAE models.

We extend the vanilla ViTAE to the multi-stage design and devise ViTAEv2. The efficiency of the RC and NC modules is also improved by exploring another inductive bias from local window attention. ViTAEv2 outperforms state-of-the-art models for image classification tasks as well as downstream vision tasks, including object detection, semantic segmentation, and pose estimation.

We also present more ablation studies and experiment analysis regarding module design, inference speed, memory footprint, and comparisons with the latest works.

Methodology

We first give a brief review of the vision transformer in this part. To adapt transformers to vision tasks, ViT (Dosovitskiy et al, 2020) first splits an image $x\in R^{H\times W\times C}$ into several non-overlapping patches with the patch size $p$ , and embeds them into visual tokens (i.e., $x_{t}\in R^{N\times D}$ ) in a patch-to-token manner, where $H$ , $W$ , $C$ denote the height, width, and channel dimensions of the input image respectively, $N$ and $D$ denote the token number and token dimension, respectively, and $N=(H\times W)/p^{2}$ . Then, an extra learnable embedding with the same dimension $D$ , considered as a class token, is concatenated to the visual tokens before adding position embeddings to all the tokens in an element-wise manner. In the following part of this paper, we use $x_{t}$ to represent all tokens, and $N$ is the total number of tokens after concatenation for simplicity unless specified. These tokens are fed into several sequential transformer layers for the final prediction. Each transformer layer is composed of two parts, i.e., a multi-head self-attention module (MHSA) and a feed-forward network (FFN).

MHSA extends single-head self-attention (SHSA) by using different projection matrices for each head. In other words, MHSA is obtained after repeating SHSA for $h$ times, where $h$ is the number of heads. Specifically, for SHSA, the input tokens $x_{t}$ are first projected to queries ( $Q$ ), keys ( $K$ ) and values ( $V$ ) using three different projection matrices, i.e., $Q,K,V=x_{t}W_{Q},x_{t}Q_{K},$ $x_{t}Q_{V}$ , where $W_{Q/K/V}\in R^{D\times\frac{D}{h}}$ denotes the projection matrix for query/key/value, respectively. Then, the self-attention operation is calculated as:

where the output of each head is of size $R^{N\times\frac{D}{h}}$ . Then the features of all the $h$ heads are concatenated along the channel dimension and formulate the MHSA module’s output.

FFN is placed on top of the MHSA module and applied to each token identically and separately. It consists of two linear transformations with an activation function in between. Besides, a layer normalization (Ba et al, 2016) and a shortcut are added before and aside from the MHSA and FFN, respectively.

2 The isotropic design of ViTAE

ViTAE aims to introduce the intrinsic IB in CNNs to vision transformers. As shown in Figure 2, ViTAE is composed of two types of cells, i.e., RCs and NCs. RCs are responsible for downsampling while embedding multi-scale context and local information into tokens, and NCs are used to further model the locality and long-range dependencies in the tokens. Taken an image $x\in R^{H\times W\times C}$ as input, three RCs are used to gradually downsample $x$ with a total of 16 $\times$ ratio by 4 $\times$ , 2 $\times$ , and 2 $\times$ , respectively. Thereby, the output tokens of the RCs after downsampling are of size $[H/16,W/16,D]$ where $D$ is the token dimension (64 in our experiments). The output tokens of RCs are then flattened as $R^{(HW/256)\times D}$ , concatenated with the class token (red in the figure), and added by the sinusoid position encoding. Next, the tokens are fed into the following NCs, which keep the length of the tokens. Finally, the prediction probability is obtained using a linear classification layer on the class token from the last NC.

Instead of directly splitting and flattening images into visual tokens based on a linear image patch embedding layer, we devise the reduction cell to embed multi-scale context and local information into visual tokens, introducing the intrinsic scale-invariance and locality IBs from convolutions into ViTAE. Technically, RC has two parallel branches responsible for modeling locality and long-range dependency, followed by an FFN for feature transformation. We denote the input feature of the $i_{th}$ RC as $f_{i}\in R^{H_{i}\times W_{i}\times D_{i}}$ . The input of the first RC is the image $x$ . In the global dependency branch, $f_{i}$ is firstly fed into a Pyramid Reduction Module (PRM) to extract multi-scale context, i.e.,

where $Conv_{ij}(\cdot)$ indicates the $j_{th}$ convolutional layer in the $i_{th}$ PRM (i.e., $PRM_{i}(\cdot)$ ). It uses a dilation rate $s_{ij}$ from the predefined dilation rate set $\mathcal{S}_{i}$ corresponding to the $i$ th RC. Note that we use stride convolution to reduce the spatial dimension of features by a ratio $r_{i}$ from the predefined reduction ratio set $\mathcal{R}$ . The features after convolution are concatenated along the channel dimension, i.e., $f_{i}^{ms}\in R^{(W_{i}/r_{i})\times(H_{i}/r_{i})\times(|\mathcal{S}_{i}|D_{i})}$ , where $|\mathcal{S}_{i}|$ denotes the number of dilation rates in the set $\mathcal{S}_{i}$ . $f_{i}^{ms}$ is then processed by an MHSA module to model long-range dependencies, i.e.,

where $Img2Seq(\cdot)$ is a simple reshape operation to flatten the feature map to a 1D sequence. In this way, $f_{i}^{g}$ embeds the multi-scale context in each token. Note that the traditional MHSA individually attends each token at the same scale and thus lacks the ability to model the relationship between tokens at different scales. By contrast, the introduced multi-scale convolutions in reduction cells can (1) mitigate the information loss when merging tokens by looking at a larger field and (2) embed multi-scale information into tokens to aid the following MHSA to model the better global dependencies based on features at different scales.

In addition, we use a Parallel Convolutional Module (PCM) to embed local context within the tokens, which are fused with $f_{i}^{g}$ as follows:

Here, $PCM_{i}(\cdot)$ represents the PCM of the $i_{th}$ RC, which is composed of an $Img2Seq(\cdot)$ operation and three stacked convolution layers with BN layers and activation layers in between. It is noteworthy that the parallel convolution branch has the same spatial downsampling ratio as the PRM by using stride convolutions. In this way, the token features can carry both local and multi-scale context, implying that RC acquires the locality IB and scale-invariance IB by design. The fused tokens are then processed by the FFN and reshaped back to feature maps, i.e.,

where the $Seq2Img(\cdot)$ is a simple reshape operation to reshape a token sequence back to feature maps. $FFN_{i}(\cdot)$ represents the FFN in the $i_{th}$ RC. In our ViTAE, three RCs are stacked sequentially to gradually reduce the input image’s spatial dimension by 4 $\times$ , 2 $\times$ , and 2 $\times$ , respectively. As the first RC handles images with high resolution, we adopt Performer (Choromanski et al, 2020) to reduce the computational burden and memory cost.

2.2 Normal cell

3 Scaling up ViTAE via self-supervised learning

Except stacking the proposed RCs and NCs to construct the isotropic ViTAE models with 4M, 6M, 13M, and 24M parameters, we also scale up ViTAE to evaluate the benefit of introducing the inductive bias in vision transformers with large model sizes. Specifically, we follow the setting in ViT (Dosovitskiy et al, 2020) to scale up the proposed ViTAE model, i.e., we embed the image into visual tokens and process them using stacked NCs to extract features. The stacking strategy is exactly the same as the strategy adopted in ViT (Dosovitskiy et al, 2020), where we use 12 NCs with 768 embedding dimensions to construct the ViTAE base model (i.e., ViTAE-B with 89M parameters), 24 NCs with 1,024 embedding dimensions to construct the ViTAE large model (i.e., ViTAE-L with 311M parameters), and 36 normal cells with 1,248 embedding dimension to construct the ViTAE huge model (i.e., ViTAE-H with 644M parameters). The normal cells are stacked sequentially. However, the scaled-up models are easy to overfit if only trained using the ImageNet-1K dataset under a fully supervised training setting. Self-supervised learning (He et al, 2022), on the contrary, can eliminate this issue and facilitate the training of scaled-up models. In this paper, we adopt MAE (He et al, 2022) to train the scaled-up ViTAE model due to its simplicity and efficiency. Specifically, we first embed the input images into tokens and then randomly remove 75% of the tokens. The removed tokens are filled with randomly initialized mask tokens. After that, the remained visual tokens are processed by the ViTAE model for feature extraction. The extracted features and the mask tokens are then concatenated and fed into the decoder network to predict the values of the pixels belonging to the masked regions. The mean squared errors between the prediction and the masked pixels are minimized during the training.

However, as the encoder only processes the visual tokens, i.e., the remained tokens after removing, the built-in locality property of images has been broken among the visual tokens. To adapt the proposed ViTAE model to the self-supervised task, we simply use convolution with kernel size $1\times 1$ instead of $3\times 3$ to formulate the ViTAE model for pretraining. This simple modification helps us to preserve a similar architecture between the network’s pretraining and finetuning stage and helps the convolution branch to learn a meaningful initialization, as demonstrated in (Zhang et al, 2018). After the pretraining stage, we convert the kernels of the convolutions from $1\times 1$ to $3\times 3$ by zero-padding to recover the complete ViTAE models, which are further finetuned on the ImageNet-1k training data for 50 epochs. Inspired by (Bao et al, 2021), we use layer-wise learning rate decay during the finetuning to adapt the pre-trained models for specific vision tasks.

4 The multi-stage design for ViTAE

Apart from classification, other downstream tasks, including object detection, semantic segmentation, and pose estimation, are also very important that a general backbone should adapt to. These downstream tasks usually need to extract multi-level features from the backbone to deal with those objects at different scales. To this end, we extend the vanilla ViTAE model to the multi-stage design, i.e., ViTAE-v2. A natural choice for the design of ViTAE-v2 can be re-constructing the model by re-organizing RCs and NCs. As shown in Figure 3, ViTAE-v2 has four stages where four corresponding RCs are used to gradually downsample the features by $4\times$ , $2\times$ , $2\times$ , and $2\times$ , respectively. At each stage, a number of $N_{i}$ normal cells are sequentially stacked following the $i_{th}$ RC. Note that a series of NCs are used only at the most coarse stage in the isotropic design. The number of normal cells, i.e., $N_{i}$ , controls the model depth and size. By doing so, ViTAE-v2 can extract a feature pyramid from different stages which can be used by the decoders specifically designed for various downstream tasks.

One remaining issue is that the vanilla attention operations in transformers have a quadratic computational complexity, therefore requiring a large memory footprint and computation cost, especially for feature maps with a large resolution. In contrast to the fast resolution reduction in the vanilla ViTAE design, we adopt a slow resolution reduction strategy in the multi-stage design, e.g., the resolution of the feature maps at the first stage is only $1/4$ of the original image size, thereby incurring more computational cost especially when the images in downstream tasks have high resolutions. To mitigate this issue, we further explore another inductive bias, i.e., local window attention introduced in (Liu et al, 2021), in the RC and NC modules. Specifically, the window attention split the whole feature map into several non-overlap local windows and conducts the multi-head self-attention within each window, i.e., each query token within the same window shares the same key and value sets. Since the parallel convolution branch in the proposed two cells can encode position information and enable inter-window information exchange, special designs like the relative position encoding and window-shifting mechanism in (Liu et al, 2021) can be omitted. We empirically find that replacing the full attention with local window attention at early stages can achieve a good trade-off between computational cost and performance. Therefore, we only use local window attention in the RC and NC modules at the first two stages. Consequently, our ViTAEv2 models can deliver superior performance for various vision tasks, including image classification, object detection, semantic segmentation, and pose estimation, while keeping a fast inference speed and reasonable memory footprint.

5 Model details

In this paper, we propose ViTAE and further extend it to the multi-stage version ViTAEv2 as described above. We devise several ViTAE and ViTAEv2 variants in our experiments to be compared with other models with similar model sizes. The details of them are summarized in Table 1. The ‘dilation’ column determines the dilation rate sets $\mathcal{S}$ in each RC. The two rows in the ‘RC’ and ‘NC’ columns denote the specific configurations of RCs and NCs, respectively, where ‘P’, ’W’, ’F’ refers to Performer (Choromanski et al, 2020), local window attention, and the vanilla full attention, respectively, and the number in the second rows denotes the number of heads in the corresponding attention module. The ‘arrangement’ column denotes the number of NC at each stage, while the ‘embedding’ denotes the token embedding size at each stage. Specifically, the default convolution kernel size in the first RC is $7\times 7$ with a stride of $4$ and dilation rates from $\mathcal{S}_{1}=$ . In the following two RCs (or three RCs for ViTAEv2), the convolution kernel size is $3\times 3$ with a stride of $2$ and dilation rates from $\mathcal{S}_{2}=$ and $\mathcal{S}_{3}=$ (and $\mathcal{S}_{4}=$ for ViTAEv2), respectively. Since the number of tokens decreases at later stages, there is no need to use large kernels and dilation rates at later stages. PCM in both RCs and NCs comprises three convolutional layers with a kernel size of $3\times 3$ .

Experiments

Unless explicitly stated, we train and test the proposed ViTAE and ViVTAEv2 model on the ImageNet-1k (Krizhevsky et al, 2012) dataset, which contains about 1.3 million images from 1k classes. The image size during training is set to $224\times 224$ . We use the AdamW (Loshchilov and Hutter, 2018) optimizer with the cosine learning rate scheduler and use the data augmentation strategy exactly the same as T2T (Yuan et al, 2021b) for a fair comparison regarding the training strategies and the size of models. We use a batch size of 512 for training ViTAE and 1024 for ViTAEv2. The learning rate is set to be proportion to 512 batch size with a base value 5e-4. The results of our models can be found in Table 2, where all the models are trained for 300 epochs. The models are built on PyTorch (Paszke et al, 2019) and TIMM (Wightman, 2019).

2 Comparison with the state-of-the-art

We compare our ViTAE and ViTAEv2 with both CNN models and vision transformers with similar model sizes in Table 2 and Table 3. Both Top-1/5 accuracy and real Top-1 accuracy (Beyer et al, 2020) on the ImageNet validation set are reported. We categorize the methods into CNN models, vision transformers with learned IB, and vision transformers with introduced intrinsic IB. Compared with CNN models, our ViTAE-T achieves a 75.3% Top-1 accuracy, which is better than ResNet-18 with more parameters. The real Top-1 accuracy of the ViTAE model is 82.9%, which is comparable to ResNet-50 that has four more times of parameters than ours. Similar phenomena can also be observed when comparing ViTAE-T with MobileNetV1 (Howard et al, 2017) and MobileNetV2 (Sandler et al, 2018), where ViTAE obtains better performance with fewer parameters. ViTAE-S achieves 82.0% Top-1 accuracy with half of the parameters of ResNet-101 and ResNet-152, showing the superiority of learning both local and long-range features from specific structures with corresponding intrinsic IBs by design. When adopting the multi-stage design, ViTAEv2-S further improves the Top-1 accuracy to 82.6% significantly. When finetuning the model using images of a larger resolution, e.g., using $384\times 384$ images as input, ViTAE-S’s performance is further improved significantly by 1.2% absolute Top-1 accuracy. ViTAEv2-48M in Table 3 also benefits from it, and the performance increases from 83.8% to 84.7%, which further shows the potential of vision transformers with intrinsic IBs for large resolution images that are common in downstream dense prediction tasks. When the model size increases to 88M, ViTAEv2-B reaches 84.6% Top-1 accuracy, significantly outperforming other transformer models including Swin-B (Liu et al, 2021), Focal-B (Yang et al, 2021), and CrossFormer-B (Wang et al, 2021b). When finetuning using images of larger resolution or pretraining the model with ImageNet-22k, ViTAEv2-B’s performance increases to 85.3% and 86.1% Top-1 accuracy, respectively, confirming the scalability of using IBs for large models and trained on large-scale datasets.

3 Analysis of the isotropic design of ViTAE

To validate the effectiveness of introducing intrinsic IBs in improving data efficiency and training efficiency, we compare our ViTAE-T model with the baseline model T2T-ViT-7 at different training settings: (a) training them using 20%, 60%, and 100% ImageNet training set for equivalent 100 epochs regarding the full ImageNet training set, e.g., we employ 5 times epochs when using 20% data for training compared with using 100% data; and (b) training them using the full ImageNet training set for 100, 200, and 300 epochs, respectively. The results are shown in Figure 1. As can be seen, ViTAE-T consistently outperforms the T2T-ViT-7 baseline by a large margin in terms of both data efficiency and training efficiency. For example, ViTAE-T using only 20% training data achieves comparable performance with T2T-ViT-7 using all data. When 60% training data are used, ViTAE-T significantly outperforms T2T-ViT-7 using all data by about an absolute 3% accuracy. It is also noteworthy that ViTAE-T trained for only 100 epochs has outperformed T2T-ViT-7 trained for 300 epochs. After training ViTAE-T for 300 epochs, its performance is significantly boosted to 75.3% Top-1 accuracy. With the proposed RCs and NCs, the transformer layers in our ViTAE only need to focus on modeling long-range dependencies, leaving the locality and multi-scale context modeling to its convolution counterparts, i.e., PCM and PRM. Such a “divide-and-conquer” strategy facilitates ViTAE’s training, making learning more efficient with less training data and fewer training epochs.

3.2 Generalization on downstream classification tasks

We further investigate the generalization of the proposed ViTAE models pre-trained on ImageNet-1k for downstream image classification tasks by finetuning them further on the training sets of several fine-grained classification tasks, including Flowers (Nilsback and Zisserman, 2008), Cars (Krause et al, 2013), Pets (Parkhi et al, 2012), and iNaturalist19. We also finetune the proposed ViTAE models pre-trained on ImageNet-1k further on Cifar10 (Krizhevsky et al, 2009) and Cifar100 (Krizhevsky et al, 2009). The results are shown in Table 4. It can be seen that ViTAE achieves SOTA performance on most of the datasets using comparable or fewer parameters. These results demonstrate the good generalization ability of our ViTAE models.

3.3 Ablation study of the design of RC and NC

We use T2T-ViT (Yuan et al, 2021b) as our baseline model in the following ablation study of our ViTAE. As shown in Table 5, we investigate the hyper-parameter settings of RC and NC in the ViTAE-T model by isolating them separately. All the models are trained for 100 epochs on ImageNet-1k, following the same training setting and data augmentation strategy as described in Section 4.1.

We use $\checkmark$ and $\times$ to denote whether or not the corresponding module is enabled during the experiments. If all columns under the RC and NC are marked $\times$ as shown in the first row, the model becomes the standard T2T-ViT-7 model. “Pre” denotes the early fusion strategy that fuses output features of PCM and MHSA before FFN, while “Post” denotes a late fusion strategy alternatively. The ✓in “BN” denotes PCM uses BN. “ $\times 3$ ” in the first column denotes that the dilation rate set is the same in the three RCs. “ $\downarrow$ ” denotes using smaller dilation rates in deeper RCs, i.e., $\mathcal{S}_{1}=$ , $\mathcal{S}_{2}=$ , $\mathcal{S}_{3}=$ .

As can be seen, using an early fusion strategy and BN in NC achieves the best 69.9% Top-1 accuracy among other settings. It is noteworthy that all the variants of NC outperform the vanilla T2T-ViT, implying the effectiveness of PCM, which introduces the intrinsic locality IB in transformers. It can also be observed that BN plays an important role in improving the model’s performance as it can help to alleviate the scale deviation between convolution’s and attention’s features. For RC, we first investigate the influence of using different dilation rates in the PRM, as shown in the first column. As can be seen, using larger dilation rates (e.g., 4 or 5) does not deliver better performance. We suspect that larger dilation rates may lead to plain features in the deeper RCs due to the smaller resolution of feature maps. To validate the hypothesis, we use smaller dilation rates in deeper RCs as denoted by $\downarrow$ . As can be seen, it achieves comparable performance as $\times$ . However, compared with $\downarrow$ , $\times$ increases the amount of parameters from 4.35M to 4.6M. Therefore, we select $\downarrow$ as the default setting. In addition, using PCM in the RC introduces the intrinsic locality IB and the performance increases to 71.7% Top-1 accuracy. Finally, the combination of RCs and NCs achieves the best accuracy at 72.6%, demonstrating their complementarity.

3.4 Visual inspection of ViTAE

To further analyze the property of our ViTAE, we apply Grad-CAM (Selvaraju et al, 2017) on the MHSA’s output in the last NC to qualitatively inspect ViTAE. The visualization results are provided in Figure 4. Compared with the baseline T2T-ViT, our ViTAE covers the single or multiple targets in the images more precisely and attends less to the background. Moreover, ViTAE can better handle the scale variance issue as shown in Figure 4(b). That is, it covers birds accurately whether they are small, medium, or large in size. Such observations demonstrate that introducing the intrinsic IBs of locality and scale-invariance from convolutions to transformers helps ViTAE learn more discriminate features than the pure transformers.

Besides, we calculate the average attention distance of each layer in ViTAE-T and the baseline T2T-ViT-7 on the ImageNet validation set, respectively. The results are shown in Figure 5. It can be observed that with the usage of PCM, which focuses on modeling locality, the transformer layers in the proposed NCs can better focus on modeling long-range dependencies, especially in shallow layers. In the deep layers, the average attention distances of ViTAE-T and T2T-ViT-7 are almost the same, where modeling long-range dependencies is much more important. It implies that the PCM does not affect the transformer’s behavior in deep layers. These results confirm the effectiveness of the adopted “divide-and-conquer” idea in ViTAE, i.e., introducing the intrinsic locality IB from convolutions into vision transformers makes it possible that transformer layers only need to be responsible to long-range dependencies since convolutions can well model locality in PCM.

4 Analysis of the scaled up ViTAE models

We evaluate the performance of the scaled-up models on the ImageNet dataset. The scaled-up models are pre-trained for 1600 epochs using MAE (He et al, 2022), taking images from the ImageNet-1K training set. Then, the models are finetuned for 100 (the base model) or 50 (the large and huge model) epochs using the labeled data from the ImageNet-1K training set. It should be noted that the original MAE is trained on the TPU machines with Tensorflow, while our implementation adopts PyTorch as the framework and uses NVIDIA GPU for the training. This implementation difference may cause a slight performance difference in the models’ classification accuracy. We compare our methods with T2TViT (Yuan et al, 2021b), CvT (Yuan et al, 2021a), Swin (Liu et al, 2021), SwinV2 (Liu et al, 2022), and ViT (Dosovitskiy et al, 2020) with either supervised learning or self-supervised learning like MAE (He et al, 2022), MaskFeat (Wei et al, 2022), and SimMIM (Xie et al, 2022) The results are summarized in Table 6. It demonstrates that the proposed ViTAE-B model with the introduced inductive bias outperforms the baseline ViT-B model by 0.4 Top-1 classification accuracy. For the ViTAE-L model with 300M parameters, the inductive bias still brings about 0.3 performance gains. After using ImageNet-22K labeled data for finetuning, the classification accuracy on the ImageNet-1K validation set further increases by about 1%. These results show that the benefit of introducing inductive bias in vision transformers is scalable to large models and datasets. Notably, our ViTAE-H, trained with only the ImageNet-1K dataset, obtains a classification accuracy of 91.2 on the ImageNet Real dataset (Beyer et al, 2020), which is the highest accuracy we are aware of. It outperforms other methods trained with additional private data, such as EfficientNet (Pham et al, 2021) and ViT-G (Zhai et al, 2022), where the former obtains 91.1 accuracy using the JFT300M dataset and the latter obtains 90.8 accuracy using the JFT3B dataset.

4.2 Few-shot learning performance

We further evaluate the data efficiency of scaled-up models by using different percentages of data to finetune the pre-trained models. We use 1%, 10%, and 100% data from the ImageNet-1k training set to finetune the self-supervised pre-trained ViTAE models with different amounts of parameters. We ensure that each model sees the same amount of the training images under different data settings, i.e., we train the ViTAE-B model for 10,000 epochs using 1% training data, for 1,000 epochs using 10% training data, and for 100 epochs using 1% training data. Similarly, the ViTAE-L and ViTAE-H models are trained for 5,000 epochs using 1% training data and 500 epochs using 10% training data. The smaller models, i.e., with less than 20M parameters, are trained from scratch using 100% ImageNet-1k training data for 300 epochs. As shown in Figure 6, the models with more parameters are more data-efficient than those with fewer parameters. For example, the ViTAE-H model with 644M parameters trained with 10% data outperforms the small model with 13.2 parameters trained with 100% data, i.e., 82.4 v.s. 81.0. Such observations mirror the findings in previous studies, including both image classification and language modeling.

4.3 Ablation study of the convolutional kernel size in the scaled up ViTAE models

We conduct experiments to investigate the influence of the convolutional kernel size in the scaled-up ViTAE models during pretraining. The results are presented in Table 7. The kernel size 0 represents that we do not use the convolution branch in ViTAE-L during pretraining, which degenerates to the original ViT-L model. Then, we add the convolution branches during finetuning and initialize the convolutional kernel weight as 0. If we use 1 $\times$ 1 kernels in ViTAE-L during pretraining, we pad them to 3 $\times$ 3 with zero padding during finetuning. We pretrain the models for 400 epochs and further finetune them for 50 epochs on the ImageNet-1K training set. As can be seen, using no convolution branch during pretraining leads to no improvement over the baseline ViT-L since those convolutional kernel weights in ViTAE-L during finetuning are zero-initialized. Directly pretraining and finetuning ViTAE-L with 3 $\times$ 3 convolutional kernels in the convolution branches leads to slightly better performance over the baseline ViT-L, but it is inferior to the proposed setting, i.e., using 1 $\times$ 1 convolutional kernels in the convolution branches during pretraining ViTAE-L while zero-padding them to 3 $\times$ 3 during finetuning. We argue the reason is that most tokens (75%) during pretraining are randomly removed, and the remaining ones have lost spatial information. Therefore, using 3 $\times$ 3 kernels may lead to overfitting while 1 $\times$ 1 convolutions pay little attention to spatial structures and could learn better feature representation, which is in line with the observations in (Zhang et al, 2018).

5 Analysis of the multi-stage design ViTAEv2

In this paper, we extend ViTAE to a multi-stage design and propose ViTAEv2 accordingly. To achieve a good trade-off between classification performance and computational cost, we study the design choice of the attention type at each stage. The results are summarized in Table 8, where ‘P’, ‘W’, ‘F’ refer to the Performer attention, local window attention, and vanilla attention, respectively. They only differ in the implementation of attention calculation while having the same number of parameters. We list the Top-1 classification accuracy of different model variants trained from scratch using 224 $\times$ 224 images from the ImageNet-1k training set. We gradually increase the image resolution to compare the memory footprint and training speed of different models considering that the backbone models should well adapt to downstream vision tasks where large resolution images are common. Specifically, we set the batch size to 128 for all models for the 224 $\times$ 224 resolution and reduce it for larger resolutions to fit the A100 GPU memory.

We start from a baseline multi-stage design where the performer attention is used at the first stage while the vanilla full attention is used at the following three stages, denoting as ‘P,F,F,F’. Then, we gradually introduce inductive bias into the model by replacing the performer and full attention with local window attention. As can be seen, all the models with introduced inductive bias outperform or are at least comparable to the baseline in terms of both classification performance and training cost. Specifically, using local window attention at the first two stages (i.e., ‘W,W,F,F’) leads to the best trade-off between classification performance and computational cost for different image resolutions. Compared with the model with the best performance (i.e., ‘W,F,F,F’), its classification accuracy only drops by 0.1 while the memory footprint is significantly reduced by 56.4% at the setting 896 $\times$ 896 image resolution. Moreover, it outperforms the other two designs (i.e., ‘W,W,W,F’ and ‘W,W,W,W’) by an absolute 0.4% Top-1 accuracy while having about the same training speed. Therefore, we choose the design of ‘W, W, F, F’ and devise the ViTAEv2 models at different model sizes accordingly.

One following interesting question is whether we still need the window-shifting mechanism to enable the inter-window information exchange and the relative position encoding (RPE) in the original implementation of local window attention proposed in (Liu et al, 2021) since the convolutional layers in PRM and PCM can enable inter-window information exchange and encode position information. We carry out an ablation study by isolating them one by one in our ViTAEv2 model to answer this question. We choose ViTAEv2-S as the base model, and all model variants are trained using 224 $\times$ 224 images. The results are summarized in Table 9. As can be seen, the above two components only contribute marginally in our ViTAE model, i.e., about 0.1% accuracy. Therefore, we do not include them in our default design to make the model simple and easy to implement.

We also compare ViTAEv2 and some representative transformer models in terms of inference speed in Table 10. All the experiments are conducted on the same A100 GPU, and TensorRT is adopted to accelerate all models. As can be seen, our ViTAEv2-S model outperforms ViT-Small by 2.7% Top-1 accuracy while keeping a fast inference speed, especially for large size images, e.g., 896 $\times$ 896. Compared with the state-of-the-art Swin transformer, the inference speed of ViTAEv2-S is slightly slower, i.e., about 10% $\sim$ 20%, but its classification performance is significantly improved by an absolute 1.3% Top-1 accuracy. ViTAEv2-S also outperforms T2T-ViT-24 in terms of both performance and inference speed.

We further evaluate the proposed ViTAEv2 models on representative downstream vision tasks, including object detection, semantic segmentation, and pose estimation. The results are detailed below.

5.2 The performance on object detection and instance segmentation

Settings To evaluate ViTAEv2’s performance on object detection and instance segmentation tasks, we adopt Mask RCNN (He et al, 2017) and Cascade RCNN (Cai and Vasconcelos, 2018) as the detection framework and finetune the models on COCO 2017 dataset, which contains 118K training images, 5K validation images, and 20K test-dev images. We adopt exactly the same training setting used in Swin (Liu et al, 2021), i.e., multi-scale training, AdamW optimizer (Loshchilov and Hutter, 2017) and the mmdetection code base. The models are trained for 12 (the 1x setting) and 36 epochs (the 3x setting), respectively. We compare the performance of ViTAEv2-S and other backbones, including the classic CNNs, i.e., ResNet (He et al, 2016), and current transformer models.

Results The results are summarized in Table 11 and ViTAEv2-S achieves the best performance with the least number of parameters. Thanks to the introduced inductive bias like locality and scale invariance, the proposed ViTAEv2 model obtains 2.6 $AP^{b}$ and 2.0 $AP^{m}$ performance gains over Swin when using Mask RCNN as the decoder for the 1 $\times$ setting. It also significantly outperforms other backbones like Conformer and CrossFormer, owning to our model’s efficient divide-and-conquer structure design. When we extend the training schedule to the 3 $\times$ setting (36 epochs in total), ViTAEv2 reaches 50.6 $AP^{b}$ and 42.6 $AP^{m}$ , significantly better than the other models. It is noteworthy that ViTAEv2 trained for 12 epochs has outperformed Swin-T trained for 36 epochs, validating the data efficiency of our model by introducing the inductive bias. The superiority of ViTAEv2 retains when using Cascade RCNN as the decoder, obtaining 50.6 $AP^{b}$ and 43.6 $AP^{m}$ when training 12 epochs and 51.4 $AP^{b}$ and 44.5 $AP^{m}$ when training 36 epochs. It can be concluded that introducing inductive bias into transformers helps our model better utilize the data and deliver the best performance for both object detection and instance segmentation.

5.3 The performance on semantic segmentation

Settings We evaluate the ViTAEv2’s performance on the semantic segmentation task on the ADE20K (Zhou et al, 2017, 2019) dataset. The ADE20K dataset covers 150 semantic categories with 20K images for training and 2K for validation. We adopt UperNet (Xiao et al, 2018b) as the segmentation framework and train the UperNet with ViTAEv2-S as backbone with default setting used in mmsegmentation (Contributors, 2020), i.e., using the AdamW (Loshchilov and Hutter, 2017) optimizer and fixed image size $512\times 512$ . The models are trained for 160K iterations with a polynomial learning rate decay scheduler.

Results The results can be found in Table 12. With 10M fewer parameters, the segmentation model with ViTAEv2-S as the backbone obtains 45.0 mIoU and outperforms the counterparts using either ResNet or Swin transformer significantly. Besides, when tested with the multi-scale input, the segmentation model with ViTAEv2-S as the backbone obtains much better performance than others, i.e., obtaining 48.0 mIoU. It implies that the ViTAE model can better extract the multi-scale feature owing to the introduced scale-invariance inductive bias. Therefore, it can be benefited more from the multi-scale input.

5.4 The performance on pose estimation

Settings We evaluate the models’ performance on the animal pose estimation task on the AP10K (Yu et al, 2021) dataset. The AP-10K dataset contains 50 different animal species with animal keypoint annotations. Compared with human pose estimation tasks, animal pose estimation is more challenging due to the diverse species, less labeled data for each species, and significant appearance variance. Therefore, it is more suitable to evaluate the model’s generalization ability on this task. Following the setting in AP-10K, we adopt SimpleBaseline (Xiao et al, 2018a) as the pose estimation framework and train the models with various backbones for 210 epochs using Adam optimizer and images of size $256\times 256$ . A step-wise learning rate decay scheduler is employed, and the learning rate is reduced by a factor of 10 after 170 and 200 epochs.

Results The results are summarized in Table 13. As can be seen, the proposed ViTAEv2-S model has fewer parameters yet brings an absolute 3% AP performance gain over the ResNet-50 backbone. Besides, it also outperforms the Swin-T (Liu et al, 2021) backbone, especially in the more strict evaluation metric, i.e., $AP_{75}$ . These results further demonstrate the superiority of the proposed ViTAEv2 model, which can better handle the tasks with limited data but rich categories, owing to the introduced inductive bias. Recently, it has been shown that the isotropic ViTAE can also achieve superior performance in human pose estimation (Xu et al, 2022). More research efforts could be made to establish a foundation model for various pose estimation tasks based on ViTAE.

6 Robustness

ViTAE employs parallel PCM module and attention module to jointly extract features from both local and global perceptive. Since the two modules extract features in a complementary manner, it is interesting to explore whether such design will make the backbone network robust to adversarial attack (Bhojanapalli et al, 2021). As demonstrated in a recent study (Tang et al, 2021) which benchmarks the robustness a series of vision transformer models, CNN models, and MLP models, ViTAE model obtains better robustness under $l_{\infty}$ attack compared with ViT (Dosovitskiy et al, 2020), MLPMixer (Tolstikhin et al, 2021), and ResNet (He et al, 2016). The theoretical foundation of vision transformer and its variants is expected to be established to explain why introducing the inductive bias into vision transformers can help improve the robustness.

Discussions

This paper explores different types of IBs and incorporates them into transformers through the proposed reduction and normal cells. With the collaboration of these two cells, our ViTAE model achieves impressive performance on the ImageNet with fast convergence and high data efficiency. According to the attention distance analysis shown in Figure 5, the ensemble nature enables the transformer and convolution layers to focus on what they are good at, i.e., modeling long-range dependencies and locality, respectively. As illustrated in Figure 2, our ViTAE model can be viewed as an intra-cell ensemble of complementary transformer layers and convolution layers owing to the skip connection and parallel structure. Moreover, the inductive bias also benefits the transformer models at larger model sizes, i.e., ViTAE-H, or on larger datasets, i.e., ImageNet-22K. Besides, we explore the multi-stage design of ViTAE models and propose ViTAEv2 accordingly, which obtains SOTA performance on image classification and downstream vision tasks, including object detection, semantic segmentation, and pose estimation. More kinds of intrinsic or learnable IBs (Sabour et al, 2017; Zhang et al, 2022) such as constituting viewpoint invariance can be explored in the future study. On the other hand, although the proposed parallel structure obtains comparable inference speed with better performance, it may also slow down the training depending on the deep learning framework, e.g., dynamic computation graph frameworks like PyTorch need to compute the parallel branches sequentially. Alternatively, static computation graph frameworks like TensorFlow can be adopted to mitigate this issue.

Conclusion

In this paper, we incorporate two types of intrinsic inductive bias (IB), i.e., locality and scale-invariance, via reduction and normal cells. By stacking the two cells in both isotropic and multi-stage manner, the proposed ViTAE and ViTAEv2 model obtains superior performance and data efficiency. Specially, extensive experiments show that the multi-stage ViTAEv2 outperforms representative vision transformers in various respects, including classification accuracy, data efficiency, and generalization ability on downstream tasks. When scaling to large-scale models, the inductive bias still helps in improving vision transformers’ performance. In future work, we can explore other kinds of IBs to improve their performance further. We hope that this study will provide valuable insights to the following studies introducing intrinsic IB into vision transformers and understanding the impact of intrinsic and learned IBs.