LocalViT: Analyzing Locality in Vision Transformers

Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, Michele Magno, Luca Benini, Luc Van Gool

Introduction

Convolutional neural networks (CNNs) now define the state-of-the-art for computer vision tasks such as image classification , object detection , segmentation , low-level vision , etc. CNNs are based on locality in that convolutional filters only perceive a local region of the input image, i.e. the receptive field. By stacking multiple layers, the effective receptive fields of a deep neural network can be enlarged progressively. This design enables the network to learn a hierarchy of deep features, which is essential for the success of CNNs. Meanwhile, the local, repetitive connections save many parameters compared with fully connected layers. Yet, one problem is that a larger receptive field can only be achieved by combining layers, despite alternative attempts at enlarging the receptive field .

A parallel, thriving research strand incorporates global connectivity into the network via self-attention . This family of networks, i.e. transformer networks, originates from machine translation and is very good at modelling long-range dependencies in sequences. There also is a rising interest in applying transformers to vision . Vision transformers have already achieved performances quite competitive with their CNN counterparts.

To process 2D images with transformers, the input image is first converted to a sequence of tokens which correspond to patches in the image. Then the attention module attends to all tokens and a weighted sum is computed as the tokens for the next layer. In this way, the effective receptive field is expanded to the whole image via a single self-attention layer. Yet, the problem of visual transformers is that global connectivity contradicts the convolutional idea.

Considering the merits of CNNs vs. transformers, a natural question is whether we can efficiently combine the locality of CNNs and the global connectivity of vision transformers to improve performance while not increasing model complexity.

We try to fill the gap between CNNs and vision transformers. Specifically, we introduce a locality mechanism to the feed-forward network of transformers, which is inspired by examining the feed-forward network and inverted residuals . The feed-forward network of transformers consists of two fully connected layers and the hidden dimension between them is expanded (usually by a factor of 4) to extract richer features. Similarly, in inverted residual blocks, the hidden channel between the two $1\times 1$ convolutions is also expanded. The major difference between them is the efficient depth-wise convolution in the inverted residual block. Such depth-wise convolution can provide precisely the mechanism for local information aggregation which is missing in the feed-forward network of vision transformers. In addition, depth-wise convolution is efficient in both parameters and computational complexity.

To cope with the 2D depth-wise convolution, the image tokens of the sequence from the self-attention module must be rearranged to a 2D feature map, which is processed by the feed-forward network. The class token is split out and bypasses the feed-forward network. The derived new feature map is converted back to image tokens and concatenated with the bypassed class token. The concatenated sequence is processed by the next transformer layer.

The effectiveness of the introduced locality mechanism is validated in two ways. Firstly, its properties are investigated experimentally. We draw four basic conclusions. i. Depth-wise convolution alone can already improve the performance of the baseline transformer. ii. A better activation function after depth-wise convolution can result in a significant performance gain. iii. The locality mechanism is more important for lower layers. iv. Expanding the hidden dimension of the feed-forward network leads to a larger model capacity and a higher classification accuracy. Secondly, as shown in Fig. 1, the locality mechanism is successfully applied to 4 vision transformers, which underlines its generality. The contributions of this paper are three-fold:

We bring a locality mechanism to vision transformers by introducing depth-wise convolutions. The new transformer architecture combines a self-attention mechanism for global relationship modelling and a locality mechanism for local information aggregation.

We analyze the basic properties of the introduced locality mechanism. The influence of each component (depth-wise convolution, non-linear activation function, layer placement, and hidden dimension expansion ratio) is singled out.

We apply these ideas to vision transformers incl. DeiT , T2T-ViT , PVT , and TNT . Experiments show that the simple technique proposed in this paper generalizes well to various transformer architectures.

Related Work

Transformers were first introduced in for machine translation. The proposed attention mechanism aggregates information from the whole input sequence. Thus, transformers are especially good at modelling long-range dependencies between elements of a sequence. Since then, there have been several attempts to adapt transformers towards vision tasks including object detection , image classification , segmentation , multiple object tracking , human pose estimation , point cloud processing , video processing , image super-resolution , image synthesis , etc. An extensive review is out of the scope of this paper. We focus on the most relevant works.

Carion et al. first proposed a detection transformer (DETR) for end-to-end objection detection . This method regards object detection as a set prediction problem and removes the hand-crafted designs for objection detection. DETR reasons about the relationship between the learned object queries and global image context. Following this work, image classification was targeted. Dosovitskiy et al. showed that a pure transformer can be directly applied to images and performs quite well compared with CNNs on image classification . Yet, this network relies heavily on large-scale models and datasets. Thus, Touvron et al. showed that it is possible to train vision transformers in a data-efficient way . The authors introduced an additional distillation token to the network and proposed hard-label distillation for vision transformers. Such transformers are identical to those for machine translation. Recent works propose to adapt transformers to images. Yuan et al. proposed a progressive tokenization method that can model the local information of nearby tokens and reduce the number of tokens. Wang et al. propose a pyramid architecture for vision transformers . Han et al. introduced an additional transformer block for the image token embeddings .

2 Locality vs. global connectivity

Both local information and global connectivity help to reason about the relationships between image contents. Thus, they are both important for visual perception. The convolution operation applies a sliding window to the input and local information is inherently aggregated to compute new representations. Thus, locality is an intrinsic property of CNNs . Although CNNs can extract information from a larger receptive field by stacking layers and forming deep networks, they still lack global connectivity . To overcome this problem, some researchers add global connectivity to CNNs with non-local blocks .

By contrast, transformers are especially good at modelling long-range dependencies within a a sequence owing to their attention mechanism . But, in return, a locality mechanism remains to be added for visual perception. Some works already contributed towards this goal . Yet, they mainly focus on improving the tokenization and self-attention part. Other work introduces hybrid architectures of CNNs and transformers . In summary, little attention has been paid to the feed-forward network of vision transformers.

3 Depth-wise convolution and inverted residuals

Compared with normal convolution, the computations of depth-wise convolution are only conducted channel-wise. That is, to obtain a channel of the output feature map, the convolution is only conducted on one input feature map. Thus, depth-wise convolution is efficient both in terms of parameters and computation. Thus, Howard et al. first proposed the MobileNet architecture based on depth-wise separable convolutions . This lightweight and computationally efficient network is quite friendly for mobile devices. Since then, depth-wise convolution has been widely used to design efficient models. Inverted residual blocks are based on depth-wise convolution and were first introduced in MobileNetV2 . The inverted residual blocks are composed of a sequence of $1\times 1$ - depth-wise - $1\times 1$ convolutions. The hidden dimension between the two $1\times 1$ convolutions is expanded. The utilization of depth-wise convolution avoids the drastic increase of model complexity brought by normal convolution. Due to the efficiency of this module, it is widely used to form the search space of neural architecture search (NAS) . The expansion of the hidden dimension of inverted residuals is quite similar to the feed-forward network of vision transformers. This motivates us to think about the connection between them (See Sec. 3.2).

Methodology

Transformers are usually composed of encoders and decoders with similar building blocks. For the image classification task considered here, only the encoders are included in the network. Thus, we mainly describe the operations in the encoder layers. The encoders have two components, i.e. the self-attention mechanism that relates a token to all of the tokens and a feed-forward network that is applied to every token. We specifically explain how to introduce locality into the feed-forward network.

Self-attention. In the self-attention mechanism, the relationship between the tokens is modelled by the similarity between the projected query-key pairs, yielding the attention score. The new tokens are computed as the weighted sum of the project values. That is,

Feed-forward network. After the self-attention layer, a feed-forward network is appended. The feed-forward network consists of two fully-connected layers and transforms the features along the embedding dimension. The hidden dimension between the two fully-connected layers is expanded to learn a richer feature representation. That is,

2 Locality

Since only $1\times 1$ convolution is applied to the feature map, there is a lack of information interaction between adjacent pixels. Besides, the self-attention part of the transformer only captures global dependencies between all of the tokens. Thus, the transformer block does not have a mechanism to model the local dependencies between nearby pixels. It would be interesting if locality could be brought to transformers in an efficient way.

The expansion of the hidden dimension between fully-connected layers and the lattice perspective of the feed-forward network remind us of the inverted residual block proposed in MobileNets . As shown in Fig. 3, both of the feed-forward network and the inverted residual expand and squeeze the hidden dimension by $1\times 1$ convolution. The only difference is that there is a depth-wise convolution in the inverted residual block. Depth-wise convolution applies a $k\times k$ ( $k>1$ ) convolution kernel per channel. The features inside the $k\times k$ kernel is aggregated to compute a new feature. Thus, depth-wise convolution is an efficient way of introducing locality into the network. Considering that, we reintroduce depth-wise convolution into the feed-forward network of transformers. And the computation could be represented as

Note that the non-linear activation functions are not visualized in Fig. 3. Yet, they play a quite important role in enhancing the network capacity, especially for efficient networks. In particular, we try ReLU6, h-swish , squeeze-and-excitation (SE) module , efficient channel attention (ECA) module , and their combinations. A thorough analysis of the activation function is discussed in the experiments section.

3 Class token

To apply vision transformers to image classification, a trainable class token is added and inserted into the token embedding, i.e.

When depth-wise convolution is introduced into the feed-forward network, the sequence of tokens needs to be rearranged into an image feature map. Yet, the additional dimension brought by the class token makes the exact rearrangement impossible. To circumvent this problem, we split the $N+1$ tokens in Eqn. (1) into a class token and image tokens again, i.e.

Then the new image token is passed through the feed-forward network according to Eqns. (3), (6), and (5), leading to $\mathbf{Y}$ . The class token is not passed through the feed-forward network. Instead, it is directly concatenated with $\mathbf{Y}$ , i.e.

The split and concatenation of the class token is done for every transformer layer. Although the class token $\mathbf{Z}_{cls}$ is not passed through the feed-forward network, the performance of the overall network is not adversely affected. This is because the information exchange and aggregation is done only in the self-attention part. A feed-forward network like Eqn. (2) only enforces a transformation within each token.

Experimental Results

This section gives the experimental results for image classification. We first study how the locality brought by depth-wise convolution can improve the performance of transformers. Then we show the influence of a non-linear activation function placed after the depth-wise convolution. The layers that are equipped with locality also influence the network capacity and this is also studied. An ablation study of the hidden dimension expansion ratio $\gamma$ is also made. All those experiments are based on DeiT-T . Finally, the study on the generalization to other vision transformers including T2T-ViT , PVT , TNT for image classification and the comparison with CNNs are made. The transformers that are equipped with locality are denoted as LocalViT followed by the suffix that denotes the basic architecture.

In order to introduce locality into transformers, we only adapt the feed-forward network of vision transformers while the other parts such as self-attention, and position encoding are not changed. The implementation is based on the inverted residual blocks . A batch normalization layer is appended to the 2D convolutions. The layer normalization before the feed-forward network is removed. Path drop after the feed-forward network is also removed. The same modification is also applied to the T2T module of T2T-ViT . The feed-forward network of the inner transformer block in TNT is not adapted. For PVT , the class token is only introduced in the final stage of the pyramid. Thus, the split and concatenation of the class token for the feed-forward network is only applied in the final stage.

Experimental setup. The ImageNet2012 dataset is used in this paper. The dataset contains 1.28M training images and 50K validation images from one thousand classes. We follow the same training protocol as DeiT . The input image is randomly cropped with size $224\times 224$ . Cross-entropy is used as the loss function. Label smoothing is used. The weight decay factor is set to 0.05. The AdamW optimizer is used with a momentum of 0.9. The training continues for 300 epochs. The batch size is set to 1024. The initial learning rate is set to $1\times 10^{-3}$ and decreases to $1\times 10^{-5}$ following a cosine learning rate scheduler. During validation, a center crop of the validation images is conducted. In the following sections, we study different aspects that could influence the performance of the introduced locality.

2 Influence of the locality

We first study how the local information could help to improve the performance of vision transformers. Different hidden dimension expansion ratios $\gamma$ are investigated. First of all, due to the change of the operations in the feed-forward network (Sec. 4.1), the Top-1 accuracy of LocalViT-T is slightly increased even without the depth-wise convolution. The performance gain is 0.3% for $\gamma=4$ and is increased to 1.2% for $\gamma=6$ . Note that compared with DeiT-T, no additional parameters and computation are introduced for the improvement. When locality is incorporated into the feed-forward network, there is a significant improvement of the model accuracy, i.e. 1.5% for $\gamma=4$ and 3.0% for $\gamma=6$ . Compared with the baseline, there only is a marginal increase in the number of parameters and a negligible increase in the amount of computation. Thus, the performance of vision transformers can be significantly improved by the incorporation of a locality mechanism and the adaptation of the operation in the feed-forward network.

3 Activation functions

The non-linear activation function after depth-wise convolution used in the above experiments is simply ReLU6. The benefit of using other non-linear activation functions is also studied. First of all, by replacing the activation function from ReLU6 to h-swish, the gain of Top-1 accuracy over the baseline is increased from 1.5% to 2.2%. This shows the benefit of h-swish activation functions can be easily extended from CNNs to vision transformers. Next, the h-swish activation function is combined with other channel attention modules including ECA and SE . By adding ECA, the performance is further improved by 0.1%. Considering that only 60 parameters are introduced, this improvement is still considerable under a harsh parameter budget.

Another significant improvement is brought by a squeeze-and-excitation module. When the reduction ratio in the squeeze-and-excitation module is reduced from 192 to 4, the gain of Top-1 accuracy is gradually increased from 2.6% to 3.6%. The number of parameters is also increased accordingly. Note that, for all of the networks, the computational complexity is almost the same. This implies that if there is no strict limitation on the number of parameters, advanced non-linear activation functions could be used. In the following experiments, we use the combination of h-swish and SE as the non-linear activation function after depth-wise convolution. Additionally, the reduction ratio of the squeeze-and-excitation module is chosen such that only 4 channels are kept after the squeeze operation. This choice of design achieves a good balance between the number of parameters and the model accuracy. Thus, local information is also important in vision transformers. A wide range of efficient modules could be introduced into the feed-forward network of vision transformers to expand the network capacity.

4 Placement of locality

The transformer layers where the locality is introduced can also influence the performance of the network. Thus, an ablation study based on LocalViT-T is conducted to study their effect. The results is reported in Table 3. There are in total 12 transformer layers in the network. We divide the 12 layers into 3 groups corresponding to “Low”, “Mid”, and “High” stages. For the former 3 rows of Table 3, we independently insert locality into the three stages. As the locality is moved gradually from lower stages to the higher stages, the accuracy of the network is decreased. This shows that local information is especially important for the lower layers. This is also consistent with our intuition. That is, when the depth-wise convolution is applied to the lower layers, the local information aggregated there could also be propagated to the higher layers, which is important to improve the overall performance of the network.

When the locality is introduced only in the higher stage, the Top-1 accuracy is even lower than DeiT-T. To investigate whether locality in the higher layers always has an adverse effect, we progressively allow more lower layers to have depth-wise convolution until locality is enabled for all layers. This corresponds to the last three rows of Table 3. The observation is that starting from the lower layers, the performance of the network could be gradually improved as locality is enabled for more layers. Thus, introducing the locality to the lower layers is more advantageous compared with higher layers.

5 Expansion ratio γ𝛾\gamma

The effect of the expansion ratio of the hidden dimension of the feed-forward network is also investigated. The results are shown in Table 4. Expanding the hidden dimension of the feed-forward network can have a significant effect on the performance of the transformers. As the expansion ratio is increased from 1 to 4, the Top-1 accuracy is increased from less than 70% to nearly 75%. The model complexity is also almost doubled. Thus, the network performance and model complexity can be balanced by the hidden dimension expansion ratio $\gamma$ . Squeeze-and-excitation can be more beneficial for smaller $\gamma$ .

6 Generalization and comparison

Finally, we try to generalize the knowledge derived above to more vision transformers including DeiT-S , T2T-ViT , TNT , PVT and compare their performance with CNNs. The result is shown in Table 5.

We draw two major conclusions from Table 5. Firstly, the effectiveness of locality can be generalized to a wide range of vision transformers based on the following observations. I. Compared with DeiT, LocalViT can yield a higher classification accuracy for both the tiny and small version of the network. The increase of Top-1 accuracy is 2.6% and 1.0%, resp. LocalViT-T even outperforms DeiT-T which is enhanced by knowledge distillation from RegNetY-160 . The small version LocalViT-S is slightly worse than DeiT-S by 0.4%. II. LocalViT-T2T outperforms T2T-ViT-7 by 0.8%. Note that T2T-ViT already tries to model the local structure information in the tokens-to-token module. III. In TNT, an additional transformer block is used to extract local features for the image tokens. Thus, the locality is also considered in TNT. The modified network, i.e. LocalViT-TNT could still improve the classification accuracy by a large margin of 2.3%. IV. The biggest improvement comes from PVT. Introducing the locality module leads to a gain of 3.1% over PVT-T. V. The comparison of the training log between the baseline transformers and LocalViT is shown in Fig. 4. During the training phase, LocalViT consistently leads to a lower training loss and higher validation accuracy than the baseline transformers.

Secondly, some versions of the enhanced vision transformer LocalViT are already quite comparable or even outperform CNNs. This conclusion can be drawn by making the pairwise comparison, i.e. LocalViT-T vs. MobileNetV2 (1.4), LocalViT-S vs. ResNet-50, LocalViT-T2T vs. MobileNetV1, LocalViT-PVT vs. DenseNet-169 etc.

Conclusion

In this paper, we proposed to incorporate a locality mechanism into vision transformers. This is done by incorporating 2D depth-wise convolutions followed by a non-linear activation function into the feed-forward network of vision transformers. The idea is motivated by the comparison between the feed-forward network of transformers and the inverted residuals of MobileNets. In previous works, the input to the feed-forward network is a sequence of tokens embedding converted from an image. To cope with the locality mechanism, the sequence of tokens embedding is rearranged into a lattice as a 2D feature map, which is used as the input to the enhanced feed-forward network. To enable the rearrangement, the class token is split before the feed-forward network and concatenated with other image embeddings after the feed-forward network. A series of studies were made to investigate various factors (activation function, layer placement, and expansion ratio) that might influence of performance of the locality mechanism. The proposed locality mechanism is successfully applied to four different vision transformers, which validates its generality.

Acknowledgements: This work was partly supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, and an Amazon AWS grant.