On the Integration of Self-Attention and Convolution
Xuran Pan, Chunjiang Ge, Rui Lu, Shiji Song, Guanfu Chen, Zeyi Huang, Gao Huang
Introduction
Recent years have witnessed the vast development of convolution and self-attention in computer vision. Convolution neural networks (CNNs) are widely adopted on image recognition , semantic segmentation and object detection , and achieve state-of-the-art performances on various benchmarks. On the other hand, self-attention is first introduced in natural language processing , and also shows great potential in the fields of image generation and super-resolution . More recently, with the advent of vision transformers , attention-based modules have achieved comparable or even better performances than their CNN counterparts on many vision tasks.
Despite the great success that both approaches have achieved, convolution and self-attention modules usually follow different design paradigms. Traditional convolution leverages an aggregation function over a localized receptive field according to the convolution filter weights, which are shared in the whole feature map. The intrinsic characteristics impose crucial inductive biases for image processing. Comparably, the self-attention module applies a weighted average operation based on the context of input features, where the attention weights are computed dynamically via a similarity function between related pixel pairs. The flexibility enables the attention module to focus on different regions adaptively and capture more informative features.
Considering the different and complementary properties of convolution and self-attention, there exists a potential possibility to benefit from both paradigms by integrating these modules. Previous work has explored the combination of self-attention and convolution from several different perspectives. Researches from early stages, e.g., SENet , CBAM , show that self-attention mechanism can serve as an augmentation for convolution modules. More recently, self-attention modules are proposed as individual blocks to substitute traditional convolutions in CNN models, e.g., SAN , BoTNet . Another line of research focuses on combining self-attention and convolution in a single block, e.g., AA-ResNet , Container , while the architecture is limited in designing independent paths for each module. Therefore, existing approaches still treat self-attention and convolution as distinct parts, and the underlying relations between them have not been fully exploited.
In this paper, we seek to unearth a closer relationship between self-attention and convolution. By decomposing the operations of these two modules, we show that they heavily rely on the same convolution operations. Based on this observation, we develop a mixed model, named ACmix, and integrate self-attention and convolution elegantly with minimum computational overhead. Specifically, we first project the input feature maps with convolutions and obtain a rich set of intermediate features. Then, the intermediate features are reused and aggregated following different paradigms, i.e, in self-attention and convolution manners respectively. In this way, ACmix enjoys the benefit of both modules, and effectively avoids conducting expensive projection operations twice.
To summarize, our contributions are two folds:
(1) A strong underlying relation between self-attention and convolution is revealed, providing new perspectives on understanding the connections between two modules and inspirations for designing new learning paradigms.
(2) An elegant integration of the self-attention and convolution module, which enjoys the benefits of both worlds, is presented. Empirical evidence demonstrates that the hybrid model outperforms its pure convolution or self-attention counterpart consistently.
Related Work
Convolution neural networks , which use convolution kernels to extract local features, have become the most powerful and conventional technique for various vision tasks . Meanwhile, self-attention also demonstrated its prevailing performance on a broad range of language tasks like BERT and GPT3 . Theoretical analysis indicates that, when equipped with sufficiently large capacity, self-attention can express the function class of any convolution layers. Therefore, a line of research recently explores the possibility of adopting the self-attention mechanism into vision tasks . There are two mainstream methods, one uses self-attention as building blocks in a network , and another views self-attention and convolution as complementary parts .
Inspired by the power of self-attention’s expressive ability in long-range dependencies , a march of work endeavours to solely use self-attention as elementary building blocks to construct the model for vision tasks . Some works show that self-attention can become a stand-alone primitive for vision models which completely substitute convolutional operations. Recently, Vision Transformer shows that given enough data, we can treat an image as a sequence of 256 tokens and leverage Transformer models to achieve competitive results in image recognition. Furthermore, transformer paradigm is adopted in detection , segmentation , point cloud recognition and other vision tasks .
2 Attention enhanced Convolution
Multiple previously proposed attention mechanisms over images suggest it can overcome the limitation of locality for convolutional networks. Therefore, many researchers explore the possibility of employing attention modules or utilizing more relational information to enhance the functionality of convolutional networks. Particularly, Squeeze-and-Excitation (SE) and Gather-Excite (GE) reweigh the map for each channel. BAM and CBAM independently reweigh both channels and spatial locations to better refine the feature map. AA-Resnet augments certain convolutional layers by concatenating attention maps from another independent self-attention pipeline. BoTNet substitutes convolutions with self-attention modules at late stages of the model. Some work aims at designing a more flexible feature extractor by aggregating information from a wider range of pixels. Hu et al. proposed a local-relation approach to adaptively determine aggregation weights based on the compositional relationship of local pixels. Wang et al. proposed non-local network , which increases the receptive field by introducing non-local blocks that compare similarity among global pixels.
3 Convolution enhanced Attention
With the advent of Vision Transformer , numerous transformer-based variants have been proposed and achieved significant improvements on computer vision tasks. Among which exist researches focusing on complementing transformer models with convolution operations to introduce additional inductive biases. CvT adopts convolution in the tokenization process and utilize stride convolution to reduce computation complexity of self-attention. ViT with convolutional stem proposes to add convolutions at the early stage to achieve stabler training. CSwin Transformer adopts a convolution-based positional encoding technique and shows improvements on downstream tasks. Conformer combines Transformer with an independent CNN model to integrate both features.
Revisiting Convolution and Self-Attention
Convolution and self-attention have been widely known in their current forms. To better capture the relationship between these two modules, we revisit them from a novel view by decomposing their operations into separated stages.
Convolution is one of the most essential parts of modern ConvNets. We first review the standard convolution operation and reformulate it from a different perspective. The illustration is shown in Fig.2(a). For simplicity, we assume the stride of convolution is 1.
Consider a standard convolution with the kernel , where is the kernel size and are the input and output channel size. Given tensors as the input and output feature maps, where denote the height and width, we denote as the feature tensors of pixel corresponding to and respectively. Then, the standard convolution can be formulated as:
where represents the kernel weights with regard to the indices of the kernel position .
For convenience, we can rewrite Eq.(1) as the summation of the feature maps from different kernel positions:
where correspond to the horizontal and vertical displacements. Then, Eq.(3) can be rewritten as:
As a result, the standard convolution can be summarized as two stages:
At the first stage, the input feature map is linearly projected w.r.t. the kernel weights from a certain position, i.e., (). This is the same as a standard convolution. While in the second stage, the projected feature maps are shifted according to the kernel positions and finally aggregated together. It can be easily observed that most of the computational costs are performed in the convolution, while the following shift and aggregation are lightweight.
2 Self-Attention
Attention mechanism has also been widely adopted in vision tasks. Comparing to the traditional convolution, attention allows the model to focus on important regions within a larger size context. We show the illustration in Fig.2(b).
Consider a standard self-attention module with heads. Let denote the input and output feature. Let denote the corresponding tensor of pixel . Then, output of the attention module is computed as:
where is the concatenation of the outputs of attention heads, and are the projection matrices for queries, keys and values. represents a local region of pixels with spatial extent centered around , and is the corresponding attention weight with regard to the features within .
For the widely adopted self-attention modules in , the attention weights are computed as:
where is the feature dimension of .
Also, multi-head self-attention can be decomposed into two stages, and reformulated as:
Similar to the traditional convolution in Sec.3.1, convolutions are first conducted in stage I to project the input feature as query, key and value. On the other hand, Stage II comprises the calculation of the attention weights and aggregation of the value matrices, which refers to gathering local features. The corresponding computational cost is also proved to be minor comparing to Stage I, following the same pattern as convolution.
3 Computational Cost
To fully understand the computation bottleneck of the convolution and self-attention modules, we analyse the floating-point operations (FLOPs) and the number of parameters at each stage and summarize in Tab.1. It is shown that theoretical FLOPs and parameters at Stage I of convolution have quadratic complexity with regard to the channel size , while the computational cost for Stage II is linear to and no additional training parameters are required.
A similar trend is also found for the self-attention module, where all training parameters are preserved at Stage I. As for the theoretical FLOPs, we consider a normal case in a ResNet-like model where and for various layer depths. It is explicitly shown that Stage I consumes a heavier operation as , and the discrepancy is more distinct as channel size grows.
To further verify the validity of our analysis, we also summarize the actual computational costs of the convolution and self-attention modules in a ResNet50 model in Tab.1. We practically add up the costs of all convolution (or self-attention) modules to reflect the tendency from the model perspective. It is shown that 99% computation of convolution and 83% of self-attention are conducted at Stage I, which are consistent with our theoretical analysis.
Method
The decomposition of self-attention and convolution modules in Sec.3 has revealed deeper relations from various perspectives. First, the two stages play quite similar roles. Stage I is a feature learning module, where both approaches share the same operations by performing convolutions to project features into deeper spaces. On the other hand, stage II corresponds to the procedure of feature aggregation, despite the differences in their learning paradigms.
From the computation perspective, the convolutions conducted at Stage I of both convolution and self-attention modules require a quadratic complexity of theoretical FLOPs and parameters with regard to the channel size . Comparably, at stage II both modules are lightweight or nearly free of computation.
As a conclusion, the above analysis shows that (1) Convolution and self-attention practically share the same operation on projecting the input feature maps through convolutions, which is also the computation overhead for both modules. (2) Although crucial for capturing semantic features, the aggregation operations at stage II are lightweight and do not acquire additional learning parameters.
2 Integration of Self-Attention and Convolution
The aforementioned observations naturally lead to an elegant integration of convolution and self-attention. As both modules share the same convolution operations, we can only perform the projection once, and reuse these intermediate feature maps for different aggregation operations respectively. The illustration of our proposed mixed module, ACmix, is shown in Fig.2(c).
Specifically, ACmix also comprises two stages. At Stage I, input feature is projected by three convolutions and reshaped into pieces, respectively. Thus, we obtain a rich set of intermediate features containing feature maps.
At Stage II, they are used following different paradigms. For the self-attention path, we gather the intermediate features into groups, where each group contains three pieces of features, one from each convolution. The corresponding three feature maps serve as queries, keys, and values, following the traditional multi-head self-attention modules (Eq.(12)). For the convolution path with kernel size , we adopt a light fully connected layer and generate feature maps. Consequently, by shifting and aggregating the generated features (Eq.(7),(8)), we process the input feature in a convolution manner, and gather information from a local receptive field like the traditional ones.
Finally, outputs from both paths are added together and the strengths are controlled by two learnable scalars:
3 Improved Shift and Summation
As shown in Sec.4.2 and Fig.2, intermediate features in the convolution path follow the shift and summation operations as conducted in traditional convolution modules. Despite that they are theoretically lightweight, shifting tensors towards various directions practically breaks the data locality and is difficult to achieve vectorized implementation. This may greatly impair the actual efficiency of our module at the inference time.
As a remedy, we resort to applying depthwise convolution with fixed kernels as a replacement of the inefficient tensor shifts, as shown in Fig.3 (b). Take as an example, shifted feature is computed as:
where represents each channel of the input feature.
On the other hand, if we denote convolution kernel (kernel size ) as:
the corresponding output can be formulated as:
Therefore, with carefully designed kernel weights for specific shift directions, the convolution outputs are equivalent to the simple tensor shifts (Eq.(14)). To further incorporate with the summation of features from different directions, we concatenate all the input features and convolution kernels respectively, and formulate shift operation as a single group convolution, as depicted in Fig.3 (c.I). This modification enables our module with higher computation efficiency.
On this basis, we additionally introduce several adaptations to enhance the flexibility of the module. As shown in Fig.3 (c.II), we release the convolution kernel as learnable weights, with shift kernels as initialization. This improves the model capacity while maintaining the ability of original shift operations. We also use multiple groups of convolution kernels to match the output channel dimension of convolution and self-attention paths, as depicted in Fig.3 (c.III).
4 Computational Cost of ACmix
For better comparison, we summarize the FLOPs and parameters of ACmix in Tab.1. The computational cost and training parameters at Stage I are the same as self-attention and lighter than traditional convolution (e.g., conv). At Stage II, ACmix introduces additional computation overhead with a light fully connected layer and a group convolution described in Sec.4.3, whose computation complexity is linear with regard to channel size and comparably minor with Stage I. The practical cost in a ResNet50 model shows similar trends with theoretical analysis.
5 Generalization to Other Attention Modes
With the development of the self-attention mechanism, numerous researches have focused on exploring variations of the attention operator to further promote the model performance. Patchwise attention proposed by incorporates information from all features in the local region as the attention weights to replace the original softmax operation. Window attention adopted by Swin-Transformer keeps the same receptive field for tokens in the same local window to save computational cost and achieve fast inference speed. ViT and DeiT , on the other hand, consider global attention to retaining long-range dependencies within a single layer. These modifications are proved to be effective under specific model architectures.
Under the circumstance, it is worth noticing that our proposed ACmix is independent of self-attention formulations, and can be readily adopted on the aforementioned variants. Specifically, the attention weights can be summarized as:
where refers to feature concatenation, represents two linear projection layers with an intermediate nonlinear activation, is the specialized receptive field for each query token, and represents the whole feature map (Please refer to the original paper for further details). Then, the computed attention weights can be applied to Eq.(12) and fits into the general formulation.
Experiments
In this section, we empirically validate ACmix on ImageNet classification, semantic segmentation, and object detection tasks, and compare with state-of-the-art models. See Appendix for detailed dataset and training configurations.
Implementation. We practically implement ACmix on 4 baseline models, including ResNet , SAN , PVT and Swin-Transformer . We also compare our models with competitive baselines, i.e., SASA , LR-Net , AA-ResNet , BoTNet , T2T-ViT , ConViT , CVT , ConT and Conformer .
Results. We show the classification results in Fig.4. For ResNet-ACmix models, our model outperforms all baselines with comparable FLOPs or parameters. For example, ResNet-ACmix 26 achieves same top-1 accuracy as SASA-ResNet 50 with FLOPs. With similar FLOPs, our model outperforms SASA by . The superiority against other baselines is even larger. For SAN-ACmix, PVT-ACmix and Swin-ACmix, our models achieve consistent improvements. As a showcase, SAN-ACmix 15 outperforms SAN 19 with FLOPs. PVT-ACmix-T shows comparable performance with PVT-Large, with only FLOPs. Swin-ACmix-S achieves higher accuracy than Swin-B with FLOPs.
2 Downstream Tasks
Semantic Segmentation We evaluate the effectiveness of our models on a challenging scene parsing dataset, ADE20K , and display the results on two segmentation approaches, Semantic-FPN and UperNet . Backbones are pretrained on ImageNet-1K. It is shown that ACmix achieves improvements under all settings.
Object Detection We also conduct experiments on the COCO benchmark . Tab.3 and Tab.4 display the result of ResNet-based models and Transformer-based models with various detection heads, including RetinaNet , Mask R-CNN and Cascade Mask R-CNN . We can observe that ACmix consistently outperform baselines with similar parameters or FLOPs. This further validate the effectiveness of ACmix when transfered to downstream tasks.
3 Practical Inference Speed
We further investigate the practical inference speed of our method under an Ascend 910 environment with MindSpore, a deep learning computing framework for mobile, edge, and cloud scenarios. We summarize the results in Tab.5. Comparing to PVT-S, our model achieves 1.3x fps with comparable mAP. When it comes to the larger model, the superiority is more distinct. ACmix outperforms PVT-L 1.9mAP with 1.8x fps.
4 Ablation Study
To evaluate the effectiveness of different components in ACmix, we conduct a series of ablation studies.
Combining the output of both paths. We explore how different combinations of the convolution and self-attention outputs influence the model performances. We conduct experiments with multiple combination methods and summarize the results in Tab.6. We also show the performances of models adopting only one path, Swin-T for self-attention, and Conv-Swin-T for convolution by replacing the window attention with traditional convolutions. As we can observe, the combination of convolution and self-attention modules consistently outperforms models with a single path. Fixing the ratio of convolution and self-attention for all operators also leads to worse performance. Comparably, using learned parameters imposes higher flexibility for ACmix, and the strength for convolution and self-attention paths can be adaptively adjusted according to the position of the filter in the whole network.
Group Convolution Kernels. We also conduct ablations on the choices of group convolution kernels, as we have shown in Sec.4.3 and Fig.3. We empirically show the effectiveness of each adaptation, and its influence on practical inference speed in Tab.7. By substituting the tensor shifts with group convolutions, inference speed is greatly boosted. Also, using learnable convolution kernels and carefully-designed initialization enhance model flexibility and contribute to the final performance.
5 Bias towards Different Paths.
It is also valuable to see that ACmix introduces two learnable scalars to combine the outputs from both paths (Eq.14). This leads to a by-product of our module, where and practically reflect the model’s bias towards convolution or self-attention at different depths.
We conduct parallel experiments and show the learned parameters from different layers of SAN-ACmix, and Swin-ACmix models in Fig.5. The left and middle plots show the changing tendency of rates for self-attention and convolution paths respectively. The variation of the rates in different experiments is relatively small, especially when layers go deeper. This observation shows a stable preference for deep models towards the different design patterns. A more distinct trend is shown in the right plot, where the ratio between two paths is explicitly presented. We can see that convolution can serve as good feature extractors at the early stages of the Transformer models. At the middle stage of the network, the model tends to leverage the mixture of both paths with an increasing bias towards convolution. At the last stage, self-attention shows superiority over convolution. This is also consistent with the design patterns in the previous works where self-attention is mostly adopted in the last stages to replace the original convolution , and convolutions at early stages are proved to be more effective for vision transformers .
Conclusion
In this paper, we explore a close relationship between two powerful techniques, convolution and self-attention. By decomposing the operations of both modules, we show that they share the same computation overhead on projecting the input feature maps. On this basis, we take a step forward and propose a hybrid operator to integrate self-attention and convolution modules by sharing the same heavy operations. Extensive results on image classification and object detection benchmarks demonstrate the effectiveness and efficiency of the proposed operator.
Acknowledgements
This work is supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grants 2018AAA0100701, the National Natural Science Foundation of China under Grants 61906106 and 62022048, and Huawei Technologies Ltd.
References
Appendix
A. Model Architectures
We summarize the architectures of ResNet 26/38/50 , SAN 10/15/19 , PVT-T/S , Swin-T/S , and their respective ACmix version in Tab 912. For fair comparison, we only substitute the original convolution or self-attention module with our proposed operator in the modified models.
B. Dataset and Training Setup
ImageNet. ImageNet 2012 comprises 1.28 million training images and 50,000 validation images from 1000 different classes. For ResNet-based models, we follow the training schedule in and train all the models for 100 epochs. We use SGD with batchsize 256 on 8 GPUs. Cosine learning rate is adopted with the base learning rate set to 0.1. We apply standard data augmentation, including random cropping, random horizontal flipping and normalization. We use label smoothing with coefficient 0.1. For experiments on Transformer-based models, including PVT and Swin-Transformer, we follow training configurations in the original paper.
COCO. COCO dataset is a standard object detection benchmark and we use a subset of 80k samples as training set and 35k for validation. For ResNet and SAN models, we train the network by SGD and 8 GPU are used with a batchsize of 16. For PVT and Swin-Transformer models, we train the network by adamw. Backbone networks are respectively pretrained on ImageNet dataset following the same training configurations in the original paper. We follow the ”1x” learning schedule to train the whole network for 12 epochs and divide the learning rate by 10 at the 8th and 11th epoch respectively. For several transformer-based models, we follow the configurations in the original paper, and additionally experiment ”3x” schedule with 36 epochs. We apply standard data augmentation, that is resize, random flip and normalize. Learning rate is set at 0.01 and linear warmup is used in the first 500 iterations. We follow the ”1x” learning schedule training the whole network for 12 epochs and divide the learning rate by 10 at the 8th and 11th epoch respectively. For several transformer-based models, we follow the configurations in the original paper, and test with ”3x” schedule. All mAP results in the main paper are tested with input image size (3, 1333, 800).
ADE20K. ADE20K is a widely-used semantic segmentation dataset, containing 150 categories. ADE20K has 25K images, with 20K for training, 2K for validation, and another 3K for testing. For two baseline models, PVT and Swin-Transformer, we follow the training configurations in their original paper respectively. For PVT, we implement the backbone models on the basis of Semantic FPN . We optimize the models using AdamW with an initial learning rate of 1e-4 for 80k iterations. For Swin-Transformer, we implement the backbone models on the basis of UperNet . We use the AdamW optimizer with an initial learning rate of 6e-5 and a linear warmup of 1,500 iterations. Models are trained for a total of 160K iterations. We randomly resize and crop the image to 512 × 512 for training, and rescale to have a shorter side of 512 pixels during testing.
C. Hyper-parameters
For ResNet-ACmix models, we set for all the experiments.
For SAN-ACmix models, the channel dimension for queries, keys and values are different in the original model . Given input features with channel dimension , queries and keys are projected to features with channels, while values are projected to features with channels. Therefore, when implementing our ACmix operator, we divide values into 4 groups, where the divided groups have the same channel dimension . The following self-attention and convolution operations follow the same patchwise attention in and the same designing pipeline as we stated in Sec.4, respectively.
For PVT-ACmix and Swin-ACmix models, we follow the configurations in the original model .
and is set for all experiments, unless stated otherwise.
D. Positional Encoding
Positional encoding is widely adopted in self-attention modules, while not used in SAN and PVT models. Therefore, we follow this setting and only adopt positional encoding in the ResNet-ACmix models and Swin-ACmix. Specifically, the popular relative positional encoding is adopted when computing the attention weights:
where represent queries, keys and relative positional encodings respectively. We didn’t include positional encoding in the analysis for computation complexity in the Tab.1 of the main paper, as the patchwise attention proposed in demonstrate the effectiveness of self-attention modules without adopting it. Nevertheless, the computation cost for positional encoding is also linear with respect to the channel dimension , which is also comparably minor to the feature projection operations. Therefore, considering the positional encoding doesn’t affect our main statement.
E. Practical Costs for Other Models
We also summarize the practical FLOPs and Parameters for convolution, self-attention and ACmix based on various models introduced in the Experiment section. The numbers are shown in Tab.8. Similar to ResNet 50, more than of the computation are performed at Stage I of the self-attention module in SAN and Swin models. Meanwhile, it also demonstrates that ACmix only introduces minimum computational cost to integrate both convolution and self-attention modules based on various model structures.