Involution: Inverting the Inherence of Convolution for Visual Recognition
Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, Qifeng Chen
Introduction
Albeit the rapid advance of neural network architectures, convolution remains the building mainstay of deep neural networks. Drawn inspiration from the classical image filtering methodology, convolution kernels enjoy two remarkable properties that contribute to its magnetism and popularity, namely, spatial-agnostic and channel-specific. In the spatial extent, the former property guarantees the efficiency of convolution kernels by reusing them among different locations and pursues translation equivalence . In the channel domain, a spectrum of convolution kernels is responsible for collecting diverse information encoded in different channels, satisfying the latter property. Furthermore, modern neural networks appreciate the compactness of convolution kernels via restricting their spatial span to no more than , since the advent of the seminal VGGNet .
On the one hand, although the nature of spatial-agnostic along with spatial-compact makes sense in enhancing the efficiency and interpreting the translation equivalence, it deprives convolution kernels of the ability to adapt to diverse visual patterns with respect to different spatial positions. Besides, locality constrains the receptive field of convolution, posing challenges for capturing long-range spatial interactions in a single shot. On the other hand, as is known to us all, inter-channel redundancy inside convolution filters stands out in many successful deep neural networks , casting the large flexibility of convolution kernels with respect to different channels into doubt.
To conquer the aforementioned limitations, we present the operation coined as involution that has symmetrically inverse inherent characteristics compared to convolution, namely, spatial-specific and channel-agnostic. Concretely speaking, involution kernels are distinct in the spatial extent but shared across channels. Being subject to its spatial-specific peculiarity, if involution kernels are parameterized as fixed-sized matrices like convolution kernels and updated using the back-propagation algorithm, the learned involution kernels would be impeded from transferring between input images with variable resolutions. To the end of handling variable feature resolutions, an involution kernel belonging to a specific spatial location is possible to be generated solely conditioned on the incoming feature vector at the corresponding location itself, as an intuitive yet effective instantiation. Besides, we alleviate the redundancy of kernels by sharing the involution kernel along the channel dimension. Taken the above two factors together, the computational complexity of an involution operation scales up linearly with the number of feature channels, based on which an extensive coverage in the spatial dimension is allowed for the dynamically parameterized involution kernels. By virtue of an inverted designing scheme, our proposed involution has two-fold privileges over convolution: () involution could summarize the context in a wider spatial arrangement, thus overcome the difficulty of modeling long-range interactions well; () involution could adaptively allocate the weights over different positions, so as to prioritize the most informative visual elements in the spatial domain.
Analogously, recent approaches have spoken for going beyond convolution with the preference of self-attention for the purpose of capturing long-range dependencies . Among these works, pure self-attention could be utilized to construct stand-alone models with promising performance. Intriguingly, we reveal that self-attention particularizes our generally defined involution through a sophisticated formulation concerning kernel construction. By comparison, the involution kernel adopted in this work is generated conditioned on a single pixel, rather than its relationship with the neighboring pixels. To take one step further, we prove in our experiments that even with our embarrassingly simple version, involution could achieve competitive accuracy-cost trade-offs to self-attention. Being fully aware that the affinity matrix acquired by comparing query with each key in self-attention is also an instantiation of the involution kernel, we question the necessity of composing query and key features to produce such a kernel, since our simplified involution kernel could also attain decent performance while avoiding the superfluous attendance of key content, let alone the dedicated positional encoding in self-attention.
The presented involution operation readily facilitates visual recognition by embedding extendable and switchable spatial modeling into the representation learning paradigm, in a fairly lightweight manner. Built upon this redesigned visual primitive, we establish a backbone architecture family, dubbed as RedNet, which could achieve superior performance over convolution-based ResNet and self-attention based models for image classification. On the downstream tasks including detection and segmentation, we comprehensively perform a step-by-step study to inspect the effectiveness of involution on different components of detectors and segmentors, such as their backbone and neck. Involution is proven to be helpful for each of the considered components, and the combination of them leads to the greatest efficiency.
Summarily, our primary contributions are as follows:
We rethink the inherent properties of convolution, associated with the spatial and channel scope. This motivates our advocate of other potential operators embodied with discrimination capability and expressiveness for visual recognition as an alternative, breaking through existing inductive biases of convolution.
We bridge the emerging philosophy of incorporating self-attention into the learning procedure of visual representation. In this context, the desiderata of composing pixel pairs for relation modeling is challenged. Furthermore, we unify the view of self-attention and convolution through the lens of our involution.
The involution-powered architectures work universally well across a wide array of vision tasks, including image classification, object detection, instance and semantic segmentation, offering significantly better performance than the convolution-based counterparts.
Sketch of Convolution
Note that the kernel is specific to the th feature slice from the view of channel and shared among all the spatial locations within this slice.
Design of Involution
Different from convolution kernels, the shape of involution kernels depends on that of the input feature map . A natural thought is to generate the involution kernels conditioned on (part of) the original input tensor, so that the output kernels would be comfortably aligned to the input. We symbolize the kernel generation function as and abstract the functional mapping at each location as
where indexes the set of pixels is conditioned on.
For building the entire network with involution, we mirror the design of ResNet by stacking residual blocks, since the elegant architecture of ResNet makes it apt for incubating new ideas and making comparisons. We replace involution for convolution at all bottleneck positions in the stem (using or involution for classification or dense prediction) and trunk (using involution for all tasks) of ResNet, but retain all the convolution for channel projection and fusion. These delicately redesigned entities unite to shape a new species of highly efficient backbone networks, termed as RedNet.
Once spatial and channel information interweaves, heavy redundancy tends to occur inside the neural networks. However, the information interactions are tactfully decoupled in our RedNet towards a favorable accuracy-efficiency trade-off, as empirically evidenced in Figure 2. To be specific, the information encoded in the channel dimension of one pixel is implicitly scattered to its spatial vicinity in the kernel generation step, after which the information in an enriched receptive field is gathered thanks to the vast and dynamic involution kernels. Indispensably, linear transformations (realized by convolutions) are interspersed for channel information exchange. In a word, channel-spatial, spatial-alone, and channel-alone interactions alternately and independently act on the stream of information propagation, collaboratively facilitating the miniaturization of network architectures while ensuring the representation capability.
In Context of Prior Literature
This section relates to several important aspects revolving around neural architecture in prior literature. We clarify their similarities and differences compared to our method.
As the de facto standard operator of modern vision systems, convolution possesses two principal characteristics, spatial-agnostic and channel-specific. Convolution kernels are location-independent in the spatial extent for translation equivalence but privatized at different channels for information discrimination. Along another research line, depth-wise convolution demonstrates wide applicability in efficient neural network architecture design . The depth-wise convolution is a pioneering attempt towards factorizing the spatial and channel entanglement of standard convolution, which is symmetric to our proposed involution operation in that depth-wise convolution contains a set of kernels specific to each channel and spatially-shared while our invented involution kernels are shared over channels and dedicated to each planar location in the image lattice.
Until most recently, dynamic convolutions emerge as powerful variants of the stationary ones. These approaches either straightforwardly generate the entire convolution filters , or parameterize the sampling grid associated with each convolution kernel . Regarding the former category , unlike us, their dynamically generated convolution filters still conform to the two properties of standard convolution, thus incurring significant memory or computation consumption for filter generation. Regarding the latter category , only certain attributes, \eg, the footprint of convolution kernels, are determined in an adaptive fashion.
Actually, early in the field of face recognition, DeepFace and DeepID have explored locally connected layers without weight sharing in the spatial domain, enlightened by apparently different regional distributions of statistics in the face imagery. Nevertheless, such excessive relaxation of convolution parameters can be problematic in knowledge transfer from one position to others. Resembling dynamic convolutions, our involution tackles this dilemma through sharing meta-weights of the kernel generation function across different positions, though not directly the weights of kernel instances. There also exist previous works that adopt pixel-wise dynamic kernels for feature aggregation, but they mainly capitalize on the context information for feature up-sampling and still rely on convolution for basic feature extraction. The most relevant work towards substituting convolution rather than up-sampling might be , but the pixel-wise generated filters still inherit one original property of convolution, to perform feature aggregation in a distinct manner over each channel.
2 Attention Mechanism
The attention mechanism originates from the field of machine translation and exhibits blossoming development in the arena of natural language processing . Its success has also been translated to a plethora of vision tasks, including image recognition , image generation , video understanding , object detection , and semantic segmentation . Some works sparingly insert self-attention as plugin modules into the backbone neural network or attach them on the top of the backbone to extract high-level semantic relationships , retaining the substratum of convolutional features. More aggressively, other works adopt the off-the-shell self-attention layer as the fundamental backbone component for vision . Still, limited emphasis has been laid on delving deep into the learning dynamics of this functional form compared to convolution .
Our proposed involution in Eqn. 4 is reminiscent of self-attention and essentially could become a generalized version of it. The self-attention pools values depending on the affinities obtained by computing correspondences between the query and key content, and , formulized as
where , and are linearly transformed from the input , and is the number of heads in multi-head self-attention . The similarity lies in that both operators collect pixels in the neighborhood or a less bounded scope through a weighted sum. On the one hand, the computing regime of involution can be considered as an attentive aggregation over the spatial domain. On the other hand, the attention map, or say affinity matrix in the self-attention, can be viewed as a kind of involution kernel .
However, with the particulars of kernel generation comes the differences between self-attention and our materialized involution form with Eqn. 6. Regrading previous endeavor on replacing convolution with local self-attention to establish backbone models, they have to derive the affinity matrix (equivalent to involution kernel in our context) based on the relationship between the query and key content, optionally with hand-crafted relative positional encoding for permutation-variance. From this point of view, for self-attention, the input to the kernel generation function in Eqn. 5 would become a set of pixels indexed by indicates adding a variable vector to each element in a set here., including both the pixel of interest and its surrounding ones. Subsequently, the function could compose all these attended pixels, in an either ordered or unordered manner, and exploit complex relationships between them. In stark contrast to above, we constitute the involution kernel via operating solely on the original input pixel itself with , as expressed by Eqn. 6. From the perspective of self-attention, our involution kernels only explicitly rely on the query content, while the relative positional information is implicitly encoded in the organized output form of our kernel generation function. We sacrifice the pixel-paired relationship modeling, but the final performance of our RedNet is on par with those heavily relation-based models. Therefore, we may reach a conclusion that it is the macro design principles of involution instead of its micro setup nuances that are instrumental in the representation learning for visual understanding, corroborated by our empirical results in the experimental part. Another strong evidence supporting our hypothesis is that only using position encoding (by replacing in Eqn. 7 with , where is the position embedding matrix) retains descent performance of self-attention based models . Previously, the above observation is interpreted as the crucial role of position encoding in self-attention, but now a reinterpretation of the root cause behind might be is still a form of dynamically parameterized involution kernel.
More importantly, precedent self-attention based works seldom show their versatility in multifarious vision tasks, but our involution paves a viable pathway for a great variety of tasks, as we shall find soon in Section 5.1.
Experiments
We conduct comprehensive experiments from conceptual prediction to (semi-)dense prediction. All the network models are implemented with the PyTorch library .
We perform the backbone training from scratch on the ImageNet training set that is one of the most challenging benchmarks for object recognition up to date. For a fair comparison, we adhere to the training protocol of Stand-Alone Self-Attention and Axial Attention , except that we do not use exponential moving average (EMA) over the trainable parameters during training. Following the identical recipe, we re-implement both pairwise and patchwise SAN with their open-source codehttps://github.com/hszhao/SAN as a stronger baseline, and show our reproduced results in the corresponding tables and figures respectively. The detailed training setup is provided in the Appendix. We apply the Inception-style preprocessing for data augmentation , \ie, random resized cropping and horizontal flipping. For evaluation, we use the single-crop testing method on the validation set following the common practice.
In the same spirit of ResNet, we scale the network depth to establish our RedNet family. The comparison to convolution and self-attention based vision models are summarized in Table 1. Almost within each group of the table, RedNet achieves the highest recognition accuracy whilst with the most parsimonious parameter storage and computational budget. RedNet could substantially outperform ResNet across all depths. For example, RedNet-50 achieves a margin of 1.6% higher accuracy over ResNet-50, using 39.5% fewer parameters and 34.1% lower computational consumption. Moreover, RedNet-50 is on par with ResNet-101 regrading to the top-1 recognition accuracy, while saving 65.2% and 65.8% storage and computation resources respectively. For an intuitive demonstration, the corresponding accuracy-complexity envelope is illustrated in Figure 2a, where our RedNet shows the top-performing Pareto frontier, in abreast with other state-of-the-art self-attention models, while being free from more complex relation modeling. Likewise, we could observe a similar trend in the accuracy-parameter envelope shown in Figure 2b. It is noteworthy that RedNet strikes a better balance between parameters and complexities, compared to the top competitors like SAN and Axial ResNet, as they are enveloped by the curve of RedNet series either in Figure 2a or 2b.
To reflect the practical runtime, we measure the inference time of different architectures with the comparable performance for a single image with the shape of . We report the running time on GPU/CPU in Table 2, where RedNet demonstrates its merits in terms of wall-clock time under the same level of accuracy. A customized CUDA kernel implementation with optimized memory scheduling for involution is highly anticipated for further acceleration on GPU. Depending on the extent to which optimizing hardware accelerators is contributed to this new involution operator, on-device speedup might approach the theoretical speedup compared to convolution in the future.
1.2 Object Detection and Instance Segmentation
Beyond fundamental image classification, we demonstrate the generalization ability of our proposed involution on downstream vision tasks, such as object detection and instance segmentation. For object detection, we employ the representative one- and two-stage detectors, RetinaNet and Faster R-CNN , both equipped with the FPN neck. For instance segmentation, we adopt the main-stream detection system, Mask R-CNN , also in companion with FPN. These three detectors with the underlying backbones, ResNet-50 or RedNet-50, are fine-tuned on the Microsoft COCO train2017 set for transferring the learned representations of images. More training details are reported in the appendix. During quantitative evaluation, we test on the val2017 set and report the COCO-style mean Average Precision (mAP) under different IoU thresholds ranging from 0.5 to 0.95 with an increment of 0.05.
Table 3 compares our models against the baseline of ResNet backbone with the convolution-based neck and head. First, with the RedNet backbone, all the three detectors excel their ResNet-based counterparts with considerable performance gains, \ie, 1.7%, 1.8%, and 1.8% higher in bounding box AP, while being more parameter- and computation-conserving. Second, additionally swapping involution for convolution in the FPN neck brings about healthy margins for Faster/Mask R-CNN, while further reducing their parameters and computational cost to 71%/73% and 65%/72%. In particular, the margins with respect to bounding box AP are enlarged to 2.5% and 2.4% respectively. Third, to build fully involution-based detectors, we further replace convolution in the task-specific heads of Faster/Mask R-CNN with involution, which could cut down more than half of the computational complexity while retaining the superior or on-par performance. This kind of fully involution-based detectors may stand out especially in cases where computational resource is the major bottleneck. Forth, we pay special attention to the scores of small/medium/large objects and notice that the most compelling performance improvement appears in the measurement of AP. Our best detection models could surpass the baselines by more than 3% bounding box AP in this regard, specifically 3.4%, 4.3%, and 3.3% for RetinaNet, Faster R-CNN, and Mask R-CNN. We hypothesize that the success of detecting large-scale objects arises from the design of spread-out and position-aware involution kernels. Besides AP, the performance gains are consistent under the fine-grained taxonomy of AP evaluation metrics, demonstrated in different columns of Table 3.
1.3 Semantic Segmentation
To further exploit the versatility of involution, we also conduct experiments on the task of semantic image segmentation. We choose the segmentation frameworks of Semantic FPN and UPerNet , loaded with ImageNet pre-trained backbone weights. We fine-tune these segmentors on the finely-annotated part of the Cityscapes dataset , which contains a split of 2975 and 500 images for training and validation respectively, divided into 19 classes. More training details can be found in the appendix. After training, we perform the evaluation on the validation set under the single-scale mode and adopt the Intersection-over-Union (IoU) as the evaluation metric.
Based on the Semantic FPN framework, we are able to achieve 3.8% higher mean IoU over all classes, taking advantage of RedNet over ResNet as the backbone. Consequent to further infusing involution into the FPN neck to replace convolution, the gain in mean IoU is elevated to 4.7% but the parameters and FLOPs are cut down to 57.5% and 56.6% of the baseline model accordingly. The detailed comparison results are shown in Table 4. To take one step further, we investigate the effectiveness of our method on different object classes. Aligned with the discovery in object detection, we notice that the segmentation effects of those objects with a large spatial arrangement are improved by more than 10%, \eg, wall, truck, and bus, while slight improvements are observed in classes of relatively small objects, \eg, traffic light, traffic sign, person, and bicycle. Once again, the involution operation effectively aids the large object perception by endowing the representation process with dynamic and distant interactions. In addition, we replace the ResNet backbone of UPerNet with RedNet and evaluate the final performance, as displayed in Table 5. Though not an apple-to-apple comparison using the same segmentor and training strategy, RedNet-based UPerNet appears more efficient than Axial-DeepLab, which is dedicatedly designed for segmentation tasks by converting the original Axial ResNet backbone network.
2 Ablation Analysis
We present several ablation studies designed to understand the contributions of individual components, taking RedNet-50 as an example.
First of all, we isolate the impact of involution on the network stem. Following the practice of recent self-attention based architectures , the network stem is decomposed into three consecutive operations to save memory cost. In accordance with our practice of integrating involution into the trunk, we place involution at the bottleneck position of the stem. This act improves the accuracy from 77.7% to 78.4% with marginal cost, leading to our default setting of RedNet in the main experiments.
Otherwise explicitly mentioned, we use RedNet-50 with convolution stem for the following ablation analysis.
In the spatial dimension, we probe the effect of kernel size. Steady improvement is observed in Table 6a when increasing the spatial extent up to with negligible computational overheads. The improvement somewhat plateaus when further expanding the spatial extent, which is possibly relevant to the feature resolution in the network. This set of controlled experiments shows the superiority of harnessing large involution kernels over compact and static convolution, while avoiding to introduce prohibitive memory and computational cost.
In the channel dimension, we assess the feasibility of sharing an involution kernel. As can be seen in Table 6b, sharing a kernel per 16 channels halves the parameters and computational cost compared to the non-shared one, only sacrificing 0.2% accuracy. However, sharing a single kernel across all the channels obviously underperforms in accuracy. Considering the channel redundancy of involution kernels, as long as setting the channels shared in a group to an acceptable range, the channel-agnostic behavior will not only reserve the performance, but also reduce the parameter count and computational cost. This will also permit a larger kernel size under the same budget.
Next, we validate the utility of bottleneck architecture for the kernel generation process in Table 6c. Adopting a single linear transform or two transforms without bottleneck () as the kernel generation function incurs more parameters and FLOPs but only performs marginally better, compared to the default setting (). Moreover, inferior performance could be ascribed to aggressive channel reduction ().
Further attaching activation functions such as softmax, sigmoid to the kernel generation function, would constrain the kernel values, thus restrict its expressive capability, and ends up hindering the performance by over 1%. So we opt not to insert any additional functions at the output end of the kernel generation function, allowing the generated kernel to span the entire subspace of matrices.
3 Visualization
For dissecting the learned involution kernels, we take the sum of values from each involution kernel as its representative value. All the representatives at different spatial locations frame the corresponding heat map. Some selected heat maps are plotted in Figure 3, where the columns following the original image indicate different involution kernels in the last block of the third stage (conv3_4 following the naming convention of ), separated by groups. On the one hand, involution kernels automatically attend to crucial parts of objects in the spatial range for correct image recognition. On the other hand, in a single involution layer, different kernels from different groups focus on varying semantic concepts of the original image, by highlighting peripheral parts, sharp edges or corners, smoother regions, outline of the foreground and background objects, respectively (from left to right in each row).
Conclusion and Prospect
In this work, we present involution, an effective and efficient operator for visual representation learning, reversing the design principles of convolution and generalizing the formulation of self-attention. Thanks to the medium of involution, we are able to disclose the underlying relationship between self-attention and convolution and empirically ascertain the essential driving force of self-attention for its recent progress in vision. Our proposed involution is benchmarked on several standard vision benchmarks, consistently delivering enhanced performance at reduced cost compared to convolution-based counterparts and self-attention based models. Furthermore, careful ablation analysis helps us better understand that such performance enhancement is rooted in the core contributions of involution, from the efficacy of spatial modeling to the efficiency of architecture design.
We believe that this work could foster future research enthusiasm on simple yet effective visual primitives beyond convolution, which is expected to make inroads into fields of neural architecture engineering where uniform and local spatial modeling has prestigiously dominated.
Appendix A Implementation Details
In accordance with Stand-Alone Self-Attention and Axial Attention , we train all these models for 130 epochs utilizing the Stochastic Gradient Descent (SGD) optimizer with the momentum of 0.9 and the weight decay of 0.0001. The learning rate initiates from 0.8 and gradually approaches zero following a half-cosine function shaped schedule. The mini-batch size per GPU is set to 32 and the training procedure is conducted on 64 GPU devices in total. The label smoothing regularization technique is applied with the coefficient of 0.1.
A.2 Object Detection and Instance Segmentation
Following the widely-adopted pipeline, the input images are resized to keep their shorter/longer side as 800/1333 pixels prior to being fed into the networks. The training procedure lasts for 12 epochs, using the Stochastic Gradient Descent (SGD) optimizer with the momentum of 0.9 and weight decay of 0.0001. The initial learning rate is set to 0.02 for Faster/Mask R-CNN and 0.01 for RetinaNet with a linear warm-up period of 500 iterations, divided by 10 in the 8th and 11st epoch. When necessary, we moderately extend the warm-up period and apply gradient clipping for the sake of convergence stability. The detectors are trained on 8 Tesla V100 GPUs with 2 samples per GPU.
A.3 Semantic Segmentation
The urban scene images with a high resolution of are randomly resized, with their aspect ratios kept in the range from 0.5 to 2.0, from which the input image patches with the size of are randomly cropped, then undergo random horizontal flipping and a sequence of photometric distortions as the data augmentation. We adopt the training schedule of 80k iterations, and apply the Stochastic Gradient Descent (SGD) optimizer with the momentum of 0.9 and weight decay of 0.0005. The learning rate starts from 0.01 and anneals following the conventional “poly” policy, which indicates the initial learning rate is multiplied by in each iteration. The segmentation networks are trained on 4 Tesla V100 GPUs with 2 samples per GPU. We apply synchronized Batch Normalization for more stable estimation of the batch statistics.
Appendix B Comparison to State-of-the-art on COCO
For both object detection and instance segmentation on COCO, we compare our involution-based Mask R-CNN with the RedNet-50 backbone against other celebrated architectures with ResNet-50 in Table 7. Our approach performs substantially better than convolution-based Mask R-CNN equipped with self-attention blocks, like NLNet , CCNet , and GCNet . Additionally, our method outperforms those of embedding dynamic mechanism into the networks, including Deformable ConvNets (DCN) and Dynamic Graph Message passing Networks (DGMN) . Note that all these referred approaches introduce extra parameters and FLOPs to the vanilla Mask R-CNN by appending complementary modules while our proposed involution operator even reduces the complexity of baseline by substituting convolution.
Appendix C Visualization of Segmentation
Based on the semantic FPN framework, we provide some prediction results on the Cityscapes validation set in Figure 4. Without the help of involution, pixels of large objects are usually mistaken as other objects with high similarity. For instance, the wall in the first image example are mostly confused with building by the convolution-based FPN. Some pixels of the bus in the third image example are misclassified as truck or car, distracted by the occlusion of the cyclist. In contrast, our involution-based FPN dissolves these ambiguities by dynamically reasoning in an enlarged spatial range. Also, better consistency of inner pixels of an object is observed in the segmentation results of our method, reaping the benefits of involution.
Appendix D Discussion
The topological connectivity and hyper-parameter configurations of convolutional neural networks have undergone rapid evolution, but developing brand new operators attracts little attention for crafting innovative architectures. In this work, we expect to bridge this regret via disassembling the elements of convolution and reassembling them into a more effective and efficient involution. In the meanwhile, one of the current front edges of neural architecture engineering is automatically searching the network structures . Our invention can also fill the pool of search space for most existing Neural Architecture Search (NAS) strategies. In the near future, we are looking forward to discovering more effective involution-equipped neural networks with the help of NAS.