HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser-Nam Lim, Jiwen Lu

Introduction

Convolutional neural networks (CNN) have driven remarkable progress in deep learning and computation vision since the introduction of AlexNet in the last decade. There are quite a few nice properties of CNNs making them naturally suitable for a wide range of vision applications. Translation equivariance introduces useful inductive biases to major vision tasks and enables transferability across different input resolutions. The highly optimized implementation makes it efficient on both high-performance GPUs and edge devices. The evolution of architectures further increases its popularity on various vision tasks.

The emergence of Transformer-based architectures greatly challenges the dominance of CNNs. By combining some successful designs in CNN architectures and the new self-attention mechanism, vision Transformers have shown leading performance on various vision tasks such as image classification , object detection , semantic segmentation and video understanding . What makes vision Transformers more powerful than CNNs? Some efforts have been made to improve the CNN architectures by learning from the new designs in vision Transformers. presents a thorough study to adopt the meta architecture of vision Transformer to improve CNNs and proposes to use a large 7×\times7 kernel to construct a modern CNN. and propose to use even larger kernels to learn long-range relations with global filters and up to 31×\times31 convolutions, respectively. shows that the input-adaptive weights play a key role in vision Transformers and achieve similar performance with Swin Transformers with dynamic convolutions . However, the effectiveness of dot-product self-attention in vision tasks has not been analyzed from the prospective of high-order spatial interactions.

While there exists complex and often high-order interactions between two spatial locations in a deep model due to the non-linearity, the success of self-attention and other dynamic networks suggests that the explicit and high-order spatial interactions introduced by the architectural designs are beneficial to improving the modeling power of vision models. As illustrated in Figure 1, the plain convolution operation does not explicitly consider the spatial interactions between a spatial location (\ie, the red feature) and its neighboring region (\ie, the light gray region). Enhanced convolution operations like dynamic convolution introduce explicit spatial interaction by generating dynamic weights. The dot-product self-attention operation in Transformers consists of two successive spatial interactions by performing matrix multiplication among queries, keys and values. The trend of the basic operations for visual modeling indicates that the network capacity can be improved by increasing the order of spatial interactions.

In this paper, we summarize that the key ingredient behind the success of vision Transformers is the new way of spatial modeling with input-adaptive, long-range and high-order spatial interactions performed by the self-attention operation. While previous work has successfully migrated the meta architecture , input-adaptive weight generation strategy and large-range modeling ability of vision Transformers to CNN models, a higher-order spatial interaction mechanism has not been studied. We show that all the three key ingredients can be efficiently implemented using a convolution-based framework. We propose the Recursive Gated Convolution (gn\textit{g}^{\textit{n}}Conv) that performs high-order spatial interactions with gated convolutions and recursive deigns. Instead of simply imitating the successful designs in self-attention, gn\textit{g}^{\textit{n}}Conv has several extra favorable properties: 1) Efficient. The convolution-based implementation avoids the quadratic complexity of self-attention. The design that progressively increases the channel width during performing spatial interactions also enables us to achieve higher-order interactions with bounded complexity; 2) Extendable. We extend the two-order interaction in self-attention to arbitrary orders to further improve the modeling power. Since we do not make assumptions on the type of spatial convolution, gn\textit{g}^{\textit{n}}Conv is compatible with various kernel size and spatial mixing strategies like ; 3) Translation-equivariant. gn\textit{g}^{\textit{n}}Conv fully inherits the translation equivariance of the standard convolution, which introduces beneficial inductive biases to major vision tasks and avoids the asymmetry brought by local attention .

Based on gn\textit{g}^{\textit{n}}Conv, we construct a new family of generic vision backbones named HorNet. We conduct extensive experiments on ImageNet classification , COCO object detection and ADE20K semantic segmentation to verify the effectiveness of our models. With the same 7×\times7 kernel/window and similar overall architecture and training configurations, HorNet outperforms Swin and ConvNeXt by a large margin on all tasks at different levels of complexity. The gap can be further enlarged by using a global kernel size . HorNet also shows favorable scalability to more training data and larger model size, attaining 87.7% top-1 accuracy on ImageNet, 57.9% mIoU on ADE20K val and 59.2% bounding box AP on COCO val with ImageNet-22K pre-training. Apart from applying gnConv\textit{g}^{\textit{n}}\text{Conv} in visual encoders, we further test the generality of our designs on task-specific decoders. By adding gConv to the widely used feature fusion model FPN , we develop HorFPN to model the high-order spatial relationships of features from different hierarchical levels. We observe that HorFPN can also consistently improve various dense prediction models with lower computational costs. Our results demonstrate that gn\textit{g}^{\textit{n}}Conv can be a promising alternative to self-attention for visual modeling and effectively combine the merits of both vision Transformers and CNNs.

Related Work

Vision Transformers. The Transformer architecture is originally designed for the natural language processing tasks. Since Dosovitskiy \etal show that vision models constructed only by the Transformer blocks and a patch embedding layer can also achieve competitive performance to CNNs, many new models have been proposed to modify the Transformer-based architecture and make it more suitable for various vision tasks . Different from the original designs in , state-of-the-art vision Transformers usually utilize a CNN-like hierarchical architecture and change the global self-attention among all patches to local self-attention to avoid the quadratic complexity. In this paper, we follow the overall architecture of the previous hierarchical vision Transformers and replace the self-attention sub-layer with our proposed gn\textit{g}^{\textit{n}}Conv to fairly compare with the previous Transformer-based models.

Convolution-based models. Inspired by the recent success of vision Transformers, several papers propose to adopt the Transformer-style architecture and spatial convolutions with a large kernel size to improve the performance of CNNs. Han \etal replace the window self-attention in Swin Transformers with large-kernel dynamic convolutions and achieve better performance. GFNet proposes to perform the global spatial interactions like vision Transformers with global filters in the frequency domain, which are equivalent to depth-wise convolutions with a global kernel size and circular padding. ConvNeXt thoroughly analyzes the designs in recent vision Transformers and presents a strong convolutional model with 7×\times7 depth-wise convolutions. RepLKNet explores CNN models with very large kernels (up to 31×\times31), showing good scalability as vision Transformers. VAN and FocalNet use gated convolutions to perform input-adaptive attention and adopts large-kernel dilated convolutions and multiple successive 3×\times3 convolutions respectively to produce the weights. Previous work focuses on the meta architecture , large-kernel designs and input-adaptive weights to improve CNNs by learning from vision Transformers. In this paper, we offer a new perspective of high-order spatial attention to analyze the merits of vision Transformers. We show that the proposed HorNet that combines the advantages of both CNNs and vision Transformers is a better architecture for various vision tasks.

Hybrid models. Combining vision Transformers and CNNs to develop hybrid architectures is a new direction in various visual recognition problems. Recently, several efforts have been made to integrate the two types of blocks into a unified model with a sequential or parallel design. Many enhanced vision Transformers also use lightweight convolutions in the basic building block to efficiently capture neighboring patterns or relax the quadratic complexity of self-attention . Different from these hybrid models, we aim to develop a self-attention free model while combining the favorable properties of both vision Transformers and CNNs.

Method

In this section, we will present gnConv\textit{g}^{\textit{n}}\text{Conv}, an efficient operation to achieve long-term and high-order spatial interactions. The gnConv\textit{g}^{\textit{n}}\text{Conv} is built with standard convolutions, linear projections and element-wise multiplications, but has a similar function of input-adaptive spatial mixing to self-attention.

Input-adaptive interactions with gated convolution. Recent success in vision Transformers mainly depends on the proper modeling of the spatial interactions in visual data. Unlike CNNs that simply use the static convolution kernel to aggregate neighboring features, vision Transformers apply multi-head self-attention to dynamically generate the weights to mix spatial tokens. However, the quadratic complexity w.r.t. the input size of the self-attention largely hinders the application of vision Transformers, especially on downstream tasks including segmentation and detection where higher-resolution feature maps are required. In this work, instead of reducing the complexity of self-attention like previous methods , we seek a more efficient and effective way to perform spatial interactions with simple operations like convolution and fully-connected layers.

where ϕin,ϕout\phi_{\rm in},\phi_{\rm out} are linear projection layers to perform channel mixing and ff is a depth-wise convolution. Note that p1(i,c)=jΩiwijcq0(j,c)p0(i,c)p_{1}^{(i,c)}=\sum_{j\in\Omega_{i}}w_{i\to j}^{c}q_{0}^{(j,c)}p_{0}^{(i,c)}, where Ωi\Omega_{i} is the local window centered at ii and ww represents the convolution weight of ff. Therefore, the above formulation explicitly introduce interactions among the neighboring features p0(i)\mathbf{p}_{0}^{(i)} and q0(j)\mathbf{q}_{0}^{(j)} through the element-wise multiplication. We consider the interaction in gConv as 1-order interaction as each p0(i)\mathbf{p}_{0}^{(i)} has interacted with its neighbor feature q0(j)\mathbf{q}_{0}^{(j)} only once.

High-order interactions with recursive gating. After achieving an efficient 1-order spatial interactions with the gConv, we then design the gnConv\textit{g}^{\textit{n}}\text{Conv}, a recursive gated convolution to further enhance the model capacity by introducing higher-order interactions. Formally, we first use ϕin\phi_{\rm in} to obtain a set of projected features p0\mathbf{p}_{0} and {qk}k=0n1\{\mathbf{q}_{k}\}_{k=0}^{n-1}:

We then perform the gated convolution recursively by

where we scale the output by 1/α1/\alpha to stabilize the training. {fk}\{f_{k}\} are a set of depth-wise convolution layers and {gk}\{g_{k}\} are used to match the dimension in different orders:

Finally, we feed the output of the last recursion step qn\mathbf{q}_{n} to the projection layer ϕout\phi_{\rm out} to obtain the result of the gnConv\textit{g}^{\textit{n}}\text{Conv}. From the recursive formula Equation (3.3), it is easy to show that the interaction-order of pk\mathbf{p}_{k} will be increased by 1 after each step. As a result, we can see that the gnConv\textit{g}^{\textit{n}}\text{Conv} achieves nn-order spatial interactions. It is also worth noting that we need only a single ff to perform depth-wise convolution to the concatenation of the features {qk}k=0n1\{\mathbf{q}_{k}\}_{k=0}^{n-1} together instead of computing the convolution in each recursive step as in Equation (3.3), which can further simplify the implementation and improve the efficiency on GPUs. To ensure that the high-order interactions do not introduce too much computational overhead, we set the channel dimension in each order as:

This design indicates that we perform the interactions in a coarse-to-fine manner, where lower orders are computed with fewer channels. Besides, the channel dimension of ϕin(x)\phi_{\rm in}(\mathbf{x}) is exactly 2C2C and the total FLOPs can be strictly bounded even with nn increasing. It can be proved that (see Appendix A):

where KK is the kernel size of the depth-wise convolution. Therefore, our gnConv\textit{g}^{\textit{n}}\text{Conv} achieves high-order interactions with a similar computational cost to a convolutional layer.

Long-term interactions with large kernel convolutions. Another difference between vision Transformers and conventional CNNs is the receptive field. Conventional CNNs often use 3×\times3 convolution through the whole network, while vision Transformers calculate self-attention on the whole feature maps or inside a relatively large local window (\eg, 7×\times7). The large receptive field in vision Transformers makes it easier to capture long-term dependencies, which is also recognized as one of the key advantages of vision Transformers. Inspired by this design, there are some efforts to introduce large kernel convolutions to CNNs recently . To make our gnConv\textit{g}^{\textit{n}}\text{Conv} capable of capturing long-term interactions, we adopt two implementations for the depth-wise convolution ff:

7×\times7 Convolution. 7×\times7 is the default window/kernel size of Swin Transformers and ConvNext . Studies in show that the kernel size produces good performance on ImageNet classification and various downstream tasks. We follow this configuration to fairly compare with representative work of vision Transformers and modern CNNs.

Global Filter (GF). The GF layer multiplies the frequency domain features with learnable global filters, which is equivalent to a convolution in the spatial domain with a global kernel size and circular padding. We use a modified version of the GF layer by processing half of the channels with the global filter and the other half with 3×\times3 depth-wise convolutions and only use GF layers in late stages to preserve more local details.

Spatial interactions in vision models. We review some representative vision model designs from the perspective of spatial interactions, as shown in Figure 1. Specifically, we are interested in the interactions between a feature xi\mathbf{x}_{i} and its neighboring feature xj,jΩi\mathbf{x}_{j},j\in\Omega_{i}. By using the tool designed for explaining the interaction effect (IE) in , we provide an intuitive analysis of the order of explicit spatial interactions in Appendix B. Our analysis reveals a key difference between vision Transformers and previous architectures from a new view, \ie, vision Transformers have higher-order spatial interactions in each basic block. The result inspires us to explore an architecture that can realize more efficient and effective spatial interactions with more than two orders. As discussed above, our proposed gnConv\textit{g}^{\textit{n}}\text{Conv} can achieve arbitrary-order interactions with bounded complexity. It is also worth noting that similar to other scaling factors in deep models like width and depth , simply increasing the order of spatial interactions without considering the overall model capacity will not lead to a good trade-off . In this paper, we focus on developing a stronger visual modeling architecture based on the analysis of the spatial interaction orders of well-designed models. We believe a more thorough and formal discussion on the high-order spatial interactions can be an important future direction.

Relation to dot-product self-attention. Although the computation of our gnConv\textit{g}^{\textit{n}}\text{Conv} largely differs from dot-product self-attention, we will show that gnConv\textit{g}^{\textit{n}}\text{Conv} also accomplishes the goal of input-adaptive spatial mixing. Let M\mathbf{M} be the attention matrix obtained by multi-head self-attention (MHSA), we write M\mathbf{M} as (mijc)(m_{ij}^{c}) since the mixing weight may vary across the channels. The spatial mixing result (before the final channel mixing projection) of the cc-th channel at location ii is

where wVw_{V} is the weight of the V-projection layer. Note that mijm_{ij} obtained by the dot-product operation contains 1-order interaction. On the other hand, the output of our gnConv\textit{g}^{\textit{n}}\text{Conv} (before the ϕout\phi_{\rm out}) can be written as

where wn1w_{n-1} is the convolutional weight for fn1f_{n-1}, wϕinw_{\phi_{\rm in}} is the linear weight of ϕin\phi_{\rm in}, and gn1=gn1(pn1)\mathbf{g}_{n-1}=g_{n-1}(\mathbf{p}_{n-1}) is a projection of pn1\mathbf{p}_{n-1}. From the formulation in Equation (3.8) we find our gnConv\textit{g}^{\textit{n}}\text{Conv} also achieves input-adaptive spatial mixing with {hijc}\{h_{ij}^{c}\} as the weights. Observing that hijh_{ij} is computed from pn1\mathbf{p}_{n-1} which contains n1n-1 order interactions, we can regard our gnConv\textit{g}^{\textit{n}}\text{Conv} as an extension of the self-attention in terms of the order of the spatial mixing weight. Therefore, our gnConv\textit{g}^{\textit{n}}\text{Conv} can better model more complex spatial interactions.

The details of gnConv\textit{g}^{\textit{n}}\text{Conv} and our implementation are summarized in Figure 2.

2 Model Architectures

HorNet. The gnConv\textit{g}^{\textit{n}}\text{Conv} can be a drop-in replacement of the spatial mixing layer in vision Transformers or modern CNNs . We follow the same meta-architecture as to construct HorNet, where the basic block contains a spatial mixing layer and a feed-forward network (FFN). Depending on the model size and the implementation of the depth-wise convolution fkf_{k} in our gnConv\textit{g}^{\textit{n}}\text{Conv}, we have two series of model variants named HorNet-T/S/B/L7×7 and HorNet-T/S/B/LGF. We consider the popular Swin Transformer and ConvNeXt as the vision Transformer and CNN baselines since our models are implemented based on a convolution-based framework while having high-order interactions like vision Transformers. To fairly compare with the baselines, we directly follow the number of blocks of Swin Transformers-S/B/L but insert an extra block to the stage 2 to make the overall complexity close, resulting in $blocksineachstageinallofthemodelvariants.Wesimplyadjustthebasenumberofchannelsblocks in each stage in all of the model variants. We simply adjust the base number of channelsCtoconstructmodelswithdifferentsizesandsetthenumberofchannelsin4stagesasto construct models with different sizes and set the number of channels in 4 stages as[C,2C,4C,8C]followingcommonpractice.Weusefollowing common practice. We useC=64,96,128,192forHorNetT/S/B/L,respectively.Wesettheinteractionorders(\ie,thefor HorNet-T/S/B/L, respectively. We set the interaction orders (\ie, theninin\textit{g}^{\textit{n}}\text{Conv})foreachstageas2,3,4,5bydefault,suchthatthechannelsofthecoarsestorder) for each stage as 2,3,4,5 by default, such that the channels of the coarsest orderC_{0}$ is the same across different stages.

HorFPN. Apart from using gnConv\textit{g}^{\textit{n}}\text{Conv} in visual encoders, we find our gnConv\textit{g}^{\textit{n}}\text{Conv} can be an enhanced alternative for standard convolution that considers higher-order spatial interactions in a wide range of convolution-based models. Thus, we replace spatial convolutions for feature fusion in the FPN with our gnConv\textit{g}^{\textit{n}}\text{Conv} to improve spatial interactions for downstream tasks. Specifically, we add our gnConv\textit{g}^{\textit{n}}\text{Conv} after the fusion of features from different pyramid levels. For object detection, we replace the 3×\times3 convolution after the top-down pathway with the gnConv\textit{g}^{\textit{n}}\text{Conv} in each level. For semantic segmentation, we simply replace the 3×\times3 convolution after the concatenation of the multi-level feature maps with gnConv\textit{g}^{\textit{n}}\text{Conv} since the final results are directly predicted from this concatenated feature. We also have two implementations called HorFPN7×7 and HorFPNGF decided by the choice of fkf_{k}.

Experiments

We conduct extensive experiments to verify the effectiveness of our method. We present the main results on ImageNet and compare them with various architectures. We also test our models on the downstream dense prediction tasks on commonly used semantic segmentation benchmark ADE20K and object detection dataset COCO . Lastly, we provide ablation studies of our designs and analyze the effectiveness of gnConv\textit{g}^{\textit{n}}\text{Conv} on a wide range of models.

Setups. We conduct image classification experiments on the widely used ImageNet dataset. We train our HorNet-T/S/B models using the standard ImageNet-1K dataset following common practice. To fairly compare with previous work, we directly use the training configurations of to train our models. We train the models for 300 epochs with 224×224224\times 224 input. To evaluate the scaling ability of our designs, we further train the HorNet-L models on the ImageNet-22K dataset that contains over 10×10\times images and more categories. We follow previous practice to train our models for 90 epochs and use a similar data augmentation strategy as ImageNet-1K experiments. We fine-tune the models pre-trained on ImageNet-22K or at the 224×224 resolution to ImageNet-1K or/and 384×384 resolution for 30 epochs following . When adapting the ImageNet-22K models to ImageNet-1K, we initialize the classifier with the pre-trained class centers to stabilize the training process. More details can be found in Appendix C.

Results. The results of our ImageNet classification experiments are summarized in Table 1. We see that our models achieve very competitive performance with state-of-the-art vision Transformers and CNNs. Notably, HorNet surpasses Swin Transformers and ConvNeXt which have similar overall architectures and training configurations by a healthy margin on various model sizes and settings. Our models also generalize well to a larger image resolution, larger model sizes and more training data. These results clearly demonstrate the effectiveness and generality of our designs.

2 Dense Prediction Tasks

HorNet for semantic segmentation. We evaluate our HorNet for semantic segmentation task on ADE20K dataset using the commonly used UperNet framework. All the models are trained for 160k iterations using AdamW optimizer with a global batch size of 16. The image size during training is 512×512512\times 512 for ImagNet-1k (HorNet-T/S/B) pre-trained models and 640×640640\times 640 for the ImageNet-22K pre-trained models (HorNet-L). The results are summarized in the left part of Table 2, where we report both the single-scale (SS) and multi-scale (MS) mIoU on the validation set. Both our HorNet7×7 and HorNetGF models outperform Swin and ConvNeXt models with similar model sizes and FLOPs. Specifically, HorNetGF models achieve better results than HorNet7×7 and ConvNeXt series by large margins in single-scale mIoU, indicating the global interactions captured by the global filter are helpful for semantic segmentation. Notably, we find both our HorNet-L7×7 and HorNet-LGF even outperform ConvNeXt-XL with \sim25% fewer FLOPs. These results clearly demonstrate the effectiveness and scalability of our HorNet on semantic segmentation.

HorNet for object detection. We also evaluate our models on the COCO dataset. We adopt the cascade Mask R-CNN framework to perform object detection and instance segmentation using HorNet-T/S/B/L backbones. Following Swin and ConvNeXt , we use 3×3\times schedule with multi-scale training. The right part of Table 2 compares the box AP and mask AP of our HorNet models and Swin/ConvNeXt models. Similarly, we show our HorNet models achieve consistently and significantly better performance than the Swin/ConvNeXt counterparts, in both box AP and mask AP. The HorNetGF series obtain +1.2\sim2.0 box AP and +1.0\sim1.9 mask AP compared with ConvNeXt. Again, our large model HorNet-L7×7 and HorNetGF can outperform ConvNeXt-XL, which further validates the favorable transferability with a larger model size and larger pre-trained dataset.

HorFPN for dense prediction. We now show another application of the proposed gnConv\textit{g}^{\textit{n}}\text{Conv}, \ie, to serve as a better fusion module that can better capture the higher-order interactions among different levels of features in dense prediction tasks. Specifically, we directly modify the FPN as described in Section 3.2 in UperNet and Mask R-CNN for semantic segmentation and object detection, respectively.We show the results in Table 3, where we compare the performance of our HorFPN and standard FPN on different backbones including ResNet-50/101 , Swin-S and HorNet-S7×7. For semantic segmentation, we find our HorFPN can significantly reduce the FLOPs (\sim50%) while achieving better validation mIoU. For object detection, our HorFPN can also outperform standard FPN in terms of both box AP and mask AP on different backbones with about 30G fewer FLOPs. Besides, we observe that the HorFPNGF is consistently better than HorFPN7×7, indicating that global interactions are also important when fusing hierarchical features.

Results with state-of-the-art frameworks. To further show the effectiveness our backbone, we conduct experiments to combine our large HorNet model with recent state-of-the-art dense prediction frameworks including HTC++ , DINO and Mask2Former . For HTC++ and DINO, we train our models on COCO for 36 epochs (3×\times schedule) and does not introduce extra pre-training data like Object365 in . We report the single-scale performance on the validation set and compared with several state-of-the-art methods in Table 5. For Mask2Former, we train our models on ADE20K with 640×640640\times 640. We report the mIoU of both single-scale and multi-scale testing on the validation set in Table 5.

3 Analysis

Ablation study. We provide detailed ablation studies of the gnConv\textit{g}^{\textit{n}}\text{Conv} and our HorNet in Table 6. We first study the model designs of our HorNet in Table 6(a). Our baseline ([*]) is obtained by simply replacing the self-attention with 7×\times7 depth-wise convolution in Swin-T . We first show that both SE and our gnConv\textit{g}^{\textit{n}}\text{Conv} with n=1n=1 (g{1,1,1,1}\textit{g}^{\text{\{1,1,1,1\}}}Conv) can improve over the baseline model [*], and g{1,1,1,1}\textit{g}^{\text{\{1,1,1,1\}}}Conv is slightly better. We then perform ablations on the interaction order nn for each stage and find: (1) if nn is shared across the 4 stages, the accuracy will increase with larger nn but saturate at 82.5 when n=4n=4; (2) progressively increased order (g{2,3,4,5}\textit{g}^{\text{\{2,3,4,5\}}}Conv) can further improve the accuracy. Our final models are built on g{2,3,4,5}\textit{g}^{\text{\{2,3,4,5\}}}Conv by adjusting the depth and width of the networks (HorNet-T7×7) and applying Global Filter for the depth-wise convolution (HorNet-TGF). These results clearly show that our gnConv\textit{g}^{\textit{n}}\text{Conv} is an efficient and extendable operation that can better capture high-order spatial interactions than both self-attention and depth-wise convolution.

gnConv\textit{g}^{\textit{n}}\text{Conv} for isotropic models. We also evaluate gnConv\textit{g}^{\textit{n}}\text{Conv} on isotropic architectures (with constant spatial resolutions). We replace the self-attention in DeiT-S with our gnConv\textit{g}^{\textit{n}}\text{Conv} and adjust the number of blocks to 13 to obtain the isotropic HorNet-S7×7 and HorNet-SGF. We compare DeiT-S, isotropic ConvNeXt-S and isotropic HorNet-S in Table 6(b). While isotropic ConvNeXt-S cannot improve DeiT-S, our isotropic HorNet surpasses DeiT-S by a large margin. These results indicate that our gnConv\textit{g}^{\textit{n}}\text{Conv} can better realize the functions of self-attention compared to plain convolutions and have better ability to model the complex spatial interactions.

gnConv\textit{g}^{\textit{n}}\text{Conv} for other operations. To further demonstrate the universality of gnConv\textit{g}^{\textit{n}}\text{Conv}, we use 3×\times3 depth-wise convolution and 3×\times3 pooling as the basic operation in the gnConv\textit{g}^{\textit{n}}\text{Conv}. The results in Table 6(c) show that gnConv\textit{g}^{\textit{n}}\text{Conv} can also improve these two operations by large margins, indicating our gnConv\textit{g}^{\textit{n}}\text{Conv} is potentially more powerful when equipped with some better basic operations.

Accuracy-complexity trade-offs. We visualize accuracy-complexity trade-offs of Swin, ConvNeXt and HorNet series in Figure 3. For fair comparisons, we fix the input image size to 224×224224\times 224 and use HorNet7×7 such that all the compared models are based on 7×\times7 local window. We see HorNet can achieve better trade-offs than the representative vision Transformers and modern CNNs with regards to model size, FLOPs and GPU latency.

Visualization. We provide some visualizations of the adaptive weights learned by gnConv\textit{g}^{\textit{n}}\text{Conv} in Figure 4. For each sample, we show the value of 1Cc=1Chijc\frac{1}{C}\sum_{c=1}^{C}h_{ij}^{c} (see Equation (3.8) or the definition of hijch_{ij}^{c}) for two random spatial locations ii from layer {1, 3, 5, 7, 8, 12} of the isotropic HorNet-S model. Figure 4 demonstrates that the spatial mixing weights of our gnConv\textit{g}^{\textit{n}}\text{Conv} are adaptive both to input samples and spatial locations, which further indicates that gnConv\textit{g}^{\textit{n}}\text{Conv} shares these two desirable characteristics with the self-attention operation.

Limitations. While HorNet shows better overall latency-accuracy trade-offs, we notice that HorNet is slower than ConvNeXt with similar FLOPs on GPU, which may be caused by the more complex designs to perform the high-order interactions. We think that developing a more hardware-friendly operation for high-order spatial interactions is an interesting future direction to improve our work.

Conclusion

We have presented the Recursive Gated Convolution (gn\textit{g}^{\textit{n}}Conv) that performs efficient, extendable, and translation-equivariant high-order spatial interactions with gated convolutions and recursive deigns. gn\textit{g}^{\textit{n}}Conv can serve as a drop-in replace of the spatial mixing layer in various vision Transformers and convolution-based models. Based on the operation, we have constructed a new family of generic vision backbones HorNet. Extensive experiments demonstrate the effectiveness of gn\textit{g}^{\textit{n}}Conv and HorNet on commonly used visual recognition benchmarks. We hope our attempt can inspire future work to further explore the high-order spatial interactions in vision models.

Acknowledgments

Jiwen Lu was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, the National Natural Science Foundation of China under Grant 62125603 and Grant U1813218, and a grant from the Beijing Academy of Artificial Intelligence (BAAI).

References

We will divide the computation of our gnConv\textit{g}^{\textit{n}}\text{Conv} into 3 parts, and calculate the FLOPs for each part.

Projection layers. The FLOPs of two projection layers ϕin\phi_{\rm in} and ϕout\phi_{\rm out} can be easily derived as:

Recursive Gating. We consider both the flops of the projection layer gkg_{k} and the element-wise multiplication.

Appendix B Spatial Interactions in Vision Models.

We review some representative vision model designs from the perspective of spatial interactions, as shown in Figure 1. Specifically, we are interested in the interactions between a feature xi\mathbf{x}_{i} and its neighbor feature xj,jΩi\mathbf{x}_{j},j\in\Omega_{i}. Inspired by the interaction effect (IE) , we consider that a binary function F(xi,xj)F(\mathbf{x}_{i},\mathbf{x}_{j}) which directly operates on xi,xj\mathbf{x}_{i},\mathbf{x}_{j} introduces an effective interaction between xi\mathbf{x}_{i} xj\mathbf{x}_{j}, if

We now analyze the cases in Figure 1 of our main paper using the above rule. (a): Convolution. The output Fi=jΩwijxjF_{i}=\sum_{j\in\Omega}w_{i\to j}\mathbf{x}_{j}, which leads to IE(F)=0\operatorname{IE}(F)=\mathbf{0}. Therefore, standard convolution introduce no interaction between xi\mathbf{x}_{i} and xj\mathbf{x}_{j} and we call it a 0-order interaction. (b): SE Block/Gated Convolution. In this case, we have Fi=jΩwijxjsi(x)F_{i}=\sum_{j\in\Omega}w_{i\to j}\mathbf{x}_{j}s_{i}(\mathbf{x}), where si(x)=1HWl=1HWxls_{i}(\mathbf{x})=\frac{1}{HW}\sum_{l=1}^{HW}x_{l} for the SE block and si(x)=xis_{i}(\mathbf{x})=\mathbf{x}_{i} for the gated convolution. It is easy to show IE(F)0\operatorname{IE}(F)\neq\mathbf{0} because sixi0\frac{\partial s_{i}}{\partial\mathbf{x}_{i}}\neq\mathbf{0}. Hence, these two operations both introduce 1-order interaction. (c): Self-attention (SA). We first denote the projected query/key/value features as q,k,v\mathbf{q},\mathbf{k},\mathbf{v}. The SA first perform an 1-order interaction by computing the attention with dot-product: ai=qi[k1,,kHW]/C\mathbf{a}_{i}=\mathbf{q}_{i}^{\top}[\mathbf{k}_{1},\ldots,\mathbf{k}_{HW}]/\sqrt{C}. We then view ai\mathbf{a}_{i} as the feature at location ii in the following computation. The normalized a^i\hat{\mathbf{a}}_{i} is then obtained by Softmax, which do not contribute to the order since it can be viewed as an implicit interaction that does not explicitly introduce xj\mathbf{x}_{j} to the computation. The second interaction is performed by xi=jΩa^ivj\mathbf{x}_{i}=\sum_{j\in\Omega}\hat{\mathbf{a}}_{i}\mathbf{v}_{j}. To sum up, the SA is a 2-order interaction. (d): gnConv\textit{g}^{\textit{n}}\text{Conv}. According to Section 3.1, we have already known that gnConv\textit{g}^{\textit{n}}\text{Conv} can achieve nn-order interaction with bounded computational cost.

From the above discussion, we reveal a key difference between ViTs and previous architectures from a new view, \ie, ViTs have higher-order spatial interactions in each basic block. Then it begs the question that whether we can achieve better accuracy-complexity trade-offs viz interactions with more than 2 orders. Our proposed gnConv\textit{g}^{\textit{n}}\text{Conv} exactly targets this question for the first time. First, we can realize arbitrary nn-order interaction as long as 1n1+log2C1\leq n\leq 1+\log_{2}C easily. Second, unlike the quadratic complexity of self-attention, the computational cost of gnConv\textit{g}^{\textit{n}}\text{Conv} has an upper bound \wrtthe order nn.

In our implementation of gnConv\textit{g}^{\textit{n}}\text{Conv}, the higher-order spatial interactions are based on the gating mechanism, which has also been investigated in LSTM and some vision modules . However, these previous methods can only achieve up to 2-order interactions, and did not fully reveal the potential of higher-order interactions. On the contrary, our gnConv\textit{g}^{\textit{n}}\text{Conv} is more extendable to achieve arbitrary higher-order spatial interactions under a controllable computational budget.

Appendix C Implementation Details

To better verify the effectiveness of our new designs, we introduce minimal changes in the overall architecture of Swin Transformers . Specifically, we make two changes to the overall architecture of Swin Transformers : 1) We add one block in stage 2 to make the overall computation and parameters close to previous models; 2) We use the LayerScale techniques to make our models more stable during training following the practice of ConvNeXt . Note that the two changes have been applied to the baseline model considered in our ablation study to clearly show the effects of our designs. The detailed architectures of ConvNeXt , Swin Transformers and HorNet are summarized in Table 7.

C.2 Experimental Settings for Image Classification.

ImageNet-1K training. ImageNet-1K is a widely used large-scale benchmark for image classification, which contains around 1.2 million images from 1,000 categories. Following common practice , we train our models on the training set of ImageNet and report the single-crop top-1 accuracy on 50,000 validation images. To fairly compare with our baseline methods (\ie, Swin Transformers and ConvNeXt ), we follow the most training details of ConvNeXt and make several small modifications to make the training configurations suitable for our models. For HorNet with 7×\times7 convolutions, we find that applying gradient clipping with a maximal norm of 5 will significantly stabilize the training process, which may be due to the large gradients brought by the high-order structures in our models. For HorNet with global filters, we use stronger regularization strategies since we find that larger kernels will improve the model capacity but may also cause more severe overfitting. Specifically, we set the gradient norm to 1 and use more aggressive RandAug data augmentation strategies (\ie, we adjust the magnitudes for tiny, small and base models to 9, 12 and 15, respectively). We set the stochastic depth coefficient of HorNet-T/S/B models to 0.2, 0.4 and 0.5. The other details are identical to ConvNeXt . Our models are trained using 32 NVIDIA A100 GPUs with a global batch size of 4096.

ImageNet-22K training. ImageNet-22K is a larger dataset that contains >>21k classes and around 14M images. We use the subset suggested by since the new winter 2021 release is the accessible version now. We also follow the to remove categories with few images, resulting in roughly half fewer categories and only 13% fewer images compared to the original dataset. We follow previous practice to train our models for 90 epochs and use a similar data augmentation strategy as ImageNet-1K experiments. We set the stochastic depth coefficient to 0.2. We also set the maximal gradient norm to 5 and 1 for our large models with standard 7×\times7 convolutions and global filters respectively. We also adjust the weight decay to 0.1. The other details are identical to ConvNeXt . We also fine-tune our best model HorNet-LGF on 384×\times384 images on ImageNet-22K for 10 epochs compete with state-of-the-art models on downstream tasks. The model is only used in the experiments in Appendix 5.

ImageNet-1K fine-tuning. We fine-tune the models pre-trained on ImageNet-22K or at the 224×\times224 resolution to ImageNet-1K or/and 384×\times384 resolution for 30 epochs with a batch size of 512 and a cosine learning rate schedule with an initial learning rate of 5e55e^{-5}. We set the weight decay to 1e61e^{-6} and disable MixUp and CutMix following . We initialize the ImageNet-1K classifier with the corresponding classifier weights for ImageNet-22K classes to further stabilize the training process.

C.3 Experimental Settings for Downstream Tasks.

Object detection and instance segmentation on COCO. We adopt the widely used Cascade Mask R-CNN framework to perform object detection and instance segmentation on COCO, following Swin and ConvNeXt . Our backbones are pre-trained on ImageNet-1K for the HorNet-T/S/B and ImageNet-22K for the HorNet-L. We use the 3×\times schedule where we train all of our model for 36 epochs with AdamW optimizer and a global batch size of 16. We set the learning rate of as {2e-4, 2e-4, 2e-4, 1e-4} and the stochastic depth rate as {0.4, 0.6, 0.7, 0.7}for HorNet-T/S/B/L. We set the weight decay as 0.05 for all the models.

Semantic Segmentation on ADE20K. We use the UperNet 160K framework for semantic segmentation on ADE20K. We use a global batch size of 16 and train all the models for 160 iterations with the AdamW optimizer. We use 512×512512\times 512 image for ImageNet-1K pre-trained HorNet-T/S/B and 640×640640\times 640 image for ImagNet-22K pre-trained HorNet-L. We set the learning rate as 1e-4 and the weight decay as 0.05 for all the models. We report the mIoU of both single-scale and multi-scale testing on the validation set.

Appendix D More Analysis

Comparisons with state-of-the-art methods on ImageNet. Our experiments are designed to clearly verify the superior of our design over previous basic operations like plain convolution and self-attention. Therefore, we choose to follow the basic architecture and the training configuration of widely used architectures Swin Transformers and ConvNeXt Therefore, there is still substantial room to further improve the performance on ImageNet-1K. We notice that some recent work like Pale Transformers and Dynamic Group Transformer with hybrid architectures or more careful designs achieve better performance than HorNet on ImageNet-1K. We think many techniques that have been used in previous work can be useful to further imporve our models, including further optimized overall architectures (\eg, optimized depth/width for each stage), better patch embedding strategies (\eg, overlapping convolutional layers for input embedding and downsampling), more efficient ways to compute adaptive weights (\eg, using downsmapled features to produce attention weights like ), and more advanced training methods and hybrid architectures (\eg, combining gnConv\textit{g}^{\textit{n}}\text{Conv} with self-attention and plain convolutions).

Throughput analysis. We provide the detailed throughput statistics of our models and several baseline methods in Table 8. Apart from ConvNeXt and Swin Transformers, we also compare our model with recent MViTv2 models. The multiple small matrix multiplications introduced by gnConv\textit{g}^{\textit{n}}\text{Conv} will affect the speed of our models on GPU. We observed that our method is slower than ConvNeXt by 7% 15% with similar FLOPs. Meanwhile, thanks to the highly efficient depth-wise convolutions implementation of CuDNN, we also see that our models achieve similar or slightly faster speed than typical vision Transformers with similar FLOPs. Notably, as shown in Figure 3(c), the higher classification accuracy helps our models achieve better speed-accuracy trade-offs than ConvNeXt and Swin Transformers. Therefore, we believe the speed of our method is still competitive with these recent models.

Effects of α\alpha. We find that re-scaling the output of gated convolution will avoid the large values produced by the recursive process and stabilize the training process. We analyze the the effects of α\alpha on our ImageNet experiments based on HorNet-B7×7. The results are summarized in Table 9. We see α=3\alpha=3 leads the best performance. Therefore, we set α\alpha to 3 in all our models.

Effects of activation functions in gated convolutions. The gated convolutions used in our models can be viewed as a type of channel attention that uses different attention weights in different locations and generates weights based on spatial interactions. Previous channel attention methods like SE-Net usually add a sigmoid function to the attention weights to generate bounded attention. Therefore, we investigate several possible activation functions in our models. The results are presented in Table 10. We see the version described in our paper (\ie, no activation function) achieves the best performance. The result also suggests that gnConv\textit{g}^{\textit{n}}\text{Conv} exhibits a different behavior from conventional channel attention methods. Since the gating weights are critical components in our models, activation functions that can cause significant information losses like GELU and sigmoid will severely hurt the performance.