HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser-Nam Lim, Jiwen Lu

Introduction

Convolutional neural networks (CNN) have driven remarkable progress in deep learning and computation vision since the introduction of AlexNet in the last decade. There are quite a few nice properties of CNNs making them naturally suitable for a wide range of vision applications. Translation equivariance introduces useful inductive biases to major vision tasks and enables transferability across different input resolutions. The highly optimized implementation makes it efficient on both high-performance GPUs and edge devices. The evolution of architectures further increases its popularity on various vision tasks.

The emergence of Transformer-based architectures greatly challenges the dominance of CNNs. By combining some successful designs in CNN architectures and the new self-attention mechanism, vision Transformers have shown leading performance on various vision tasks such as image classification , object detection , semantic segmentation and video understanding . What makes vision Transformers more powerful than CNNs? Some efforts have been made to improve the CNN architectures by learning from the new designs in vision Transformers. presents a thorough study to adopt the meta architecture of vision Transformer to improve CNNs and proposes to use a large 7 $\times$ 7 kernel to construct a modern CNN. and propose to use even larger kernels to learn long-range relations with global filters and up to 31 $\times$ 31 convolutions, respectively. shows that the input-adaptive weights play a key role in vision Transformers and achieve similar performance with Swin Transformers with dynamic convolutions . However, the effectiveness of dot-product self-attention in vision tasks has not been analyzed from the prospective of high-order spatial interactions.

While there exists complex and often high-order interactions between two spatial locations in a deep model due to the non-linearity, the success of self-attention and other dynamic networks suggests that the explicit and high-order spatial interactions introduced by the architectural designs are beneficial to improving the modeling power of vision models. As illustrated in Figure 1, the plain convolution operation does not explicitly consider the spatial interactions between a spatial location (\ie, the red feature) and its neighboring region (\ie, the light gray region). Enhanced convolution operations like dynamic convolution introduce explicit spatial interaction by generating dynamic weights. The dot-product self-attention operation in Transformers consists of two successive spatial interactions by performing matrix multiplication among queries, keys and values. The trend of the basic operations for visual modeling indicates that the network capacity can be improved by increasing the order of spatial interactions.

In this paper, we summarize that the key ingredient behind the success of vision Transformers is the new way of spatial modeling with input-adaptive, long-range and high-order spatial interactions performed by the self-attention operation. While previous work has successfully migrated the meta architecture , input-adaptive weight generation strategy and large-range modeling ability of vision Transformers to CNN models, a higher-order spatial interaction mechanism has not been studied. We show that all the three key ingredients can be efficiently implemented using a convolution-based framework. We propose the Recursive Gated Convolution ( $\textit{g}^{\textit{n}}$ Conv) that performs high-order spatial interactions with gated convolutions and recursive deigns. Instead of simply imitating the successful designs in self-attention, $\textit{g}^{\textit{n}}$ Conv has several extra favorable properties: 1) Efficient. The convolution-based implementation avoids the quadratic complexity of self-attention. The design that progressively increases the channel width during performing spatial interactions also enables us to achieve higher-order interactions with bounded complexity; 2) Extendable. We extend the two-order interaction in self-attention to arbitrary orders to further improve the modeling power. Since we do not make assumptions on the type of spatial convolution, $\textit{g}^{\textit{n}}$ Conv is compatible with various kernel size and spatial mixing strategies like ; 3) Translation-equivariant. $\textit{g}^{\textit{n}}$ Conv fully inherits the translation equivariance of the standard convolution, which introduces beneficial inductive biases to major vision tasks and avoids the asymmetry brought by local attention .

Based on $\textit{g}^{\textit{n}}$ Conv, we construct a new family of generic vision backbones named HorNet. We conduct extensive experiments on ImageNet classification , COCO object detection and ADE20K semantic segmentation to verify the effectiveness of our models. With the same 7 $\times$ 7 kernel/window and similar overall architecture and training configurations, HorNet outperforms Swin and ConvNeXt by a large margin on all tasks at different levels of complexity. The gap can be further enlarged by using a global kernel size . HorNet also shows favorable scalability to more training data and larger model size, attaining 87.7% top-1 accuracy on ImageNet, 57.9% mIoU on ADE20K val and 59.2% bounding box AP on COCO val with ImageNet-22K pre-training. Apart from applying $\textit{g}^{\textit{n}}\text{Conv}$ in visual encoders, we further test the generality of our designs on task-specific decoders. By adding gConv to the widely used feature fusion model FPN , we develop HorFPN to model the high-order spatial relationships of features from different hierarchical levels. We observe that HorFPN can also consistently improve various dense prediction models with lower computational costs. Our results demonstrate that $\textit{g}^{\textit{n}}$ Conv can be a promising alternative to self-attention for visual modeling and effectively combine the merits of both vision Transformers and CNNs.

Related Work

Vision Transformers. The Transformer architecture is originally designed for the natural language processing tasks. Since Dosovitskiy \etal show that vision models constructed only by the Transformer blocks and a patch embedding layer can also achieve competitive performance to CNNs, many new models have been proposed to modify the Transformer-based architecture and make it more suitable for various vision tasks . Different from the original designs in , state-of-the-art vision Transformers usually utilize a CNN-like hierarchical architecture and change the global self-attention among all patches to local self-attention to avoid the quadratic complexity. In this paper, we follow the overall architecture of the previous hierarchical vision Transformers and replace the self-attention sub-layer with our proposed $\textit{g}^{\textit{n}}$ Conv to fairly compare with the previous Transformer-based models.

Convolution-based models. Inspired by the recent success of vision Transformers, several papers propose to adopt the Transformer-style architecture and spatial convolutions with a large kernel size to improve the performance of CNNs. Han \etal replace the window self-attention in Swin Transformers with large-kernel dynamic convolutions and achieve better performance. GFNet proposes to perform the global spatial interactions like vision Transformers with global filters in the frequency domain, which are equivalent to depth-wise convolutions with a global kernel size and circular padding. ConvNeXt thoroughly analyzes the designs in recent vision Transformers and presents a strong convolutional model with 7 $\times$ 7 depth-wise convolutions. RepLKNet explores CNN models with very large kernels (up to 31 $\times$ 31), showing good scalability as vision Transformers. VAN and FocalNet use gated convolutions to perform input-adaptive attention and adopts large-kernel dilated convolutions and multiple successive 3 $\times$ 3 convolutions respectively to produce the weights. Previous work focuses on the meta architecture , large-kernel designs and input-adaptive weights to improve CNNs by learning from vision Transformers. In this paper, we offer a new perspective of high-order spatial attention to analyze the merits of vision Transformers. We show that the proposed HorNet that combines the advantages of both CNNs and vision Transformers is a better architecture for various vision tasks.

Hybrid models. Combining vision Transformers and CNNs to develop hybrid architectures is a new direction in various visual recognition problems. Recently, several efforts have been made to integrate the two types of blocks into a unified model with a sequential or parallel design. Many enhanced vision Transformers also use lightweight convolutions in the basic building block to efficiently capture neighboring patterns or relax the quadratic complexity of self-attention . Different from these hybrid models, we aim to develop a self-attention free model while combining the favorable properties of both vision Transformers and CNNs.

Method

In this section, we will present $\textit{g}^{\textit{n}}\text{Conv}$ , an efficient operation to achieve long-term and high-order spatial interactions. The $\textit{g}^{\textit{n}}\text{Conv}$ is built with standard convolutions, linear projections and element-wise multiplications, but has a similar function of input-adaptive spatial mixing to self-attention.

Input-adaptive interactions with gated convolution. Recent success in vision Transformers mainly depends on the proper modeling of the spatial interactions in visual data. Unlike CNNs that simply use the static convolution kernel to aggregate neighboring features, vision Transformers apply multi-head self-attention to dynamically generate the weights to mix spatial tokens. However, the quadratic complexity w.r.t. the input size of the self-attention largely hinders the application of vision Transformers, especially on downstream tasks including segmentation and detection where higher-resolution feature maps are required. In this work, instead of reducing the complexity of self-attention like previous methods , we seek a more efficient and effective way to perform spatial interactions with simple operations like convolution and fully-connected layers.

where $\phi_{\rm in},\phi_{\rm out}$ are linear projection layers to perform channel mixing and $f$ is a depth-wise convolution. Note that $p_{1}^{(i,c)}=\sum_{j\in\Omega_{i}}w_{i\to j}^{c}q_{0}^{(j,c)}p_{0}^{(i,c)}$ , where $\Omega_{i}$ is the local window centered at $i$ and $w$ represents the convolution weight of $f$ . Therefore, the above formulation explicitly introduce interactions among the neighboring features $\mathbf{p}_{0}^{(i)}$ and $\mathbf{q}_{0}^{(j)}$ through the element-wise multiplication. We consider the interaction in gConv as 1-order interaction as each $\mathbf{p}_{0}^{(i)}$ has interacted with its neighbor feature $\mathbf{q}_{0}^{(j)}$ only once.

High-order interactions with recursive gating. After achieving an efficient 1-order spatial interactions with the gConv, we then design the $\textit{g}^{\textit{n}}\text{Conv}$ , a recursive gated convolution to further enhance the model capacity by introducing higher-order interactions. Formally, we first use $\phi_{\rm in}$ to obtain a set of projected features $\mathbf{p}_{0}$ and $\{\mathbf{q}_{k}\}_{k=0}^{n-1}$ :

We then perform the gated convolution recursively by

where we scale the output by $1/\alpha$ to stabilize the training. $\{f_{k}\}$ are a set of depth-wise convolution layers and $\{g_{k}\}$ are used to match the dimension in different orders:

Finally, we feed the output of the last recursion step $\mathbf{q}_{n}$ to the projection layer $\phi_{\rm out}$ to obtain the result of the $\textit{g}^{\textit{n}}\text{Conv}$ . From the recursive formula Equation (3.3), it is easy to show that the interaction-order of $\mathbf{p}_{k}$ will be increased by 1 after each step. As a result, we can see that the $\textit{g}^{\textit{n}}\text{Conv}$ achieves $n$ -order spatial interactions. It is also worth noting that we need only a single $f$ to perform depth-wise convolution to the concatenation of the features $\{\mathbf{q}_{k}\}_{k=0}^{n-1}$ together instead of computing the convolution in each recursive step as in Equation (3.3), which can further simplify the implementation and improve the efficiency on GPUs. To ensure that the high-order interactions do not introduce too much computational overhead, we set the channel dimension in each order as:

This design indicates that we perform the interactions in a coarse-to-fine manner, where lower orders are computed with fewer channels. Besides, the channel dimension of $\phi_{\rm in}(\mathbf{x})$ is exactly $2C$ and the total FLOPs can be strictly bounded even with $n$ increasing. It can be proved that (see Appendix A):

where $K$ is the kernel size of the depth-wise convolution. Therefore, our $\textit{g}^{\textit{n}}\text{Conv}$ achieves high-order interactions with a similar computational cost to a convolutional layer.

Long-term interactions with large kernel convolutions. Another difference between vision Transformers and conventional CNNs is the receptive field. Conventional CNNs often use 3 $\times$ 3 convolution through the whole network, while vision Transformers calculate self-attention on the whole feature maps or inside a relatively large local window (\eg, 7 $\times$ 7). The large receptive field in vision Transformers makes it easier to capture long-term dependencies, which is also recognized as one of the key advantages of vision Transformers. Inspired by this design, there are some efforts to introduce large kernel convolutions to CNNs recently . To make our $\textit{g}^{\textit{n}}\text{Conv}$ capable of capturing long-term interactions, we adopt two implementations for the depth-wise convolution $f$ :

7 $\times$ 7 Convolution. 7 $\times$ 7 is the default window/kernel size of Swin Transformers and ConvNext . Studies in show that the kernel size produces good performance on ImageNet classification and various downstream tasks. We follow this configuration to fairly compare with representative work of vision Transformers and modern CNNs.

Global Filter (GF). The GF layer multiplies the frequency domain features with learnable global filters, which is equivalent to a convolution in the spatial domain with a global kernel size and circular padding. We use a modified version of the GF layer by processing half of the channels with the global filter and the other half with 3 $\times$ 3 depth-wise convolutions and only use GF layers in late stages to preserve more local details.

Spatial interactions in vision models. We review some representative vision model designs from the perspective of spatial interactions, as shown in Figure 1. Specifically, we are interested in the interactions between a feature $\mathbf{x}_{i}$ and its neighboring feature $\mathbf{x}_{j},j\in\Omega_{i}$ . By using the tool designed for explaining the interaction effect (IE) in , we provide an intuitive analysis of the order of explicit spatial interactions in Appendix B. Our analysis reveals a key difference between vision Transformers and previous architectures from a new view, \ie, vision Transformers have higher-order spatial interactions in each basic block. The result inspires us to explore an architecture that can realize more efficient and effective spatial interactions with more than two orders. As discussed above, our proposed $\textit{g}^{\textit{n}}\text{Conv}$ can achieve arbitrary-order interactions with bounded complexity. It is also worth noting that similar to other scaling factors in deep models like width and depth , simply increasing the order of spatial interactions without considering the overall model capacity will not lead to a good trade-off . In this paper, we focus on developing a stronger visual modeling architecture based on the analysis of the spatial interaction orders of well-designed models. We believe a more thorough and formal discussion on the high-order spatial interactions can be an important future direction.

Relation to dot-product self-attention. Although the computation of our $\textit{g}^{\textit{n}}\text{Conv}$ largely differs from dot-product self-attention, we will show that $\textit{g}^{\textit{n}}\text{Conv}$ also accomplishes the goal of input-adaptive spatial mixing. Let $\mathbf{M}$ be the attention matrix obtained by multi-head self-attention (MHSA), we write $\mathbf{M}$ as $(m_{ij}^{c})$ since the mixing weight may vary across the channels. The spatial mixing result (before the final channel mixing projection) of the $c$ -th channel at location $i$ is

where $w_{V}$ is the weight of the V-projection layer. Note that $m_{ij}$ obtained by the dot-product operation contains 1-order interaction. On the other hand, the output of our $\textit{g}^{\textit{n}}\text{Conv}$ (before the $\phi_{\rm out}$ ) can be written as

where $w_{n-1}$ is the convolutional weight for $f_{n-1}$ , $w_{\phi_{\rm in}}$ is the linear weight of $\phi_{\rm in}$ , and $\mathbf{g}_{n-1}=g_{n-1}(\mathbf{p}_{n-1})$ is a projection of $\mathbf{p}_{n-1}$ . From the formulation in Equation (3.8) we find our $\textit{g}^{\textit{n}}\text{Conv}$ also achieves input-adaptive spatial mixing with $\{h_{ij}^{c}\}$ as the weights. Observing that $h_{ij}$ is computed from $\mathbf{p}_{n-1}$ which contains $n-1$ order interactions, we can regard our $\textit{g}^{\textit{n}}\text{Conv}$ as an extension of the self-attention in terms of the order of the spatial mixing weight. Therefore, our $\textit{g}^{\textit{n}}\text{Conv}$ can better model more complex spatial interactions.

The details of $\textit{g}^{\textit{n}}\text{Conv}$ and our implementation are summarized in Figure 2.

2 Model Architectures

HorNet. The $\textit{g}^{\textit{n}}\text{Conv}$ can be a drop-in replacement of the spatial mixing layer in vision Transformers or modern CNNs . We follow the same meta-architecture as to construct HorNet, where the basic block contains a spatial mixing layer and a feed-forward network (FFN). Depending on the model size and the implementation of the depth-wise convolution $f_{k}$ in our $\textit{g}^{\textit{n}}\text{Conv}$ , we have two series of model variants named HorNet-T/S/B/L7×7 and HorNet-T/S/B/LGF. We consider the popular Swin Transformer and ConvNeXt as the vision Transformer and CNN baselines since our models are implemented based on a convolution-based framework while having high-order interactions like vision Transformers. To fairly compare with the baselines, we directly follow the number of blocks of Swin Transformers-S/B/L but insert an extra block to the stage 2 to make the overall complexity close, resulting in $ $blocks in each stage in all of the model variants. We simply adjust the base number of channels$ C $to construct models with different sizes and set the number of channels in 4 stages as$ [C,2C,4C,8C] $following common practice. We use$ C=64,96,128,192 $for HorNet-T/S/B/L, respectively. We set the interaction orders (\ie, the$ n $in$ \textit{g}^{\textit{n}}\text{Conv} $) for each stage as 2,3,4,5 by default, such that the channels of the coarsest order$ C_{0}$ is the same across different stages.

HorFPN. Apart from using $\textit{g}^{\textit{n}}\text{Conv}$ in visual encoders, we find our $\textit{g}^{\textit{n}}\text{Conv}$ can be an enhanced alternative for standard convolution that considers higher-order spatial interactions in a wide range of convolution-based models. Thus, we replace spatial convolutions for feature fusion in the FPN with our $\textit{g}^{\textit{n}}\text{Conv}$ to improve spatial interactions for downstream tasks. Specifically, we add our $\textit{g}^{\textit{n}}\text{Conv}$ after the fusion of features from different pyramid levels. For object detection, we replace the 3 $\times$ 3 convolution after the top-down pathway with the $\textit{g}^{\textit{n}}\text{Conv}$ in each level. For semantic segmentation, we simply replace the 3 $\times$ 3 convolution after the concatenation of the multi-level feature maps with $\textit{g}^{\textit{n}}\text{Conv}$ since the final results are directly predicted from this concatenated feature. We also have two implementations called HorFPN7×7 and HorFPNGF decided by the choice of $f_{k}$ .

Experiments

We conduct extensive experiments to verify the effectiveness of our method. We present the main results on ImageNet and compare them with various architectures. We also test our models on the downstream dense prediction tasks on commonly used semantic segmentation benchmark ADE20K and object detection dataset COCO . Lastly, we provide ablation studies of our designs and analyze the effectiveness of $\textit{g}^{\textit{n}}\text{Conv}$ on a wide range of models.

Setups. We conduct image classification experiments on the widely used ImageNet dataset. We train our HorNet-T/S/B models using the standard ImageNet-1K dataset following common practice. To fairly compare with previous work, we directly use the training configurations of to train our models. We train the models for 300 epochs with $224\times 224$ input. To evaluate the scaling ability of our designs, we further train the HorNet-L models on the ImageNet-22K dataset that contains over $10\times$ images and more categories. We follow previous practice to train our models for 90 epochs and use a similar data augmentation strategy as ImageNet-1K experiments. We fine-tune the models pre-trained on ImageNet-22K or at the 224×224 resolution to ImageNet-1K or/and 384×384 resolution for 30 epochs following . When adapting the ImageNet-22K models to ImageNet-1K, we initialize the classifier with the pre-trained class centers to stabilize the training process. More details can be found in Appendix C.

Results. The results of our ImageNet classification experiments are summarized in Table 1. We see that our models achieve very competitive performance with state-of-the-art vision Transformers and CNNs. Notably, HorNet surpasses Swin Transformers and ConvNeXt which have similar overall architectures and training configurations by a healthy margin on various model sizes and settings. Our models also generalize well to a larger image resolution, larger model sizes and more training data. These results clearly demonstrate the effectiveness and generality of our designs.

2 Dense Prediction Tasks

HorNet for semantic segmentation. We evaluate our HorNet for semantic segmentation task on ADE20K dataset using the commonly used UperNet framework. All the models are trained for 160k iterations using AdamW optimizer with a global batch size of 16. The image size during training is $512\times 512$ for ImagNet-1k (HorNet-T/S/B) pre-trained models and $640\times 640$ for the ImageNet-22K pre-trained models (HorNet-L). The results are summarized in the left part of Table 2, where we report both the single-scale (SS) and multi-scale (MS) mIoU on the validation set. Both our HorNet7×7 and HorNetGF models outperform Swin and ConvNeXt models with similar model sizes and FLOPs. Specifically, HorNetGF models achieve better results than HorNet7×7 and ConvNeXt series by large margins in single-scale mIoU, indicating the global interactions captured by the global filter are helpful for semantic segmentation. Notably, we find both our HorNet-L7×7 and HorNet-LGF even outperform ConvNeXt-XL with $\sim$ 25% fewer FLOPs. These results clearly demonstrate the effectiveness and scalability of our HorNet on semantic segmentation.

HorNet for object detection. We also evaluate our models on the COCO dataset. We adopt the cascade Mask R-CNN framework to perform object detection and instance segmentation using HorNet-T/S/B/L backbones. Following Swin and ConvNeXt , we use $3\times$ schedule with multi-scale training. The right part of Table 2 compares the box AP and mask AP of our HorNet models and Swin/ConvNeXt models. Similarly, we show our HorNet models achieve consistently and significantly better performance than the Swin/ConvNeXt counterparts, in both box AP and mask AP. The HorNetGF series obtain +1.2 $\sim$ 2.0 box AP and +1.0 $\sim$ 1.9 mask AP compared with ConvNeXt. Again, our large model HorNet-L7×7 and HorNetGF can outperform ConvNeXt-XL, which further validates the favorable transferability with a larger model size and larger pre-trained dataset.

HorFPN for dense prediction. We now show another application of the proposed $\textit{g}^{\textit{n}}\text{Conv}$ , \ie, to serve as a better fusion module that can better capture the higher-order interactions among different levels of features in dense prediction tasks. Specifically, we directly modify the FPN as described in Section 3.2 in UperNet and Mask R-CNN for semantic segmentation and object detection, respectively.We show the results in Table 3, where we compare the performance of our HorFPN and standard FPN on different backbones including ResNet-50/101 , Swin-S and HorNet-S7×7. For semantic segmentation, we find our HorFPN can significantly reduce the FLOPs ( $\sim$ 50%) while achieving better validation mIoU. For object detection, our HorFPN can also outperform standard FPN in terms of both box AP and mask AP on different backbones with about 30G fewer FLOPs. Besides, we observe that the HorFPNGF is consistently better than HorFPN7×7, indicating that global interactions are also important when fusing hierarchical features.

Results with state-of-the-art frameworks. To further show the effectiveness our backbone, we conduct experiments to combine our large HorNet model with recent state-of-the-art dense prediction frameworks including HTC++ , DINO and Mask2Former . For HTC++ and DINO, we train our models on COCO for 36 epochs (3 $\times$ schedule) and does not introduce extra pre-training data like Object365 in . We report the single-scale performance on the validation set and compared with several state-of-the-art methods in Table 5. For Mask2Former, we train our models on ADE20K with $640\times 640$ . We report the mIoU of both single-scale and multi-scale testing on the validation set in Table 5.

3 Analysis

Ablation study. We provide detailed ablation studies of the $\textit{g}^{\textit{n}}\text{Conv}$ and our HorNet in Table 6. We first study the model designs of our HorNet in Table 6(a). Our baseline ([*]) is obtained by simply replacing the self-attention with 7 $\times$ 7 depth-wise convolution in Swin-T . We first show that both SE and our $\textit{g}^{\textit{n}}\text{Conv}$ with $n=1$ ( $\textit{g}^{\text{\{1,1,1,1\}}}$ Conv) can improve over the baseline model [*], and $\textit{g}^{\text{\{1,1,1,1\}}}$ Conv is slightly better. We then perform ablations on the interaction order $n$ for each stage and find: (1) if $n$ is shared across the 4 stages, the accuracy will increase with larger $n$ but saturate at 82.5 when $n=4$ ; (2) progressively increased order ( $\textit{g}^{\text{\{2,3,4,5\}}}$ Conv) can further improve the accuracy. Our final models are built on $\textit{g}^{\text{\{2,3,4,5\}}}$ Conv by adjusting the depth and width of the networks (HorNet-T7×7) and applying Global Filter for the depth-wise convolution (HorNet-TGF). These results clearly show that our $\textit{g}^{\textit{n}}\text{Conv}$ is an efficient and extendable operation that can better capture high-order spatial interactions than both self-attention and depth-wise convolution.

$\textit{g}^{\textit{n}}\text{Conv}$ for isotropic models. We also evaluate $\textit{g}^{\textit{n}}\text{Conv}$ on isotropic architectures (with constant spatial resolutions). We replace the self-attention in DeiT-S with our $\textit{g}^{\textit{n}}\text{Conv}$ and adjust the number of blocks to 13 to obtain the isotropic HorNet-S7×7 and HorNet-SGF. We compare DeiT-S, isotropic ConvNeXt-S and isotropic HorNet-S in Table 6(b). While isotropic ConvNeXt-S cannot improve DeiT-S, our isotropic HorNet surpasses DeiT-S by a large margin. These results indicate that our $\textit{g}^{\textit{n}}\text{Conv}$ can better realize the functions of self-attention compared to plain convolutions and have better ability to model the complex spatial interactions.

$\textit{g}^{\textit{n}}\text{Conv}$ for other operations. To further demonstrate the universality of $\textit{g}^{\textit{n}}\text{Conv}$ , we use 3 $\times$ 3 depth-wise convolution and 3 $\times$ 3 pooling as the basic operation in the $\textit{g}^{\textit{n}}\text{Conv}$ . The results in Table 6(c) show that $\textit{g}^{\textit{n}}\text{Conv}$ can also improve these two operations by large margins, indicating our $\textit{g}^{\textit{n}}\text{Conv}$ is potentially more powerful when equipped with some better basic operations.

Accuracy-complexity trade-offs. We visualize accuracy-complexity trade-offs of Swin, ConvNeXt and HorNet series in Figure 3. For fair comparisons, we fix the input image size to $224\times 224$ and use HorNet7×7 such that all the compared models are based on 7 $\times$ 7 local window. We see HorNet can achieve better trade-offs than the representative vision Transformers and modern CNNs with regards to model size, FLOPs and GPU latency.

Visualization. We provide some visualizations of the adaptive weights learned by $\textit{g}^{\textit{n}}\text{Conv}$ in Figure 4. For each sample, we show the value of $\frac{1}{C}\sum_{c=1}^{C}h_{ij}^{c}$ (see Equation (3.8) or the definition of $h_{ij}^{c}$ ) for two random spatial locations $i$ from layer {1, 3, 5, 7, 8, 12} of the isotropic HorNet-S model. Figure 4 demonstrates that the spatial mixing weights of our $\textit{g}^{\textit{n}}\text{Conv}$ are adaptive both to input samples and spatial locations, which further indicates that $\textit{g}^{\textit{n}}\text{Conv}$ shares these two desirable characteristics with the self-attention operation.

Limitations. While HorNet shows better overall latency-accuracy trade-offs, we notice that HorNet is slower than ConvNeXt with similar FLOPs on GPU, which may be caused by the more complex designs to perform the high-order interactions. We think that developing a more hardware-friendly operation for high-order spatial interactions is an interesting future direction to improve our work.

Conclusion

We have presented the Recursive Gated Convolution ( $\textit{g}^{\textit{n}}$ Conv) that performs efficient, extendable, and translation-equivariant high-order spatial interactions with gated convolutions and recursive deigns. $\textit{g}^{\textit{n}}$ Conv can serve as a drop-in replace of the spatial mixing layer in various vision Transformers and convolution-based models. Based on the operation, we have constructed a new family of generic vision backbones HorNet. Extensive experiments demonstrate the effectiveness of $\textit{g}^{\textit{n}}$ Conv and HorNet on commonly used visual recognition benchmarks. We hope our attempt can inspire future work to further explore the high-order spatial interactions in vision models.

Acknowledgments

Jiwen Lu was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, the National Natural Science Foundation of China under Grant 62125603 and Grant U1813218, and a grant from the Beijing Academy of Artificial Intelligence (BAAI).

References

We will divide the computation of our $\textit{g}^{\textit{n}}\text{Conv}$ into 3 parts, and calculate the FLOPs for each part.

Projection layers. The FLOPs of two projection layers $\phi_{\rm in}$ and $\phi_{\rm out}$ can be easily derived as:

Recursive Gating. We consider both the flops of the projection layer $g_{k}$ and the element-wise multiplication.

Appendix B Spatial Interactions in Vision Models.

We review some representative vision model designs from the perspective of spatial interactions, as shown in Figure 1. Specifically, we are interested in the interactions between a feature $\mathbf{x}_{i}$ and its neighbor feature $\mathbf{x}_{j},j\in\Omega_{i}$ . Inspired by the interaction effect (IE) , we consider that a binary function $F(\mathbf{x}_{i},\mathbf{x}_{j})$ which directly operates on $\mathbf{x}_{i},\mathbf{x}_{j}$ introduces an effective interaction between $\mathbf{x}_{i}$ $\mathbf{x}_{j}$ , if

We now analyze the cases in Figure 1 of our main paper using the above rule. (a): Convolution. The output $F_{i}=\sum_{j\in\Omega}w_{i\to j}\mathbf{x}_{j}$ , which leads to $\operatorname{IE}(F)=\mathbf{0}$ . Therefore, standard convolution introduce no interaction between $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ and we call it a 0-order interaction. (b): SE Block/Gated Convolution. In this case, we have $F_{i}=\sum_{j\in\Omega}w_{i\to j}\mathbf{x}_{j}s_{i}(\mathbf{x})$ , where $s_{i}(\mathbf{x})=\frac{1}{HW}\sum_{l=1}^{HW}x_{l}$ for the SE block and $s_{i}(\mathbf{x})=\mathbf{x}_{i}$ for the gated convolution. It is easy to show $\operatorname{IE}(F)\neq\mathbf{0}$ because $\frac{\partial s_{i}}{\partial\mathbf{x}_{i}}\neq\mathbf{0}$ . Hence, these two operations both introduce 1-order interaction. (c): Self-attention (SA). We first denote the projected query/key/value features as $\mathbf{q},\mathbf{k},\mathbf{v}$ . The SA first perform an 1-order interaction by computing the attention with dot-product: $\mathbf{a}_{i}=\mathbf{q}_{i}^{\top}[\mathbf{k}_{1},\ldots,\mathbf{k}_{HW}]/\sqrt{C}$ . We then view $\mathbf{a}_{i}$ as the feature at location $i$ in the following computation. The normalized $\hat{\mathbf{a}}_{i}$ is then obtained by Softmax, which do not contribute to the order since it can be viewed as an implicit interaction that does not explicitly introduce $\mathbf{x}_{j}$ to the computation. The second interaction is performed by $\mathbf{x}_{i}=\sum_{j\in\Omega}\hat{\mathbf{a}}_{i}\mathbf{v}_{j}$ . To sum up, the SA is a 2-order interaction. (d): $\textit{g}^{\textit{n}}\text{Conv}$ . According to Section 3.1, we have already known that $\textit{g}^{\textit{n}}\text{Conv}$ can achieve $n$ -order interaction with bounded computational cost.

From the above discussion, we reveal a key difference between ViTs and previous architectures from a new view, \ie, ViTs have higher-order spatial interactions in each basic block. Then it begs the question that whether we can achieve better accuracy-complexity trade-offs viz interactions with more than 2 orders. Our proposed $\textit{g}^{\textit{n}}\text{Conv}$ exactly targets this question for the first time. First, we can realize arbitrary $n$ -order interaction as long as $1\leq n\leq 1+\log_{2}C$ easily. Second, unlike the quadratic complexity of self-attention, the computational cost of $\textit{g}^{\textit{n}}\text{Conv}$ has an upper bound \wrtthe order $n$ .

In our implementation of $\textit{g}^{\textit{n}}\text{Conv}$ , the higher-order spatial interactions are based on the gating mechanism, which has also been investigated in LSTM and some vision modules . However, these previous methods can only achieve up to 2-order interactions, and did not fully reveal the potential of higher-order interactions. On the contrary, our $\textit{g}^{\textit{n}}\text{Conv}$ is more extendable to achieve arbitrary higher-order spatial interactions under a controllable computational budget.

Appendix C Implementation Details

To better verify the effectiveness of our new designs, we introduce minimal changes in the overall architecture of Swin Transformers . Specifically, we make two changes to the overall architecture of Swin Transformers : 1) We add one block in stage 2 to make the overall computation and parameters close to previous models; 2) We use the LayerScale techniques to make our models more stable during training following the practice of ConvNeXt . Note that the two changes have been applied to the baseline model considered in our ablation study to clearly show the effects of our designs. The detailed architectures of ConvNeXt , Swin Transformers and HorNet are summarized in Table 7.

C.2 Experimental Settings for Image Classification.

ImageNet-1K training. ImageNet-1K is a widely used large-scale benchmark for image classification, which contains around 1.2 million images from 1,000 categories. Following common practice , we train our models on the training set of ImageNet and report the single-crop top-1 accuracy on 50,000 validation images. To fairly compare with our baseline methods (\ie, Swin Transformers and ConvNeXt ), we follow the most training details of ConvNeXt and make several small modifications to make the training configurations suitable for our models. For HorNet with 7 $\times$ 7 convolutions, we find that applying gradient clipping with a maximal norm of 5 will significantly stabilize the training process, which may be due to the large gradients brought by the high-order structures in our models. For HorNet with global filters, we use stronger regularization strategies since we find that larger kernels will improve the model capacity but may also cause more severe overfitting. Specifically, we set the gradient norm to 1 and use more aggressive RandAug data augmentation strategies (\ie, we adjust the magnitudes for tiny, small and base models to 9, 12 and 15, respectively). We set the stochastic depth coefficient of HorNet-T/S/B models to 0.2, 0.4 and 0.5. The other details are identical to ConvNeXt . Our models are trained using 32 NVIDIA A100 GPUs with a global batch size of 4096.

ImageNet-22K training. ImageNet-22K is a larger dataset that contains $>$ 21k classes and around 14M images. We use the subset suggested by since the new winter 2021 release is the accessible version now. We also follow the to remove categories with few images, resulting in roughly half fewer categories and only 13% fewer images compared to the original dataset. We follow previous practice to train our models for 90 epochs and use a similar data augmentation strategy as ImageNet-1K experiments. We set the stochastic depth coefficient to 0.2. We also set the maximal gradient norm to 5 and 1 for our large models with standard 7 $\times$ 7 convolutions and global filters respectively. We also adjust the weight decay to 0.1. The other details are identical to ConvNeXt . We also fine-tune our best model HorNet-LGF on 384 $\times$ 384 images on ImageNet-22K for 10 epochs compete with state-of-the-art models on downstream tasks. The model is only used in the experiments in Appendix 5.

ImageNet-1K fine-tuning. We fine-tune the models pre-trained on ImageNet-22K or at the 224 $\times$ 224 resolution to ImageNet-1K or/and 384 $\times$ 384 resolution for 30 epochs with a batch size of 512 and a cosine learning rate schedule with an initial learning rate of $5e^{-5}$ . We set the weight decay to $1e^{-6}$ and disable MixUp and CutMix following . We initialize the ImageNet-1K classifier with the corresponding classifier weights for ImageNet-22K classes to further stabilize the training process.

C.3 Experimental Settings for Downstream Tasks.

Object detection and instance segmentation on COCO. We adopt the widely used Cascade Mask R-CNN framework to perform object detection and instance segmentation on COCO, following Swin and ConvNeXt . Our backbones are pre-trained on ImageNet-1K for the HorNet-T/S/B and ImageNet-22K for the HorNet-L. We use the 3 $\times$ schedule where we train all of our model for 36 epochs with AdamW optimizer and a global batch size of 16. We set the learning rate of as {2e-4, 2e-4, 2e-4, 1e-4} and the stochastic depth rate as {0.4, 0.6, 0.7, 0.7}for HorNet-T/S/B/L. We set the weight decay as 0.05 for all the models.

Semantic Segmentation on ADE20K. We use the UperNet 160K framework for semantic segmentation on ADE20K. We use a global batch size of 16 and train all the models for 160 iterations with the AdamW optimizer. We use $512\times 512$ image for ImageNet-1K pre-trained HorNet-T/S/B and $640\times 640$ image for ImagNet-22K pre-trained HorNet-L. We set the learning rate as 1e-4 and the weight decay as 0.05 for all the models. We report the mIoU of both single-scale and multi-scale testing on the validation set.

Appendix D More Analysis

Comparisons with state-of-the-art methods on ImageNet. Our experiments are designed to clearly verify the superior of our design over previous basic operations like plain convolution and self-attention. Therefore, we choose to follow the basic architecture and the training configuration of widely used architectures Swin Transformers and ConvNeXt Therefore, there is still substantial room to further improve the performance on ImageNet-1K. We notice that some recent work like Pale Transformers and Dynamic Group Transformer with hybrid architectures or more careful designs achieve better performance than HorNet on ImageNet-1K. We think many techniques that have been used in previous work can be useful to further imporve our models, including further optimized overall architectures (\eg, optimized depth/width for each stage), better patch embedding strategies (\eg, overlapping convolutional layers for input embedding and downsampling), more efficient ways to compute adaptive weights (\eg, using downsmapled features to produce attention weights like ), and more advanced training methods and hybrid architectures (\eg, combining $\textit{g}^{\textit{n}}\text{Conv}$ with self-attention and plain convolutions).

Throughput analysis. We provide the detailed throughput statistics of our models and several baseline methods in Table 8. Apart from ConvNeXt and Swin Transformers, we also compare our model with recent MViTv2 models. The multiple small matrix multiplications introduced by $\textit{g}^{\textit{n}}\text{Conv}$ will affect the speed of our models on GPU. We observed that our method is slower than ConvNeXt by 7% 15% with similar FLOPs. Meanwhile, thanks to the highly efficient depth-wise convolutions implementation of CuDNN, we also see that our models achieve similar or slightly faster speed than typical vision Transformers with similar FLOPs. Notably, as shown in Figure 3(c), the higher classification accuracy helps our models achieve better speed-accuracy trade-offs than ConvNeXt and Swin Transformers. Therefore, we believe the speed of our method is still competitive with these recent models.

Effects of $\alpha$ . We find that re-scaling the output of gated convolution will avoid the large values produced by the recursive process and stabilize the training process. We analyze the the effects of $\alpha$ on our ImageNet experiments based on HorNet-B7×7. The results are summarized in Table 9. We see $\alpha=3$ leads the best performance. Therefore, we set $\alpha$ to 3 in all our models.

Effects of activation functions in gated convolutions. The gated convolutions used in our models can be viewed as a type of channel attention that uses different attention weights in different locations and generates weights based on spatial interactions. Previous channel attention methods like SE-Net usually add a sigmoid function to the attention weights to generate bounded attention. Therefore, we investigate several possible activation functions in our models. The results are presented in Table 10. We see the version described in our paper (\ie, no activation function) achieves the best performance. The result also suggests that $\textit{g}^{\textit{n}}\text{Conv}$ exhibits a different behavior from conventional channel attention methods. Since the gating weights are critical components in our models, activation functions that can cause significant information losses like GELU and sigmoid will severely hurt the performance.