Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Xiaohan Ding, Xiangyu Zhang, Yizhuang Zhou, Jungong Han, Guiguang Ding, Jian Sun

Introduction

Convolutional neural networks (CNNs) used to be a common choice of visual encoders in modern computer vision systems. However, recently, CNNs have been greatly challenged by Vision Transformers (ViTs) , which have shown leading performances on many visual tasks – not only image classification and representation learning , but also many downstream tasks such as object detection , semantic segmentation and image restoration . Why are ViTs super powerful? Some works believed that multi-head self-attention (MHSA) mechanism in ViTs plays a key role. They provided empirical results to demonstrate that, MHSA is more flexible , capable (less inductive bias) , more robust to distortions , or able to model long-range dependencies . But some works challenge the necessity of MHSA , attributing the high performance of ViTs to the proper building blocks , and/or dynamic sparse weights . More works explained the superiority of ViTs from different point of views.

In this work, we focus on one view: the way of building up large receptive fields. In ViTs, MHSA is usually designed to be either global or local but with large kernels , thus each output from a single MHSA layer is able to gather information from a large region. However, large kernels are not popularly employed in CNNs (except for the first layer ). Instead, a typical fashion is to use a stack of many small spatial convolutionsConvolutional kernels (including the variants such as depth-wise/group convolutions) whose spatial size is larger than 1 $\times$ 1. (e.g., 3 $\times$ 3) to enlarge the receptive fields in state-of-the-art CNNs. Only some old-fashioned networks such as AlexNet , Inceptions and a few architectures derived from neural architecture search adopt large spatial convolutions (whose size is greater than 5) as the main part. The above view naturally lead to a question: what if we use a few large instead of many small kernels to conventional CNNs? Is large kernel or the way of building large receptive fields the key to close the performance gap between CNNs and ViTs?

To answer this question, we systematically explore the large kernel design of CNNs. We follow a very simple “philosophy”: just introducing large depth-wise convolutions into conventional networks, whose sizes range from 3 $\times$ 3 to 31 $\times$ 31, although there exist other alternatives to introduce large receptive fields via a single or a few layers, e.g. feature pyramids , dilated convolutions and deformable convolutions . Through a series of experiments, we summarize five empirical guidelines to effectively employ large convolutions: 1) very large kernels can still be efficient in practice; 2) identity shortcut is vital especially for networks with very large kernels; 3) re-parameterizing with small kernels helps to make up the optimization issue; 4) large convolutions boost downstream tasks much more than ImageNet; 5) large kernel is useful even on small feature maps.

Based on the above guidelines, we propose a new architecture named RepLKNet, a pureNamely CNNs free of any attention or dynamic mechanism, e.g., squeeze-and-excitation , multi-head self-attention, dynamic weights , and etc. CNN where re-parameterized large convolutions are employed to build up large receptive fields. Our network in general follows the macro architecture of Swin Transformer with a few modifications, while replacing the multi-head self-attentions with large depth-wise convolutions. We mainly benchmark middle-size and large-size models, since ViTs used to be believed to surpass CNNs on large data and models. On ImageNet classification, our baseline (similar model size with Swin-B), whose kernel size is as large as 31 $\times$ 31, achieves 84.8% top-1 accuracy trained only on ImageNet-1K dataset, which is 0.3% better than Swin-B but much more efficient in latency.

More importantly, we find that the large kernel design is particularly powerful on downstream tasks. For example, our networks outperform ResNeXt-101 or ResNet-101 backbones by 4.4% on COCO detection and 6.1% on ADE20K segmentation under the similar complexity and parameter budget, which is also on par with or even better than the counterpart Swin Transformers but with higher inference speed. Given more pretraining data (e.g., 73M images) and more computational budget, our best model obtains very competitive results among the state-of-the-arts with similar model sizes, e.g. 87.8% top-1 accuracy on ImageNet and 56.0% on ADE20K, which shows excellent scalability towards large-scale applications.

We believe the high performance of RepLKNet is mainly because of the large effective receptive fields (ERFs) built via large kernels, as compared in Fig. 1. Moreover, RepLKNet is shown to leverage more shape information than conventional CNNs, which partially agrees with human’s cognition. We hope our findings can help to understand the intrinsic mechanism of both CNNs and ViTs.

Related Work

As mentioned in the introduction, apart from a few old-fashioned models like Inceptions , large-kernel models became not popular after VGG-Net . One representative work is Global Convolution Networks (GCNs) , which uses very large convolutions of 1 $\times$ K followed by K $\times$ 1 to improve semantic segmentation task. However, large kernels are reported to harm the performance on ImageNet. Local Relation Networks (LR-Net) proposes a spatial aggregation operator (LR-Layer) to replace standard convolutions, which can be viewed as a dynamic convolution. LR-Net could benefit from a kernel size of 7 $\times$ 7, but the performance decreases with 9 $\times$ 9. With a kernel size as large as the feature map, the top-1 accuracy significantly reduced from 75.7% to 68.4%.

Recently, Swin Transformers propose to capture the spatial patterns with shifted window attention, whose window sizes range from 7 to 12, which can also be viewed as a variant of large kernel. The follow-ups employ even larger window sizes. Inspired by the success of those local transformers, a recent work replaces MHSA layers with static or dynamic 7 $\times$ 7 depth-wise convolutions in while still maintains comparable results. Though the network proposed by shares similar design pattern with ours, the motivations are different: does not investigate the relationship between ERFs, large kernels and performances; instead, it attributes the superior performances of vision transformers to sparse connections, shared parameters and dynamic mechanisms. Another three representative works are Global Filter Networks (GFNets) , CKConv and FlexConv . GFNet optimizes the spatial connection weights in the Fourier domain, which is equivalent to circular global convolutions in the spatial domain. CKConv formulates kernels as continuous functions to process sequential data, which can construct arbitrarily large kernels. FlexConv learns different kernel sizes for different layers, which can be as large as the feature maps. Although they use very large kernels, they do not intend to answer the key questions we desire: why do traditional CNNs underperform ViTs, and how to apply large kernels in common CNNs. Besides, both and do not evaluate their models on strong baselines, e.g., models larger than Swin-L. Hence it is still unclear whether large-kernel CNNs can scale up well as transformers.

ConvMixer uses up to 9 $\times$ 9 convolutions to replace the “mixer” component of ViTs or MLPs . MetaFormer suggests pooling layer is an alternate to self-attention. ConvNeXt employs 7 $\times$ 7 depth-wise convolutions to design strong architectures, pushing the limit of CNN performances. Although those works show excellent performances, they do not show benefits from much larger convolutions (e.g., 31 $\times$ 31).

2 Model Scaling Techniques

Given a small model, it is a common practice to scale it up for better performance, thus scaling strategy plays a vital role in the resultant accuracy-efficiency trade-offs. For CNNs, existing scaling approaches usually focus on model depth, width, input resolution , bottleneck ratio and group width . Kernel size, however, is often neglected. In Sec. 3, we will show that the kernel size is also an important scaling dimension in CNNs, especially for downstream tasks.

3 Structural Re-parameterization

Structural Re-parameterization is a methodology of equivalently converting model structures via transforming the parameters. For example, RepVGG targeted at a deep inference-time VGG-like (e.g., branch-free) model, and constructed extra ResNet-style shortcuts parallel to the 3 $\times$ 3 layers during training. In contrast to a real VGG-like model that is difficult to train , such shortcuts helped the model reach a satisfactory performance. After training, the shortcuts are absorbed into the parallel 3 $\times$ 3 kernels via a series of linear transformations, so that the resultant model becomes a VGG-like model. In this paper, we use this methodology to add a relatively small (e.g., 3 $\times$ 3 or 5 $\times$ 5) kernel into a very large kernel. In this way, we make the very large kernel capable of capturing small-scale patterns, hence improve the performance of the model.

Guidelines of Applying Large Convolutions

Trivially applying large convolutions to CNNs usually leads to inferior performance and speed. In this section, we summarize 5 guidelines for effectively using large kernels.

It is believed that large-kernel convolutions are computationally expensive because the kernel size quadratically increases the number of parameters and FLOPs. The drawback can be greatly overcome by applying depth-wise (DW) convolutions . For example, in our proposed RepLKNet (see Table 5 for details), increasing the kernel sizes in different stages from $to$ only increases the FLOPs and number of parameters by 18.6% and 10.4% respectively, which is acceptable. The remaining 1 $\times$ 1 convolutions actually dominate most of the complexity.

One may concern that DW convolutions could be very inefficient on modern parallel computing devices like GPUs. It is true for conventional DW 3 $\times$ 3 kernels , because DW operations introduce low ratio of computation vs. memory access cost , which is not friendly to modern computing architecture. However, we find when kernel size becomes large, the computational density increases: for example, in a DW 11 $\times$ 11 kernel, each time we load a value from the feature map, it can attend at most 121 multiplications, while in a 3 $\times$ 3 kernel the number is only 9. Therefore, according to the roofline model, the actual latency should not increase as much as the increasing of FLOPs when kernel size becomes larger.

Remark 1. Unfortunately, we find off-the-shelf deep learning tools (such as Pytorch) support large DW convolutions poorly, as shown in Table 1. Hence we try several approaches to optimize the CUDA kernels. FFT-based approach appears reasonable to implement large convolutions. However, in practice we find block-wise (inverse) implicit gemm algorithm is a better choice. The implementation has been integrated into the open-sourced framework MegEngine and we omit the details here. We have also released an efficient implementation for PyTorch. Table 1 shows that our implementation is far more efficient, compared with the Pytorch baseline. With our optimization, the latency contribution of DW convolutions in RepLKNet reduces from 49.5% to 12.3%, which is roughly in proportion to the FLOPs occupation.

Guideline 2: identity shortcut is vital especially for networks with very large kernels.

To demonstrate this, we use MobileNet V2 to benchmark, since it heavily uses DW layers and has two published variants (with or without shortcuts). For the large-kernel counterparts, we simply replace all the DW 3 $\times$ 3 layers with 13 $\times$ 13. All the models are trained on ImageNet with the identical training configurations for 100 epochs (see Appendix A for details). Table 2 shows large kernels improve the accuracy of MobileNet V2 with shortcuts by 0.77%. However, without shortcuts, large kernels reduce the accuracy to only 53.98%.

Remark 2. The guideline also works for ViTs. A recent work finds that without identity shortcut, attention loses rank doubly exponentially with depth, leading to over-smoothing issue. Although large-kernel CNNs may degenerate in a different mechanism from ViT’s, we also observed without shortcut, it is difficult for the network to capture local details. From a similar perspective as , shortcuts make the model an implicit ensemble composed of numerous models with different receptive fields (RFs), so it can benefit from a much larger maximum RF while not losing the ability to capture small-scale patterns.

Guideline 3: re-parameterizing [31] with small kernels helps to make up the optimization issue.

We replace the 3 $\times$ 3 layers of MobileNet V2 by 9 $\times$ 9 and 13 $\times$ 13 respectively, and optionally adopt Structural Re-parameterization methodology. Specifically, we construct a 3 $\times$ 3 layer parallel to the large one, then add up their outputs after Batch normalization (BN) layers (Fig. 2). After training, we merge the small kernel as well as BN parameters into the large kernel, so the resultant model is equivalent to the model for training but no longer has small kernels. Table 3 shows directly increasing the kernel size from 9 to 13 reduces the accuracy, while re-parameterization addresses the issue.

We then transfer the ImageNet-trained models to semantic segmentation with DeepLabv3+ on Cityscapes . We only replace the backbone and keep all the default training settings provided by MMSegmentation . The observation is similar to that on ImageNet: 3 $\times$ 3 re-param improves the mIoU of the 9 $\times$ 9 model by 0.19 and the 13 $\times$ 13 model by 0.93. With such simple re-parameterization, increasing kernel size from 9 to 13 no longer degrades the performance on both ImageNet and Cityscapes.

Remark 3. It is known that ViTs have optimization problem especially on small datasets . A common workaround is to introduce convolutional prior, e.g., add a DW 3 $\times$ 3 convolution to each self-attention block , which is analogous to ours. Those strategies introduce additional translational equivariance and locality prior to the network, making it easier to optimize on small dataset without loss of generality. Similar to what ViT behaves , we also find when the pretraining dataset increases to 73 million images (refer to RepLKNet-XL in the next section), re-parameterization can be omitted without degradation.

Guideline 4: large convolutions boost downstream tasks much more than ImageNet classification.

Table 3 (after re-param) shows increasing the kernel size of MobileNet V2 from 3 $\times$ 3 to 9 $\times$ 9 improves the ImageNet accuracy by 1.33% but the Cityscapes mIoU by 3.99%. Table 5 shows a similar trend: as the kernel sizes increase from $to$ , the ImageNet accuracy improves by only 0.96%, while the mIoU on ADE20K improves by 3.12%. Such phenomenon indicates that models of similar ImageNet scores could have very different capability in downstream tasks (just as the bottom 3 models in Table 5).

Remark 4. What causes the phenomenon? First, large kernel design significantly increases the Effective Receptive Fields (ERFs) . Numerous works have demonstrated “contextual” information, which implies large ERFs, is crucial in many downstream tasks like object detection and semantic segmentation . We will discuss the topic in Sec. 5. Second, We deem another reason might be that large kernel design contributes more shape biases to the network. Briefly speaking, ImageNet pictures can be correctly classified according to either texture or shape, as proposed in . However, humans recognize objects mainly based on shape cue rather than texture, therefore a model with stronger shape bias may transfer better to downstream tasks. A recent study points out ViTs are strong in shape bias, which partially explains why ViTs are super powerful in transfer tasks. In contrast, conventional CNNs trained on ImageNet tend to bias towards texture . Fortunately, we find simply enlarging the kernel size in CNNs can effectively improve the shape bias. Please refer to Appendix C for details.

Guideline 5: large kernel (e.g., 13×\times13) is useful even on small feature maps (e.g., 7×\times7).

To validate it, We enlarge the DW convolutions in the last stage of MobileNet V2 to 7 $\times$ 7 or 13 $\times$ 13, hence the kernel size is on par with or even larger than feature map size (7 $\times$ 7 by default). We apply re-parameterization to the large kernels as suggested by Guideline 3. Table 4 shows although convolutions in the last stage already involve very large receptive field, further increasing the kernel sizes still leads to performance improvements, especially on downstream tasks such as Cityscapes.

Remark 5. When kernel size becomes large, notice that translational equivariance of CNNs does not strictly hold. As illustrated in Fig. 3, two outputs at adjacent spatial locations share only a fraction of the kernel weights, i.e., are transformed by different mappings. The property also agrees with the “philosophy” of ViTs – relaxing the symmetric prior to obtain more capacity. Interestingly, we find 2D Relative Position Embedding (RPE) , which is widely used in the transformer community, can also be viewed as a large depth-wise kernel of size $(2H-1)\times(2W-1)$ , where $H$ and $W$ are feature map height and width respectively. Large kernels not only help to learn the relative positions between concepts, but also encode the absolute position information due to padding effect .

RepLKNet: a Large-Kernel Architecture

Following the above guidelines, in this section we propose RepLKNet, a pure CNN architecture with large kernel design. To our knowledge, up to now CNNs still dominate small models , while vision transformers are believed to be better than CNNs under more complexity budget. Therefore, in the paper we mainly focus on relatively large models (whose complexity is on par with or larger than ResNet-152 or Swin-B ), in order to verify whether large kernel design could eliminate the performance gap between CNNs and ViTs.

We sketch the architecture of RepLKNet in Fig. 4:

Stem refers to the beginning layers. Since we target at high performance on downstream dense-prediction tasks, we desire to capture more details by several conv layers at the beginning. After the first 3 $\times$ 3 with 2 $\times$ downsampling, we arrange a DW 3 $\times$ 3 layer to capture low-level patterns, a 1 $\times$ 1 conv, and another DW 3 $\times$ 3 layer for downsampling.

Stages 1-4 each contains several RepLK Blocks, which use shortcuts (Guideline 2) and DW large kernels (Guideline 1). We use 1 $\times$ 1 conv before and after DW conv as a common practice. Note that each DW large conv uses a 5 $\times$ 5 kernel for re-parameterization (Guideline 3), which is not shown in Fig. 4. Except for the large conv layers which provide sufficient receptive field and the ability to aggregate spatial information, the model’s representational capacity is also closely related to the depth. To provide more nonlinearities and information communications across channels, we desire to use 1 $\times$ 1 layers to increase the depth. Inspired by the Feed-Forward Network (FFN) which has been widely used in transformers and MLPs , we use a similar CNN-style block composed of shortcut, BN, two 1 $\times$ 1 layers and GELU , so it is referred to as ConvFFN Block. Compared to the classic FFN which uses Layer Normalization before the fully-connected layers, BN has an advantage that it can be fused into conv for efficient inference. As a common practice, the number of internal channels of the ConvFFN Block is 4 $\times$ as the input. Simply following ViT and Swin, which interleave attention and FFN blocks, we place a ConvFFN after each RepLK Block.

Transition Blocks are placed between stages, which first increase the channel dimension via 1 $\times$ 1 conv and then conduct 2 $\times$ downsampling with DW 3 $\times$ 3 conv.

In summary, each stage has three architectural hyper-parameters: the number of RepLK Blocks $B$ , the channel dimension $C$ , and the kernel size $K$ . So that a RepLKNet architecture is defined by $[B_{1},B_{2},B_{3},B_{4}]$ , $[C_{1},C_{2},C_{3},C_{4}]$ , $[K_{1},K_{2},K_{3},K_{4}]$ .

2 Making Large Kernels Even Larger

We continue to evaluate large kernels on RepLKNet via fixing $\bm{\mathbf{B}}$ =$ $,$ \bm{\mathbf{C}} $=$ $, varying$ \bm{\mathbf{K}} $and observing the performance of both classification and semantic segmentation. Without careful tuning of the hyper-parameters, we casually set the kernel sizes as <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">,</annotation></semantics></math>,,$ $, respectively, and refer to the models as RepLKNet-13/25/31. We also construct two small-kernel baselines where the kernel sizes are all 3 or 7 (RepLKNet-3/7).

On ImageNet, we train for 120 epochs with AdamW optimizer, RandAugment , mixup , CutMix , Rand Erasing and Stochastic Depth , following the recent works . The detailed training configurations are presented in Appendix A.

For semantic segmentation, we use ADE20K , which is a widely-used large-scale semantic segmentation dataset containing 20K images of 150 categories for training and 2K for validation. We use the ImageNet-trained models as backbones and adopt UperNet implemented by MMSegmentation with the 80K-iteration training setting and test the single-scale mIoU.

Table 5 shows our results with different kernel sizes. On ImageNet, though increasing the kernel sizes from 3 to 13 improves the accuracy, making them even larger brings no further improvements. However, on ADE20K, scaling up the kernels from $to$ brings 0.82 higher mIoU with only 5.3% more parameters and 3.5% higher FLOPs, which highlights the significance of large kernels for downstream tasks.

In the following subsections, we use RepLKNet-31 with stronger training configurations to compare with the state-of-the-arts on ImageNet classification, Cityscapes/ADE20K semantic segmentation and COCO object detection. We refer to the aforementioned model as RepLKNet-31B (B for Base) and a wider model with $\bm{\mathbf{C}}=$ as RepLKNet-31L (Large). We construct another RepLKNet-XL with $\bm{\mathbf{C}}=$ and 1.5 $\times$ inverted bottleneck design in the RepLK Blocks (i.e., the channels of the DW large conv layers are 1.5 $\times$ as the inputs).

3 ImageNet Classification

Since the overall architecture of RepLKNet is akin to Swin, we desire to make a comparison at first. For RepLKNet-31B on ImageNet-1K, we extend the aforementioned training schedule to 300 epochs for a fair comparison. Then we finetune for 30 epochs with input resolution of 384 $\times$ 384, so that the total training cost is much lower than the Swin-B model, which was trained with 384 $\times$ 384 from scratch. Then we pretrain RepLKNet-B/L models on ImageNet-22K and finetune on ImageNet-1K. RepLKNet-XL is pretrained on our private semi-supervised dataset named MegData73M, which is introduced in the Appendix. We also present the throughput tested with a batch size of 64 on the same 2080Ti GPU. The training configurations are presented in the Appendix.

Table 6 shows that though very large kernels are not intended for ImageNet classification, our RepLKNet models show a a favorable trade-off between accuracy and efficiency. Notably, with only ImageNet-1K training, RepLKNet-31B reaches 84.8% accuracy, which is 0.3% higher than Swin-B, and runs 43% faster. And even though RepLKNet-XL has higher FLOPs than Swin-L, it runs faster, which highlights the efficiency of very large kernels.

4 Semantic Segmentation

We then use the pretrained models as the backbones on Cityscapes (Table 7) and ADE20K (Table 8). Specifically, we use the UperNet implemented by MMSegmentation with the 80K-iteration training schedule for Cityscapes and 160K for ADE20K. Since we desire to evaluate the backbone only, we do not use any advanced techniques, tricks, nor custom algorithms.

On Cityscapes, ImageNet-1K-pretrained RepLKNet-31B outperforms Swin-B by a significant margin (single-scale mIoU of 2.7), and even outperforms the ImageNet-22K-pretrained Swin-L. Even equipped with DiversePatch , a technique customized for vision transformers, the single-scale mIoU of the 22K-pretrained Swin-L is still lower than our 1K-pretrained RepLKNet-31B, though the former has 2 $\times$ parameters.

On ADE20K, RepLKNet-31B outperforms Swin-B with both 1K and 22K pretraining, and the margins of single-scale mIoU are particularly significant. Pretrained with our semi-supervised dataset MegData73M, RepLKNet-XL achieves an mIoU of 56.0, which shows feasible scalability towards large-scale vision applications.

5 Object Detection

For object detection, we use RepLKNets as the backbone of FCOS and Cascade Mask R-CNN , which are representatives of one-stage and two-stage detection methods, and the default configurations in MMDetection . The FCOS model is trained with the 2x (24-epoch) training schedule for a fair comparison with the X101 (short for ResNeXt-101 ) baseline from the same code base , and the other results with Cascade Mask R-CNN all use 3x (36-epoch). Again, we simply replace the backbone and do not use any advanced techniques. Table 9 shows RepLKNets outperform ResNeXt-101-64x4d by up to 4.4 mAP while have fewer parameters and lower FLOPs. Note that the results may be further improved with the advanced techniques like HTC , HTC++ , Soft-NMS or a 6x (72-epoch) schedule. Compared to Swin, RepLKNets achieve higher or comparable mAP with fewer parameters and lower FLOPs. Notably, RepLKNet-XL achieves an mAP of 55.5, which demonstrates the scalability again.

Discussions

We have demonstrated large kernel design can significantly boost CNNs (especially on downstream tasks). However, it is worth noting that large kernel can be expressed by a series of small convolutions , e.g., a 7 $\times$ 7 convolution can be decomposed into a stack of three 3 $\times$ 3 kernels without information loss (more channels are required after the decomposition to maintain the degree of freedom). Given that fact, a question naturally comes up: why do conventional CNNs, which may contain tens or hundreds of small convolutions (e.g., ResNets ), still behave inferior to large-kernel networks?

We argue that in terms of obtaining large receptive field, a single large kernel is much more effective than many small kernels. First, according to the theory of Effective Receptive Field (ERF) , ERF is proportion to $\mathcal{O}(K\sqrt{L})$ , where $K$ is the kernel size and $L$ is the depth, i.e., number of layers. In other words, ERF grows linearly with the kernel size while sub-linearly with the depth. Second, the increasing depth introduces optimization difficulty . Although ResNets seem to overcome the dilemma, managing to train a network with hundreds of layers, some works indicate ResNets might not be as deep as they appear to be. For example, suggests ResNets behave like ensembles of shallow networks, which implies the ERFs of ResNets could still be very limited even if the depth dramatically increases. Such phenomenon is also empirically observed in previous works . To summarize, large kernels design requires fewer layers to obtain large ERFs and avoids the optimization issue brought by the increasing depth.

2 Large-Kernel Models are More Similar to Human in Shape Bias

We have found out that RepLKNet-31B has much higher shape bias than Swin Transformer and small-kernel CNNs.

A recent work reported that vision transformers are more similar to the human vision systems in that they make predictions more based on the overall shapes of objects, while CNNs focus more on the local textures. We follow its methodology and use its toolbox to obtain the shape bias (e.g., the fraction of predictions made based on the shapes, rather than the textures) of RepLKNet-31B and Swin-B pretrained on ImageNet-1K or 22K, together with two small-kernel baselines, RepLKNet-3 and ResNet-152. Fig. 5 shows that RepLKNet has higher shape bias than Swin. Considering RepLKNet and Swin have similar overall architectures, we reckon shape bias is closely related to the Effective Receptive Field rather than the concrete formulation of self-attention (i.e., the query-key-value design). This also explains 1) the high shape bias of ViTs reported by (since ViTs employ global attention), 2) the low shape bias of 1K-pretrained Swin (attention within local windows), and 3) the shape bias of the small-kernel baseline RepLKNet-3, which is very close to ResNet-152 (both models are composed of $3\times 3$ convolutions).

3 Large Kernel Design is a Generic Design Element that Works with ConvNeXt

Replacing the 7 $\times$ 7 convolutions in ConvNeXt by kernels as large as 31 $\times$ 31 brings significant improvements, e.g., ConNeXt-Tiny + large kernel >ConNeXt-Small , and ConNeXt-Small + large kernel >ConNeXt-Base.

We use the recently proposed ConvNeXt as the benchmark architecture to evaluate large kernel as a generic design element. We simply replace the 7 $\times$ 7 convolutions in ConvNeXt by kernels as large as 31 $\times$ 31. The training configurations on ImageNet (120 epochs) and ADE20K (80K iterations) are identical to the results shown in Sec. 4.2. Table 11 shows that though the original kernels are already 7 $\times$ 7, further increasing the kernel sizes still brings significant improvements, especially on the downstream task: with kernels as large as 31 $\times$ 31, ConvNeXt-Tiny outperforms the original ConvNeXt-Small, and the large-kernel ConvNeXt-Small outperforms the original ConvNeXt-Base. Again, such phenomena demonstrate that kernel size is an important scaling dimension.

4 Large Kernels Outperform Small Kernels with High Dilation Rates

Limitations

Although large kernel design greatly improves CNNs on both ImageNet and downstream tasks, however, according to Table 6, as the scale of data and model increases, RepLKNets start to fall behind Swin Transformers, e.g., the ImageNet top-1 accuracy of RepLKNet-31L is 0.7% lower than Swin-L with ImageNet-22K pretraining (while the downstream scores are still comparable). It is not clear whether the gap is resulted from suboptimal hyper-parameter tuning or some other fundamental drawback of CNNs which emerges when data/model scales up. We are working in progress on the problem.

Conclusion

This paper revisits large convolutional kernels, which have long been neglected in designing CNN architectures. We demonstrate that using a few large kernels instead of many small kernels results in larger effective receptive field more efficiently, boosting CNN’s performances especially on downstream tasks by a large margin, and greatly closing the performance gap between CNNs and ViTs when data and models scale up. We hope our work could advance both studies of CNNs and ViTs. On one hand, for CNN community, our findings suggest that we should pay special attention to ERFs, which may be the key to high performances. On the other hand, for ViT community, since large convolutions act as an alternative to multi-head self-attentions with similar behaviors, it may help to understand the intrinsic mechanism of self-attentions.

References

Appendix A: Training Configurations

For training MobileNet V2 models (Sec. 3), we use 8 GPUs, an SGD optimizer with momentum of 0.9, a batch size of 32 per GPU, input resolution of 224 $\times$ 224, weight decay of $4\times 10^{-5}$ , learning rate schedule with 5-epoch warmup, initial value of 0.1 and cosine annealing for 100 epochs. For the data augmentation, we only use random cropping and left-right flipping, as a common practice.

For training RepLKNet models (Sec. 4.2),we use 32 GPUs and a batch size of 64 per GPU to train for 120 epochs. The optimizer is AdamW with momentum of 0.9 and weight decay of 0.05. The learning rate setting includes an initial value of $4\times 10^{-3}$ , cosine annealing and 10-epoch warm-up. For the data augmentation and regularization, we use RandAugment (“rand-m9-mstd0.5-inc1” as implemented by timm ), label smoothing coefficient of 0.1, mixup with $\alpha=0.8$ , CutMix with $\alpha=1.0$ , Rand Erasing with probability of 25% and Stochastic Depth with a drop-path rate of 30%, following the recent works . The RepLKNet-31B reported in Sec. 4.3 is trained with the same configurations except the epoch number of 300 and drop-path rate of 50%.

For finetuning the 224 $\times$ 224-trained RepLKNet-31B with 384 $\times$ 384, we use 32 GPUs, a batch size of 32 per GPU, initial learning rate of $4\times 10^{-4}$ , cosine annealing, 1-epoch warm-up, 30 epochs, model EMA (Exponential Moving Average) with momentum of $10^{-4}$ , the same RandAugment as above but no CutMix nor mixup.

ImageNet-22K Pretraining and 1K Finetuning

For pretraining RepLKNet-31B/L on ImageNet-22K, we use 128 GPUs and a batch size of 32 per GPU to train for 90 epochs with a drop-path rate of 10%. The other configurations are the same as the aforementioned ImageNet-1K pretraining.

Then for finetuning RepLKNet-31B with 224 $\times$ 224, we use 16 GPUs, a batch size of 32 per GPU, drop-path rate of 20%, initial learning rate of $4\times 10^{-4}$ , cosine annealing, model EMA with momentum of $10^{-4}$ to finetune for 30 epochs. Note again that we use the same RandAugment as above but no CutMix nor mixup.

For finetuning RepLKNet-31B/L with 384 $\times$ 384, we use 32 GPUs and a batch size of 16 per GPU, and the drop-path rate is raised to 30%.

RepLKNet-XL and Semi-supervised Pretraining

We continue to scale up our architecture and train a ViT-L level model named RepLKNet-XL. We use $B=$ , $C=$ , $K=$ , and introduce inverted bottleneck with expansion ratio of 1.5 to each RepLK Block. During pretraining, we use a private semi-supervised dataset named MegData73M, which contains 38 million labeled images and 35 million unlabeled ones. Labeled images come from public and private classification datasets such as ImageNet-1K, ImageNet-22K and Places365 . Unlabeled images are selected from YFCC100M . We design a multi-task label system according to , and utilize soft pseudo labels which are offline generated by multiple task-specific ViT-Ls wherever human annotations are unavailable. We pretrain our model for up to 15 epochs with similar configurations as ImageNet-1K pretraining. We do not use CutMix or mixup, decrease drop-path rate to 20%, and use a lower initial learning rate of $1.5\times 10^{-3}$ and a total batch size of 2048. Structural Re-parameterization is omitted because it only brings less than 0.1% performance gain on such a large-scale dataset. In other words, we observe that the inductive bias (re-parameterization with small kernels) becomes less important as the data become bigger, which is similar to the discoveries reported by ViT .

We finetune on ImageNet-1K with input resolution of 320 $\times$ 320 for 30 epochs following BeiT , except for a higher learning rate of $10^{-4}$ and stage-wise learning rate decay of 0.4. Finetuning with a higher resolution of 384 $\times$ 384 brings no further improvements. For downstream tasks, we use the default training setting except for a drop-path rate of 50% and stage-wise learning rate decay.

Appendix B: Visualizing the ERF

Appendix C: Dense Convolutions vs. Dilated Convolutions

As another alternative to implement large convolutions, dilated convolution is a common component to increase the receptive field (RF). However, Table 12 shows though a depth-wise dilated convolution may have the same maximum RF as a depth-wise dense convolution, its representational capacity is much lower, which is expected because it is mathematically equivalent to a sparse large convolution. Literature (e.g., ) further suggests that dilated convolutions may suffer from gridding problem. We reckon the drawbacks of dilated convolutions could be overcome by mixture of convolutions with different dilations, which will be investigated in the future.

Appendix D: Visualizing the Kernel Weights with Small-Kernel Re-parameterization

We visualize the weights of the re-parameterized 13 $\times$ 13 kernels. Specifically, we investigate into the MobileNet V2 models both with and without 3 $\times$ 3 re-parameterization. As Shown in Sec. 3 (Guideline 3) , the ImageNet scores are 73.24% and 72.53%, respectively. We use the first stride-1 13 $\times$ 13 conv in the last stage (i.e., the stage with input resolution of 7 $\times$ 7) as the representative, and aggregate (take the absolute value and sum up across channels) the resultant kernel into a 13 $\times$ 13 matrix, and respectively rescale to $ $for the comparability. For the model with 3$ \times $3 re-param, we show both the original 13$ \times $13 kernel (only after BN fusion) and the result after re-param (i.e., adding the 3$ \times $3 kernel onto the central part of 13$ \times$13). For the model without re-param, we also fuse the BN for the fair comparison.

We observe that every aggregated kernel shows a similar pattern: the central point has the largest magnitude; generally, points closer to the center have larger values; and the “skeleton” parameters (the 13 $\times$ 1 and 1 $\times$ 13 criss-cross parts) are relatively larger, which is consistent with the discovery reported by ACNet . But the kernel with 3 $\times$ 3 re-param differs in that the central 3 $\times$ 3 part of the resultant kernel is further enhanced, which is found to improve the performance.