MetaFormer Is Actually What You Need for Vision
Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan
Introduction
Transformers have gained much interest and success in the computer vision field . Since the seminal work of Vision Transformer (ViT) that adapts pure Transformers to image classification tasks, many follow-up models are developed to make further improvements and achieve promising performance in various computer vision tasks .
The Transformer encoder, as shown in Figure 1(a), consists of two components. One is the attention module for mixing information among tokens and we term it as token mixer. The other component contains the remaining modules, such as channel MLPs and residual connections. By regarding the attention module as a specific token mixer, we further abstract the overall Transformer into a general architecture MetaFormer where the token mixer is not specified, as shown in Figure 1(a).
The success of Transformers has been long attributed to the attention-based token mixer . Based on this common belief, many variants of the attention modules have been developed to improve the Vision Transformer. However, a very recent work replaces the attention module completely with spatial MLPs as token mixers, and finds the derived MLP-like model can readily attain competitive performance on image classification benchmarks. The follow-up works further improve MLP-like models by data-efficient training and specific MLP module design, gradually narrowing the performance gap to ViT and challenging the dominance of attention as token mixers.
Some recent approaches explore other types of token mixers within the MetaFormer architecture, and have demonstrated encouraging performance. For example, replaces attention with Fourier Transform and still achieves around 97% of the accuracy of vanilla Transformers. Taking all these results together, it seems as long as a model adopts MetaFormer as the general architecture, promising results could be attained. We thus hypothesize compared with specific token mixers, MetaFormer is more essential for the model to achieve competitive performance.
To verify this hypothesis, we apply an extremely simple non-parametric operator, pooling, as the token mixer to conduct only basic token mixing. Astonishingly, this derived model, termed PoolFormer, achieves competitive performance, and even consistently outperforms well-tuned Transformer and MLP-like models, including DeiT and ResMLP , as shown in Figure 1(b). More specifically, PoolFormer-M36 achieves 82.1% top-1 accuracy on ImageNet-1K classification benchmark, surpassing well-tuned vision Transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 50%/62% fewer MACs. These results demonstrate that MetaFormer, even with a naive token mixer, can still deliver promising performance. We thus argue that MetaFormer is our de facto need for vision models which is more essential to achieve competitive performance rather than specific token mixers. Note that it does not mean the token mixer is insignificant. MetaFormer still has this abstracted component. It means token mixer is not limited to a specific type, e.g. attention.
The contributions of our paper are two-fold. Firstly, we abstract Transformers into a general architecture MetaFormer, and empirically demonstrate that the success of Transformer/MLP-like models is largely attributed to the MetaFormer architecture. Specifically, by only employing a simple non-parametric operator, pooling, as an extremely weak token mixer for MetaFormer, we build a simple model named PoolFormer and find it can still achieve highly competitive performance. We hope our findings inspire more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Secondly, we evaluate the proposed PoolFormer on multiple vision tasks including image classification , object detection , instance segmentation , and semantic segmentation , and find it achieves competitive performance compared with the SOTA models using sophistic design of token mixers. The PoolFormer can readily serve as a good starting baseline for future MetaFormer architecture design.
Related work
Transformers are first proposed by for translation tasks and then rapidly become popular in various NLP tasks. In language pre-training tasks, Transformers are trained on large-scale unlabeled text corpus and achieve amazing performance . Inspired by the success of Transformers in NLP, many researchers apply attention mechanism and Transformers to vision tasks . Notably, Chen et al. introduce iGPT where the Transformer is trained to auto-regressively predict pixels on images for self-supervised learning. Dosovitskiy et al. propose Vision Transformer (ViT) with hard patch embedding as input. They show that on supervised image classification tasks, a ViT pre-trained on a large propriety dataset (JFT dataset with 300 million images) can achieve excellent performance. DeiT and T2T-ViT further demonstrate that the ViT pre-trained on only ImageNet-1K ( 1.3 million images) from scratch can achieve promising performance. A lot of works have been focusing on improving the token mixing approach of Transformers by shifted windows , relative position encoding , refining attention map , or incorporating convolution , etc. In addition to attention-like token mixers, surprisingly find that merely adopting MLPs as token mixers can still achieve competitive performance. This discovery challenges the dominance of attention-based token mixers and triggers a heated discussion in the research community about which token mixer is better . However, the target of this work is neither to be engaged in this debate nor to design new complicated token mixers to achieve new state of the art. Instead, we examine a fundamental question: What is truly responsible for the success of the Transformers and their variants? Our answer is the general architecture i.e., MetaFormer. We simply utilize pooling as basic token mixers to probe the power of MetaFormer.
Contemporarily, some works contribute to answering the same question. Dong et al. prove that without residual connections or MLPs, the output converges doubly exponentially to a rank one matrix . Raghu et al. compare the feature difference between ViT and CNNs, finding that self-attention allows early gathering of global information while residual connections greatly propagate features from lower layers to higher ones. Park et al. shows that multi-head self-attentions improve accuracy and generalization by flattening the loss landscapes. Unfortunately, they do not abstract Transformers into a general architecture and study them from the aspect of general framework.
Method
We present the core concept “MetaFormer” for this work at first. As shown in Figure 1, abstracted from Transformers , MetaFormer is a general architecture where the token mixer is not specified while the other components are kept the same as Transformers. The input is first processed by input embedding, such as patch embedding for ViTs ,
Then, embedding tokens are fed to repeated MetaFormer blocks, each of which includes two residual sub-blocks. Specifically, the first sub-block mainly contains a token mixer to communicate information among tokens and this sub-block can be expressed as
The second sub-block primarily consists of a two-layered MLP with non-linear activation,
Instantiations of MetaFormer. MetaFormer describes a general architecture with which different models can be obtained immediately by specifying the concrete design of the token mixers. As shown in Figure 1(a), if the token mixer is specified as attention or spatial MLP, MetaFormer then becomes a Transformer or MLP-like model respectively.
2 PoolFormer
From the introduction of Transformers , lots of works attach much importance to the attention and focus on designing various attention-based token mixer components. In contrast, these works pay little attention to the general architecture, i.e., the MetaFormer.
In this work, we argue that this MetaFormer general architecture contributes mostly to the success of the recent Transformer and MLP-like models. To demonstrate it, we deliberately employ an embarrassingly simple operator, pooling, as the token mixer. This operator has no learnable parameters and it just makes each token averagely aggregate its nearby token features.
where is the pooling size. Since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4). The PyTorch-like code of the pooling is shown in Algorithm 1.
As well known, self-attention and spatial MLP have computational complexity quadratic to the number of tokens to mix. Even worse, spatial MLPs bring much more parameters when handling longer sequences. As a result, self-attention and spatial MLPs usually can only process hundreds of tokens. In contrast, the pooling needs a computational complexity linear to the sequence length without any learnable parameters. Thus, we take advantage of pooling by adopting a hierarchical structure similar to traditional CNNs and recent hierarchical Transformer variants . Figure 2 shows the overall framework of PoolFormer. Specifically, PoolFormer has 4 stages with , , , and tokens respectively, where and represent the width and height of the input image. There are two groups of embedding size: 1) small-sized models with embedding dimensions of 64, 128, 320, and 512 responding to the four stages; 2) medium-sized models with embedding dimensions 96, 192, 384, and 768. Assuming there are PoolFormer blocks in total, stages 1, 2, 3, and 4 will contain , , , and PoolFormer blocks respectively. The MLP expansion ratio is set as 4. According to the above simple model scaling rule, we obtain 5 different model sizes of PoolFormer and their hyper-parameters are shown in Table 1.
Experiments
Results. Table 2 shows the performance of PoolFormers on ImageNet classification. Qualitative results are shown in the appendix. Surprisingly, despite the simple pooling token mixer, PoolFormers can still achieve highly competitive performance compared with CNNs and other MetaFormer-like models. For example, PoolFormer-S24 reaches the top-1 accuracy of more than 80 while only requiring 21M parameters and 3.4G MACs. Comparatively, the well-established ViT baseline DeiT-S , attains slightly worse accuracy of 79.8 and requires 35% more MACs (4.6G). To obtain similar accuracy, MLP-like model ResMLP-S24 needs 43% more parameters (30M) as well as 76% more computation (6.0G) while only 79.4 accuracy is attained. Even compared with more improved ViT and MLP-like variants , PoolFormer still shows better performance. Specifically, the pyramid Transformer PVT-Medium obtains 81.2 top-1 accuracy with 44M parameters and 6.7G MACs while PoolFormer-S36 reaches 81.4 with 30% fewer parameters (31M) and 25% fewer MACs (5.0G) than those of PVT-Medium.
Besides, compared with RSB-ResNet (“ResNet Strikes Back”) where ResNet is trained with improved training procedure for the same 300 epochs, PoolFormer still performs better. With 22M parameters/3.7G MACs, RSB-ResNet-34 gets 75.5 accuracy while PoolFormer-S24 can obtain 80.3. Since the local spatial modeling ability of the pooling layer is much worse than the neural convolution layer, the competitive performance of PoolFormer can only be attributed to its general architecture MetaFormer.
With the pooling operator, each token evenly aggregates the features from its nearby tokens. Thus it is an extremely basic token mixing operation. However, the experiment results show that even with this embarrassingly simple token mixer, MetaFormer still obtains highly competitive performance. Figure 3 clearly shows that PoolFormer surpasses other models with fewer MACs and parameters. This finding conveys that the general architecture MetaFormer is actually what we need when designing vision models. By adopting MetaFormer, it is guaranteed that the derived models would have the potential to achieve reasonable performance.
2 Object detection and instance segmentation
Setup. We evaluate PoolFormer on the challenging COCO benchmark that includes 118K training images (train2017) and 5K validation images (val2017). The models are trained on training set and the performance on validation set is reported. PoolFormer is employed as the backbone for two standard detectors, i.e., RetinaNet and Mask R-CNN . ImageNet pre-trained weights are utilized to initialize the backbones and Xavier to initialize the added layers. AdamW is adopted for training with an initial learning rate of and batch size of 16. Following , we employ 1 training schedule, i.e., training the detection models for 12 epochs. The training images are resized into shorter side of 800 pixels and longer side of no more than 1,333 pixels. For testing, the shorter side of the images is also resized to 800 pixels. The implementation is based on the mmdetection codebase and the experiments are run on 8 NVIDIA A100 GPUs.
Results. Equipped with RetinaNet for object detection, PoolFormer-based models consistently outperform their comparable ResNet counterparts as shown in Table 3. For instance, PoolFormer-S12 achieves 36.2 AP, largely surpassing that of ResNet-18 (31.8 AP). Similar results are observed for those models based on Mask R-CNN on object detection and instance segmentation. For example, PoolFormer-S12 largely surpasses ResNet-18 (bounding box AP 37.3 vs. 34.0, and mask AP 34.6 vs. 31.2). Overall, for COCO object detection and instance segmentation, PoolForemrs achieve competitive performance, consistently outperforming those counterparts of ResNet.
3 Semantic segmentation
Setup. ADE20K , a challenging scene parsing benchmark, is selected to evaluate the models for semantic segmentation. The dataset includes 20K and 2K images in the training and validation set, respectively, covering 150 fine-grained semantic categories. PoolFormers are evaluated as backbones equipped with Semantic FPN . ImageNet-1K trained checkpoints are used to initialize the backbones while Xavier is utilized to initialize other newly added layers. Common practices train models for 80K iterations with a batch size of 16. To speed up training, we double the batch size to 32 and decrease the iteration number to 40K. The AdamW is employed with an initial learning rate of that will decay in the polynomial decay schedule with a power of 0.9. Images are resized and cropped into for training and are resized to shorter side of 512 pixels for testing. Our implementation is based on the mmsegmentation codebase and the experiments are conducted on 8 NVIDIA A100 GPUs.
Results. Table 4 shows the ADE20K semantic segmentation performance of different backbones using FPN . PoolFormer-based models consistently outperform the models with backbones of CNN-based ResNet and ResNeXt as well as Transformer-based PVT. For instance, PoolFormer-12 achieves mIoU of 37.1, 4.3 and 1.5 better than ResNet-18 and PVT-Tiny, respectively.
These results demonstrate that our PoorFormer which serves as backbone can attain competitive performance on semantic segmentation although it only utilizes pooling for basically communicating information among tokens. This further indicates the great potential of MetaFormer and supports our claim that MetaFormer is actually what we need.
4 Ablation studies
The experiments of ablation studies are conducted on ImageNet-1K . Table 5 reports the ablation study of PoolFormer. We discuss the ablation below according to the following aspects.
Token mixers. Compared with Transformers, the main change made by PoolFormer is using simple pooling as a token mixer. We first conduct ablation for this operator by directly replacing pooling with identity mapping. Surprisingly, MetaFormer with identity mapping can still achieve 74.3% top-1 accuracy, supporting the claim that MetaFormer is actually what we need to guarantee reasonable performance.
Further, pooling is replaced with Depthwise Convolution that has learnable parameters for spatial modeling. Not surprisingly, the derived model still achieve highly competitive performance with top-1 accuracy of 78.1%, 0.9% higher than PoolFormer-S12 due to its better local spatial modeling ability. Until now, we have specified multiple token mixers in Metaformer, and all resulted models keep promising results, well supporting the claim that MetaFormer is the key to guaranteeing models’ competitiveness. Due to the simplicity of pooling, it is mainly utilized as a tool to demonstrate MetaFormer.
We test the effects of pooling size on PoolFormer. We observe similar performance when pooling sizes are 3, 5, and 7. However, when the pooling size increases to 9, there is an obvious performance drop of 0.5%. Thus, we adopt the default pooing size of 3 for PoolFormer.
Activation. We change GELU to ReLU or SiLU . When ReLU is adopted for activation, an obvious performance drop of 0.8% is observed. For SiLU, its performance is almost the same as that of GELU. Thus, we still adopt GELU as default activation.
Other components. Besides token mixer and normalization discussed above, residual connection and channel MLP are two other important components in MetaFormer. Without residual connection or channel MLP, the model cannot converge and only achieves the accuracy of 0.1%/5.7%, proving the indispensability of these parts.
Hybrid stages. Among token mixers based on pooling, attention, and spatial MLP, the pooling-based one can handle much longer input sequences while attention and spatial MLP are good at capturing global information. Therefore, it is intuitive to stack MetaFormers with pooling in the bottom stages to handle long sequences and use attention or spatial MLP-based mixer in the top stages, considering the sequences have been largely shortened. Thus, we replace the token mixer pooling with attention or spatial FC Following , we use only one spatial fully connected layer as a token mixer, so we call it FC. in the top one or two stages in PoolFormer. From Table 5, the hybrid models perform quite well. The variant with pooling in the bottom two stages and attention in the top two stages delivers highly competitive performance. It achieves 81.0% accuracy with only 16.5M parameters and 2.5G MACs. As a comparison, ResMLP-B24 needs parameters (116M) and MACs (23.0G) to achieve the same accuracy. These results indicate that combining pooling with other token mixers for MetaFormer may be a promising direction to further improve the performance.
Conclusion and future work
In this work, we abstracted the attention in Transformers as a token mixer, and the overall Transformer as a general architecture termed MetaFormer where the token mixer is not specified. Instead of focusing on specific token mixers, we point out that MetaFormer is actually what we need to guarantee achieving reasonable performance. To verify this, we deliberately specify token mixer as extremely simple pooling for MetaFormer. It is found that the derived PoolFormer model can achieve competitive performance on different vision tasks, which well supports that “MetaFormer is actually what you need for vision”.
In the future, we will further evaluate PoolFormer under more different learning settings, such as self-supervised learning and transfer learning. Moreover, it is interesting to see whether PoolFormer still works on NLP tasks to further support the claim “MetaFormer is actually what you need” in the NLP domain. We hope that this work can inspire more future research devoted to improving the fundamental architecture MetaFormer instead of paying too much attention to the token mixer modules.
Acknowledgement
The authors would like to thank Quanhong Fu at Sea AI Lab for the help to improve the technical writing aspect of this paper. Weihao Yu would like to thank TPU Research Cloud (TRC) program and Google Cloud research credits for the support of partial computational resources. This project is in part supported by NUS Faculty Research Committee Grant (WBS: A-0009440-00-00). Shuicheng Yan and Xinchao Wang are the corresponding authors.
References
Appendix A Detailed hyper-parameters on ImageNet-1K
PoolFormer. On ImageNet-1K classification benchmark, we utilize the hyper-parameters shown in Table 6 to train models in our paper. Based on the relation between batch size and learning rate in Table 6, we set the batch size as 4096 and learning rate as . For stochastic depth, following the original paper , we linearly increase the probability of dropping a layer from 0.0 for the bottom block to for the top block.
Hybrid Models. We use the hyper-parameters for all models except for the hybrid models with token mixers of pooling and attention. For these hybrid models, we find that they achieve much better performances by setting batch size as 1024, learning rate as , and normalization as Layer Normalization .
Appendix B Training for longer epochs
In our paper, PoolFormer models are trained for the default 300 epochs on ImageNet-1K. For DeiT /ResMLP, it is observed that the performance saturates after 400/800 epochs. Thus, we also conduct the experiments of training longer for PoolFormer-S12 and the results are shown in Table 7. We observe that PoolFormer-S12 obtains saturated performance after around 2000 epochs with a top-1 accuracy improvement of 1.8%. However, for fair comparison with other ViT/MLP-like models, we still train PoolFormers for 300 epochs by default.
Appendix C Qualitative results
We use Grad-CAM to visualize the results of different models trained on ImageNet-1K. We find that although ResMLP also activates some irrelevant parts, all models can locate the semantic objects. The activation parts of DeiT and ResMLP in the maps are more scattered, while those of RSB-ResNet and PoolFormer are more gathered.
Input RSB-ResNet-50 DeiT-small ResMLP-S24 PoolFormer-S24
Appendix D Comparison between Layer Normalization and Modified Layer Normalization
Appendix E Code in PyTorch
We provide the PyTorch-like code in Algorithm 3 associated with the modules used in the PoolFormer block. Algorithm 4 further shows the PoolFormer block built with these modules.