A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

Introduction

Convolutional neural networks (CNNs) have been dominating the computer vision (CV) field since the renaissance of deep neural networks (DNNs). They have demonstrated effectiveness in numerous vision tasks from image classification , object detection, to pixel-based segmentation . Remarkably, despite the huge success of Transformer structure in natural language processing (NLP) , the CV society still focuses on the CNN structure for quite some time.

The transformer structure finally made its grand debut in CV last year. Vision Transformer (ViT) showed that a pure Transformer applied directly to a sequence of image patches can perform very well on image classification tasks, if the training dataset is sufficiently large. DeiT further demonstrated that Transformer can be successfully trained on typical-scale dataset, such as ImageNet-1K , with appropriate data augmentation and model regularization.

Interestingly, before the heat of Transformer dissipated, the structure of multi-layer perceptrons (MLPs) was revived by Tolstikhin et al. in a work called MLP-Mixer . MLP-Mixer is based exclusively on MLPs applied across spatial locations and feature channels. When trained on large datasets, MLP-Mixer attains competitive scores on image classification benchmarks. The success of MLP-Mixer suggests that neither convolution nor attention are necessary for good performance. It sparked further research on MLP as the authors wished .

However, as the reported accuracy on image classification benchmarks continues to increase by new network designs from various camps, no conclusion can be made as which structure among CNN, Transformer, and MLP performs the best or is most suitable for vision tasks. This is partly due to the pursuit of high scores that leads to multifarious tricks and exhaustive parameter tuning. As a result, network structures cannot be fairly compared in a systematic way. The work presented in this paper fills this blank by conducting a series of controlled experiments over CNN, Transformer, and MLP in a unified framework.

We first develop a unified framework called SPACH as shown in Fig. 1. It is mostly adopted from current Transformer and MLP frameworks, since convolution can also fit into this framework and is in general robust to optimization. The SPACH framework contains a plug-and-play module called mixing block which could be implemented as convolution layers, Transformer layers, or MLP layers. Aside from the mixing block, other components in the framework are kept the same when we explore different structures. This is in stark contrast to previous work which compares different network structures in different frameworks that vary greatly in layer cascade, normalization, and other non-trivial implementation details. As a matter of fact, we found that these structure-free components play an important role in the final performance of the model, and this is commonly neglected in the literature.

With this unified framework, we design a series of controlled experiments to compare the three network structures. The results show that all three network structures could perform well on the image classification task when pre-trained on ImageNet-1K. In addition, each individual structure has its distinctive properties leading different behaviors when the network size scales up. We also find several common design choices which contribute a lot to the performance of our SPACH framework. The detailed findings are listed in the following.

Multi-stage design is standard in CNN models, but its effectiveness is largely overlooked in Transformer-based or MLP-based models. We find that the multi-stage framework consistently and notably outperforms the single-stage framework no matter which of the three network structures is chosen.

Local modeling is efficient and crucial. With only light-weight depth-wise convolutions, the convolution model can achieve similar performance as a Transformer model in our SPACH framework. By adding a local modeling bypass in both MLP and Transformer structures, a significant performance boost is obtained with negligible parameters and FLOPs increase.

MLP can achieve strong performance under small model sizes, but it suffers severely from over-fitting when the model size scales up. We believe that over-fitting is the main obstacle that prevents MLP from achieving SOTA performance.

Convolution and Transformer are complementary in the sense that convolution structure has the best generalization capability while Transformer structure has the largest model capacity among the three structures. This suggests that convolution is still the best choice in designing lightweight models but designing large models should take Transformer into account.

Based on these findings, we propose two hybrid models of different scales which are built upon convolution and Transformer layers. Experimental results show that, when a sweet point between generalization capability and model capacity is reached, the performance of these straightforward hybrid models is already on par with SOTA models with sophisticated architecture designs.

Background

CNN and its variants have dominated the vision domain. During the evolution of CNN models, useful experience about the architecture design has been accumulated. Recently, two types of architectures, namely Transformer and MLP , begin to emerge in the vision domain and have shown performance similar to the well-optimized CNNs. These results kindle a spark towards building better vision models beyond CNNs.

Convolution-based vision models Since the entrance of deep learning era pioneered by AlexNet , the computer vision community has devoted enormous efforts to designing better vision backbones. In the past decade, most work focused on improving the design of CNN, and a series of networks, including VGG , ResNet , SENet , Xception , MoblieNet, and EfficientNet , are designed. They achieve significant accuracy improvements in various vision tasks.

A standard convolution layer learns filters in a 3D space, with two spatial dimensions and one channel dimension. Thus, the learning of spatial correlations and channel correlations are coupled inside a single convolution kernel. Differently, A depth-wise convolution layer only learns spatial correlations by moving the learning process of channel correlations to an additional 1x1 convolution. The fundamental hypothesis behind this design is that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly . Recent work shows that depth-wise convolution can achieve both high accuracy and good efficiency, confirming this hypothesis to some extent. In addition, the idea of decoupling spatial and channel correlations is adopted in the vision Transformer. Therefore, this paper employs the spatial-channel decoupling idea in our framework design.

Transformer-based vision models. With the success of Transformer in natural language processing (NLP) , many researchers start to explore the use of Transformer as a stand-alone architecture for vision tasks. They are facing two main challenges. First, Transformer operates over a group of tokens, but no natural tokens, similar to the words in natural language, exist in an image. Second, images have a strong local structure while the Transformer structure treats all tokens equally and ignores locality. The pioneering work ViT solved the first challenge by simply dividing an image into non-overlapping patches and treat each patch as a visual token. ViT also reveals that Transformer models trained on large-scale datasets could attain SOTA image recognition performance. However, when the training data is insufficient, ViT does not perform well due to the lack of inductive biases. DeiT mitigates the problem by introducing a regularization and augmentation pipeline on ImageNet-1K.

Swin and Twins propose local ViT to address the second challenge. They adopt locally-grouped self-attention by computing the standard self-attention within non-overlapping windows. The local mechanism not only leads to performance improvement thanks to the reintroduction of locality, but also bring sufficient improvement on memory and computational efficiency. Thus, the pyramid structure becomes feasible again for vision Transformer.

There has been a blowout development in the design of Transformer-based vision models. Since this paper is not intended to review the progress of vision Transformer, we only briefly introduce some highly correlated Transformer models. CPVT and CvT introduce convolution into Transformer blocks, bringing the desired translation-invariance properties into ViT architecture. CaiT introduces a LayerScale approach to empower effective training of deeper ViT network. It is also discovered that some class-attention layers built on top of ViT network offer more effective processing than the class embedding. LV-ViT proposes a bag of training techniques to build a strong baseline for vision Transformer. LeViT proposes a hybrid neural network for fast image classification inference.

MLP-based vision models. Although MLP is not a new concept for the computer vision community, the recent progress on MLP-based visual models surprisingly demonstrates, both conceptually and technically, that simple architecture can achieve competitive performance with CNN or Transformer . The pioneering work MLP-Mixer proposed a Mixer architecture using channel-mixing MLPs and channel-mixing MLPs to communicate between different channels and spatial locations (tokens), respectively. It achieves promising results when trained on a large-scale dataset (i.e., JFT). ResMLP built a similar MLP-based model with a deeper architecture. ResMLP does not need large-scale datasets and it achieves comparable accuracy/complexity trade-offs on ImageNet-1K with Transformer-based models. FF showed that simply replacing the attention layer in ViT with an MLP layer applied over the patch dimension could achieve moderate performance on ImageNet classification. gMLP proposed a gating mechanism on MLP and suggested that self-attention is not a necessary ingredient for scaling up machine learning models.

A Unified Experimental Framework

In order to fairly compare the three network structures, we are in need of a unified framework that excludes other performance-affecting factors. Since recent MLP-based networks have already shared a similar framework as Transformer-based networks, we build the unified experimental framework based on them and try to include CNN-based network in this framework as well.

We build our experimental framework with reference to ViT and MLP-Mixer . Fig. 1(a) shows the single-stage version of the SPACH framework, which is used for our empirical study. The architecture is very simple and consists mainly of a cascade of mixing blocks, plus some necessary auxiliary modules, such as patch embedding, global average pooling, and a linear classifier. Fig. 1(b) shows the details of the mixing block. Note that the spatial mixing and channel mixing are performed in consecutive steps. The name SPACH for our framework is coined to emphasize the serial structure of SPAtial and CHannel processing.

We also enable a multi-stage variation, referred to as SPACH-MS, as shown in Fig. 1(c). Multi-stage is an important mechanism in CNN-based networks to improve the performance. Unlike the single-stage SPACH, which processes the image in a low resolution by down-sampling the image by a large factor at the input, SPACH-MS is designed to keep a high-resolution in the initial stage of the framework and progressively perform down-sampling. Specifically, our SPACH-MS contains four stages with down-sample ratios of 4, 8, 16, and 32, respectively. Each stage contains $N_{s}$ mixing blocks, where $s$ is the stage index. Due to the extremely high computational cost of Transformer and MLP on high-resolution feature maps, we implement the mixing blocks in the first stage with convolutions only. The feature dimension within a stage remains constant, and will be multiplied with a factor of 2 after down-sampling.

We list the hyper-parameters used in different model configurations in Table 1. Three model size for each variations of SPACH are designed, namely SPACH-XXS, SPACH-XS and SPACH-S, by controlling the number of blocks, the number of channels, and the expansion ratio of channel mixing MLP $\mathcal{F}_{c}$ . The model size, theoretical computational complexity (FLOPS), and empirical throughput are presented in Section 4. We measure the throughput using one P100 GPU.

2 Mixing Block Design

Following ViT , we use an MLP with appropriate normalization and residual connection to implement $\mathcal{F}_{c}$ . The MLP here can be also viewed as a 1x1 convolution (also known as point-wise convolution ) which is a special case of regular convolution. Note that $\mathcal{F}_{c}$ only performs channel fusion and does not explore any spatial context.

The spatial mixing function $\mathcal{F}_{s}$ is the key to implement different architectures. As shown in Fig. 2, we implement three structures using convolution, self-attention, and MLP. The common components include normalization and residual connection. Specifically, the convolution structure is implemented by a 3x3 depth-wise convolution, as channel mixing will be handled separately in subsequent steps. For the Transformer structure, there is a positional embedding module in the original design. But recent research suggests that absolute positional embedding breaks translation variance, which is not suitable for images. In view of this and inspired by recent vision transformer design , we introduce a convolutional positional encoding (CPE) as a bypass in each spatial mixing module. The CPE module has negligible parameters and FLOPs. For MLP-based network, the pioneering work MLP-Mixer does not use any positional embedding, but we empirically find that adding the very lightweight CPE significantly improves the model performance, so we use the same treatment for MLP as for Transformer.

The three implementations of $\mathcal{F}_{s}$ have distinctive properties as listed in Table 2. First, the convolution structure only involves local connections so that it is computational efficient. Second, the self-attention structure uses dynamic weight for each input instance so that model capacity is increased. Moreover, it has a global receptive field, which enables information to flow freely across different positions . Third, MLP structure has a global receptive field just as the self-attention structure, but it does not use dynamic weight. In summary, these three properties seen in different architectures are all desirable and may have positive influence on the model performance or efficiency. We can find convolution and self-attention have complementary properties thus there is potential to build hybrid model to combine all desirable properties. Besides, MLP structure seems to be inferior to self-attention in this analysis.

Empirical Studies on Mixing Blocks

In this section, we design a series of controlled experiments to compare the three network structures. We first introduce the experimental settings in Section 4.1, and then present our main findings in Section 4.2, 4.3, 4.4, and 4.5.

We conduct experiments on ImageNet-1K (IN-1K) image classification which has 1k classes. The training set has 1.28M images while the validation set has 50k images. The Top-1 accuracy on a single crop is reported. Unless otherwise indicated, we use the input resolution of 224x224. Most of our training settings are inherited from DeiT . We employ an AdamW optimizer for 300 epochs with a cosine decay learning rate scheduler and 20 epochs of linear warm-up. The weight decay is 0.05, and the initial learning rate is $0.005\times\frac{\text{batchsize}}{512}$ . 8 GPUs with mini-batch 128 per GPU are used in training, resulting a total batch-size of 1024. We use exactly the same data augmentation and regularization configurations as DeiT, including Rand-Augment , random erasing , Mixup , CutMix , stochastic depth , and repeated augmentation . We use the same training pipeline for all comparing models. And the implementation is built upon PyTorch and timm library .

2 Multi-Stage is Superior to Single-Stage

Multi-stage design is standard in CNN models, but it is largely overlooked in Transformer-based or MLP-based models. Our first finding is that multi-stage design should always be adopted in vision models no matter which of the three network structures is chosen.

Table 3 compares the image classification performance between multi-stage framework and single-stage framework. For all three network scales and all three network structures, multi-stage framework consistently achieves better complexity-accuracy trade-off. For the sake of easy comparison, the changes of FLOPs and accuracy are highlighted in Table 3. Most of the multi-stage models are designed to have slightly fewer computational costs, but they manage to achieve a higher accuracy than the corresponding single-stage models. An accuracy loss of 2.6 points is observed for the Transformer model at the XXS scale, but it is understandable as the multi-stage model happens to have only half of the parameters and FLOPs of the corresponding single-stage model.

In addition, Fig. 3 shows how the image classification accuracy changes with the size of model parameters and model throughput. Despite the different trends observed for different network structures, the multi-stage models always outperform their single-stage counterparts.

This finding is consistent with the results reported in recent work. Both Swin-Transformer and TWins adopt multi-stage framework and achieve a stronger performance than the single-stage framework DeiT . Our empirical study suggests that the use of multi-stage framework can be an important reason.

3 Local Modeling is Crucial

Although it has been pointed out in many previous work that local modeling is crucial for vision models, we will show in this subsection how amazingly efficient local modeling could be.

In our empirical study, the spatial mixing block of the convolution structure is implemented by a $3\times 3$ depth-wise convolution, which is a typical local modeling operation. It is so light-weight that it only contributes to 0.3% of the model parameter and 0.5% of the FLOPs. However, as Table 3 and Fig. 3 show, this structure can achieve competitive performance when compared with the Transformer structure in the XXS and XS configurations.

It is due to the sheer efficiency of $3\times 3$ depth-wise convolution that we propose to use it as a bypass in both MLP and Transformer structures. The increase of model parameters and inference FLOPs is almost negligible, but the locality of the models is greatly strengthened. In order to demonstrate how local modeling helps the performance of Transformer and MLP structures, we carry out an ablation study which removes this convolution bypass in the two structures.

Table 4 shows the performance comparison between models with or without local modeling. The two models we pick are the top performers in Table 3 when multi-stage framework is used and network scale is S. We can clearly find that the convolution bypass only slightly decreases the throughput, but brings a notable accuracy increase to both models. Note that the convolution bypass is treated as convolutional positional embedding in Trans-MS-S, so we bring back the standard patch embedding as in ViT in $\text{Trans-MS-S}^{-}$ . For $\text{MLP-MS-S}^{-}$ , we follow the practice in MLP-Mixer and do not use any positional embedding. This experiment confirms the importance of local modeling and suggests the use of $3\times 3$ depth-wise convolution as a bypass for any designed network structures.

4 A Detailed Analysis of MLP

Due to the excessive number of parameters, MLP models suffer severely from over-fitting. We believe that over-fitting is the main obstacle for MLP to achieve SOTA performance. In this part, we discuss two mechanisms which can potentially alleviate this problem.

One is the use of multi-stage framework. We have already shown in Table 3 that multi-stage framework brings gain. Such gain is even more prominent for larger MLP models. In particular, the MLP-MS-S models achieves 2.6 accuracy gain over the single-stage model MLP-S. We believe this owes to the strong generalization capability of the multi-stage framework. Fig. 4 shows how the test accuracy increases with the decrease of training loss. Over-fitting can be observed when the test accuracy starts to flatten. These results also lead to a very promising baseline for MLP-based models. Without bells and whistles, MLP-MS-S model achieves 82.1% ImageNet Top-1 accuracy, which is 5.7 points higher than the best results reported by MLP-Mixer when ImageNet-1K is used as training data.

The other mechanism is parameter reduction through weight sharing. We apply weight-sharing on the spatial mixing function $\mathcal{F}_{s}$ . For the single-stage model, all $N$ mixing blocks use the same $\mathcal{F}_{s}$ , while for the multi-stage model, each stage use the same same $\mathcal{F}_{s}$ for its $N_{s}$ mixing blocks. We present the results of S models in Table 5. We can find that the shared-weight variants, denoted by ”+Shared”, achieve higher accuracy with almost the same model size and computation cost. Although they are still inferior to Transformer models, the performance is on par with or even better than convolution models. Fig. 4 confirms that using shared weights in the MLP-MS model further delays the appearance of over-fitting signs. Therefore, we conclude that MLP-based models remain competitive if they could solve or alleviate the over-fitting problem.

5 Convolution and Transformer are Complementary

We find that convolution and Transformer are complementary in the sense that convolution structure has the best generalization capability while Transformer structure has the largest model capacity among the three structures we investigated.

Fig. 5 shows that, before the performance of Conv-MS saturates, it achieves a higher test accuracy than Trans-MS at the same training loss. This shows that convolution models generalize better than Transformer models. In particular, when the training loss is relatively large, the convolution models show great superiority against Transformer models. This suggests that convolution is still the best choice in designing lightweight vision models.

On the other hand, both Fig. 3 and Fig. 5 show that Transformer models achieve higher accuracy than the other two structures when we increase the model size and allow for higher computational cost. Recall that we have discussed three properties of network architectures in Section 3.2. It is now clear that the sparse connectivity helps to increase generalization capability, while dynamic weight and global receptive field help to increase model capacity.

Hybrid Models

As discussed in Section 3.2 and 4.4, convolution and Transformer structures have complementary characteristics and have potential to be used in a single model. Based on this observation, we construct hybrid models at the XS and S scales based on these two structures. The procedure we used to construct hybrid models is rather simple. We take a multi-stage convolution-based model as the base model, and replace some selected layers with Transformer layers. Considering the local modeling capability of convolutions and global modeling capability of Transformers, we tend to do such replacement in later stages of the model. The details of layer selection in the two hybrid models are listed as follows.

Hybrid-MS-XS: It is based on Conv-MS-XS. The last ten layers in Stage 3 and the last two layers in Stage 4 are replaced by Transformer layers. Stage 1 and 2 remain unchanged.

Hybrid-MS-S: It is based on Conv-MS-S. The last two layers in Stage 2, the last ten layers in Stage 3, and the last two layers in Stage 4 are replaced by Transformer layers. Stage 1 remains unchanged.

In order to unleash the full potential of hybrid models, we further adopt the deep patch embedding layer (PEL) implementation as suggested in LV-ViT . Different from default PEL which uses one large (16x16) convolution kernel, the deep PEL uses four convolution kernels with kernel size $\{7,3,3,2\}$ , stride $\{2,1,1,2\}$ , and channel number $\{64,64,64,C\}$ . By using small kernel sizes and more convolution kernels, deep PEL helps a vision model to explore the locality inside single patch embedding vector. We mark models with deep PEL as ”Hybrid-MS-*+”.

Table 6 shows comparison between our hybrid models and some of the state-of-the-art models based on CNN, Transformer, or MLP. All listed models are trained on ImageNet-1K. Within the section of our models, we can find that hybrid models achieve better model size-performance trade-off than pure convolution models or Transformer models. The Hybrid-MS-XS achieves 82.4% top-1 accuracy with 28M parameters, which is higher than Conv-MS-S with 44M parameters and only a little lower than Trans-MS-S with 40M parameters. In addition, the Hybrid-MS-S achieve 83.7% top-1 accuracy with 63M parameters, which has 0.8 point gain compared with Trans-MS-S.

The Hybrid-MS-S+ model we proposed achieves 83.9% top-1 accuracy with 63M parameters. This number is higher than the accuracy achieved by SOTA models Swin-B and CaiT-S36, which have model size of 88M and 68.2M, respectively. The FLOPs of our model is also fewer than these two models. We believe Hybrid-MS-S can be serve as a strong yet simple baseline for future research on architecture design of vision models.

Conclusion

The objective of this work is to understand how the emerging Transformer and MLP structures compare with CNNs in the computer vision domain. We first built a simple and unified framework, called SPACH, that could use CNN, Transformer, or MLP as plug-and-play components. Under the SPACH framework, we discover with a little surprise that all three network structures are similarly competitive in terms of the accuracy-complexity trade-off, although they show distinctive properties when the network scales up. In addition to the analysis of specific network structures, we also investigate two important design choices, namely multi-stage framework and local modeling, which are largely overlooked in previous work. Finally, inspired by the analysis, we propose two hybrid models which achieve SOTA performance on ImageNet-1k classification without bells and whistles.

Our work also raises several questions worth exploring. First, realizing the fact that the performance of MLP-based models is largely affected by over-fitting, is it possible to design a high-performing MLP model that is not subject to over-fitting? Second, current analyses suggest that neither convolution nor Transformer is the optimal structure across all model sizes. What is the best way to fuse these two structures? Last but not least, do better visual models exist beyond the known structures including CNN, Transformer, and MLP?