ResMLP: Feedforward networks for image classification with data-efficient training

Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou

Introduction

Recently, the transformer architecture , adapted from its original use in natural language processing with only minor changes, has achieved performance competitive with the state of the art on ImageNet-1k when pre-trained with a sufficiently large amount of data . Retrospectively, this achievement is another step towards learning visual features with less priors: Convolutional Neural Networks (CNN) had replaced the hand-designed choices from hard-wired features with flexible and trainable architectures. Vision transformers further removes several hard decisions encoded in the convolutional architectures, namely the translation invariance and local connectivity.

This evolution toward less hard-coded prior in the architecture has been fueled by better training schemes , and, in this paper, we push this trend further by showing that a purely multi-layer perceptron (MLP) based architecture, called Residual Multi-Layer Perceptrons (ResMLP), is competitive on image classification. ResMLP is designed to be simple and encoding little prior about images: it takes image patches as input, projects them with a linear layer, and sequentially updates their representations with two residual operations: (i) a cross-patch linear layer applied to all channels independently; and (ii) an cross-channel single-layer MLP applied independently to all patches. At the end of the network, the patch representations are average pooled, and fed to a linear classifier. We outline ResMLP in Figure 1 and detail it further in Section 2.

The ResMLP architecture is strongly inspired by the vision transformers (ViT) , yet it is much simpler in several ways: we replace the self-attention sublayer by a linear layer, resulting in an architecture with only linear layers and GELU non-linearity . We observe that the training of ResMLP is more stable than ViTs when using the same training scheme as in DeiT and CaiT , allowing to remove the need for batch-specific or cross-channel normalizations such as BatchNorm, GroupNorm or LayerNorm. We speculate that this stability comes from replacing self-attention with linear layers. Finally, another advantage of using a linear layer is that we can still visualize the interactions between patch embeddings, revealing filters that are similar to convolutions on the lower layers, and longer range in the last layers.

We further investigate if our purely MLP based architecture could benefit to other domains beyond images, and particularly, with more complex output spaces. In particular, we adapt our MLP based architecture to take inputs with variable length, and show its potential on the problem of Machine Translation. To do so, we develop a sequence-to-sequence (seq2seq) version of ResMLP, where both encoder and decoders are based on ResMLP with across-attention between the encoder and decoder . This model is similar to the original seq2seq Transformer with ResMLP layers instead of Transformer layers . Despite not being originally designed for this task, we observe that ResMLP is competitive with Transformers on the challenging WMT benchmarks.

In summary, in this paper, we make the following observations:

despite its simplicity, ResMLP reaches surprisingly good accuracy/complexity trade-offs with ImageNet-1k training only111Concurrent work by Tolstikhin et al. brings complementary insights to ours: they achieve interesting performance with larger MLP models pre-trained on the larger public ImageNet-22k and even more data with the proprietary JFT-300M. In contrast, we focus on faster models trained on ImageNet-1k. Other concurrent related work includes that of Melas-Kyriazi and the RepMLP and gMLP models., without requiring normalization based on batch or channel statistics;

these models benefit significantly from distillation methods ; they are also compatible with modern self-supervised learning methods based on data augmentation, such as DINO ;

A seq2seq ResMLP achieves competitive performances compared to a seq2seq Transformers on the WMT benchmark for Machine Translation.

Method

In this section, we describe our architecture, ResMLP, as depicted in Figure 1. ResMLP is inspired by ViT and this section focuses on the changes made to ViT that lead to a purely MLP based model. We refer the reader to Dosovitskiy et al. for more details about ViT.

The overall ResMLP architecture. Our model, denoted by ResMLP, takes a grid of N ⁣× ⁣NN\!\times\!N non-overlapping patches as input, where the patch size is typically equal to 16 ⁣× ⁣1616\!\times\!16. The patches are then independently passed through a linear layer to form a set of N2N^{2} dd-dimensional embeddings.

The resulting set of N2N^{2} embeddings are fed to a sequence of Residual Multi-Layer Perceptron layers to produce a set of N2N^{2} dd-dimensional output embeddings. These output embeddings are then averaged (“average-pooling”) as a dd-dimension vector to represent the image, which is fed to a linear classifier to predict the label associated with the image. Training uses the cross-entropy loss.

The Residual Multi-Perceptron Layer. Our network is a sequence of layers that all have the same structure: a linear sublayer applied across patches followed by a feedforward sublayer applied across channels. Similar to the Transformer layer, each sublayer is paralleled with a skip-connection . The absence of self-attention layers makes the training more stable, allowing us to replace the Layer Normalization by a simpler Affine transformation:

Diag𝜶𝐱𝜷\displaystyle=\texttt{Diag}(\bm{\alpha})\mathbf{x}+\bm{\beta}, (1) where α\bm{\alpha} and β\bm{\beta} are learnable weight vectors. This operation only rescales and shifts the input element-wise. This operation has several advantages over other normalization operations: first, as opposed to Layer Normalization, it has no cost at inference time, since it can absorbed in the adjacent linear layer. Second, as opposed to BatchNorm and Layer Normalization, the Aff operator does not depend on batch statistics. The closer operator to Aff is the LayerScale introduced by Touvron et al. , with an additional bias term. For convenience, we denote by Aff(X)\texttt{Aff}(\mathbf{X}) the Affine operation applied independently to each column of the matrix X\mathbf{X}.

We apply the Aff operator at the beginning (“pre-normalization”) and end (“post-normalization”) of each residual block. As a pre-normalization, Aff replaces LayerNorm without using channel-wise statistics. Here, we initialize α=1\bm{\alpha}=\bm{1}, and β=0\bm{\beta}=\bm{0}. As a post-normalization, Aff is similar to LayerScale and we initialize α\bm{\alpha} with the same small value as in .

Overall, our Multi-layer perceptron takes a set of N2N^{2} dd-dimensional input features stacked in a d×N2d\times N^{2} matrix X\mathbf{X}, and outputs a set of N2N^{2} dd-dimension output features, stacked in a matrix Y\mathbf{Y} with the following set of transformations:

𝐗Affsuperscript𝐀Affsuperscript𝐗toptop\displaystyle\mathbf{X}+\texttt{Aff}\left((\mathbf{A}~{}\texttt{Aff}\left(\mathbf{X})^{\top}\right)^{\top}\right), (2) Y\displaystyle\mathbf{Y} =\displaystyle= Z+Aff(C GELU(B Aff(Z))),\displaystyle\mathbf{Z}+\texttt{Aff}\left(\mathbf{C}~{}\texttt{GELU}(\mathbf{B}~{}\texttt{Aff}(\mathbf{Z}))\right), (3) where A\mathbf{A}, B\mathbf{B} and C\mathbf{C} are the main learnable weight matrices of the layer. Note that Eq (3) is the same as the feedforward sublayer of a Transformer with the ReLU non-linearity replaced by a GELU function . The dimensions of the parameter matrix A\mathbf{A} are N2 ⁣× ⁣N2N^{2}\!\times\!N^{2}, i.e., this “cross-patch” sublayer exchanges information between patches, while the “cross-channel” feedforward sublayer works per location. Similar to a Transformer, the intermediate activation matrix Z\mathbf{Z} has the same dimensions as the input and output matrices, X\mathbf{X} and Y\mathbf{Y}. Finally, the weight matrices B\mathbf{B} and C\mathbf{C} have the same dimensions as in a Transformer layer, which are 4d ⁣× ⁣d4d\!\times\!d and d ⁣× ⁣4dd\!\times\!4d, respectively.

Differences with the Vision Transformer architecture. Our architecture is closely related to the ViT model . However, ResMLP departs from ViT with several simplifications:

no self-attention blocks: it is replaced by a linear layer with no non-linearity,

no positional embedding: the linear layer implicitly encodes information about patch positions,

no extra “class” token: we simply use average pooling on the patch embeddings,

no normalization based on batch statistics: we use a learnable affine operator.

Class-MLP as an alternative to average-pooling. We propose an adaptation of the class-attention token introduced in CaiT . In CaiT, this consists of two layers that have the same structure as the transformer, but in which only the class token is updated based on the frozen patch embeddings. We translate this method to our architecture, except that, after aggregating the patches with a linear layer, we replace the attention-based interaction between the class and patch embeddings by simple linear layers, still keeping the patch embeddings frozen. This increases the performance, at the expense of adding some parameters and computational cost. We refer to this pooling variant as “class-MLP”, since the purpose of these few layers is to replace average pooling.

Sequence-to-sequence ResMLP. Similar to Transformer, the ResMLP architecture can be applied to sequence-to-sequence tasks. First, we follow the general encoder-decoder architecture from Vaswani et al. , where we replace the self-attention sublayers by the residual multi-perceptron layer. In the decoder, we keep the cross-attention sublayers, which attend to the output of the encoder. In the decoder, we adapt the linear sublayers to the task of language modeling by constraining the matrix A\mathbf{A} to be triangular, in order to prevent a given token representation to access tokens from the future. Finally, the main technical difficulty from using linear sublayers in a sequence-to-sequence model is to deal with variable sequence lengths. However, we observe that simply padding with zeros and extracting the submatrix A\mathbf{A} corresponding to the longest sequence in a batch, works well in practice.

Experiments

In this section, we present experimental results for the ResMLP architecture on image classification and machine translation. We also study the impact of the different components of ResMLP in ablation studies. We consider three training paradigms for images:

Supervised learning: We train ResMLP from labeled images with a softmax classifier and cross-entropy loss. This paradigm is the main focus of our work.

Self-supervised learning: We train the ResMLP with the DINO method of Caron et al. that trains a network without labels by distilling knowledge from previous instances of the same network.

Knowledge distillation: We employ the knowledge distillation procedure proposed by Touvron et al. to guide the supervised training of ResMLP with a convnet.

Datasets. We train our models on the ImageNet-1k dataset , that contains 1.2M images evenly spread over 1,000 object categories. In the absence of an available test set for this benchmark, we follow the standard practice in the community by reporting performance on the validation set. This is not ideal since the validation set was originally designed to select hyper-parameters. Comparing methods on this set may not be conclusive enough because an improvement in performance may not be caused by better modeling, but by a better selection of hyper-parameters. To mitigate this risk, we report additional results in transfer learning and on two alternative versions of ImageNet that have been built to have distinct validation and test sets, namely the ImageNet-real and ImageNet-v2 datasets. We also report a few data-points when training on ImageNet-21k. Our hyper-parameters are mostly adopted from Touvron et al. .

Hyper-parameter settings. In the case of supervised learning, we train our network with the Lamb optimizer with a learning rate of 5×1035\times 10^{-3} and weight decay 0.20.2. We initialize the LayerScale parameters as a function of the depth by following CaiT . The rest of the hyper-parameters follow the default setting used in DeiT . For the knowledge distillation paradigm, we use the same RegNety-16GF as in DeiT with the same training schedule. The majority of our models take two days to train on eight V100-32GB GPUs.

2 Main Results

In this section, we compare ResMLP with architectures based on convolutions or self-attentions with comparable size and throughput on ImageNet.

Supervised setting. In Table 1, we compare ResMLP with different convolutional and Transformer architectures. For completeness, we also report the best-published numbers obtained with a model trained on ImageNet alone. While the trade-off between accuracy, FLOPs, and throughput for ResMLP is not as good as convolutional networks or Transformers, their strong accuracy still suggests that the structural constraints imposed by the layer design do not have a drastic influence on performance, especially when training with enough data and recent training schemes.

Self-supervised setting. We pre-train ResMLP-S12 using the self-supervised method called DINO during 300 epochs. We report our results in Table 3.2. The trend is similar to the supervised setting: the accuracy obtained with ResMLP is lower than ViT. Nevertheless, the performance is surprisingly high for a pure MLP architecture and competitive with Convnet in kk-NN evaluation. Additionally, we also fine-tune network pre-trained with self-supervision on ImageNet using the ground-truth labels. Pre-training substantially improves performance compared to a ResMLP-S24 solely trained with labels, achieving 79.9% top-1 accuracy on ImageNet-val (+0.5%).

Knowledge distillation setting. We study our model when training with the knowledge distillation approach of Touvron et al. . In their work, the authors show the impact of training a ViT model by distilling it from a RegNet. In this experiment, we explore if ResMLP also benefits from this procedure and summarize our results in Table 3 (Blocks “Baseline models” and “Training”). We observe that similar to DeiT models, ResMLP greatly benefits from distilling from a convnet. This result concurs with the observations made by d’Ascoli et al. , who used convnets to initialize feedforward networks. Even though our setting differs from theirs in scale, the problem of overfitting for feedforward networks is still present on ImageNet. The additional regularization obtained from the distillation is a possible explanation for this improvement.

3 Visualization & analysis of the linear interaction between patches

Visualisations of the cross-patch sublayers. In Figure 2, we show in the form of squared images, the rows of the weight matrix from cross-patch sublayers at different depths of a ResMLP-S24 model. The early layers show convolution-like patterns: the weights resemble shifted versions of each other and have local support. Interestingly, in many layers, the support also extends along both axes; see layer 7. The last 7 layers of the network are different: they consist of a spike for the patch itself and a diffuse response across other patches with different magnitude; see layer 20.

Measuring sparsity of the weights. The visualizations described above suggest that the linear communication layers are sparse. We analyze this quantitatively in more detail in Figure 4. We measure the sparsity of the matrix A\mathbf{A}, and compare it to the sparsity of B\mathbf{B} and C\mathbf{C} from the per-patch MLP. Since there are no exact zeros, we measure the rate of components whose absolute value is lower than 5% of the maximum value. Note, discarding the small values is analogous to the case where we normalize the matrix by its maximum and use a finite-precision representation of weights. For instance, with a 4-bits representation of weight, one would typically round to zero all weights whose absolute value is below 6.25% of the maximum value.

The measurements in Figure 4 show that all three matrices are sparse, with the layers implementing the patch communication being significantly more so. This suggests that they may be compatible with parameter pruning, or better, with modern quantization techniques that induce sparsity at training time, such as Quant-Noise and DiffQ . The sparsity structure, in particular in earlier layers, see Figure. 2, hints that we could implement the patch interaction linear layer with a convolution. We provide some results for convolutional variants in our ablation study. Further research on network compression is beyond the scope of this paper, yet we believe it worth investigating in the future.

Communication across patches if we remove the linear interaction layer (linear \rightarrow none), we obtain substantially lower accuracy (-20% top-1 acc.) for a “bag-of-patches” approach. We have tried several alternatives for the cross-patch sublayer, which are presented in Table 3 (block “patch communication”). Amongst them, using the same MLP structure as for patch processing (linear \rightarrow MLP), which we analyze in more details in the supplementary material. The simpler choice of a single linear square layer led to a better accuracy/performance trade-off – considering that the MLP variant requires compute halfway between ResMLP-S12 and ResMLP-S24 – and requires fewer parameters than a residual MLP block.

The visualization in Figure 2 indicates that many linear interaction layers look like convolutions. In our ablation, we replaced the linear layer with different types of 3 ⁣× ⁣33\!\times\!3 convolutions. The depth-wise convolution does not implement interaction across channels – as our linear patch communication layer – and yields similar performance at a comparable number of parameters and FLOPs. While full 3 ⁣× ⁣33\!\times\!3 convolutions yield best results, they come with roughly double the number of parameters and FLOPs. Interestingly, the depth-separable convolutions combine accuracy close to that of full 3 ⁣× ⁣33\!\times\!3 convolutions with a number of parameters and FLOPs comparable to our linear layer. This suggests that convolutions on low-resolution feature maps at all layers is an interesting alternative to the common pyramidal design of convnets, where early layers operate at higher resolution and smaller feature dimension.

4 Ablation studies

Table 3 reports the ablation study of our base network and a summary of our preliminary exploratory studies. We discuss the ablation below and give more detail about early experiments in Appendix A.

Control of overfitting. Since MLPs are subject to overfitting, we show in Fig. 4 a control experiment to probe for problems with generalization. We explicitly analyze the differential of performance between the ImageNet-val and the distinct ImageNet-V2 test set. The relative offsets between curves reflect to which extent models are overfitted to ImageNet-val w.r.t. hyper-parameter selection. The degree of overfitting of our MLP-based model is overall neutral or slightly higher to that of other transformer-based architectures or convnets with same training procedure.

Normalization & activation. Our network configuration does not contain any batch normalizations. Instead, we use the affine per-channel transform Aff. This is akin to Layer Normalization , typically used in transformers, except that we avoid to collect any sort of statistics, since we do no need it it for convergence. In preliminary experiments with pre-norm and post-norm , we observed that both choices converged. Pre-normalization in conjunction with Batch Normalization could provide an accuracy gain in some cases, see Appendix A.

We choose to use a GELU function. In Appendix A we also analyze the activation function: ReLU also gives a good performance, but it was a bit more unstable in some settings. We did not manage to get good results with SiLU and HardSwish .

Pooling. Replacing average pooling with Class-MLP, see Section 2, brings a significant gain for a negligible computational cost. We do not include it by default to keep our models more simple.

Patch size. Smaller patches significantly increase the performance, but also increase the number of flops (see Block "Patch size" in Table 3). Smaller patches benefit more to larger models, but only with an improved optimization scheme involving more regularization (distillation) or more data.

Training. Consider the Block “Training’ in Table 3. ResMLP significantly benefits from modern training procedures such as those used in DeiT. For instance, the DeiT training procedure improves the performance of ResMLP-S12 by 7.4%7.4\% compared to the training employed for ResNet 222Interestingly, if trained with this “old-fashion” setting, ResMLP-S12 outperforms AlexNet by a margin. . This is in line with recent work pointing out the importance of the training strategy over the model choice . Pre-training on more data and distillation also improve the performance of ResMLP, especially for the bigger models, e.g., distillation improves the accuracy of ResMLP-B24/8 by 2.6%2.6\%.

Other analysis. In our early exploration, we evaluated several alternative design choices. As in transformers, we could use positional embeddings mixed with the input patches. In our experiments we did not see any benefit from using these features, see Appendix A. This observation suggests that our cross-patch sublayer provides sufficient spatial communication, and referencing absolute positions obviates the need for any form of positional encoding.

5 Transfer learning

We evaluate the quality of features obtained from a ResMLP architecture when transferring them to other domains. The goal is to assess if the features generated from a feedforward network are more prone to overfitting on the training data distribution. We adopt the typical setting where we pre-train a model on ImageNet-1k and fine-tune it on the training set associated with a specific domain. We report the performance with different architectures on various image benchmarks in Table 4, namely CIFAR-10 and CIFAR-100 , Flowers-102 , Stanford Cars and iNaturalist . We refer the reader to the corresponding references for a more detailed description of the datasets.

We observe that the performance of our ResMLP is competitive with the existing architectures, showing that pretraining feedforward models with enough data and regularization via data augmentation greatly reduces their tendency to overfit on the original distribution. Interestingly, this regularization also prevents them from overfitting on the training set of smaller dataset during the fine-tuning stage.

6 Machine translation

We also evaluate the ResMLP transpose-mechanism to replace the self-attention in the encoder and decoder of a neural machine translation system. We train models on the WMT 2014 English-German and English-French tasks, following the setup from Ott et al. . We consider models of dimension 512, with a hidden MLP size of 2,048, and with 6 or 12 layers. Note that the current state of the art employs much larger models: our 6-layer model is more comparable to the base transformer model from Vaswani et al. , which serves as a baseline, along with pre-transformer architectures such as recurrent and convolutional neural networks. We use Adagrad with learning rate 0.2, 32k steps of linear warmup, label smoothing 0.1, dropout rate 0.15 for En-De and 0.1 for En-Fr. We initialize the LayerScale parameter to 0.2. We generate translations with the beam search algorithm, with a beam of size 4. As shown in Table 5, the results are at least on par with the compared architectures.

Related work

We review the research on applying Fully Connected Network (FCN) for computer vision problems as well as other architectures that shares common modules with our model.

Fully-connected network for images. Many studies have shown that FCNs are competitive with convnets for the tasks of digit recognition , keyword spotting and handwritting recognition . Several works have questioned if FCNs are also competitive on natural image datasets, such as CIFAR-10 . More recently, d’Ascoli et al. have shown that a FCN initialized with the weights of a pretrained convnet achieves performance that are superior than the original convnet. Neyshabur further extend this line of work by achieving competitive performance by training an FCN from scratch but with a regularizer that constrains the models to be close to a convnet. These studies have been conducted on small scale datasets with the purpose of studying the impact of architectures on generalization in terms of sample complexity and energy landscape . In our work, we show that, in the larger scale setting of ImageNet, FCNs can attain surprising accuracy without any constraint or initialization inspired by convnets.

Finally, the application of FCN networks in computer vision have also emerged in the study of the properties of networks with infinite width , or for inverse scattering problems . More interestingly, the Tensorizing Network is an approximation of very large FCN that shares similarity with our model, in that they intend to remove prior by approximating even more general tensor operations, i.e., not arbitrarily marginalized along some pre-defined sharing dimensions. However, their method is designed to compress the MLP layers of a standard convnets.

Other architectures with similar components. Our FCN architecture shares several components with other architectures, such as convnets or transformers . A fully connected layer is equivalent to a convolution layer with a 1×11\times 1 receptive field, and several work have explored convnet architectures with small receptive fields. For instance, the VGG model uses 3 ⁣× ⁣33\!\times\!3 convolutions, and later, other architectures such as the ResNext or the Xception mix 1 ⁣× ⁣11\!\times\!1 and 3 ⁣× ⁣33\!\times\!3 convolutions. In contrast to convnets, in our model interaction between patches is obtained via a linear layer that is shared across channels, and that relies on absolute rather than relative positions.

More recently, transformers have emerged as a promising architecture for computer vision . In particular, our architecture takes inspiration from the structure used in the Vision Transformer (ViT) , and as consequence, shares many components. Our model takes a set of non-overlapping patches as input and passes them through a series of MLP layers that share the same structure as ViT, replacing the self-attention layer with a linear patch interaction layer. Both layers have a global field-of-view, unlike convolutional layers. Whereas in self-attention the weights to aggregate information from other patches are data dependent through queries and keys, in ResMLP the weights are not data dependent and only based on absolute positions of patches. In our implementation we follow the improvements of DeiT to train vision transformers, use the skip-connections from ResNets with pre-normalization of the layers .

Finally, our work questions the importance of self-attention in existing architectures. Similar observations have been made in natural language processing. Notably, Synthesizer shows that dot-product self-attention can be replaced by a feedforward network, with competitive performance on sentence representation benchmarks. As opposed to our work, Synthesizer does use data dependent weights, but in contrast to transformers the weights are determined from the queries only.

Conclusion

In this paper we have shown that a simple residual architecture, whose residual blocks consist of a one-hidden layer feed-forward network and a linear patch interaction layer, achieves an unexpectedly high performance on ImageNet classification benchmarks, provided that we adopt a modern training strategy such as those recently introduced for transformer-based architectures. Thanks to their simple structure, with linear layers as the main mean of communication between patches, we can vizualize the filters learned by this simple MLP. While some of the layers are similar to convolutional filters, we also observe sparse long-range interactions as early as the second layer of the network. We hope that our model free of spatial priors will contribute to further understanding of what networks with less priors learn, and potentially guide the design choices of future networks without the pyramidal design prior adopted by most convolutional neural networks.

Acknowledgments

We would like to thank Mark Tygert for relevant references. This work builds upon the Timm library by Ross Wightman.

References

Appendix A Report on our exploration phase

As discussed in the main paper, our work on designing a residual multi-layer perceptron was inspired by the Vision Transformer. For our exploration, we have adopted the recent CaiT variant as a starting point. This transformer-based architecture achieves state-of performance with Imagenet-training only (achieving 86.5% top-1 accuracy on Imagenet-val for the best model). Most importantly, the training is relatively stable with increasing depth.

In our exploration phase, our objective was to radically simplify this model. For this purpose, we have considered the Cait-S24 model for faster iterations. This network consists of 24-layer with a working dimension of 384. All our experiments below were carried out with images in resolution 224×\times224 and N=16×16N=16\times 16 patches. Trained with regular supervision, Cait-S24 attains 82.7% top-1 acc. on Imagenet.

The self-attention can be seen a weight generator for a linear transformation on the values. Therefore, our first design modification was to get rid of the self-attention by replacing it by a residual feed-forward network, which takes as input the transposed set of patches instead of the patches. In other terms, in this case we alternate residual blocks operating along the channel dimension with some operating along the patch dimension. In that case, the MLP replacing the self-attention consists of the sequence of operations

()T(\cdot)^{T} — linear N×4NN\times 4N — GELU — linear 4N×N4N\times N()T(\cdot)^{T}

Hence this network is symmetrical in NN and dd. By keeping the other elements identical to CaiT, the accuracy drops to 80.2%80.2\% (-2.5%) when replacing self-attention layers.

Class-attention →→\rightarrow class-MLP.

If we further replace the class-attention layer of CaiT by a MLP as described in our paper, then we obtain an attention-free network whose top-1 accuracy on Imagenet-val is 79.2%, which is comparable to a ResNet-50 trained with a modern training strategy. This network has served as our baseline for subsequent ablations. Note that, at this stage, we still include LayerScale, a class embedding (in the class-MLP stage) and positional encodings.

Distillation.

The same model trained with distillation inspired by Touvron et al. achieves 81.5%. The distillation variant we choose corresponds to the “hard-distillation”, whose main advantage is that it does not require any parameter-tuning compared to vanilla cross-entropy. Note that, in all our experiments, this distillation method seems to bring a gain that is complementary and seemingly almost orthogonal to other modifications.

Activation: LayerNorm →→\rightarrow X.

We have tried different activations on top of the aforementioned MLP-based baseline, and kept GeLU for its accuracy and to be consistent with the transformer choice.

Ablation on the size of the communication MLP.

For the MLP that replaced the class-attention, we have explored different sizes of the latent layer, by adjusting the expansion factor ee in the sequence: linear N×e×NN\times e\times N — GELU — linear e×N×Ne\times N\times N. For this experiment we used average pooling to aggregating the patches before the classification layer.

We observe that a large expansion factor is detrimental in the patch communication, possibly because we should not introduce too much capacity in this residual block. This has motivated the choice of adopting a simple linear layer of size N×NN\times N: This subsequently improved performance to 79.5%79.5\% in a setting comparable to the table above. Additionally, as shown earlier this choice allows visualizations of the interaction between patches.

Normalization.

On top of our MLP baseline, we have tested different variations for normalization layers. We report the variation in performance below.

For the sake of simplicity, we therefore adopted only the Aff transformation so as to not depend on any batch or channel statistics.

Position encoding.

In our experiments, removing the position encoding does not change the results when using a MLP or a simple linear layer as a communication mean across patch embeddings. This is not surprising considering that the linear layer implicitly encodes each patch identity as one of the dimension, and that additionally the linear includes a bias that makes it possible to differentiate the patch positions before the shared linear layer.

Appendix B Analysis of interaction layers in 12-layer networks

In this section we further analyze the linear interaction layers in 12-layer models.

In Figure B.1 we consider a ResMLP-S12 model trained on the ImageNet-1k dataset, as explained in Section 3.1, and show all the 12 linear patch interaction layers. The linear interaction layers in the supervised 12-layer model are similar to those observed in the 24-layer model in Figure 2.

We also provide the corresponding sparsity measurements for this model in Figure B.2, analogous to the measurements in Figure 4 for the supervised 24-layer model. The sparsity levels in the supervised 12-layer model (left panel) are similar to those observes in the supervised 24-layer model, cf. Figure 4. In the right panel of Figure B.2 we consider the sparsity levels of the Distilled 12-layer model, which are overall similar to those observed for supervised the 12-layer and 24-layer models.

Appendix C Model definition in Pytorch

In Algorithm 1 we provide the pseudo-pytorch-code associated with our model.

Appendix D Additional Ablations

DeiT proposes a training strategy which allows for data-efficient vision transformers on ImageNet only. In Table D.1 we ablate each component of the DeiT training to go back to the initial ResNet50 training. As to be expected, the training used in the ResNet-50 paper degrades the performance.

Training schedule.

Table D.2 compares the performance of ResMLP-S36 according to the number of training epochs. We observe a saturation of the performance after 800 epochs for ResMLP. This saturation is observed in DeiT from 400 epochs. So ResMLP needs more epochs to be optimal.

Pooling layers.

Table D.3 compares the performance of two pooling layers: average-pooling and class-MLP, with different depth with and without distillation. We can see that class-MLP performs much better than average pooling by changing only a few FLOPs and number of parameters. Nevertheless, the gap seems to decrease between the two approaches with deeper models.