MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

Introduction

Convolutional Neural Networks (ConvNets) have been the dominant architectural design choice for computer vision since AlexNet . ConvNets continue to excel on numerous vision problems by going deeper , wider , adding dense connections , efficient separable convolutions , atrous convolutions , using encoder-decoder frameworks , and even introducing modern micro-design components . Meanwhile, as inspired by the evolution of self-attention models like Transformers in natural language processing , numerous researchers have started to introduce attention mechanisms into vision . The Vision Transformer (ViT) is perhaps the first fully Transformer-based architecture for vision, whereby image patches are simply regarded as sequences of words and a transformer encoder is applied on these visual tokens. When pre-trained on large-scale datasets , ViT can achieve compelling results on image recognition.

However, it has been observed that without extensive pre-training ViT underperforms on image recognition. This is due to the strong model capacity of Transformers, that is imbued with less inductive bias, which leads to overfitting. To properly regularize the model capacity and improve its scalability, numerous subsequent efforts have studied sparse Transformer models tailored for vision tasks such as local attention . These methods typically re-introduce hierarchical architectures to compensate for the loss of non-locality. The Swin Transformer is one such successful attempt to modify Transformers by applying self-attention on shifted non-overlapping windows. For the first time, this approach outperformed ConvNets on the ImageNet benchmark with a pure vision Transformer. Despite having more flexibility and generalizability than the full attention used in ViT, window-based attention has been observed to have limited model capacity due to the loss of non-locality, and henceforth scales unfavorably on larger data regimes such as ImageNet-21K and JFT . However, acquiring global interactions via full-attention at early or high-resolution stages in a hierarchical network is computationally heavy, as the attention operator requires quadratic complexity. How to efficiently incorporate global and local interactions to balance the model capacity and generalizability under a computation budget still remains challenging.

In this paper, we present a new type of Transformer module, called multi-axis self-attention (Max-SA), that capably serves as a basic architecture component which can perform both local and global spatial interactions in a single block. Compared to full self-attention, Max-SA enjoys greater flexibility and efficiency, i.e., naturally adaptive to different input lengths with linear complexity; in contrast to (shifted) window/local attention, Max-SA allows for stronger model capacity by proposing a global receptive field. Moreover, with merely linear complexity, Max-SA can be used as a general stand-alone attention module in any layer of a network, even in earlier, high-resolution stages.

To demonstrate its effectiveness and universality, we further design a simple but effective vision backbone called Multi-axis Vision Transformer (MaxViT) by hierarchically stacking repeated blocks composed of Max-SA and convolutions. While our proposed model belongs to the category of hybrid vision Transformers, MaxViT distinguishes from previous approaches in that we strive for simplicity, by designing a basic block unifying convolution, local, and global attention, then simply repeating it. Our experiments shows that the MaxViT significantly improves upon state-of-the-art (SOTA) performance under all data regimes for a broad range of visual tasks including classification, object detection and segmentation, image aesthetics assessment, and image generation. Specifically, as Figure 1 shows, MaxViT outperforms all recent Transformer-based models in regards to both accuracy vs. FLOPs and accuracy vs. parameter curves. Our contributions are:

A generic strong Transformer backbone, MaxViT, that can capture both local and global spatial interactions throughout every stage of the network.

A novel stand-alone multi-axis attention module composed of blocked local and dilated global attention, enjoying global perception in linear complexity.

We demonstrate large amounts of design choices including number of layers, layouts, the use of MBConv, etc. with extensive ablation studies, that eventually converge towards our final modular design, the MaxViT-Block.

Our extensive experiments show that MaxViT achieves SOTA results under various data regimes for a broad range of tasks including image classification, object detection, image aesthetic assessment, and image generation.

Related work

Convolutional networks. Since AlexNet , convolutional neural networks (ConvNets) have been used as de facto solutions to almost all vision tasks before the “Roaring 20s” . Phenomenal architectural improvements have been made in the past decade: residual and dense connections , fully-convolutional networks , encoder-decoder schemes , feature pyramids , increased depths and widths , spatial- and channel-wise attention models , non-local interactions , to name a few. A remarkable recent work ConvNeXt has re-introduced core designs of vision Transformers and shown that a ‘modernized’ pure ConvNet can achieve performance comparable to Transformers on broad vision tasks.

Transformers in vision. Transformers were originally proposed for natural language processing . The debut of the Vision Transformer (ViT) in 2020 showed that pure Transformer-based architectures are also effective solutions for vision problems. The elegantly novel view of ViT that treats image patches as visual words has stimulated explosive research interest in visual Transformers. To account for locality and 2D nature of images, the Swin Transformer aggregates attention in shifted windows in a hierarchical architecture . More recent works have been focused on improving model and data efficiency, including sparse attention , improved locality , pyramidal designs , improved training strategies , etc. We refer readers to dedicated surveys of vision Transformers for a comprehensive review.

Hybrid models. Pure Transformer-based vision models have been observed to generalize poorly due to relatively less inductive bias . Vision Transformers also exhibit substandard optimizability . An intriguingly simple improvement is to adopt a hybrid design of Transformer and convolution layers such as using a few convolutions to replace the coarse patchify stem . A broad range of works fall into this category, either explicitly hybridized or in an implicit fashion .

Transformer for GANs. Transformers have also proven effective in generative adversarial networks (GANs) . TransGAN built a pure Transformer GAN with a careful design of local attention and upsampling layers, demonstrating effectiveness on small scale datasets . GANformer explored efficient global attention mechanisms to improve on StyleGAN generator. HiT presents an efficient Transformer generator based on local-global attention that can scale up to 1K high-resolution image generation.

Method

Inspired by the sparse approaches presented in , we introduce a new type of attention module, dubbed blocked multi-axis self-attention (Max-SA), by decomposing the fully dense attention mechanisms into two sparse forms – window attention and grid attention – which reduces the quadratic complexity of vanilla attention to linear, without any loss of non-locality. Our sequential design offers greater simplicity and flexibility, while performing even better than previous methods – each individual module can be used either standalone or combined in any order (Tables 10-10), whereas parallel designs offer no such benefits. Because of the flexibility and scalability of Max-SA, we are able to build a novel vision backbone, which we call MaxViT, by simply stacking alternative layers of Max-SA with MBConv in a hierarchical architecture, as shown in Figure 2. MaxViT benefits from global and local receptive fields throughout the entire network, from shallow to deep stages, demonstrating superior performance in regards to both model capacity and generalization abilities.

Self-attention allows for spatial mixing of entire spatial (or sequence) locations while also benefiting from content-dependent weights based on normalized pairwise similarity. The standard self-attention defined in is location-unaware, i.e., non-translation equivariant, an important inductive bias imbued in ConvNets. Relative self-attention has been proposed to improve on vanilla attention by introducing a relative learned bias added to the attention weights, which has been shown to consistently outperform original attention on many vision tasks . In this work, we mainly adopt the pre-normalized relative self-attention defined in as the key operator in MaxViT.

2 Multi-axis Attention

Despite bypassing the notoriously heavy computation of full self-attention, local-attention models have been observed to underfit on huge-scale datasets . Inspired by block attention, we present a surprisingly simple but effective way to gain sparse global attention, which we call grid attention. Instead of partitioning feature maps using fixed window size, we grid the tensor into the shape $(G\times G,\frac{H}{G}\times\frac{W}{G},C)$ using a fixed $G\times G$ uniform grid, resulting in windows having adaptive size $\frac{H}{G}\times\frac{W}{G}$ . Employing self-attention on the decomposed grid axis i.e., $G\times G$ , corresponds to dilated, global spatial mixing of tokens. By using the same fixed window and grid sizes (we use $P=G=7$ following Swin ), we can fully balance the computation between local and global operations, both having only linear complexity with respect to spatial size or sequence length. Note that our proposed Max-SA module can be a drop-in replacement of the Swin attention module with exactly the same number of parameters and FLOPs. Yet it enjoys global interaction capability without requiring masking, padding, or cyclic-shifting, making it more implementation friendly, preferable to the shifted window scheme . For instance, the multi-axis attention can be easily implemented with einops without modifying the original attention operation (see Appendix). It is worth mentioning that our proposed multi-axis attention (Max-SA) is fundamentally different from the axial-attention models . Please see Appendix for a detailed comparison.

MaxViT block. We sequentially stack the two types of attentions to gain both local and global interactions in a single block, as shown in Figure 3. Note that we also adopt typical designs in Transformers , including LayerNorm , Feedforward networks (FFNs) , and skip-connections. We also add a MBConv block with squeeze-and-excitation (SE) module prior to the multi-axis attention, as we have observed that using MBConv together with attention further increases the generalization as well as the trainability of the network . Using MBConv layers prior to attention offers another advantage, in that depthwise convolutions can be regarded as conditional position encoding (CPE) , making our model free of explicit positional encoding layers. Note that our proposed stand-alone multi-axis attention may be used together or in isolation for different purposes – block attention for local interaction, and grid attention for global mixing. These elements can be easily plugged into many vision architectures, especially on high-resolution tasks that can benefit by global interactions with affordable computation.

3 Architecture Variants

We designed a series of extremely simple architectural variants to explore the effectiveness of our proposed MaxViT block, as shown in Figure 2. We use a hierarchical backbone similar to common ConvNet practices where the input is first downsampled using Conv3x3 layers in stem stage (S0). The body of the network contains four stages (S1-S4), with each stage having half the resolution of the previous one with a doubled number of channels (hidden dimension). In our network, we employ identical MaxViT blocks throughout the entire backbone. We apply downsampling in the Depthwise Conv3x3 layer of the first MBConv block in each stage. The expansion and shrink rates for inverted bottleneck and squeeze-excitation (SE) are 4 and 0.25 by default. We set the attention head size to be 32 for all attention blocks. We scale up the model by increasing block numbers per stage $B$ and the channel dimension $C$ . We summarize the architectural configurations of the MaxViT variants in Table 1.

Experiments

We validated the efficacy of our proposed model on various vision tasks: ImageNet classification , image object detection and instance segmentation , image aesthetics/quality assessment , and unconditional image generation . More experimental details can be found in the Appendix.

ImageNet-1K. We show in Table 2 the performance comparisons on ImageNet-1K classification. Under the basic 224 $\times$ 224 setting, MaxViT outperformed the most recent strong hybrid model CoAtNet by a large margin across the entire FLOPs spectrum, as shown in Figure 1(a). The MaxViT-L model sets a new performance record of 85.17% at $224\times 224$ training without extra training strategies, outperforming CoAtNet-3 by 0.67%. In regards to throughput-accuracy trade-offs at $224^{2}$ , MaxViT-S obtains 84.45% top-1 accuracy, 0.25% higher than CSWin-B and 0.35% higher than CoAtNet-2 with comparable throughput.

When fine-tuned at higher resolutions (384/512), MaxViT continues to deliver high performance compared to strong ConvNet and Transformer competitors: (1) at $384^{2}$ , MaxViT-B attains 86.34% top-1 accuracy, outperforming EfficientNetV2-L by 0.64%; (2) when fine-tuned at $512^{2}$ , our MaxViT-L (212M) achieves top-1 accuracy 86.7% , setting new SOTA performance on ImageNet-1K under the normal training setting. As Figure 1 shows, MaxViT scales much better than SOTA vision Transformers on the ImageNet-1K trained model scale.

ImageNet-21K. Table 3 shows the results of models pre-trained on ImageNet-21K. Remarkably, the MaxViT-B model achieves 88.38% accuracy, outperforming the previous best model CoAtNet-4 by 0.28% using only 43% of parameter count and 38% of FLOPs, demonstrating greater parameter and computing efficiency. Figure 4(a) visualizes the model size comparison – MaxViT scales significantly better than previous attention-based models of similar complexities, across the board. Additionally, the MaxViT-XL model achieves new SOTA performance, an accuracy of 88.70% when fine-tuned at resolution $512\times 512$ .

JFT-300M. We also trained our model on a larger-scale proprietary dataset JFT-300M which contains $\sim$ 300 million weakly labeled images. As shown in Table 3 and Figure 4(b), our model is also scalable to massive scale training data – MaxViT-XL achieves a high accuracy of 89.53% with 475 million parameters, outperforming previous models under comparable model sizes. Due to resource limitations, we leave experiments on billion-parameter-scale models on planet-scale datasets (e.g., JFT-3B ) as future work.

2 Object Detection and Instance Segmentation

Setting. We evaluated the MaxViT architectures on the COCO2017 object bounding box detection and instance segmentation tasks with a two-stage framework . On the object detection task, a feature-pyramid architecture was employed to boost different levels of objectiveness. In the instance segmentation task, a well-known Cascade Mask-RCNN framework was employed. The dataset contains 118K training and 5K validation samples. For all the compared models, the backbones are first pretrained using ImageNet-1K. The pretrained models are then used to finetune on the detection and segmentation tasks.

Results on COCO. As shown in Table 4, $AP$ , $AP_{50}$ , and $AP_{75}$ are reported for comparison. The parameters and FLOPs are also reported as a reference for model complexity. The MaxViT backbone models, used in object detection and segmentation tasks, outperform all other backbones by large margins, including Swin, ConvNeXt, and UViT at various model sizes with respect to both accuracy and efficiency. Note that MaxViT-S outperforms other base-level models (e.g., Swin-B, UViT-B), with about 40% less computational cost.

3 Image Aesthetic Assessment.

Setting. We train and evaluate the MaxViT model on the AVA benchmark which contains 255K images with aesthetics scores rated by amateur photographers. Similar to , we split the dataset into 80%/20% training and test sets. We followed and used the normalized Earth Mover’s Distance as our training loss. We trained MaxViT at three different input resolutions: $224^{2}$ , $384^{2}$ and $512^{2}$ , initialized with ImageNet-1K pre-trained weights.

Results on AVA. To evaluate and compare our model against existing methods, we present a summary of our results in Table 6. For similar input resolutions, the proposed MaxViT-T model outperforms existing image aesthetic assessment methods. As the input resolution increases, the performance improves, benefiting from its strong non-local capacity. Also, MaxViT shows better linear correlation compared to the SOTA method which uses multi-resolution inputs.

4 Image Generation

Setting. We evaluate the generative ability of MaxViT blocks to generate images of 128x128 resolution on ImageNet-1K. We choose the unconditional image generation to focus on the performance of different generators in GANs. We use the Inception Score (IS) and the Fréchet Inception Distance (FID) as quantitative evaluation metrics. 50,000 samples were randomly generated to calculate the FID and IS scores. We compared MaxViT against HiT , a SOTA generative Transformer model, which uses attention at low resolutions (e.g., 32, 64), and using implicit neural functions at high resolutions (e.g., 128). By contrast, MaxViT uses the proposed MaxViT block at every resolution. Note that we use an inverse block order (GA-BA-Conv) as we found it to perform better (see Table 10). Since Batch Normalization achieves better results on image generation, we replaced all Layer Norm with Batch Norm under this setting.

Results on ImageNet-1K. The results are shown in Table 6. Our MaxViT achieved better FID and IS with significantly lower number of parameters. These results demonstrate the effectiveness of MaxViT blocks for generation tasks. More details of the generative experiment can be found in Appendix.

5 Ablation Studies.

In this section, we ablate important design choices in MaxViT on ImageNet-1K image classification. We use the MaxViT-T model trained for 300 epochs by default and report top-1 accuracy on ImageNet-1K. Except for the ablated design choice, we used the same training configurations, unless stated otherwise.

Global grid-attention. One of our main contributions is the grid-attention module, which allows for sparse global interactions at linear time, enabling our model to capture global information at all stages. We conducted two ablations to understand its gain: 1) completely removed global attention at each stage; 2) replaced grid attention with block attention to retain the same parameter count and FLOPs. As Table 10 shows, enabling global attention at earlier stages can further boost performance over using only local attention or convolutions.

MBConv layer. We also ablated the usage of MBConv layers in MaxViT by removing all MBConv in each stage. Note that we should also consider the reduction of parameter count and FLOPs when removing the MBConv layers. Plus, Stage 3 has 5 blocks whereas other stages have only 2. As Table 10 shows, the usage of MBConv layers in MaxViT significantly boosts performance.

Block order study. We present three different modules to build the MaxViT block – MBConv, block-, and grid-attention – which captures spatial interactions from local to global. To investigate the most effective way to combine them, we evaluated the MaxViT-T model using all 6 permutations. We always apply downsampling in the first layer, which might cause a minor model size difference. We can observe from Table 10 that placing MBConv before attention layers is almost always better than other combinations. The reason might be that it is more suitable to get local features/patterns in early layers, then aggregate them globally, which is aligned with existing hybrid models , which puts Conv layers in front of attention. In generative experiments (Section 4.4), however, we found the best order to be from global to local: GA-BA-C. We hypothesize that it may be advantageous for generation tasks to first obtain the overall structures correct with global processing blocks (i.e., grid-attention layers), then fill in finer details using local processing blocks (i.e., MBConv).

Sequential vs. parallel. In our approach, we sequentially stack the multi-axis attention modules following , while there also exist other models that adopt a parallel design . In this ablation, we compare our sequential Max-SA against parallel branches containing block- and grid-attention respectively. Note that we use an input projection to double the channels, then split the heads to feed the two branches in order to remain similar complexity to MaxViT, and an output projection that reduces the concatenated branches. We did rough parameter tuning and found that an initial learning rate of $10^{-3}$ performs significantly better than $3\times 10^{-3}$ for parallel models. We use all the same parameters except the learning rate. As Table 10 shows, our sequential approach remarkably outperforms parallel counterparts with fewer parameters and computation. The reason may be that the parallel designs learn complementary cues with less interactions between them, whereas our sequential stack is able to learn more powerful fusions between local and global layers.

Vertical layout. We further examine our vertical layout design, i.e., the number of blocks each stage. We compared our design against the choice of Swin/ConvNeXt . We change MaxViT-T and -S to blocks $B=(2,2,6,2)$ , and MaxViT-B, -L to have blocks $B=(2,2,18,2)$ strictly following the stage ratio of Swin . It may be seen from Figure 5 that our layout performed comparably to Swin for small models, but scales significantly better for larger models.

Discussion and Conclusion

While recent works in the 2020s have arguably shown that ConvNets and vision Transformers can achieve similar performance on image recognition, our work presents a unified design that takes advantages of the best of both worlds – efficient convolution and sparse attention – and demonstrates that a model built on top, namely MaxViT, can achieve state-of-the-art performance on a variety of vision tasks, and more importantly, scale extremely well to massive scale data sizes. Even though we present our model in the context of vision tasks, the proposed multi-axis approach can easily extend to language modeling to capture both local and global dependencies in linear time. We also look forward to studying other forms of sparse attention in higher-dimensional or multi-modal signals such as videos, point clouds, and vision-languages.

Societal impact. Investigating the performance and scalability of large model designs would consume considerable computing resources. These efforts can contribute to increased carbon emissions, which could hence raise environmental concerns. However, the proposed model offers strong modular candidates that expand the network’s design space for future efforts on automated architectural design. If trained improperly, the proposed model may express bias and fairness issues. The proposed generative model can be abused to generate misleading media and fake news. These issues demand caution in future related research.

Acknowledgment. We thank Xianzhi Du and Wuyang Chen for extensive help on experiments. We also thank Hanxiao Liu, Zihang Dai, Anurag Arnab, Huiwen Chang, Junjie Ke, Mauricio Delbracio, Sungjoon Choi, and Irene Zhu for valuable discussions and help.

Appendix

In this Appendix we provide the following material:

Sec. 0.A describes the detailed architectures of MaxViT for image classification (Sec. 0.A.1), object detection and segmentation (Sec. 0.A.2), image aesthetics assessment (Sec. 0.A.3), and image generation (Sec. 0.A.4).

Sec. 0.B presents complete training settings and hyperparameters for image classification (Sec. 0.B.1), object detection and segmentation (Sec. 0.B.2), image aesthetics assessment (Sec. 0.B.3), and image generation (Sec. 0.B.4).

Sec. 0.C demonstrates comprehensive experimental results, including image classification on ImageNet-1K (Table 13), ImageNet-21K and JFT (Table 14), as well as more image generation visualizations on ImageNet-1K (Figure 8).

Appendix 0.A Model Details

MaxViT leverages the MBConv block as the main convolution operator. We also adopt a pre-activation structure to promote homogeneity between MBConv and Transformer blocks. Specifically, assume $\mathbf{x}$ to be the input feature, the MBConv block without downsampling is formulated as:

where $\mathsf{Norm}$ is $\mathsf{BatchNorm}$ , $\mathsf{Conv}$ is the expansion Conv1x1 followed by $\mathsf{BatchNorm}$ and $\mathsf{GELU}$ activation, a typical choice for Transformer-based models. $\mathsf{DWConv}$ is the Depthwise Conv3x3 followed by $\mathsf{BatchNorm}$ and $\mathsf{GELU}$ . $\mathsf{SE}$ is the Squeeze-Excitation layer , while $\mathsf{Proj}$ is the shrink Conv1x1 to down-project the number of channels. Note that for the first MBConv block in every stage, the downsampling is done by applying stride-2 Depthwise Conv3x3 while the shortcut branch should also apply pooling and channel projection:

A.1.2 Relative Attention

Relative attention has been explored in several previous studies for both NLP and vision . Here to simplify the presentation, we present our model using only a single head of the multi-head self-attention. In the actual implementation, we always use multi-head attention with the same head dimension. The relative attention can be defined as:

A.1.3 Multi-Axis Attention

We denote the $\mathsf{Unblock}(\cdot)$ operation as the reverse of the above block partition procedure. Similarly, we define the $\mathsf{Grid}(\cdot)$ operation with parameter $G$ as dividing the input feature into a uniform $G\times G$ grid, with each lattice having adaptive size $\frac{H}{G}\times\frac{W}{G}$ . Unlike the $\mathsf{block}$ operator, we need to apply an extra $\mathsf{Transpose}$ to place the grid dimension in the assumed spatial axis (i.e., -2 axis):

with its inverse operation $\mathsf{Ungrid}(\cdot)$ that reverses the gridded input back to the normal 2D feature space.

while the global, dilated Grid Attention module is formulated as:

where we omit the $QKV$ input format in the $\mathsf{RelAttention}$ operation for simplicity. $\mathsf{LN}$ denotes the Layer Normalization , where $\mathsf{MLP}$ is a standard MLP network consisting of two linear layers: $\mathbf{x}\leftarrow W_{2}\mathsf{GELU}(W_{1}\mathbf{x})$ .

A.1.4 Comparison to Axial attention

It should be noted that our proposed multi-axis attention (Max-SA) module is completely different from the axial attention proposed in . As shown in Figure 6(a), Axial attention proposes to first apply column-wise attention then row-wise, which achieves a global receptive field with $\mathcal{O}(N\sqrt{N})$ complexity (assuming $N$ equals to the number of pixels). On the contrary, our proposed Max-SA shown in Figure 6(b) first employs local attention, then sparse global attention, enjoying global receptive fields with only $\mathcal{O}(N)$ linear complexity. Moreover, we deem the proposed Max-SA a more natural approach for vision since the design of attended regions account for the 2D structure of images, e.g., mixing tokens in a spatially-local small window.

A.1.5 MaxViT Block

We demonstrate in Algo. 1 an einops-style pseudocode of the MaxViT block which contains MBConv, block attention, and grid attention.

A.1.6 Classification Head

Instead of using the [cls] token , we simply apply global average pooling to the output of the last stage (S4) to obtain the feature representation, followed by the final classification head.

A.1.7 Architectural Specifications

Finally, we present detailed architectural specifications for the MaxViT model family (T/S/B/L) in Table 11.

A.2 Detection and Segmentation Models

We follow the settings of the cascaded Faster-RCNN and Mask-RCNN , but replace the feature extraction backbone with our MaxViT backbone. We also applied FPN in the feature map generation, where the S2, S3, S4 (multi-scale features of targeted resolution $1/8$ , $1/16$ , $1/32$ in MaxViT, respectively) are used. Then the generated feature maps are fed into the detection head. For fair comparison, we follow the original implementation without adopting any system-level strategies to further boost the final performance, such as the HTC framework , instaboost , etc. used in Swin . We show the results of MaxViT-T/S/B on these two tasks to compare it against recent strong models at similar model complexity.

A.3 Image Aesthetics Model

This task requires incorporating both local and global information of an image to accurately predict human perceptual preference. To this end, the model needs to have the capacity to learn pixel-level quality aspects such as sharpness, noisiness and contrast as well as semantic-level aspects such as composition and depth-of-field. We follow and use the normalized Earth Mover’s Distance as our training loss. Given the ground truth and predicted probability mass functions p and $\widehat{\textbf{p}}$ representing the histogram of scores, the normalized Earth Mover’s Distance can be expressed as:

where $\mbox{CDF}_{\textbf{p}}(k)$ is the cumulative distribution function as $\sum_{i=1}^{k}\textbf{p}_{i}$ , and $N=10$ represents the number score bins. In our experiments we set $r=2$ . We remove the classification head used in MaxViT, and instead append a fully-connected layer with 10 neurons followed by $\mathsf{softmax}$ .

A.4 GAN Model

The above image recognition tasks can validate the power of our proposed MaxViT block used in downsampling (contracting) models. For this GAN experiment, we would like to demonstrate its effectiveness in upsampling (expanding) architectures. The MaxViT-GAN model for image generation is illustrated in Figure 7. For unconditional image generation, MaxViT-GAN first takes a latent code $z\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ as input, then progressively generates an image of target resolution through a hierarchically upsampling structure. We start by linearly projecting the input to a feature with spatial dimension $8\times 8$ . During the generation, the feature will go through five stages consisting of identical GAN blocks with gradually increased spatial resolution, similar to the design of our main model. Similar to , we apply a cross-attention layer before the MaxViT block as a memory-efficient form of self-modulation in every stage, which has been shown to stabilize GAN training and also improve mode coverage . We use pixel shuffle for upsampling in the end of each stage.

Appendix 0.B Experimental Settings

We provide ImageNet-1K experimental settings of MaxViT models for both pre-training and fine-tuning in Table 12. All the MaxViT variants used similar hyperparameters except that we mainly customize the stochastic depth rate to regularize each model separately.

B.2 Coco Detection and Segmentation

We evaluated MaxViT on the COCO2017 object bounding box detection and instance segmentation tasks. The dataset contains 118K training and 5K validation samples. All the MaxViT backbones used are pretrained on ImageNet-1k at resolution $224\times 224$ . These pretrained checkpoints are then used as the warm-up weights for fine-tuning the detection and segmentation tasks. For both tasks, the input images are resized to $896\times 896$ . The training is conducted with a batch size of 256, using the AdamW optimizer with learning rate of 1e-3, 3e-3, 3e-3, and stochastic depth of $0.8,0.3,0.3$ for MaxViT-T/S/B, respectively.

B.3 Image Aesthetics Assessment

We trained and evaluated the MaxViT model on the AVA benchmark . This dataset consists of 255K images rated by armature photographers through photography contests. Each image is rated by an average of 200 human raters, assigning a score from 1 to 10 to images. The higher the score, the better the visual aesthetic quality of the image. Each image in the dataset has a histogram of scores associated with it, which we use as the ground truth label. Similar to , we split the dataset into train and test sets, such that 20% of the data is used for testing. We train MaxViT for three different input resolutions: $224\times 224$ , $384\times 384$ and $512\times 512$ . We initialized the model with ImageNet-1K 224 $\times$ 224 pre-trained weights. The weight and bias momentums are set to 0.9, and a dropout rate of 0.75 is applied on the last layer of the baseline network. We use an initial learning rate of 1e-3, exponentially decayed with decay factor 0.9 every 10 epochs. We set the stochastic depth rate to 0.5.

B.4 Image Generation

We use a ResNet-based discriminator following . To train the model, we also used the standard non-saturating logistic GAN loss with $R1$ gradient penalty applied to the discriminator with the gradient penalty weight set to 10. We employ the Adam optimizer with a learning rate of 1e-4 for both generator and discriminator. The model is trained on TPU for one million steps with batch size 256. Notably, we do not employ extra GAN training tricks such as pixel norm, noise injection, progressive growing, etc. on which recent state-of-the-art models are heavily relied to attain good results . The overall objectives of the GAN training are defined as:

where $\gamma$ denotes the $R_{1}$ gradient penalty weight.

Appendix 0.C Complete Experimental Results

We provide complete experiment comparisons for ImageNet-1K, Image-21K, and JFT datasets in Table 13 and Table 14, respectively. We also provide more visual results for unconditional image generation on ImageNet-1K in Figure 8.