VMamba: Visual State Space Model

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, Yunfan Liu

Introduction

Visual representation learning is one of the most fundamental research topics in computer vision, which has experienced significant breakthroughs since the onset of the deep learning era. Two primary categories of deep foundation models, i.e., Convolution Neural Networks (CNNs) and Vision Transformers (ViTs) , have been extensively employed in a variety of visual tasks. While both have achieved remarkable success in computing expressive visual representations, ViTs generally exhibit superior performance compared to CNNs, which could be attributed to global receptive fields and dynamic weights facilitated by the attention mechanism.

However, the attention mechanism requires quadratic complexity in terms of image sizes, resulting in expensive computational overhead when addressing downstream dense prediction tasks, such as object detection, semantic segmentation, etc. To tackle this issue, substantial effort has been dedicated to improving the efficiency of attention by constraining the size or stride of computing windows , albeit at the cost of imposing restrictions on the scale of receptive fields. This motivates us to design a novel visual foundation model with linear complexity, while still preserving the advantages associated with global receptive fields and dynamic weights.

Drawing inspiration from the recently proposed state space model , we introduce the Visual State Space Model (denoted as VMamba) for efficient visual representation learning. The pivotal concept behind VMamba’s success in effectively reducing attention complexity is inherited from the Selective Scan Space State Sequential Model (S6) , originally devised to address Natural Language Processing (NLP) tasks. In contrast to the conventional attention computation approach, S6 enables each element in a 1-D array (e.g., text sequence) to interact with any of the previously scanned samples through a compressed hidden state, effectively reducing the quadratic complexity to linear.

However, due to the non-causal nature of visual data, directly applying such a strategy to a patchified and flattened image would inevitably result in restricted receptive fields, as the relationships against unscanned patches could not be estimated. We term this issue as the ‘direction-sensitive’ problem and propose to address it through the newly introduced Cross-Scan Module (CSM). Instead of traversing the spatial domain of image feature maps in a unidirectional pattern (either column-wise or row-wise), CSM adopts a four-way scanning strategy, i.e., from four corners all across the feature map to the opposite location (see Figure 2 (b)). This strategy ensures that each element in a feature map integrates information from all other locations in different directions, which renders a global receptive field without increasing the linear computational complexity.

Extensive experiments on diverse visual tasks are conducted to verify the effectiveness of VMamba. As shown in Figure 1, VMamba models show superior or at least competitive performance on ImageNet-1K in comparison with benchmark vision models including Resnet , ViT , and Swin We encounter a bug during the training of VMamba-B, and we will update the latest result as soon as possible. We also report the results on downstream dense prediction tasks. For example, VMamba-Tiny/Small/Base (with 22/44/7522/44/75 M parameters respectively) achieves 46.5%/48.2%/48.5%46.5\%/48.2\%/48.5\% mAP on COCO using the MaskRCNN detector (1×1\times training schedule) and 47.3%/49.5%/50.0%47.3\%/49.5\%/50.0\% mIoU on ADE20K using UperNet with 512×512512\times 512 inputs, demonstrating its potential to serve as a powerful foundation model. Furthermore, when larger images are used as input, the FLOPs of ViT increase significantly faster than those of CNN models, despite usually still exhibiting superior performance. However, it is intriguing that VMamba, being essentially a foundation model based on the Transformer architecture, is able to attain performance comparable to ViT with a steady increase in FLOPs.

We propose VMamba, a visual state space model with global receptive fields and dynamic weights for visual representation learning. VMamba presents a novel option for vision foundation models, extending beyond the existing choices of CNNs and ViTs.

The Cross-Scan Module (CSM) is introduced to bridge the gap between 1-D array scanning and 2-D plain traversing, facilitating the extension of S6 to visual data without compromising the field of reception.

Without bells and whistles, we show that VMamba achieves promising results across various visual tasks including image classification, object detection, and semantic segmentation. These findings underscore the potential of VMamba to serve as a robust vision foundation model.

Related Work

Deep neural networks have substantially advanced the research in machine visual perception. There are primarily two prevalent types of visual foundation models, i.e., CNNs and ViTs . Recently, the success of State Space Models (SSMs) has illustrated their efficacy in efficient long sequence modeling, which has attracted extensive attention in both the NLP and CV communities. Our study sticks to this line of work and proposes VMamba, a SSM-based architecture for data modeling in the vision domain. VMamba contributes as an alternative foundation model to the community, alongside CNNs and ViTs.

Convolution Neural Networks (CNNs) serve as the landmark models in the history of visual perception. Early CNN-based models are designed for basic tasks, such as recognizing handwritten digits and classifying character categories . The distinctive characteristics of CNNs are encapsulated in the convolution kernels, which employ receptive fields to capture visual information of interest from images. With the aid of powerful computing devices (GPU) and large-scale datasets , increasingly deeper and efficient models have been proposed to enhance performance across a spectrum of visual tasks. In addition to these efforts, progress has been made to propose more advanced convolution operators or more efficient network architectures .

Vision Transformers (ViTs) are adapted from the NLP community, showcasing a potent perception model for visual tasks and swiftly evolving into one of the most promising visual foundation models. Early ViT-based models usually require large-scale datasest and appear in a plain configuration . Later, DeiT employs training techniques to address challenges encountered in the optimization process, and subsequent studies tend to incorporate inductive bias of visual perception into network design. For example, the community proposes hierarchical ViTs to gradually decrease the feature resolution throughout the backbone. Moreover, other studies propose to harness the advantages of CNNs, such as introducing convolution operations , designing hybrid architectures by combining CNN and ViT modules , etc.

State Space Models (SSMs) are recently proposed models that are introduced into deep learning as state space transforming . Inspired by continuous state space models in control systems, combined with HiPPO initialization, LSSL showcases the potential in handling long range dependency problems. However, due to the prohibitive computation and memory requirements induced by the state representation, LSSL is infeasible to use in practice. To solve this problem, S4 proposes to normalize the parameter into diagonal structure. Since then, many flavors of structured state space models sprang up with different structures like complex-diagonal structure , multiple-input multiple output supporting , decomposition of diagonal plus low-rank operations , selection mechanism . These models are then integrated into large representation models .

Those models are mainly focuses on the how state space models are applied on long-range and casual data like language and speech, such as language understanding , content-based reasoning , pixel-level 1-D image classification , few of them pay attention in visual recognition. The most similar work to ours is S4ND . S4ND is the first work applying state space mechanism into visual tasks and showing the potential that its performance may compete with ViT . However, S4ND expands the S4 model in a simple manner, fails on efficiently capturing image information in an input-dependent manner. We demonstrates that with selective scan mechanism introduced by mamba , the proposed VMamba is able to match existing popular vision foundation models like ResNet , ViT , swin , and convnext , showcasing the potential of VMamba to be the powerful foundation model.

Method

In this section, we start by introducing the preliminary concepts related to VMamba, including the state space models, the discretization process, and the selective scan mechanism. We then provide detailed specifications of the 2D state space model which serves as the core element of VMamba. Finally, we present a comprehensive discussion of the overall VMamba architecture.

Discretization. State Space Models (SSMs), as continuous-time models, face great challenges when integrated into deep learning algorithms. To overcome this obstacle, the discretization process becomes imperative.

2 2D Selective Scan

Despite its distinctive characteristics, S6 causally processes the input data, and thus can only capture information within the scanned part of the data. This naturally aligns S6 with NLP tasks that involve temporal data but poses significant challenges when adapting to non-causal data such as image, graph, set, etc. A straightforward solution to this problem would be to scan data along two different directions (i.e., forward and backward), allowing them to compensate for the receptive field of each other without increasing the computational complexity.

Despite the non-causal nature, images differ from texts in that they contain 2D spatial information (e.g. local texture and global structure). To tackle this problem, S4ND suggests reformulating SSM with convolution and straightforwardly expanding the kernel from 1-D to 2-D via outer-product. However, such modification prevents the weights from being dynamic (i.e., input independent), resulting in a loss of the context-based data modeling capability. Therefore, we choose to preserve dynamic weights by sticking to the selective scan approach , which unfortunately disallows us to follow and integrate convolution operations.

To address this problem, we propose the Cross-Scan Module (CSM) as shown in Figure 2. We choose to unfold image patches along rows and columns into sequences (scan expand), and then proceed with scanning along four different directions: top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right. In this way, any pixel (such as the center pixel in Figure 2) integrates information from all other pixels in different directions. We then reshape each sequence into a single image, and all sequences are merged to from a new one as illustrated in Figure 3 (scan merge).

The integration of S6 with CSM, referred to as the S6 block, serves as the core element to construct the Visual State Space (VSS) block, which constitutes the fundamental building block of VMamba (further detailed in the next subsection). We emphasize that the S6 block inherits the linear complexity of the selective scan mechanism while retaining a global receptive field, which aligns with our motivation to construct such a vision model.

3 VMamba Model

An overview of the architecture of VMamba-Tiny is illustrated in Figure 4 (a). VMamba begins the process by partitioning the input image into patches using a stem module, similar to ViTs, but without further flattening the patches into a 1-D sequence. This modification preserves the 2D structure of images, resulting in a feature map with dimensions of H4×W4×C1\frac{H}{4}\times\frac{W}{4}\times{C_{1}}.

VMamba then stacks several VSS blocks on the feature map, maintaining the same dimension, constituting “Stage 1”. Hierarchical representations in VMamba are built by down-sampling the feature map in “Stage 1” through a patch merge operation . Subsequently, more VSS blocks are involved, resulting in an output resolution of H8×W8\frac{H}{8}\times\frac{W}{8} and forming “Stage 2”. This procedure is repeated to create “Stage 3” and “Stage 4” with resolutions of H16×W16\frac{H}{16}\times\frac{W}{16} and H32×W32\frac{H}{32}\times\frac{W}{32}, respectively. All these stages collectively construct hierarchical representations akin to popular CNN models , and some ViTs . The resulting architecture can serve as a versatile replacement for other vision models in practical applications with similar requirements.

We develop VMamba in three distinct scales, i.e., VMamba-Tiny, VMamba-Small, and VMamba-Base (referred to as VMamba-T, VMamba-S, and VMamba-B, respectively). Detailed architectural specifications are outlined in Table 1. The FLOPs for all models are assessed using a 224×224224\times 224 input size. Additional architectures, such as a large-scale model, will be introduced in future updates.

3.2 VSS Block

The structure of VSS block is illustrated in Figure 4 (b). The input undergoes an initial linear embedding layer, and the output splits into two information flows. One flow passes through a 3×33\times 3 depth-wise convolution layer, followed by a Silu activation function before entering the core SS2D module. The output of SS2D goes through a layer normalization layer and is then added to the output of the other information flow, which has undergone a Silu activation. This combination produces the final output of the VSS block.

Unlike vision transformers, we refrain from utilizing position embedding bias in VMamba due to its causal nature. Our design diverges from the typical vision transformer structure, which employs the following sequence of operations: Norm \rightarrow attention \rightarrow Norm \rightarrow MLP in a block, and discards the MLP operation. Consequently, the VSS block is shallower than the ViT block, which allows us to stack more blocks with a similar budget of total model depth.

Experiment

In this section, we perform a series of experiments to assess and compare VMamba against popular models, including CNNs and vision transformers. Our evaluation spans diverse tasks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K. Subsequently, we delve into analysis experiments to gain deeper insights into the architecture of VMamba.

Settings We evaluate VMamba’s classification performance on ImageNet-1K . Following the configuration in , VMamba-T/S/B undergo training from scratch for 300300 epochs (with the first 20 epochs to warmup), utilizing a batch size of 10241024. The training process incorporates the AdamW optimizer with betas set to (0.9,0.999)(0.9,0.999), a momentum of 0.90.9, a cosine decay learning rate scheduler, an initial learning rate of 1×1031\times 10^{-3}, and a weight decay of 0.050.05. Additional techniques such as label smoothing (0.10.1) and exponential moving average (EMA) are also employed. Beyond these, no further training techniques are applied.

Results Table 2 summarizes results on ImageNet-1K, comparing VMamba with popular CNN models and vision transformers. The comparison reveals that, with similar FLOPs, VMamba-T achieves a performance of 82.2%82.2\%, surpassing RegNetY-4G by 2.2%2.2\%, DeiT-S by 2.4%2.4\%, and Swin-T by 0.9%0.9\%. Notably, the performance advantages of VMamba persist across small and base scale models. For instance, at the small scale, VMamba-S attains a top-1 accuracy of 83.5%83.5\%, outperforming RegNetY-8G by 1.8%1.8\% and Swin-S by 0.5%0.5\%. Meanwhile, VMamba-B achieves a top-1 accuracy of 83.2%83.2\%, surpassing RegNetY-16G by 0.3%0.3\% and DeiT-B by 0.1%0.1\%. These promising results underscore VMamba’s potential as a robust foundational model, extending its superiority beyond traditional CNN models and vision transformers.

2 Object Detection on COCO

Settings In this section, we assess the performance of the proposed VMamba on object detection using the MSCOCO 2017 dataset . Our training framework is built on the mmdetection library , and we adhere to the hyperparameters in Swin with the Mask-RCNN detector. Specifically, we employ the AdamW optimizer and fine-tune the pre-trained classification models (on ImageNet-1K) for both 1212 and 3636 epochs. The drop path rates are set to 0.2%/0.2%/0.2%0.2\%/0.2\%/0.2\% All being 0.2 is due to our oversight, and we will update the latest experiments. for VMamba-T/S/B, respectively. The learning rate is initialized at 1×1041\times 10^{-4} and is reduced by a factor of 10×10\times at the 9th and 11th epochs. We implement multi-scale training and random flip with a batch size of 1616. These choices align with established practices for object detection evaluations.

Results The results for COCO are summarized in Table 3. VMamba maintains superiority in box/mask Average Precision (AP) on COCO, regardless of the training schedule employed (1212 or 3636 epochs). Specifically, with a 1212-epoch fine-tuning schedule, VMamba-T/S/B models achieve object detection mAPs of 46.5%/48.2%/48.5%46.5\%/48.2\%/48.5\%, surpassing Swin-T/S/B by 3.8%/3.6%/1.6%3.8\%/3.6\%/1.6\% mAP, and ConvNeXt-T/S/B by 2.3%/2.8%/1.5%2.3\%/2.8\%/1.5\% mAP. Using the same configuration, VMamba-T/S/B achieves instance segmentation mIoUs of 42.1%/43.0%/43.1%42.1\%/43.0\%/43.1\%, outperforming Swin-T/S/B by 2.8%/2.1%/0.8%2.8\%/2.1\%/0.8\% mIoU, and ConvNeXt-T/S/B by 2.0%/1.2%/0.7%2.0\%/1.2\%/0.7\% mIoU, respectively.

Furthermore, the advantages of VMamba persist under the 3636-epoch fine-tuning schedule with multi-scale training, as indicated in Table 3. When compared to counterparts, including Swin , ConvNeXt , PVTv2 , and ViT (with Adapters), VMamba-T/S exhibit superior performance, achieving 48.5%/49.7%48.5\%/49.7\% mAP on object detection and 43.2%/44.0%43.2\%/44.0\% mIoU on instance segmentation. These results underscore the potential of VMamba in downstream dense prediction tasks.

3 Semantic Segmentation on ADE20K

Settings Following Swin , we construct a UperHead on top of the pre-trained model. Employing the AdamW optimizer , we set the learning rate as 6×1056\times 10^{-5}. The fine-tuning process spans a total of 160k160k iterations with a batch size of 1616. The default input resolution is 512×512512\times 512, and we additionally present experimental results using 640×640640\times 640 inputs and multi-scale (MS) testing.

Results The results are presented in Table 4. Once again, VMamba exhibits superior accuracy, particularly with the VMamba-T model achieving 47.3%47.3\% mIoU with a resolution of 512×512512\times 512 and 48.3%48.3\% mIoU using multi-scale (MS) input. These scores surpass all competitors, including ResNet , DeiT , Swin , and ConvNeXt . Notably, the advantages extend to VMamba-S/B models, even when using 640×640640\times 640 inputs.

4 Analysis Experiments

Effective Receptive Field To assess the effective receptive fields (ERFs) across various models, we present a comparative analysis in Figure 5. The ERT measures the significance of model input concerning its output. Visualizing the ERF of the central pixel with an input size of 1024×10241024\times 1024, we compare VMamba with four prominent visual foundation models: ResNet50 , ConvNeXt-T , Swin-T , and DeiT-S (ViT) at both the Before training and After training stages. Key observations from Figure 5 include: 1) Only DeiT (ViT) and VMamba exhibit global ERFs, while other models demonstrate local ERFs, despite their theoretical global potential. It’s important to note that the DeiT (ViT) model incurs quadratic complexity costs (refer to Figure 6). 2). In contrast to DeiT (ViT), which evenly activates all pixels using the attention mechanism, VMamba activates all pixels and notably emphasizes cross-shaped activations. The Cross-Scan Module’s scanning mechanism ensures the central pixel is most influenced by pixels along the cross, prioritizing long-dependency context over local information for each pixel. 3) Intriguingly, VMamba initially exhibits only a local ERF at Before training. However, After training transforms the ERF to global, signifying an adaptive process in the model’s global capability. We believe this adaptive process contributes to the model’s enhanced perception of images. This stands in contrast to DeiT, which maintains nearly identical ERFs at both Before training and After training.

Input Scaling We proceed to perform experiments on input scaling, measuring top-1 accuracy on ImageNet-1K and FLOPs, as illustrated in Figure 6. In Figure 6 (a), we assess the inference performance of popular models (trained with a 224×224224\times 224 input size) across various image resolutions (ranging from 64×6464\times 64 to 1024×10241024\times 1024). In comparison to counterparts, VMamba demonstrates the most stable performance across different input image sizes. Notably, as the input size increases from 224×224224\times 224 to 384×384384\times 384, only VMamba exhibits an upward trend in performance (VMamba-S achieving 84%84\%), highlighting its robustness to changes in input image size. In Figure 6 (b), we evaluate FLOPs using different image resolutions (also ranging from 64×6464\times 64 to 1024×10241024\times 1024). As anticipated, the VMamba series report a linear growth in complexity, aligning with CNN models. VMamba’s complexity is consistent with carefully designed vision transformers like Swin . However, it’s crucial to note that only VMamba achieves a global effective receptive field (ERF). DeiT, which also exhibits global ERF capability, experiences a quadratic growth in complexity.

Conclusion

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represent the predominant foundation models for visual representation learning. While CNNs exhibit linear complexity with respect to image resolution, ViTs excel in fitting capabilities despite quadratic complexity. Our investigation reveals that ViTs achieve superior visual modeling through global receptive fields and dynamic weights. Motivated by this, we propose the Visual State Space Model (VMamba), drawing inspiration from the state space model to achieve linear complexity without sacrificing global receptive fields. To address direction sensitivity, we introduce the Cross-Scan Module (CSM) for spatial traversal, converting non-causal visual images into ordered patch sequences. Extensive experiments demonstrate VMamba’s promising performance across visual tasks, with pronounced advantages as image resolution increases, surpassing established benchmarks.

References