Multiscale Vision Transformers
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer
Introduction
We begin with the intellectual history of neural network models for computer vision. Based on their studies of cat and monkey visual cortex, Hubel and Wiesel developed a hierarchical model of the visual pathway with neurons in lower areas such as V1 responding to features such as oriented edges and bars, and in higher areas to more specific stimuli. Fukushima proposed the Neocognitron , a neural network architecture for pattern recognition explicitly motivated by Hubel and Wiesel’s hierarchy. His model had alternating layers of simple cells and complex cells, thus incorporating downsampling, and shift invariance, thus incorporating convolutional structure. LeCun et al. took the additional step of using backpropagation to train the weights of this network. But already the main aspects of hierarchy of visual processing had been established: (i) Reduction in spatial resolution as one goes up the processing hierarchy and (ii) Increase in the number of different “channels”, with each channel corresponding to ever more specialized features.
In a parallel development, the computer vision community developed multiscale processing, sometimes called “pyramid” strategies, with Rosenfeld and Thurston , Burt and Adelson , Koenderink , among the key papers. There were two motivations (i) To decrease the computing requirements by working at lower resolutions and (ii) A better sense of “context” at the lower resolutions, which could then guide the processing at higher resolutions (this is a precursor to the benefit of “depth” in today’s neural networks.)
The Transformer architecture allows learning arbitrary functions defined over sets and has been scalably successful in sequence tasks such as language comprehension and machine translation . Fundamentally, a transformer uses blocks with two basic operations. First, is an attention operation for modeling inter-element relations. Second, is a multi-layer perceptron (MLP), which models relations within an element. Intertwining these operations with normalization and residual connections allows transformers to generalize to a wide variety of tasks.
Recently, transformers have been applied to key computer vision tasks such as image classification. In the spirit of architectural universalism, vision transformers approach performance of convolutional models across a variety of data and compute regimes. By only having a first layer that ‘patchifies’ the input in spirit of a 2D convolution, followed by a stack of transformer blocks, the vision transformer aims to showcase the power of the transformer architecture using little inductive bias.
In this paper, our intention is to connect the seminal idea of multiscale feature hierarchies with the transformer model. We posit that the fundamental vision principle of resolution and channel scaling, can be beneficial for transformer models across a variety of visual recognition tasks.
We present Multiscale Vision Transformers (MViT), a transformer architecture for modeling visual data such as images and videos. Consider an input image as shown in Fig. 1. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution ‘scale’ stages. Starting from the image resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of feature activations inside the transformer network, effectively connecting the principles of transformers with multi scale feature hierarchies.
Our conceptual idea provides an effective design advantage for vision transformer models. The early layers of our architecture can operate at high spatial resolution to model simple low-level visual information, due to the lightweight channel capacity. In turn, the deeper layers can effectively focus on spatially coarse but complex high-level features to model visual semantics. The fundamental advantage of our multiscale transformer arises from the extremely dense nature of visual signals, a phenomenon that is even more pronounced for space-time visual signals captured in video.
A noteworthy benefit of our design is the presence of strong implicit temporal bias in video multiscale models. We show that vision transformer models trained on natural video suffer no performance decay when tested on videos with shuffled frames. This indicates that these models are not effectively using the temporal information and instead rely heavily on appearance. In contrast, when testing our MViT models on shuffled frames, we observe significant accuracy decay, indicating strong use of temporal information.
Our focus in this paper is video recognition, and we design and evaluate MViT for video tasks (Kinetics , Charades , SSv2 and AVA ). MViT provides a significant performance gain over concurrent video transformers , without any external pre-training data.
In Fig. A.4 we show the computation/accuracy trade-off for video-level inference, when varying the number of temporal clips used in MViT. The vertical axis shows accuracy on Kinetics-400 and the horizontal axis the overall inference cost in FLOPs for different models, MViT and concurrent ViT video variants: VTN , TimeSformer , ViViT . To achieve similar accuracy level as MViT, these models require significant more computation and parameters (e.g. ViViT-L has 6.8 higher FLOPs and 8.5 more parameters at equal accuracy, more analysis in §A.1) and need large-scale external pre-training on ImageNet-21K (which contains around 60 more labels than Kinetics-400).
We further apply our architecture to an image classification task on ImageNet , by simply removing the temporal dimension of the video model found with ablation experiments on Kinetics, and show significant gains over single-scale vision transformers for image recognition.
Related Work
Convolutional networks (ConvNets). Incorporating downsampling, shift invariance, and shared weights, ConvNets are de-facto standard backbones for computer vision tasks for image and video .
Self-attention in ConvNets. Self-attention mechanisms has been used for image understanding , unsupervised object recognition as well as vision and language . Hybrids of self-attention operations and convolutional networks have also been applied to image understanding and video recognition .
Vision Transformers. Much of current enthusiasm in application of Transformers to vision tasks commences with the Vision Transformer (ViT) and Detection Transformer . We build directly upon with a staged model allowing channel expansion and resolution downsampling. DeiT proposes a data efficient approach to training ViT. Our training recipe builds on, and we compare our image classification models to, DeiT under identical settings.
An emerging thread of work aims at applying transformers to vision tasks such as object detection , semantic segmentation , 3D reconstruction , pose estimation , generative modeling , image retrieval , medical image segmentation , point clouds , video instance segmentation , object re-identification , video retrieval , video dialogue , video object detection and multi-modal tasks . A separate line of works attempts at modeling visual data with learnt discretized token sequences .
Efficient Transformers. Recent works reduce the quadratic attention complexity to make transformers more efficient for natural language processing applications, which is complementary to our approach.
Three concurrent works propose a ViT-based architecture for video . However, these methods rely on pre-training on vast amount of external data such as ImageNet-21K , and thus use the vanilla ViT with minimal adaptations. In contrast, our MViT introduces multiscale feature hierarchies for transformers, allowing effective modeling of dense visual input without large-scale external data.
Multiscale Vision Transformer (MViT)
Our generic Multiscale Transformer architecture builds on the core concept of stages. Each stage consists of multiple transformer blocks with specific space-time resolution and channel dimension. The main idea of Multiscale Transformers is to progressively expand the channel capacity, while pooling the resolution from input to output of the network.
We first describe Multi Head Pooling Attention (MHPA), a self attention operator that enables flexible resolution modeling in a transformer block allowing Multiscale Transformers to operate at progressively changing spatiotemporal resolution. In contrast to original Multi Head Attention (MHA) operators , where the channel dimension and the spatio-temporal resolution remains fixed, MHPA pools the sequence of latent tensors to reduce the sequence length (resolution) of the attended input. Fig. 3 shows the concept.
/ with weights of dimensions . The obtained intermediate tensors are then pooled in sequence length, with a pooling operator as described below.
Before attending the input, the intermediate tensors are pooled with the pooling operator which is the cornerstone of our MHPA and, by extension, of our Multiscale Transformer architecture.
Pooling Attention.
The pooling operator is applied to all the intermediate tensors , and independently with chosen pooling kernels , stride and padding . Denoting yielding the pre-attention vectors , and with reduced sequence lengths. Attention is now computed on these shortened vectors, with the operation,
Naturally, the operation induces the constraints on the pooling operators. In summary, pooling attention is computed as,
where is normalizing the inner product matrix row-wise. The output of the Pooling attention operation thus has its sequence length reduced by a stride factor of following the shortening of the query vector in .
Multiple heads.
As in the computation can be parallelized by considering heads where each head is performing the pooling attention on a non overlapping subset of channels of the dimensional input tensor .
Computational Analysis.
Since attention computation scales quadratically w.r.t. the sequence length, pooling the key, query and value tensors has dramatic benefits on the fundamental compute and memory requirements of the Multiscale Transformer model. Denoting the sequence length reduction factors by , and we have,
Considering the input tensor to to have dimensions , the run-time complexity of MHPA is per head and the memory complexity is .
This trade-off between the number of channels and sequence length term informs our design choices about architectural parameters such as number of heads and width of layers. We refer the reader to the supplement for a detailed analysis and discussions on the time-memory complexity trade-off.
2 Multiscale Transformer Networks
Building upon Multi Head Pooling Attention (Sec. 3.1), we describe the Multiscale Transformer model for visual representation learning using exclusively MHPA and MLP layers. First, we present a brief review of the Vision Transformer Model that informs our design.
The Vision Transformer (ViT) architecture starts by dicing the input video of resolution , where is the number of frames the height and the width, into non-overlapping patches of size 11616 each, followed by point-wise application of linear layer on the flattened image patches to to project them into the latent dimension, , of the transformer. This is equivalent to a convolution with equal kernel size and stride of 11616 and is shown as patch1 stage in the model definition in Table 1.
The resulting sequence of length of is then processed sequentially by a stack of transformer blocks, each one performing attention ( ), multi-layer perceptron () and layer normalization () operations. Considering to be the input of the block, the output of a single transformer block, is computed by
The resulting sequence after consecutive blocks is layer-normalized and the class embedding is extracted and passed through a linear layer to predict the desired output (e.g. class). By default, the hidden dimension of the MLP is 4. We refer the reader to for details.
In context of the present paper, it is noteworthy that ViT maintains a constant channel capacity and spatial resolution throughout all the blocks (see Table 1).
Multiscale Vision Transformers (MViT).
Our key concept is to progressively grow the channel resolution (i.e. dimension), while simultaneously reducing the spatiotemporal resolution (i.e. sequence length) throughout the network. By design, our MViT architecture has fine spacetime (and coarse channel) resolution in early layers that is up-/downsampled to a coarse spacetime (and fine channel) resolution in late layers. MViT is shown in Table 2.
Scale stages.
A scale stage is defined as a set of transformer blocks that operate on the same scale with identical resolution across channels and space-time dimensions . At the input (cube1 in Table 2), we project the patches (or cubes if they have a temporal extent) to a smaller channel dimension (e.g. 8 smaller than a typical ViT model), but long sequence (e.g. 44 16 denser than a typical ViT model; cf. Table 1).
At a stage transition (e.g. scale1 to scale2 to in Table 2), the channel dimension of the processed sequence is up-sampled while the length of the sequence is down-sampled. This effectively reduces the spatio-temporal resolution of the underlying visual data while allowing the network to assimilate the processed information in more complex features.
Channel expansion.
When transitioning from one stage to the next, we expand the channel dimension by increasing the output of the final MLP layer in the previous stage by a factor that is relative to the resolution change introduced at the stage. Concretely, if we down-sample the space-time resolution by 4, we increase the channel dimension by 2. For example, scale3 to scale4 changes resolution from to in Table 2. This roughly preserves the computational complexity across stages, and is similar to ConvNet design principles .
Query pooling.
The pooling attention operation affords flexibility not only in the length of key and value vectors but also in the length of the query, and thereby output, sequence. Pooling the query vector with a kernel leads to sequence reduction by a factor of . Since, our intention is to decrease resolution at the beginning of a stage and then preserve this resolution throughout the stage, only the first pooling attention operator of each stage operates at non-degenerate query stride , with all other operators constrained to .
Key-Value pooling.
Unlike Query pooling, changing the sequence length of key and value tensors, does not change the output sequence length and, hence, the space-time resolution. However, they play a key role in overall computational requirements of the pooling attention operator.
We decouple the usage of and pooling, with pooling being used in the first layer of each stage and pooling being employed in all other layers. Since the sequence length of key and value tensors need to be identical to allow attention weight calculation, the pooling stride used on and value tensors needs to be identical. In our default setting, we constrain all pooling parameters () to be identical i.e. within a stage, but vary adaptively w.r.t. to the scale across stages.
Skip connections.
Since the channel dimension and sequence length change inside a residual block, we pool the skip connection to adapt to the dimension mismatch between its two ends. MHPA handles this mismatch by adding the query pooling operator to the residual path. As shown in Fig. 3, instead of directly adding the input of MHPA to the output, we add the pooled input to the output, thereby matching the resolution to attended query .
For handling the channel dimension mismatch between stage changes, we employ an extra linear layer that operates on the layer-normalized output of our MHPA operation. Note that this differs from the other (resolution-preserving) skip-connections that operate on the un-normalized signal.
3 Network instantiation details
Table 3 shows concrete instantiations of the base models for Vision Transformers and our Multiscale Vision Transformers. ViT-Base (Table 3b) initially projects the input to patches of shape 11616 with dimension , followed by stacking transformer blocks. With an 8224224 input the resolution is fixed to 76881414 throughout all layers. The sequence length (spacetime resolution + class token) is .
Our MViT-Base (Table 3b) is comprised of scale stages, each having several transformer blocks of consistent channel dimension. MViT-B initially projects the input to a channel dimension of with overlapping space-time cubes of shape 377. The resulting sequence of length is reduced by a factor of for each additional stage, to a final sequence length of at scale4. In tandem, the channel dimension is up-sampled by a factor of at each stage, increasing to at scale4. Note that all pooling operations, and hence the resolution down-sampling, is performed only on the data sequence without involving the processed class token embedding.
We set the number of MHPA heads to in the scale1 stage and increase the number of heads with the channel dimension (channels per-head remain consistent at ).
At each stage transition, the previous stage output MLP dimension is increased by 2 and MHPA pools on tensors with at the input of the next stage.
We employ pooling in all MHPA blocks, with and in scale1 and adaptively decay this stride w.r.t. to the scale across stages such that the tensors have consistent scale across all blocks.
Experiments: Video Recognition
We use Kinetics-400 (K400) (240k training videos in 400 classes) and Kinetics-600 . We further assess transfer learning performance for on Something-Something-v2 , Charades , and AVA .
We report top-1 and top-5 classification accuracy (%) on the validation set, computational cost (in FLOPs) of a single, spatially center-cropped clip and the number of clips used.
Training.
By default, all models are trained from random initialization (“from scratch”) on Kinetics, without using ImageNet or other pre-training. Our training recipe and augmentations follow . For Kinetics, we train for 200 epochs with 2 repeated augmentation repetitions.
We report ViT baselines that are fine-tuned from ImageNet, using a 30-epoch version of the training recipe in .
For the temporal domain, we sample a clip from the full-length video, and the input to the network are frames with a temporal stride of ; denoted as .
Inference.
We apply two testing strategies following : (i) Temporally, uniformly samples clips (e.g. =) from a video, scales the shorter spatial side to 256 pixels and takes a 224224 center crop, and (ii), the same as (i) temporally, but take 3 crops of 224224 to cover the longer spatial axis. We average the scores for all individual predictions.
1 Main Results
Table 4 compares to prior work. From top-to-bottom, it has four sections and we discuss them in turn.
The first Table 4 section shows prior art using ConvNets.
The second section shows concurrent work using Vision Transformers for video classification . Both approaches rely on ImageNet pre-trained base models. ViT-B-VTN achieves 75.6% top-1 accuracy, which is boosted by 3% to 78.6% by merely changing the pre-training from ImageNet-1K to ImageNet-21K. ViT-B-TimeSformer shows another 2.1% gain on top of VTN, at higher cost of 7140G FLOPs and 121.4M parameters. ViViT improves accuracy further with an even larger ViT-L model.
The third section in Table 4 shows our ViT baselines. We first list our ViT-B, also pre-trained on the ImageNet-21K, which achieves 79.3%, thereby being 1.4% lower than ViT-B-TimeSformer, but is with 4.4 fewer FLOPs and 1.4 fewer parameters. This result shows that simply fine-tuning an off-the-shelf ViT-B model from ImageNet-21K provides a strong baseline on Kinetics. However, training this model from-scratch with the same fine-tuning recipe will result in 34.3%. Using our “training-from-scratch” recipe will produce 68.5% for this ViT-B model, using the same 15, spatial temporal, views for video-level inference.
The final section of Table 4 lists our MViT results. All our models are trained-from-scratch using this recipe, without any external pre-training. Our small model, MViT-S produces 76.0% while being relatively lightweight with 26.1M param and 32.95164.5G FLOPs, outperforming ViT-B by +7.5% at 5.5 less compute in identical train/val setting.
Our base model, MViT-B provides 78.4%, a +9.9% accuracy boost over ViT-B under identical settings, while having 2.62.4fewer FLOPsparameters. When changing the frame sampling from 164 to 323 performance increases to 80.2%. Finally, we take this model and fine-tune it for 5 epochs with longer 64 frame input, after interpolating the temporal positional embedding, to reach 81.2% top-1 using 3 spatial and 3 temporal views for inference (it is sufficient test with fewer temporal views if a clip has more frames). Further quantitative and qualitative results are in §A.
Kinetics-600
is a larger version of Kinetics. Results are in Table 5. We train MViT from-scratch, without any pre-training. MViT-B, 164 achieves 82.1% top-1 accuracy. We further train a deeper 24-layer model with longer sampling, MViT-B-24, 323, to investigate model scale on this larger dataset. MViT achieves state-of-the-art of 83.4% with 5-clip center crop testing while having 56.0 fewer FLOPs and 8.4 fewer parameters than ViT-L-ViViT which relies on large-scale ImageNet-21K pre-training.
Something-Something-v2
(SSv2) is a dataset with videos containing object interactions, which is known as a ‘temporal modeling‘ task. Table 6 compares our method with the state-of-the-art. We first report a simple ViT-B (our baseline) that uses ImageNet-21K pre-training. Our MViT-B with 16 frames has 64.7% top-1 accuracy, which is better than the SlowFast R101 which shares the same setting (K400 pre-training and 31 view testing). With more input frames, our MViT-B achieves 67.7% and the deeper MViT-B-24 achieves 68.7% using our K600 pre-trained model of above. In general, Table 6 verifies the capability of temporal modeling for MViT.
Charades
is a dataset with longer range activities. We validate our model in Table 7. With similar FLOPs and parameters, our MViT-B 164 achieves better results (+2.0 mAP) than SlowFast R50 . As shown in the Table, the performance of MViT-B is further improved by increasing the number of input frames and MViT-B layers and using K600 pre-trained models.
AVA
is a dataset with for spatiotemporal-localization of human actions. We validate our MViT on this detection task. Details about the detection architecture of MViT can be found in §D.2. Table 8 shows the results of our MViT models compared with SlowFast and X3D . We observe that MViT-B can be competitive to SlowFast and X3D using the same pre-training and testing strategy.
2 Ablations on Kinetics
We carry out ablations on Kinetics-400 (K400) using 5-clip center 224224 crop testing. We show top-1 accuracy (Acc), as well as computational complexity measured in GFLOPs for a single clip input of spatial size 2242. Inference computational cost is proportional as a fixed number of 5 clips is used (to roughly cover the inferred videos with =164 sampling.) We also report Parameters in M() and training GPU memory in G() for a batch size of 4. By default all MViT ablations are with MViT-B, 164 and max-pooling in MHSA.
Table 9 shows results for randomly shuffling the input frames in time during testing. All models are trained without any shuffling and have temporal embeddings. We notice that our MViT-B architecture suffers a significant accuracy drop of -7.1% (77.2 70.1) for shuffling inference frames. By contrast, ViT-B is surprisingly robust for shuffling the temporal order of the input.
This indicates that a naïve application of ViT to video does not model temporal information, and the temporal positional embedding in ViT-B seems to be fully ignored. We also verified this with the 79.3% ImageNet-21K pre-trained ViT-B of Table 4, which has the same accuracy of 79.3% for shuffling test frames, suggesting that it implicitly performs bag-of-frames video classification in Kinetics.
Two scales in ViT.
We provide a simple experiment that ablates the effectiveness of scale-stage design on ViT-B. For this we add a single scale stage to the ViT-B model. To isolate the effect of having different scales in ViT, we do not alter the channel dimensionality for this experiment. We do so by performing -Pooling with after 6 Transformer blocks (cf. Table 3). Table 10 shows the results. Adding a single scale stage to the ViT-B baseline boosts accuracy by +1.5% while deceasing FLOPs and memory cost by 38% and 41%. Pooling Key-Value tensors reduces compute and memory cost while slightly increasing accuracy.
Separate space & time embeddings in MViT.
In Table 11, we ablate using (i) none, (ii) space-only, (iii) joint space-time, and (iv) a separate space and time (our default), positional embeddings. We observe that no embedding (i) decays accuracy by -0.9% over using just a spatial one (ii) which is roughly equivalent to a joint spatiotemporal one (iii). Our separate space-time embedding (iv) is best, and also has 2.1M fewer parameters than a joint spacetime embedding.
Input Sampling Rate.
Table 12 shows results for different cubification kernel size and sampling stride (cf. Table 2). We observe that sampling patches, = 1, performs worse than sampling cubes with 1. Further, sampling twice as many frames, 16, with twice the cube stride, 2, keeps the cost constant but boosts performance by +1.3% (75.9% 77.2%). Also, sampling overlapping input cubes allows better information flow and benefits performance. While 1 helps, very large temporal kernel size ( 7) doesn’t futher improve performance.
Stage distribution.
The ablation in Table 13 shows the results for distributing the number of transformer blocks in each individual scale stage. The overall number of transformer blocks, is consistent. We observe that having more blocks in early stages increases memory and having more blocks later stages the parameters of the architecture. Shifting the majority of blocks to the scale4 stage (Variant V5 and V6 in Table 13) achieves the best trade-off.
Key-Value pooling.
The ablation in Table 14 analyzes the pooling stride , for pooling and tensors. Here, we compare an “adaptive” pooling that uses a stride w.r.t. stage resolution, and keeps the resolution fixed across all stages, against a non-adaptive version that uses the same stride at every block. First, we compare the baseline which uses no pooling with non-adaptive pooling with a fixed stride of 244 across all stages: this drops accuracy from 77.6% to 74.8 (and reduces FLOPs and memory by over 50%). Using an adaptive stride that is 188 in the scale1 stage, 144 in scale2, and 122 in scale3 gives the best accuracy of 77.2% while still preserving most of the efficiency gains in FLOPs and memory.
Pooling function.
The ablation in Table 15 looks at the kernel size w.r.t. the stride , and the pooling function (max/average/conv). First, we see that having equivalent kernel and stride provides 76.1%, increasing the kernel size to decays to 75.5%, but using a kernel gives a clear benefit of 77.2%. This indicates that overlapping pooling is effective, but a too large overlap () hurts. Second, we investigate average instead of max-pooling and observe that accuracy decays by from 77.2% to 75.4%.
Third, we use conv-pooling by a learnable, channelwise convolution followed by LN. This variant has +1.2% over max pooling and is used for all experiments in §4.1 and §5.
Speed-Accuracy tradeoff.
In Table 16, we analyze the speed/accuracy trade-off of our MViT models, along with their counterparts vision transformer (ViT ) and ConvNets (SlowFast 88 R50, SlowFast 88 R101 , & X3D-L ). We measure training throughput as the number of video clips per second on a single M40 GPU.
We observe that both MViT-S and MViT-B models are not only significantly more accurate but also much faster than both the ViT-B baseline and convolutional models. Concretely, MViT-S has 3.4 higher throughput speed (clips/s), is +5.8% more accurate (Acc), and has 3.3 fewer parameters (Param) than ViT-B. Using a conv instead of max-pooling in MHSA, we observe a training speed reduction of 20% for convolution and additional parameter updates.
Experiments: Image Recognition
We apply our video models on static image recognition by using them with single frame, , on ImageNet-1K .
Our recipe is identical to DeiT and summarized in the supplementary material. Training is for 300 epochs and results improve for training longer .
1 Main Results
For this experiment, we take our models which were designed by ablation studies for video classification on Kinetics and simply remove the temporal dimension. Then we train and validate them (“from scratch”) on ImageNet.
Table 17 shows the comparison with previous work. From top to bottom, the table contains RegNet and EfficientNet as ConvNet examples, and DeiT , with DeiT-B being identical to ViT-B but trained with the improved recipe in . Therefore, this is the vision transformer counterpart we are interested in comparing to.
The bottom section in Table 17 shows results for our Multiscale Vision Transformer (MViT) models.
We show models of different depth, MViT-B-Depth, (16, 24, and 32), where MViT-B-16 is our base model and the deeper variants are simply created by repeating the number of blocks in each scale stage (cf. Table 3b). “wide” denotes a larger channel dimension of . All our models are trained using the identical recipe as DeiT .
(i) Our lightweight MViT-B-16 achieves 82.5% top-1 accuracy, with only 7.8 GFLOPs, which outperforms the DeiT-B counterpart by +0.7% with lower computation cost (2.3fewer FLOPs and Parameters). If we use conv instead of max-pooling, this number is increased by +0.5% to 83.0%.
(ii) Our deeper model MViT-B-24, provides a gain of +0.6% accuracy at slight increase in computation.
(iii) A larger model, MViT-B-24-wide with input resolution 3202 reaches 84.3%, corresponding to a +1.2% gain, at 1.7fewer FLOPs, over DeiT-B3842. Using convolutional, instead of max-pooling elevates this to 84.8%.
These results suggest that Multiscale Vision Transformers have an architectural advantage over Vision Transformers.
Conclusion
We have presented Multiscale Vision Transformers that aim to connect the fundamental concept of multiscale feature hierarchies with the transformer model. MViT hierarchically expands the feature complexity while reducing visual resolution. In empirical evaluation, MViT shows a fundamental advantage over single-scale vision transformers for video and image recognition. We hope that our approach will foster further research in visual recognition.
Appendix
In this appendix, §A contains further ablations for Kinetics (§A.1) & ImageNet (§A.2), §C contains an analysis on computational complexity of MHPA, and §B qualitative observations in MViT and ViT models. §D contains additional implementation details for: Kinetics (§D.1), AVA (§D.2), Charades (§D.3), SSv2 (§D.4), and ImageNet (§D.5).
Appendix A Additional Results
In the spirit of we aim to provide further ablations for the effect of using fewer testing clips for efficient video-level inference. In Fig. A.4 we analyze the trade-off for the full inference of a video, when varying the number of temporal clips used. The vertical axis shows the top-1 accuracy on K400-val and the horizontal axis the overall inference cost in FLOPs for different model families: MViT, X3D , SlowFast , and concurrent ViT models, VTN ViT-B-TimeSformer ViT-L-ViViT , pre-trained on ImageNet-21K.
We first compare MViT with concurrent Transformer-based methods in the left plot in Fig. A.4. All these methods, VTN , TimeSformer and ViViT , pre-train on ImageNet-21K and use the ViT model with modifications on top of it. The inference FLOPs of these methods are around 5-10higher than MViT models with equivalent performance; for example, ViT-L-ViViT uses 4 clips of 1446G FLOPs (i.e. 5.78 TFLOPs) each to produce 80.3% accuracy while MViT-B, 323 uses 5 clips of 170G FLOPs (i.e. 0.85 TFLOPs) to produce 80.2% accuracy. Therefore, MViT-L can provide similar accuracy at 6.8 lower FLOPs (and 8.5 lower parameters), than concurrent ViViT-L . More importantly, the MViT result is achieved without external data. All concurrent Transformer based works require the huge scale ImageNet-21K to be competitive, and the performance degrades significantly (-3% accuracy, see IN-1K in Fig. A.4 for VTN ). These works further report failure of training without ImageNet initialization.
The plot in Fig. A.4 right shows this same plot with a logarithmic scale applied to the FLOPs axis. Using this scaling it is clearer to observe that smaller models convolutional models (X3D-S and X3D-M) can still provide more efficient inference in terms of multiply-add operations and MViT-B compute/accuracy trade-off is similar to X3D-XL.
Ablations on skip-connections.
Recall that, at each scale-stage transition in MViT, we expand the channel dimension by increasing the output dimension of the previous stages’ MLP layer; therefore, it is not possible to directly apply the original skip-connection design , because the input channel dimension () differs from the output channel dimension (). We ablate three strategies for this:
(a) First normalize the input with layer normalization and then expand its channel dimension to match the output dimension with a linear layer (Fig. A.5a); this is our default.
(b) Directly expand the channel dimension of the input by using a linear layer to match the dimension (Fig. A.5b).
(c) No skip-connection for stage-transitions (Fig. A.5c).
Table A.1 shows the Kinetics-400 ablations for all 3 variants. Our default of using a normalized skip-connection (a) obtains the best results with 77.2% top-1 accuracy, while using an un-normalized skip-connection after channel expansion (b) decays significantly to 74.6% and using no skip-connection for all stage-transitions (c) has a similar result. We hypothesize that for expanding the channel dimension, normalizing the signal is essential to foster optimization, and use this design as our default in all other experiments.
SlowFast with MViT recipe.
To investigate if our training recipe can benefit ConvNet models, we apply the same augmentations and training recipe as for MViT to SlowFast in Table A.2. The results suggest that SlowFast models do not benefit from the MViT recipe directly and more studies are required to understand the effect of applying our training-from-scratch recipe to ConvNets, as it seems higher capacity ConvNets (R101) perform worse when using our recipe.
A.2 Ablations: ImageNet Image Classification
We carry out ablations on ImageNet with the MViT-B-16 model with 16 layers, and show top-1 accuracy (Acc) as well as computational complexity measured in GFLOPs (floating-point operations). We also report Parameters in M() and training GPU memory in G() for a batch size of 512.
The ablation in Table A.3 analyzes the pooling stride , for pooling and tensors. Here, we use our default ‘adaptive’ pooling that uses a stride w.r.t. stage resolution, and keeps the resolution fixed across all stages.
First, we compare the baseline which uses pooling with a fixed stride of 44 with a model has a stride of 88: this drops accuracy from 82.5% to 81.6%, and reduces FLOPs and memory by 0.6G and 2.9G.
Second, we reduce the stride to 22, which increases FLOPs and memory significantly but performs 0.7% worse than our default stride of 44.
Third, we remove the pooling completely which increases FLOPs by 33% and memory consumption by 45%, while providing lower accuracy than our default.
Overall, the results show that our pooling is an effective technique to increase accuracy and decrease cost (FLOPs/memory) for image classification.
Appendix B Qualitative Experiments: Kinetics
In Figure A.6, we plot the mean attention distance for all heads across all the layers of our Multiscale Transformer model and its Vision Transformer counterpart, at initialization with random weights, and at convergence after training. Each head represents a point in the plots (ViT-B has more heads). Both the models use the exact same weight initialization scheme and the difference in the attention signature stems purely from the multiscale skeleton in MViT. We observe that the dynamic range of attention distance is about 4 larger in the MViT model than ViT at initialization itself (A.6a vs. A.6b). This signals the strong inductive bias stemming from the multiscale design of MViT. Also note that while at initialization, every layer in ViT has roughly the same mean attention distance, the MViT layers have strikingly different mean attention signatures indicating distinct predilections towards global and local features.
The bottom row of Fig. A.6 shows the same plot for a converged Vision Transformer (A.6c) and Multiscale Vision Transformer (A.6d) model.
We notice very different trends between the two models after training. While the ViT model (A.6c) has a consistent increase in attention distance across layers, the MViT model (A.6d) is not monotonic at all. Further, the intra-head variation in the ViT model decreases as the depth saturates, while, for MViT, different heads are still focusing on different features even in the higher layers. This suggests that some of the capacity in the ViT model might indeed be wasted with redundant computation while the lean MViT heads are more judiciously utilizing their compute. Noticeable is further a larger delta (between initialization in Fig. A.6a and convergence in A.6c) in the overall attention distance signature in the ViT model, compared to MViT’s location distribution.
Appendix C Computational Analysis
Since attention is quadratic in compute and memory complexity, pooling the key, query and value vectors have direct benefits on the fundamental compute and memory requirements of the pooling operator and by extension, on the complete Multiscale Transformer model. Consider an input tensor of dimensions and corresponding sequence length . Further, assume the key, query and value strides to be , and . As described in Sec. 3.1 in main paper, each of the vectors would experience a sptio-temporal resolution downsampling by a factor of their corresponding strides. Equivalently, the sequence length of query, key and value vectors would be reduced by a factor of , and respectively, where,
Using these shorter sequences yields a corresponding reduction in space and runtime complexities for the pooling attention operator. Considering key, query and value vectors to have sequence lengths , and after pooling, the overall runtime complexity of computing the key, query and value embeddings is per head, where is the number of heads in MHPA. Further, the runtime complexity for calculating the full attention matrix and the weighed sum of value vectors with reduced sequence lengths is per head. Computational complexity for pooling is
which is negligible compared to the quadratic complexity of the attention computation and hence can be ignored in asymptotic notation. Thus, the final runtime complexity of MHPA is .
Memory complexity.
The space complexity for storing the sequence itself and other tensors of similar sizes is . Complexity for storing the full attention matrix is . Thus the total space complexity of MHPA is .
Design choice.
Note the trade-off between the number of channels and the sequence length term in both space and runtime complexity. This tradeoff in multi head pooling attention informs two critical design choices of Multiscale Transformer architecture.
First, as the effective spatio-temporal resolution decreases with layers because of diminishing , the channel capacity is increased to keep the computational time spent (FLOPs) roughly the same for each stage.
Second, for a fixed channel dimension, , higher number of heads cause a prohibitively larger memory requirement because of the term. Hence, Multiscale Transformer starts with a small number of heads which is increased as the resolution factor decreases, to hold the effect of roughly constant.
Appendix D Additional Implementation Details
We implement our model with PySlowFast . Code and models are available at: https://github.com/facebookresearch/SlowFast.
As in original ViT , we use residual connections and Layer Normalization (LN) in the pre-normalization configuration that applies LN at the beginning of the residual function, and our MLPs consist of two linear layers with GELU activation , where the first layer expands the dimension from to , and the second restores the input dimension , except at the end of a scale-stage, where we increase this channel dimensions to match the input of the next scale-stage. At such stage-transitions, our skip connections receive an extra linear layer that takes as input the layer-normalized signal which is also fed into the MLP. In case of -pooling at scale-stage transitions, we correspondingly pool the skip-connection signal.
Optimization details.
Regularization details.
We use weight decay of 510, a dropout of 0.5 before the final classifier, label-smoothing of 0.1 and use stochastic depth (i.e. drop-connect) with rate 0.2.
Our data augmentation is performed on input clips by applying the same transformation across all frames. To each clip, we apply a random horizontal flip, Mixup with to half of the clips in a batch and CutMix to the other half, Random Erasing with probability , and Rand Augment with probability of for layers of maximum magnitude .
For the temporal domain, we randomly sample a clip from the full-length video, and the input to the network are frames with a temporal stride of ; denoted as . For the spatial domain, we use Inception-style cropping that randomly resizes the input area between a min, max, scale of 0.08, 1.00, and jitters aspect ratio between 3/4 to 4/3, before taking an = 224224 crop.
Fine-tuning from ImageNet.
D.2 Details: AVA Action Detection
The AVA dataset has bounding box annotations for spatiotemporal localization of (possibly multiple) human actions. It has 211k training and 57k validation video segments. We follow the standard protocol reporting mean Average Precision (mAP) on 60 classes on AVA v2.2.
Detection architecture.
We follow the detection architecture in to allow direct comparison of MViT against SlowFast networks as a backbone.
First, we reinterpret our transformer spacetime cube outputs from MViT as a spatial-temporal feature map by concatenating them according to the corresponding temporal and spatial location.
Second, we employ a the detector similar to Faster R-CNN with minimal modifications adapted for video. Region-of-interest (RoI) features are extracted at the generated feature map from MViT by extending a 2D proposal at a frame into a 3D RoI by replicating it along the temporal axis, similar as done in previous work , followed by application of frame-wise RoIAlign and temporal global average pooling. The RoI features are then max-pooled and fed to a per-class, sigmoid classifier for prediction.
Training.
We initialize the network weights from the Kinetics models and adopt synchronized SGD training on 64 GPUs. We use 8 clips per GPU as the mini-batch size and a half-period cosine schedule of learning rate decaying. The base learning rate is set as . We train for 30 epochs with linear warm-up for the first 5 epochs and use a weight decay of 10-8 and stochastic depth with rate 0.4. Ground-truth boxes, and proposals overlapping with ground-truth boxes by IoU 0.9, are used as the samples for training. The region proposals are identical to the ones used in .
Inference.
We perform inference on a single clip with frames sampled with stride centered at the frame that is to be evaluated.
D.3 Details: Charades Action Classification
Charades has 9.8k training videos and 1.8k validation videos in 157 classes in a multi-label classification setting of longer activities spanning 30 seconds on average. Performance is measured in mean Average Precision (mAP).
Training.
We fine-tune our MViT models from the Kinetics models. A per-class sigmoid output is used to account for the multi-class nature. We train with SGD on 32 GPUs for 200 epochs using 8 clips per GPU. The base learning rate is set as 0.6 with half-period cosine decay. We use weight decay of 10 and stochastic depth with rate 0.45. We perform the same data augmentation schemes as for Kinetics in §D.1, except of using Mixup.
Inference.
To infer the actions over a single video, we spatio-temporally max-pool prediction scores from multiple clips in testing .
D.4 Details: Something-Something V2 (SSv2)
The Something-Something V2 dataset contains 169k training, and 25k validation videos. The videos show human-object interactions to be classified into 174 classes. We report accuracy on the validation set.
Training.
We fine-tune the pre-trained Kinetics models. We train for 100 epochs using 64 GPUs with 8 clips per GPU and a base learning rate of 0.02 with half-period cosine decay . Weight decay is set to 10-4 and stochastic depth rate is 0.4. Our training augmentation is the same as in §D.1, but as SSv2 requires distinguishing between directions, we disable random flipping in training. We use segment-based input frame sampling that splits each video into segments, and from each of them, we sample one frame to form a clip.
Inference.
We take single clip with 3 spatial crops to form predictions over a single video in testing.
D.5 Details: ImageNet
For image classification experiments, we perform our experiments on ImageNet-1K dataset that has 1.28M images in 1000 classes. We train models on the train set and report top-1 and top-5 classification accuracy (%) on the val set. Inference cost (in FLOPs) is measured from a single center-crop with resolution of if the input resolution was not specifically mentioned.
Training.
We use the training recipe of DeiT and summarize it here for completeness. We train for epochs with repeated augmentation repetitions (overall computation equals epochs), using a batch size of in GPUs. We use truncated normal distribution initialization and adopt synchronized AdamW optimization with a base learning rate of per batch-size that is warmed up and decayed as half-period cosine, as in . We use a weight decay of , label-smoothing of . Stochastic depth (i.e. drop-connect) is also used with rate for model with depth of 16 (MViT-B-16), and rate for deeper models (MViT-B-24). Mixup with to half of the clips in a batch and CutMix to the other half, Random Erasing with probability , and Rand Augment with maximum magnitude and probability of for layers (for max-pooling) or layers (for conv-pooling).
Acknowledgements
We are grateful for discussions with Chao-Yuan Wu, Ross Girshick, and Kaiming He.