UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu Qiao

Introduction

Spatiotemporal representation learning is a fundamental task in video understanding. Recently, Vision Transformers (ViTs) have achieved remarkable successes in the image domain (Dosovitskiy et al., 2021; Wang et al., 2021b; Liu et al., 2021; Li et al., 2022a). Therefore, researchers make a great effort to transfer image-based ViTs for video modeling (Bertasius et al., 2021; Arnab et al., 2021; Yan et al., 2022), by extending Multi-Head Self-Attention (MHSA) along the temporal dimension. However, the spatiotemporal attention mechanism in these approaches mainly focuses on capturing global video dependency, while lacking the capacity of tackling local video redundancy. As a result, these models bear a large computational burden to encode local video representations in the shallow layers, leading to unsatisfactory accuracy-efficiency balance in spatiotemporal learning.

To tackle these problems, researchers introduce a concise UniFormer (Li et al., 2022a), which unifies convolution and self-attention as Multi-Head Relation Aggregator (MHRA) in a transformer fashion. By modeling local and global relations respectively in shallow and deep layers, it can not only learn discriminative spatiotemporal representation but also largely reduce computation burden. However, as a new architecture for video modeling, UniFormer does not have any image-based pretraining as a start. To obtain a robust visual representation, it has to go through a tedious supervised pretraining phase by learning images from scratch, before finetuning on videos. Alternatively, we notice that there are various open-sourced image ViTs (Wightman, 2019; Touvron et al., 2021), which have been well-pretrained on huge web datasets under rich supervision such as image-text contrastive learning (Radford et al., 2021) and mask image modeling (He et al., 2022; Bao et al., 2021). These models exhibit great generalization capacity on a range of vision tasks (Luo et al., 2022; Chen et al., 2022; Shen et al., 2021). Hence, we are motivated by a natural question: Can we integrate advantages from both ViTs and UniFormer for video modeling?

In this paper, we propose a generic paradigm to construct a powerful family of video networks, by arming the image-pretrained ViTs with efficient video designs of UniFormer. We called the resulting model UniFormerV2 (Fig. 1), since it inherits the concise style of UniFormer but equips local and global UniBlocks with new MHRA. In the local UniBlock, we flexibly insert a local temporal MHRA before the spatial ViT block. In this case, we can largely reduce temporal redundancy as well as leverage the well-pretrained ViT block, for learning local spatiotemporal representation effectively. In the global UniBlock, we introduce a query-based cross MHRA. Unlike the costly global MHRA in the original UniFormer, our cross MHRA can summarize all the spatiotemporal tokens into a video token, for learning global spatiotemporal representation efficiently. Finally, we re-organize local and global UniBlocks as a multi-stage fusion architecture. It can adaptively integrate multi-scale spatiotemporal representation to capture complex dynamics in videos.

We deploy our paradigm on ViTs that are pretrained on three popular supervision, including supervised learning, contrastive learning, and mask image modeling. All the enhanced models have great performance on video classification, showing the generic property of our UniFormerV2. Moreover, we develop a compact Kinetics-710 benchmark, where we integrate action categories of Kinetics-400/600/700, and remove the repeated and/or leaked videos in the training sets of these benchmarks for fairness (i.e., the total number of training videos is reduced from 1.14M to 0.66M). After training on K710, our model can simply achieve higher accuracy on K400/600/700 via only 5-epoch finetuning. Finally, extensive experiments show that, our UniFormerV2 achieves state-of-the-art performance on 8 popular video benchmarks, including scene-related datasets (i.e., Kinetics-400/600/700 (Carreira & Zisserman, 2017; Carreira et al., 2018; 2019) and Moments in Time (Monfort et al., 2020)), temporal-related datasets (i.e., Something-Something V1/V2 (Goyal et al., 2017b)), and untrimmed datasets (i.e., ActivityNet (Heilbron et al., 2015) and HACS (Zhao et al., 2019)). To our best knowledge, it is the first model to achieve 90.0% top-1 accuracy on Kinetics-400.

Related Work

Vision Transformer. Following Transformer in NLP (Vaswani et al., 2017), Vision Transformer (ViT) (Dosovitskiy et al., 2021) has made great successes in various vision tasks, including object detection Carion et al. (2020); Zhu et al. (2021), semantic segmentation Xie et al. (2021); Cheng et al. (2021), low-level image processing Liang et al. (2021); Cui et al. (2022), action recognition (Bertasius et al., 2021; Arnab et al., 2021), temporal localization (Zhang et al., 2022) and multi-modality learning (Radford et al., 2021; Wang et al., 2022). To make ViT more efficient and effective, researchers introduce scale and locality modeling in different ways, such as multi-scale architectures (Wang et al., 2021b; Fan et al., 2021), local window (Liu et al., 2021), early convolution embedding (Xiao et al., 2021; Yuan et al., 2021a) and convolutional position encoding (Chu et al., 2021; Dong et al., 2022). Alternatively, UniFormer (Li et al., 2022a) unifies convolution and self-attention as relation aggregator in a transformer manner, thus reducing large local redundancy.

Video Learning. 3D Convolutional Neural Networks (CNNs) once played a dominant role in video understanding (Tran et al., 2015; Carreira & Zisserman, 2017). Due to the difficult optimization problem of 3D CNNs, great efforts have been made to factorize 3D convolution in the spatiotemporal dimension (Tran et al., 2018; Qiu et al., 2017; Feichtenhofer et al., 2019) or channel dimension (Tran et al., 2019; Feichtenhofer, 2020; Kondratyuk et al., 2021). However, the local receptive field limits 3D convolution to capture long-range dependency. The global attention motivates researchers to transfer image-pretrained ViTs to video tasks (Bertasius et al., 2021; Neimark et al., 2021; Zhang et al., 2021b; Arnab et al., 2021; Bulat et al., 2021; Patrick et al., 2021). To make the video transformer more efficient, prior works introduce hierarchical structure with pooling self-attention (Fan et al., 2021), local self-attention (Liu et al., 2022) or unified attention (Li et al., 2022a). Though these novel models are adept at temporal modeling, they rely on tiresome image pretraining. In contrast, various well-pretrained ViTs with rich supervision are open-sourced (Wightman, 2019). In this paper, we aim to extend efficient UniFormer designs to ViT, arming it as a strong video learner.

Method

Overall Framework. We propose to arm an image ViT with video designs of UniFormer (Li et al., 2022a), leading to UniFormerV2. On one hand, spatial interactions in well-pretrained ViT can be fully leveraged and preserved to enhance spatial modeling. On the other hand, hierarchical temporal interactions in efficient UniFormer can be flexibly adopted to enhance temporal modeling. Our overall architecture is shown in Fig. 2. It firstly projects input videos into tokens, then conducts local and global modeling by the corresponding UniBlocks. Finally, a multi-stage fusion block will adaptively integrate global tokens of different stages to further enhance video representation.

To efficiently model temporal dependency upon the well-learned spatial representation, we propose a new local UniBlock, by inserting a local temporal MHRA before the standard ViT block,

${\rm LT\_MHRA}$ and ${\rm GS\_MHRA}$ refer to MHRA with local temporal affinity and global spatial affinity respectively. ${\rm FFN}$ consists of two linear projections separated by GeLU (Hendrycks & Gimpel, 2016). Additionally, following the normalization in UniFormer (Li et al., 2022a), we adopt Batch Norm (BN) (Ioffe & Szegedy, 2015) before local MHRA, and Layer Norm (LN) (Ba et al., 2016) before global MHRA and FFN. Note that ${\rm GS\_MHRA}$ and ${\rm FFN}$ come from the image-pretrained ViT block. In general, MHRA (Li et al., 2022a) learn token relation via multi-head fusion:

This allows to efficiently learn the local temporal relation between one token $\mathbf{X}_{i}$ and other tokens $\mathbf{X}_{j}$ in the tube. Alternatively, ${\rm GS\_MHRA}$ belongs to the original ViT block. Therefore, the affinity in ${\rm GS\_MHRA}$ refers to a global spatial self-attention in the single frame $1\times H\times W$ ,

Discussion. (I) Note the spatiotemporal affinity in our local UniBlock is decomposed as local temporal one ${\rm A}_{n}^{\rm LT}$ in Eq. (6), and global spatial one ${\rm A}_{n}^{\rm GS}$ in Eq. (7). In this case, we can not only leverage the efficient video processing design of UniFormer but also inherit the effective image pretraining of ViT. Alternatively, such local affinity in the original UniFormer (Li et al., 2022a) is jointly spatiotemporal, i.e., ${\rm A}_{n}^{local}(\mathbf{X}_{i},\mathbf{X}_{j})=a_{n}^{i-j}$ , where $j$ belongs to a 3D tube $\Omega_{i}^{t\times h\times w}$ . The parameter matrix has to learn from scratch, which inevitably increases the training cost. (II) Compared with UniFormer, we abandon its Dynamic Position Encoding (DPE) in the local UniBlock, since the position encoding in the ViT block has characterized token locations. Table 9 also reveals an extra DPE in the local UniBlock does not help. (III) Instead of applying global temporal modeling as in TimeSformer (Bertasius et al., 2021), we use local affinity for temporal characterization, largely reducing the computation burden by tackling temporal redundancy in the UniFormer style.

2 Global UniBlock

To explicitly conduct long-range dependency modeling on the spatiotemporal scale, we introduce a global UniBlock in our UniFormerV2. Specifically, this global UniBlock consists of three basic components including DPE, MHRA, and FFN as follows,

The DPE is instantiated as depth-wise spatiotemporal convolution (Li et al., 2022a). We design the global ${\rm C\_MHRA}$ in a cross-attention style to efficiently construct a video representation,

Discussion. We further discuss the distinct design of our global UniBlock, compared to the one in the original UniFormer (Li et al., 2022a). (I) We add the global UniBlock on top of the local UniBlock, extracting multi-scale spatiotemporal representations in token form. Such design helps strengthen the discriminative video representation without compromising the pretrained architecture. (II) The typical global spatiotemporal attention is computationally heavy, due to its quadratic complexity. To pursue better accuracy-computation balance, we introduce a cross-attention style of global MHRA in UniFormerV2, thus largely reducing the computation complexity from $O(L^{2})$ to $O(L)$ , where $L$ is the number of tokens. More importantly, since the query $\mathbf{q}$ is learnable, it can adaptively integrate the spatiotemporal context from all $L$ tokens to boost video recognition. (III) The global UniBlock inherits DPE design from UniFormer, and we find it also helps in Table 9.

3 Multi-Stage Fusion Block

We propose a multi-stage fusion block to integrate all video tokens from each global UniBlock as in Fig. 3. For simplicity, we denote the $i$ -th global block as $\mathbf{X}_{i}^{G}={\rm G}_{i}(\mathbf{q}_{i},\mathbf{X}_{i}^{L})$ . Given the tokens $\mathbf{X}_{i}^{L}$ from the local UniBlock, the global block transforms the learnable query $\mathbf{q}$ into a video token $\mathbf{X}_{i}^{G}$ . In this paper, we explore four fusion strategies to integrate the video tokens from all the global blocks $\{\mathbf{X}_{i}^{G}\}_{i=1}^{N}$ into a final video representation $\mathbf{F}$ , and employ the sequential way to conduct fusion regarding efficacy and efficiency.

Finally, we dynamically integrate the final tokens from both local and global blocks, effectively promoting recognition performance in empirical studies (Table 15). Specifically, we extract the class token $\mathbf{F}^{C}$ from the final local UniBlock, and add it with the video token $\mathbf{F}$ by weighted sum, i.e., $\mathbf{Z}=\alpha\mathbf{F}+(1-\alpha)\mathbf{F}^{C}$ , where $\alpha$ is a learnable parameter processed by the Sigmoid function.

Experiments

Datasets. To verify the learning capacity of our UniFormerV2, we conduct experiments on 8 popular video benchmarks, including the trimmed videos less than 10 seconds, and the untrimmed videos more than 1 min. For the trimmed video benchmarks, we divide them into two categories. (a) Scene-related datasets: Kinetics family (Kay et al., 2017) (i.e., Kinetics-400, 600 and 700) and Moments in Time V1 (Monfort et al., 2020). (b) Temporal-related datasets: Something-Something V1/V2 (Goyal et al., 2017b). For the untrimmed video recognition, we choose ActivityNet (Heilbron et al., 2015) and HACS (Zhao et al., 2019). More dataset details can be found in Appendix A.

Kinetics-710 for Post-Pretraining We propose a unified video benchmark for post-pretraining UniFormerV2. Different from (Yan et al., 2022) that exploits a web-scale video dataset (i.e., 60M video-text pairs), we build a much smaller video benchmark based on the Kinetics-400/600/700. Concretely, we merge the training set of these Kinetics datasets, and then delete the repeated videos according to Youtube IDs. Note we also remove testing videos from different Kinetics datasets leaked in our combined training set for correctness. As a result, the total number of training videos is reduced from 1.14M to 0.66M. Additionally, we merge the action categories in these three Kinetics datasets, which leads to 710 classes in total. Hence, we call this video benchmark Kinetics-710. More detailed descriptions can be found in Appendix F. In our experiments, we empirically show the effectiveness of our Kinetics-710. For post-pretraining, we simply use 8 input frames and adopt the same hyperparameters as training on the individual Kinetics dataset. After that, no matter how many frames are input (16, 32, or even 64), we only need 5-epoch finetuning for more than 1% top-1 accuracy improvement on Kinetics-400/600/700, as shown in Table 9.

Implement Details. Unless stated otherwise, we follow most of the training recipes in UniFormer (Li et al., 2022a), and the detailed training hyperparameters can be found in Appendix A. We build UniFormerV2 based on ViTs pretrained with various supervisions (see Table 8), showing the generality of our design. For the best result, we adopt CLIP-ViT (Radford et al., 2021) as the backbone by default, due to its robust representation pretrained by vision-language contrastive learning. For most datasets, we insert the global UniBlocks in the last 4 layers of ViT-B/L to perform the multi-stage fusion. But for Sth-Sth V1/V2, we insert the global UniBlocks in the last 8/16 layers of ViT-B/L for better temporal modeling. The corresponding ablation studies are shown in Table 9. Finally, we adopt sparse sampling (Wang et al., 2016) with the resolution of 224 for all the datasets.

Kinetics. Table 1 presents the state-of-the-art comparison on Kinetics-400. (1) The first part lists the models pretrained on open-source datasets like ImageNet (Deng et al., 2022). On one hand, compared with UniFormerV1-B (Li et al., 2022a), our UniFormerV2-B only uses 50% fine-tuning epochs but achieves a better accuracy, showing the importance of inheriting the pretrained weights. On the other hand, compared with TimeSformer-L (Bertasius et al., 2021), our model achieves 2.7% performance gain with 50% FLOPs, showing the importance of adopting the UniFormer designs. Besides, compared with Swin-L (Liu et al., 2022), our UniFormerV2-L based on BeiT (Bao et al., 2021) that pretrained on ImageNet-22K, achieves comparable results but with 12% FLOPs. (2) The second part shows the methods using web-scale data. On one hand, compared with MTV-H (ensembling 4 models) (Yan et al., 2022), our single model only requires 1% video post-pretraining, 16% finetuning epochs and 35% model parameters to achieve competitive accuracy. On the other hand, under the same CLIP-400M pretraining, our UniFormerV2-L (frozen) only uses 25% FLOPs to achieve the competitive accuracy compared with EVL-L (frozen) (Lin et al., 2022), and obtains 1.1% accuracy improvement with similar FLOPs. Finally, our UniFormerV2 is the first model to achieve 90.0% top-1 accuracy on K400, to our best knowledge. For Kinetics-600 and 700, our UniFormerV2 also obtains the state-of-the-art performance (90.1% and 82.7%, see Table 2).

Moments in Time. Due to complex inter-class and intra-class variation, MiT is more challenging than Kinetics. As shown in Table 3, our model beats most of the recent methods, i.e., compared with ViViT-L (Arnab et al., 2021), UniFormerV2-B obtains 4.2% performance gain but only with 19% model parameters and 15% FLOPs. Compared with MTV-H (Yan et al., 2022), UniFormerV2-L only uses 35% model parameters and 25% FLOPs to achieve 1.2% top-5 accuracy improvement.

Something-Something. In Table 4, we show the results on Sth-SthV2. First, our model outperforms those standard models based on the well-pretrained image ViT on hand. For example, under the same CLIP-400M pretraining and the same number of sampled frames, our UniFormerV2-B obtains 4% higher accuracy with only 11% FLOPs, compared with EVL-L (Lin et al., 2022). Second, we compare our model with those models whose backbone is specially designed. Since the pretraining is unavailable for these models, they have to perform a tedious training phrase, consisting of image-pretraining, video pretraining and video finetuning. Alternatively, our UniFormerV2 can work well with only video finetuning, e.g., our model only uses 22 epochs to achieve the performance of UniFormerV1 (Li et al., 2022a), which requires 110+50=160 video epochs to obtain results. Finally, we compare UniFormerV2 with those models which do not apply image pretraining. Such models require a huge number of training epochs, e.g., VideoMAE-B (Tong et al., 2022) contains 2400 video pretraining epochs and 40 video finetuning epochs, much longer than our UniFormerV2-B with a similar accuracy (only 22 video finetuning epochs, i.e., 0.9 % training epochs of VideoMAE-B). For Sth-Sth V1 in Table 8, we reach the new state-of-the-art performance (62.7%). The above results reveal the effectiveness and efficiency of our UniFormerV2 for temporal modeling.

ActivityNet and HACS. For the untrimmed videos, it is essential to capture long-range temporal information, since the action may occur multiple times at arbitrary moments. As shown in Table 8 and 8, our UniFormerV2 significantly outperforms the previous best results on the large-scale untrimmed benchmark ActivityNet and HACS by 4.5% and 3.6%, respectively. These results demonstrate the strong long-term modeling capacity of our UniFomrerV2.

2 Ablation Studies

To evaluate the effectiveness of UniFormerV2, we investigate each key structure design, as shown in Table 8 and Table 9. All the models are directly finetuned from CLIP-ViT-B/16 by default. We utilize ‘8 $\times$ 4 $\times$ 3’ and ‘16 $\times$ 1 $\times$ 3’ testing strategies for Kinetics and Something-Something respectively.

Pretraining Sources. To demonstrate the generality of our UniFormerV2 design, we apply it on the ViTs with different pertaining methods, including supervised learning (Dosovitskiy et al., 2021; Touvron et al., 2022), contrastive learning(Caron et al., 2021; Radford et al., 2021) and mask image modeling (He et al., 2022; Bao et al., 2021). Table 8 shows that all the models beat TimeSformer (Bertasius et al., 2021), especially for Something-Something that relies on strong temporal modeling. It also reflects that a better-pretrained ViT is helpful for stronger video performance.

Different Components. In Table 9, note the global UniBlock is crucial for the scene-related benchmark (e.g., K400), since this block can effectively provide holistic video representation for classification. Alternatively, the local UniBlock is critical for the temporal-related benchmark (e.g., SSV2), since this block can effectively describe detailed video representation for classification. Besides, using temporal downsampling with double input frames (similar FLOPs) is also helpful for distinguishing fine-grained videos like SSV2, due to the larger temporal receptive field.

Local UniBlock. To explore the structure of local UniBlock, we conduct experiments in Table 9. It reveals that convolution is better than self-attention for temporal modeling, and our local MHRA is more powerful than both of them in SSV2. Following ST-Adapter (Pan et al., 2022), we add another local MHRA after the spatial MHRA for better performance. Besides, we add local MHRA in all the layers and reduce the channel by 1.5 times for the best accuracy-flops trade-off.

Global UniBlock and Multi-stage Fusion. In Table 9, we find that the features in the deep layers are critical for capturing long-term dependency, while the DPE and the middle information are necessary for identifying the motion difference. For the fusion strategy, Table 9 shows that the simplest sequential fusion is adequate for integrating multi-stage features.

Training Recipes. We compare different training and finetuning methods in Table 9. Note that when co-training with K400, K600 and K700, we remove the leaked videos in the validation set and introduce three classification heads. K710 maintains only about 60% of the total training videos (0.66M vs. 1.14M for K400+K600+K700), but it improves classification performance significantly for Kinetics. Meanwhile it saves about 33% training cost (see Appendix A). Besides, direct training on it works better than a Kinetics co-training, especially for K600 (+1.3% vs. +1.0%) and K700 (+0.5 vs. -0.2%). Though co-finetuning shared the backbone and saved parameters, we adopt individual finetuning for each dataset considering the best performance.

Conclusion

In this paper, we propose a powerful video model, namely UniFormerV2. It arms image-pretrained ViTs with efficient UniFormer designs for video learning. By novel local and global video relation aggregators, it is capable of conducting effective spatiotemporal modeling with a tractable complexity. Besides of seamlessly integrating advantages from both ViTs and UniFormer, we also introduce multi-scale token fusion for further enhancing video representation. Our UniFormerV2 achieves state-of-the-art performance on 8 popular video benchmarks, and firstly reaches 90% top-1 accuracy on Kinetics-400, to our best knowledge.

Reproducibility. To ensure all the results can be reproduced, we give the details of the datasets, model and training hyperparameters in our experiments (see Table 10 and Table 11). For Kinetics-710, we provide its label list in Table 20 for reproduction. All the codes are based on the UniFormer (Li et al., 2022b) repository.

References

Appendix A Additional Implementation Details

Datasets. In Table 10, we give more details of our datasets. Kinetics family (Kay et al., 2017) is the most widely-used benchmark, includes Kinetics-400, 600 and 700. Since some videos are unavailable on YouTube, the Kinetics datasets are gradually shrinking over time. We report the video number of our version for a more fair comparison. Moments in Time V1 (Monfort et al., 2020) contains 0.8M 3-second video clips annotated with 339 classes, which suggests capturing the gist of a dynamic scene. Something-Something V1/V2 Goyal et al. (2017b) consist of 174 actions interacted with everyday objects. They require strong temporal modeling to distinguish confusing actions such as opening/closing something. ActivityNet (Heilbron et al., 2015) and HACS (Zhao et al., 2019) are two large-scale untrimmed video benchmark. They respectively contain about 20K and 50K videos in 200 human daily living actions. For these two datasets, we sample those video clips containing action for training, thus we do not add another background class. While for testing, we sample the frames sparsely from the whole untrimmed videos.

Implementation Details. For the scene-related datasets, we only insert the global UniBlocks in the last 4 layers of ViT-B/L to perform multi-stage fusion, since the local UniBlocks and temporal downsampling do not further improve the results in Table 9. But for Something-Something V1/V2, we adopt all the designs and insert the global UniBlocks in the last 8/16 layers of ViT-B/L for better temporal modeling. Besides, when finetuning those models with large-scale dataset pretraining, it is necessary to initialize the new parameters properly. For stable training, we zero initialize some of the layers, including the last point-wise convolutions in the local temporal MHRA, the query tokens and output projection layers in the query-based cross MHRA, the last linear layers in the FFN of the global UniBlock, and the learnable fusion weights. What’s more, we provide the detailed hyperparameters in Table 11. Most of the training scripts follow UniFormer (Li et al., 2022a), but differently, we do not apply Mixup (Zhang et al., 2018), CutMix (Yun et al., 2019), Label Smoothing (Szegedy et al., 2016) and Random Erasing (Zhong et al., 2020). When finetuning the full models on Kinetics directly from image pretraining, we adopt the same hyperparameters as in K710 pretraining. If the backbone is frozen, we use a larger learning rate (4e-4) without warmup.

Training Cost. In table 9, we compare different training scripts. When finetuning Kinetics-400, 600 and 700 individually, we train the models for 55 epochs, and the total training data is about $0.24+0.366+0.529\approx 1.14$ M. When pretraining with Kinetics-710 (0.66M), we only finetune the models for 5 epochs. Thus the percentage of saving cost is as follows,

Thus we save almost 33% of the training cost. More importantly, for the models with more frames (16, 32, or even 64), we only need to finetune the K710 pretrained models with 8 frames. Our training scripts are very efficient while effective for the Kinetics family.

Appendix B visualizations

In Figure 4, we compared UniFormerV2 with the typical ViT-based model, i.e., TimeSformer (Bertasius et al., 2021), and UniFormerV1 (Li et al., 2022a) through visualization. Since UniFormerV1 is a multi-scale architecture, we show its features at the bottom of 4 stages. For TimeSformer and UniFormerV2, they are based on ViTs with a fixed resolution, thus we show their features every 3 layers. We use CAM (Zhou et al., 2016) to show the most discriminative features that the network locates. The red parts indicate where the models focus more on, while the blue parts are ignored.

It reveals that both UniFormerV1 and UniFormerV2 are good at capturing local details, but UniFormerV1 may lose information in deeper layers due to the shrinking resolution, thus it fails to activate the discriminative parts. In contrast, TimeSformer only learns local features in the shallow layers, thus struggling to focus on meaningful areas. As for UniFormerV2, it surprisingly maintains local details even in the deep layers. More importantly, it can observe the whole video and learn to concentrate more on the woman’s leg, which helps recognize the action. These results demonstrate that our UniFormerV2 is effective to capture local details and long-term dependency.

Appendix C More ablation studies

We conduct more ablation studies based on CLIP-ViT-B/16 (Radford et al., 2021).

Output token combination. When only using global token for classification, the top-1 accuracy drops from 84.4% to 81.8% in Table 15. It shows that both local and global output tokens are essential for maintaining performance.

Kinetics pretraining for Something-Something. Different from the prior works (Li et al., 2022a; Fan et al., 2021), in Table 15, we find that extra Kinetics pretraining harms the representation inherited from CLIP, leading to lower performance.

Query number. In Table 15, we try to increase the query number. However, more queries lead to severe overfitting, thus the performance drops.

Different modules. In Table 15, we compare our local MHRA with popular temporal modules, including simple mean pooling (Wang et al., 2016), divided and joint space-time MHSA (Bertasius et al., 2021), temporal convolution (Tran et al., 2018), temporal shift (Lin et al., 2019) and temporal transformer (Sharir et al., 2021). All the modules are inserted before all the spatial MHSA, except that the 6-layer temporal transformer is added after the backbone. The results shows that our local MHRA beats the previous methods, achieving 2.0% to 22.6% higher top-1 accuracy. It demonstrate the effectiveness of our local MHRA for temporal modeling.

Appendix D Additional results

In Table 16, Table 17, Table 19 and Table 19, we give more results on the 8 video benchmarks, i.e., Kinetics-400/600/700, Moments in Time, Something-Something V1/V2, ActivityNet and HACS.

Appendix E More discussions

Local UniBlock vs. ST-Adapter (Pan et al., 2022). Our Local UniBlock is motivated by the style of UniForme r(Li et al., 2022a), i.e., we treat temporal depth-wise convolution as local temporal relation aggregator. Hence, like UniFormer, we introduce extra BatchNorm (Ioffe & Szegedy, 2015) before the first linear projection ${\rm V}(\cdot)$ . Alternatively, ST-adapter does not have this design, since it simply treats temporal depth-wise convolution as adaptation. With such motivation, it further introduces extra activation function for enhancing such adaptation, while our local UniBlock does not need it. In fact, we have also made comparisons in Table 9. It shows that our local MHRA beats ST-Adapter (69.1% vs. 68.0%).

Global UniBlock vs. Perceiver (Jaegle et al., 2021), DETR (Carion et al., 2020) and Flamingo(Alayrac et al., 2022). Our Glocal UniBlock is also motivated by the style of UniFormer (Li et al., 2022a). But differently, to decrease the global computation in UniFormer, we change self-attention MHRA as cross-attention MHRA in our UniFormerV2. Hence, our Global UniBlock consists of Dynamic Position Embedding (DPE), cross MHRA and FFN. On the contrary, none of those works belong to such an operation combination, without insight of UniFormer in video learning. In fact, these methods often use the standard cross-style transformer block including self MHRA, cross MHRA and FFN.

Limitations. In UniFormerV2, we propose the effective designs to arm pretrained ViT as spatiotemporal learners. Although its training is more efficient compared to non-trivial video backbones, its performance tends to depend on the scale of pretraining data, as shown in Table 8. Hence, it would be interesting to explore our UniFormerV2 on huge image foundation models pretrained by massive datasets, for further evaluating its scalability and generalization capacity.

Appendix F Label list of Kinetics-710

To generate our Kinetics-710, we align labels in different Kinetics datasets by filtering symbols and replacing synonyms. The final label list is shown in Table20. Compared with Kinetics-700, there are 8 and 2 unique labels in Kinetics-400 and Kinetics-600 respectively. When finetuning the models pretrained on Kinetics-710, it is vital to load the pretrained weight of the classification layer, thus we map the weight according to the label list.