UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao

Introduction

Learning spatiotemporal representations is a fundamental task for video understanding. Basically, there are two distinct challenges. On the one hand, videos contain large spatiotemporal redundancy, where target motions across local neighboring frames are subtle. On the other hand, videos contain complex spatiotemporal dependency, since target relations across long-range frames are dynamic.

The advances in video classification have mostly driven by 3D convolutional neural networks (Tran et al., 2015; Carreira & Zisserman, 2017b; Feichtenhofer et al., 2019) and spatiotemporal transformers (Bertasius et al., 2021; Arnab et al., 2021). Unfortunately, each of these two frameworks focuses on one of the aforementioned challenges. 3D convolution can capture detailed and local spatiotemporal features, by processing each pixel with context from a small 3D neighborhood (e.g., 3 $\times$ 3 $\times$ 3). Hence, it can reduce spatiotemporal redundancy across adjacent frames. However, due to the limited receptive field, 3D convolution suffers from difficulty in learning long-range dependency (Wang et al., 2018; Li et al., 2020a). Alternatively, vision transformers are good at capturing global dependency, with the help of self-attention among visual tokens (Dosovitskiy et al., 2021). Recently, this design has been introduced in video classification via spatiotemporal attention mechanism (Bertasius et al., 2021). However, we observe that, video transformers are often inefficient to encode local spatiotemporal features in the shallow layers. We take the well-known and typical TimeSformer (Bertasius et al., 2021) for illustration. As shown in Figure 1, TimeSformer indeed learns detailed video representations in the early layers, but with very redundant spatial and temporal attention. Specifically, spatial attention mainly focuses on the neighbor tokens (mostly in 3 $\times$ 3 local regions), while learning nothing from the rest tokens in the same frame. Similarly, temporal attention mostly only aggregates tokens in the adjacent frames, while ignoring the rest in the distant frames. More importantly, such local representations are learned from global token-to-token similarity comparison in all layers, requiring large computation cost. This fact clearly deteriorates computation-accuracy balance of such video transformer (Figure 2).

To tackle these difficulties, we propose to effectively unify 3D convolution and spatiotemporal self-attention in a concise transformer format, thus we name the network Unified transFormer (UniFormer), which can achieve a preferable balance between efficiency and effectiveness. More specifically, our UniFormer consists of three core modules, i.e., Dynamic Position Embedding (DPE), Multi-Head Relation Aggregator (MHRA), and Feed-Forward Network (FFN). The key difference between our UniFormer and traditional video transformers is the distinct design of our relation aggregator. First, instead of utilizing a self-attention mechanism in all layers, our proposed relation aggregator tackles video redundancy and dependency respectively. In the shallow layers, our aggregator learns local relation with a small learnable parameter matrix, which can largely reduce computation burden by aggregating context from adjacent tokens in a small 3D neighborhood. In the deep layers, our aggregator learns global relation with similarity comparison, which can flexibly build long-range token dependencies from distant frames in the video. Second, different from spatial and temporal attention separation in the traditional transformers (Bertasius et al., 2021; Arnab et al., 2021), our relation aggregator jointly encodes spatiotemporal context in all the layers, which can further boost video representations in a joint learning manner. Finally, we build up our model by progressively integrating UniFormer blocks in a hierarchical manner. In this case, we enlarge the cooperative power of local and global UniFormer blocks for efficient spatiotemporal representation learning in videos. We conduct extensive experiments on the popular video benchmarks, e.g., Kineticss-400 (Carreira & Zisserman, 2017a), Kinetics-600 (Carreira et al., 2018) and Something-Something V1&V2 (Goyal et al., 2017b). With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring $\mathbf{10}\times$ fewer GFLOPs than other comparable methods (e.g., $\mathbf{16.7}\times$ fewer GFLOPs than ViViT (Arnab et al., 2021) with JFT-300M pre-training). For Something-Something V1 and V2, our UniFormer achieves 60.9% and 71.2% top-1 accuracy respectively, which are new state-of-the-art performances.

Related Work

Convolution-based Video Networks. 3D Convolution Neural Networks (CNNs) have been dominant in video understanding (Tran et al., 2015; Feichtenhofer et al., 2019). However, they suffer from the difficult optimization problem and large computation cost. To resolve this issue, I3D (Carreira & Zisserman, 2017b) inflates the pre-trained 2D convolution kernels for better optimization. Other prior works (Tran et al., 2018; Qiu et al., 2017; Tran et al., 2019; Feichtenhofer, 2020; Wang et al., 2020a) try to factorize 3D convolution kernel in different dimensions to reduce complexity. Recent methods propose well-designed modules to enhance the temporal modeling ability for 2D CNNs (Wang et al., 2016; Lin et al., 2019; Luo & Yuille, 2019; Jiang et al., 2019; Liu et al., 2020a; Li et al., 2020b; Kwon et al., 2020; Li et al., 2020a; 2021a; Wang et al., 2020b). However, 3D convolution struggles to capture long-range dependency, due to the limited receptive field.

Transformer-based Video Networks. Vision Transformers (Dosovitskiy et al., 2021; Touvron et al., 2021a; b; Liu et al., 2021a) have been popular for vision tasks and outperform many CNNs. Based on ViT, several prior works (Bertasius et al., 2021; Neimark et al., 2021; Sharir et al., 2021; Li et al., 2021b; Arnab et al., 2021; Bulat et al., 2021; Patrick et al., 2021; Zha et al., 2021) propose different variants for spatiotemporal learning, verifying the outstanding ability of the transformer to capture long-term dependencies. To reduce high dot-product computation, MViT (Fan et al., 2021) introduces the hierarchical structure and pooling self-attention, while Video Swin (Liu et al., 2021b) advocates an inductive bias of locality for video. Nevertheless, the self-attention mechanism is inefficient to encode low-level features, hindering their high potential. To tackle this challenge, different from Video Swin that applies self-attention in a local 3D window, we adopt 3D convolution in a concise transformer format to encode local features. Besides, we follow their hierarchical designs and propose our UniFormer, achieving powerful performance for video understanding.

Method

In this section, we describe our UniFormer in detail. First, we introduce the overall architecture of the UniFormer block. Then, we explain the vital designs of our UniFormer for spatiotemporal modeling, i.e., multi-head relation aggregator and dynamic position embedding. Finally, we hierarchically stack UniFormer blocks to build up our video network.

To overcome problems of spatiotemporal redundancy and dependency, we propose a novel and concise Unified transFormer (UniFormer) shown in Figure 3. We utilize a basic transformer format (Vaswani et al., 2017) but specially design it for efficient and effective spatiotemporal representation learning. Specifically, our UniFormer block consists of three key modules: Dynamic Position Embedding (DPE), Multi-Head Relation Aggregator (MHRA), and Feed-Forward Network (FFN):

2 Multi-Head Relation Aggregator

As discussed above, we should solve large local redundancy and complex global dependency, for efficient and effective spatiotemporal representation learning. Unfortunately, the popular 3D CNNs and spatiotemporal transformers only focus on one of these two challenges. For this reason, we design an alternative Relation Aggregator (RA), which can flexibly unify 3D convolution and spatiotemporal self-attention in a concise transformer format, solving video redundancy and dependency in the shallow layers and deep layers respectively. Specifically, our MHRA conducts token relation learning via multi-head fusion:

Local MHRA. In the shallow layers, we aim at learning detailed video representation from the local spatiotemporal context in small 3D neighborhoods. This coincidentally shares a similar insight with the design of a 3D convolution filter. As a result, we design the token affinity as a learnable parameter matrix operated in the local 3D neighborhood, i.e., given one anchor token $\mathbf{X}_{i}$ , RA learns local spatiotemporal affinity between this token and other tokens in the small tube $\Omega_{i}^{t\times h\times w}$ :

Comparison to 3D Convolution Block. Interestingly, we find that our local MHRA can be interpreted as a spatiotemporal extension of MobileNet block (Sandler et al., 2018; Tran et al., 2019; Feichtenhofer, 2020). Specifically, the linear transformation ${\rm V}(\cdot)$ can be instantiated as pointwise convolution (PWConv). Furthermore, the local token affinity ${\rm A}_{n}^{local}$ is a spatiotemporal matrix that operated on each output channel (or head) ${\rm V}_{n}(\mathbf{X})$ , thus the relation aggregator ${\rm R}_{n}(\mathbf{X})={\rm A}_{n}^{local}{\rm V}_{n}(\mathbf{X})$ can be explained as a depthwise convolution (DWConv). Finally, all heads are concatenated and fused by a linear matrix $\mathbf{U}$ , which can also be instantiated as pointwise convolution (PWConv). As a result, this local MHRA can be reformulated with a manner of PWConv-DWConv-PWConv in the MobileNet block. In our experiments, we flexibly instantiate our local MHRA as such channel-separated spatiotemporal convolution, so that our UniFormer can inherit computation efficiency for light-weight video classification. Different from the MobileNet block, our UniFormer block is designed as a generic transformer format, thus an extra FFN is inserted after MHRA, which can further mix token context at each spatiotemporal position to boost classification accuracy.

Global MHRA. In the deep layers, we focus on capturing long-term token dependency in the global video clip. This naturally shares a similar insight with the design of self-attention. Hence, we design the token affinity via comparing content similarity among all the tokens in the global view:

where $\mathbf{X}_{j}$ can be any token in the global 3D tube with size of $T<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×H<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×W$ , while $Q_{n}(\cdot)$ and $K_{n}(\cdot)$ are two different linear transformations. Most video transformers apply self-attention in all stages, introducing a large amount of calculation. To reduce the dot-product computation, the prior works tend to divide spatial and temporal attention (Bertasius et al., 2021; Arnab et al., 2021), but it deteriorates the spatiotemporal relation among tokens. In contrast, our MHRA performs local relation aggregation in the early layers, which largely saves the computation of token comparison. Hence, instead of factorizing spatiotemporal attention, we jointly encode spatiotemporal relation in our MHRA for all the stages, in order to achieve a preferable computation-accuracy balance.

Comparison to Transformer Block. In the deep layers, our UniFormer block is equipped with a global MHRA ${\rm A}_{n}^{global}$ (Eq. 7). It can be instantiated as a spatiotemporal self attention, where ${\rm Q}_{n}(\cdot)$ , ${\rm K}_{n}(\cdot)$ and ${\rm V}_{n}(\cdot)$ become Query, Key and Value in the transformer (Dosovitskiy et al., 2021). Hence, it can effectively learn long-term dependency. Instead of spatial and temporal factorization in the previous video transformers (Bertasius et al., 2021; Arnab et al., 2021), our global MHRA is based on joint spatiotemporal learning to generate more discriminative video representation. Moreover, we adopt dynamic position embedding (DPE, see Section 3.3) to overcome permutation-invariance, which can maintain translation-invariance and is friendly to different input clip lengths.

3 Dynamic Position Embedding

Since videos are both spatial and temporal variant, it is necessary to encode spatiotemporal position information for token representations. The previous methods mainly adapt the absolute or relative position embedding of image tasks to tackle this problem (Bertasius et al., 2021; Arnab et al., 2021). However, when testing with longer input clips, the absolute one should be interpolated to target input size with fine-tuning. Besides, the relative version modifies the self-attention and performs worse due to lack of absolute position information (Islam et al., 2020). To overcome the above problems, we extend the conditional position encoding (CPE) (Chu et al., 2021) to design our DPE:

where ${\rm DWConv}$ means simple 3D depthwise convolution with zero paddings. Thanks to the shared parameters and locality of convolution, DPE can overcome permutation-invariance and is friendly to arbitrary input lengths. Moreover, it has been proven in CPE that zero paddings help the tokens on the borders be aware of their absolute positions, thus all tokens can progressively encode their absolute spatiotemporal position information via querying their neighbor.

4 Model Architecture

Comparison to Convolution+Transformer Network. The prior works have demonstrate that self-attention can perform convolution (Ramachandran et al., 2019; Cordonnier et al., 2020), but they propose to replace convolution instead of combining them. Recent works have attempted to introduce convolution to vision transformers (Wu et al., 2021; Dai et al., 2021; Gao et al., 2021; Srinivas et al., 2021), but they mainly focus on image recognition, without any spatiotemporal consideration for video understanding. Moreover, the combination is almost straightforward in the prior video transformers, e.g., using transformer as global attention (Wang et al., 2018) or using convolution as patch stem (Liu et al., 2020b). In contrast, our UniFormer tackles both video redundancy and dependency with an insightful unified framework (Table 1). Via local and global token affinity learning, we can achieve a preferable computation-accuracy balance for video classification.

Experiments

We conduct experiments on widely-used Kinetics-400 (Carreira & Zisserman, 2017a) and larger benchmark Kinetics-600 (Carreira et al., 2018). We further verify the transfer learning performance on temporal-related datasets Something-Something V1&V2 (Goyal et al., 2017b). For training, we utilize the dense sampling strategy (Wang et al., 2018) for Kinetics and uniform sampling strategy (Wang et al., 2016) for Something-Something. We adopt the same training recipe as MViT (Fan et al., 2021) by default, but the random horizontal flip is not applied for Something-Something. To reduce the total training cost, we inflate the 2D convolution kernels pre-trained on ImageNet for Kinetics (Carreira & Zisserman, 2017b). More implementation specifics are shown in Appendix C. For testing, we explore the sampling strategies in our experiments. To obtain a preferable computation-accuracy balance, we adopt multi-clip testing for Kinetics and multi-crop testing for Something-Something. All scores are averaged for the final prediction.

2 Comparison to state-of-the-art

Kinetics-400&600. Table 2 presents comparisons to the state-of-the-art methods on Kinetics-400 and Kinetics-600. The first part shows the prior works using CNN. Compared with SlowFast (Feichtenhofer et al., 2019), our UniFormer-S16f requires $\mathbf{42}\times$ fewer GFLOPs but obtains 1.0% performance gain on both datasets. Even compared with MoViNet (Kondratyuk et al., 2021), which is designed through extensive neural architecture search, our model achieves slightly better results with fewer input frames ( $16f<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×4$ vs. $120f$ ). The second part lists the recent works based on vision transformers. With only ImageNet-1K pre-training, UniFormer-B16f surpasses most of the other backbones with large dataset pre-training. For example, compared with ViViT-L pre-trained from JFT-300M and Swin-B pre-trained from ImageNet-21K, UniFormer-B32f obtains comparable performance with $\mathbf{16.7\times}$ and $\mathbf{3.3\times}$ fewer computation on both Kinetics-400 and Kinetics-600.

Something-Something V1&V2. Results on Something-Something V1&V2 are shown in Table 3. Since these datasets depend on temporal relation modeling, it is difficult for the CNN-based methods to capture long-term dependencies, which leads to their worse results. In contrast, transformer-based backbones are good at processing long sequential data and demonstrate better transfer learning capabilities (Zhou et al., 2021). Our UniFormer pre-trained from Kinetis-600 outperforms all the current methods under the same settings. In fact, our best model achieves the new state-of-the-art results: 61.0% top-1 accuracy on Something-Something V1 (4.2% higher than TDNEN) (Wang et al., 2020b) and 71.2% top-1 accuracy on Something-Something V2 (1.6% higher than Swin-B (Liu et al., 2021b)). Such results verify the capability of spatiotemporal learning for UniFormer.

3 Ablation Studies

UniFormer vs. Convolution: Does transformer-style FFN help? As mentioned in Section 3.2, our UniFormer block in the shallow layers can be interpreted as a transformer-style spatiotemporal MobileNet block (Tran et al., 2019) with extra FFN. Hence, we first investigate its effectiveness by replacing our UniFormer blocks in shallow layers with MobileNet blocks (the expand ratios are set to 3 for similar parameters). As expected, our default UniFormer outperforms such spatiotemporal MobileNet block in Table 4. It shows that, FFN in our UniFormer can further mix token context at each spatiotemporal position to boost classification accuracy.

Does dynamic position embedding matter to UniFormer? With dynamic position embedding, our UniFormer improve the top-1 accuracy by 0.5% and 1.7% on ImageNet and Kinetics-400. It shows that via encoding the position information, our DPE can maintain spatiotemporal order, contributing to better spatiotemporal representation learning.

Is our UniFormer more transferable? We further verify the transfer learning ability of our UniFormer in Table 4. All models share the same stage numbers but the stage types are different. Compared with pre-training from ImgeNet, pre-training from Kinetics-400 will further improve the top-1 accuracy by 1.8%. However, such distinct characteristic is not observed in the pure local MHRA structure and UniFormer with divided spatiotemporal attention. It demonstrates that the joint learning manner is preferable for transfer learning.

Empirical investigation on model parameters. We further evaluate the robustness of our UniFormer network to several important model parameters. (1) size of local tube: In our local token affinity (Eq. 6), we aggregate spatiotemporal context from a small local tube. Hence, we investigate the influence of this tube by changing its 3D size (Table 4). Our network is robust to the tube size. We choose 5 $\times$ 5 $\times$ 5 for better accuracy. (2) sampling method: We explore the vital sampling method shown in Table 4. For training, 16 $\times$ 4 means that we sample 16 frames with frame stride 4. For testing, 4 $\times$ 1 means four-clip testing. As expected, sparser sampling method achieves a higher single-clip result. For multi-clip testing, dense sampling is slightly better when sampling a few frames. However, when sampling more frames, sparse sampling is obviously better. (3) testing strategy: We evaluate our network with different numbers of clips and crops for the validation videos. As shown in Figure 4, since Kinetics is a scene-related dataset and trained with dense sampling, multi-clip testing is preferable to cover more frames for boosting performance. Alternatively, Something-Something is a temporal-related dataset and trained with uniform sampling, so multi-crop testing is better for capturing the discriminative motion for boosting performance.

4 Visualization

Conclusion

In this paper, we propose a novel UniFormer, which can effectively unify 3D convolution and spatiotemporal self-attention in a concise transformer format to overcome video redundancy and dependency. We adopt local MHRA in shallow layers to largely reduce computation burden and global MHRA in deep layers to learn global token relation. Extensive experiments demonstrate that our UniFormer achieves a preferable balance between accuracy and efficiency on popular video benchmarks, Kinetics-400/600 and Something-Something V1/V2.

Acknowledgement

This work is partially supported bythe National Natural Science Foundation of China (61876176,U1813218), Guangdong NSF Project (No. 2020B1515120085), the Shenzhen Research Program(RCJC20200714114557087), the Shanghai Committee of Science and Technology, China (Grant No. 21DZ1100100).

References

Appendix A More details about local MHRA

For local MHRA, it is vital to determine the neighbor tokens. Considering any token $\mathbf{X}_{k}$ ( $k\in[0,L-1]$ ), we can calculate its index $(t_{k},h_{k},w_{k})$ as follows:

Therefore, for an anchor token $\mathbf{X}_{i}$ , any of its neighbor tokens $\mathbf{X}_{j}$ in $\Omega_{i}^{t\times h\times w}$ should satisfy

Thus the local spatiotemporal affinity in Eq. 6 can be calculated as follows:

Appendix B More details about FFN.

We adopt the standard FFN (Eq. 3) in vision transformers (Dosovitskiy et al., 2021),

Appendix C Additional Implementation Details

Architecture details. As in ViT (Dosovitskiy et al., 2021), we adopt the pre-normalization configuration (Wang et al., 2019) that applies norm layer at the beginning of the residual function (He et al., 2016). Differently, we utilize BN (Ioffe & Szegedy, 2015) for local MHRA and LN (Ba et al., 2016) for global MHRA. Moreover, we add an extra layer normalization in the downsampling layers.

Training details. We adopt AdamW (Loshchilov & Hutter, 2017a) optimizer with cosine learning rate schedule (Loshchilov & Hutter, 2017b) to train the entire network. The first 5 or 10 epochs are used for warm-up (Goyal et al., 2017a) to overcome early optimization difficulty. For UniFormer-S, the warmup epoch, total epoch, stochastic depth rate, weight decay are set to 10, 110, 0.1 and 0.05 respectively for Kinetics and 5, 50, 0.3 and 0.05 respectively for Something-Something. For UniFormer-B, all the hyper-parameters are the same unless the stochastic depth rates are doubled. We linearly scale the base learning rates according to the batch size, which are $1e^{-4}\times\frac{batchsize}{32}$ and $2e^{-4}\times\frac{batchsize}{32}$ for Kinetics and Something-Something.

Appendix D Visualization

Appendix E Additional Results

Table 5 shows more results on Kinetics-400 (Carreira & Zisserman, 2017a) and Kinetics-600 (Carreira et al., 2018). The trends of the results on both datasets are similar. When sampling with a large frame stride, the corresponding single-clip testing result will be better. It is mainly because sparser sampling covers a larger time range. For multi-clip testing, sampling with frame stride 4 always performs better, thus we adopt frame stride 4 by default.

E.2 More results on Something-Something

Table 6 presents more results on Something-Something V1&V2 (Goyal et al., 2017b). For UniFormer-S, pre-training with Kinetics-600 is better than pre-training with Kinetics-400, improving the top-1 accuracy by approximately 1.5%. However, for UniFormer-B, the improvement is not obvious. We claim that the small model is difficult to fit, thus larger dataset pre-training can help it fit better. Besides, UniFormer-B with 16 frames performs better than UniFormer-S with 32 frames.

E.3 Comparsion to state-of-the-art on ImageNet

Table 7 compares our method with the state-of-the-art ImageNet (Deng et al., 2009). We design four model variants as follows:

UniFormer-S: channel numbers= $\{64,128,320,512\}$ , stage numbers= $\{3,4,8,3\}$

UniFormer-S $\dagger$ : channel numbers= $\{64,128,320,512\}$ , stage numbers= $\{3,5,9,3\}$

UniFormer-B: channel numbers= $\{64,128,320,512\}$ , stage numbers= $\{5,8,20,7\}$

UniFormer-L: channel numbers= $\{128,192,448,640\}$ , stage numbers= $\{5,10,24,7\}$

All the other model parameters are the same as we mention in Section 3.4. For UniFormer-S $\dagger$ , we adopt overlapped convolutional patch embedding. All the training hyper-parameters are the same as DeiT (Touvron et al., 2021a) by defaults. When training our models with Token Labeling, we follow the settings used in LV-ViT (Jiang et al., 2021). It shows that our models outperform other methods with similar parameters/FLOPs on ImageNet, especially when training with Token Labeling. Moreover, our model surpasses those models combining CNN with Transformer, e.g., CvT (Wu et al., 2021) and CoAtNet (Dai et al., 2021), which reflects our UniFormer can unify convolution and self-attention better for preferable accuracy-computation balance.