Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, Yu Qiao

Introduction

Video understanding has emerged as a critical skill for artificial intelligence systems to analyze and comprehend videos effectively. The progress in video understanding is currently driven by the Image Foundation Models (IFMs) , which are trained from massive datasets and adapted for different downstream tasks . However, IFMs tend to focus more on scenes and objects, disregarding the essential motion patterns and object interactions required for complex video understanding. The true Video Foundation Models (VFMs) are underexplored due to the high computational costs and data scarcity.

While building VFMs on well-learned IFMs reduces training costs, it poses significant challenges in transferring knowledge from the image domain to the video domain. Firstly, due to limited video data and a substantial domain gap, video post-pretraining may undermine the generality inherited from IFMs . Moreover, the strong spatial initialization offers a shortcut to perceive videos from scenes in single frames (e.g., “grass” in “horse riding”), which constrains VFMs from learning spatiotemporal relationships to recognize and localize temporal-related actions, such as “opening” and “closing” in Figure 2. Lastly, this paradigm is difficult to scale up as it requires well-prepared IFMs.

The recent success of VideoMAE offers a data-efficient way to learn effective spatiotemporal features from scratch, which handles complex temporal action recognition and detection tasks impressively. Nonetheless, its strong data efficiency and spatiotemporal modeling are traded by long pre-training (e.g., 2400 epochs on 160k videos). Besides, it is not well-suited for video-language tasks since the low-level pixel reconstruction task conflicts with high-level cross-modal alignment . Additionally, the extra decoder that handles masked and unmasked tokens causes high memory costs due to global self-attention, making scaling up this paradigm also challenging.

In this paper, we present a training-efficient method for temporal-sensitive VFMs by integrating the benefits of previous methods. Rather than directly adapting public IFM, e.g., CLIP , we utilize them as UnMasked Teacher (UMT) to train vanilla ViT from scratch. We mask out most of the video tokens with low semantics and only align the unmasked tokens with a linear projection to the corresponding ones from the teacher. This approach not only inherits data efficiency from VideoMAE but also makes the learned video encoder multimodal-friendly (validated in Table 1). Moreover, training with only unmasked tokens without a decoder further saves GPU memory compared to VideoMAE, and the guidance from the teacher’s semantically rich representation leads to faster convergence. Notably, the resulting model can handle both scene-related and temporal-related actions exceptionally well, while the alignment to CLIP features enables the model to be compatible with cross-modal learning.

To address various video tasks, we propose a progressive pre-training framework in Figure 2. In Stage 1, we only use video data for masked video modeling, resulting in a model that excels at video-only tasks. In Stage 2, we employ public vision-language data for multi-modality learning. This allows the model to conduct complex video-language tasks, such as video-text retrieval and video question answering . We use the UMT in both stages, significantly reducing the training sources and speeding up convergence. Thanks to readily-available image and language foundation models , our simple framework is easily scalable for video foundation models.

We conduct extensive experiments to verify the effectiveness and efficiency of our approach. As shown in Figure 1, with public sources (data/models) for pre-training, our method achieves state-of-the-art performances on various video tasks, including action recognition (90.6% top-1 accuracy on K400), spatiotemporal localization (39.8 mAP on AVA), video-text retrieval (58.8 R@1 on MSRVTT) and video question-answering (47.1% accuracy on MSRVTT). It is worth emphasizing that our method is much more environmentally friendly compared to CoCa , which uses 2,048 CloudTPUv4 chips for 5 days. In contrast, our pre-training requires 32 A100(80G) GPUs within 6 days, leading to a remarkable 70 $\times$ reduction in carbon emissions.

Related Works

Video foundation models. The present Video Foundation Models (VFMs) are primarily based on well-prepared Image Foundation Models (IFMs) . However, the strong spatial pre-training restricts their ability to learn spatiotemporal representations. Despite the impressive results demonstrated by Florence , CoCa , MTV , and UniFormerV2 on video-only tasks , these models struggle to handle temporal-related actions and localize actions . As for video-language tasks, there have been promising explorations on model architecture and learning paradigms . Recently, InternVideo introduces general VFMs through generative and discriminative learning. However, the dependence on CLIP pre-training and tremendous training costs make it difficult to scale up. In this paper, we propose an easily scalable framework for VFMs that is much more training-efficient.

Masked vision modeling. Inspired by the success of masked language modeling , masked vision modeling has been proposed for vision transformers . BeiT is the first to propose a BERT-like mask-then-predict framework to recover the discrete tokens , while MAE designs masked autoencoders to reconstruct normalized pixel values, which reduces memory consumption by processing only unmasked tokens in the encoder. Later works can be roughly divided into BeiT-style and MAE-style with various target supervision, such as HOG descriptors and momentum features . For spatiotemporal learning, BEVT and VideoMAE can be seen as extensions of BeiT and MAE, respectively. Recent works also indicate that CLIP features provide good guidance for mask modeling , but all of them actually perform worse than CLIP itself with elaborate fine-tuning . In contrast, we demonstrate that in the video domain, our model with CLIP supervision clearly outperforms the teacher.

Method

In this section, we introduce our UnMasked Teacher (UMT) for masked video modeling and the progressive pre-training framework for temporal-sensitive video foundation models, as illustrated in Figure 2.

As discussed in the introduction, directly adapting the public Image Foundation Model (IFM) to Video Foundation Model (VFM) is challenging , thus we propose using IFM as a teacher to train a VFM from scratch. Given the limited data scale, we leverage mask modeling to make good use of the video data. However, unlike VideoMAE , we selectively align the unmasked tokens with the teacher, removing an extra decoder for efficient training.

Architecture. We choose CLIP-ViT as an unmasked teacher due to its rich semantics that are learned with language guidance, which is beneficial for our following multi-modality learning. To fully impart the teacher’s knowledge, we maintain its spatial architecture to process each video frame individually. For our backbone, we apply the vanilla ViT without a class token. We employ spatiotemporal attention to encourage all the unmasked tokens to interact with each other. For better alignment with the spatial teacher, we do not use temporal downsampling, thus the tokens can be aligned frame by frame.

Target. For the teacher model, we input all $L$ spatial tokens along with the class token, frame by frame. In contrast, for the student model, we only input the unmasked tokens, which are equal to $L(1-r)T$ tokens, where $r$ is the masking ratio and $T$ is the frame number. To distill the rich semantics more effectively, we process the output teacher tokens using the pre-trained visual projection, which is designed to establish meaningful connections between visual and text embeddings. Additionally, we add a simple linear projection for the student model to align the token dimension. We select the corresponding unmasked token from the student and teacher, and compute the mean squared error (MSE) between the normalized pairs. Compared to low-level pixel reconstruction, token alignment requires a high-level understanding, which is beneficial for multi-modality learning.

2 Progressive Pre-training

For general video understanding, it is vital for the foundation model to handle video-language tasks. However, directly training such a model from scratch is inefficient. For example, CoCa utilizes 4.8B data to train 5 days on 2,048 CloudTPUv4 chips. Therefore, we introduce a training-efficient framework with progressive pre-training.

Pre-training pipeline. Figure 2 outlines our pipeline. In Stage 1, we train the ViT from scratch using only high-quality videos and guidance from Unmasked Teacher. The masked video modeling fully mines knowledge from the videos, resulting in a model that excels at video-only tasks. In Stage 2, we equip the pre-trained ViT with a text encoder and cross-modal decoder, initialized with the well-prepared language model. And we conduct multi-modality training with large-scale vision-text pairs, enabling the model to handle complex video-language tasks. It’s worth noting that currently, open-source language models are larger and more diverse than vision models, making it easy to scale up our foundation models. For example, the largest OPT has 175B parameters, while ViT-G only has 1.8B.

Pre-training objectives. For both stages, we utilize Unmasked Teacher to perform Unmasked Token Alignment (UTA). In Stage 2, we employ three other popular objectives: (i) Video-Text Contrastive (VTC) learning, which aims to align the pooled unmasked video and text embeddings. We use the symmetric contrastive loss to maximize the mutual information. (ii) Video-Text Matching (VTM) enhances cross-modal fusion by aligning the unmasked video and text tokens. We adopt the binary cross-entropy loss with hard negative mining . (iii) Masked Language Modeling (MLM) uses the cross-modal decoder to predict masked words from the other text and unmasked video tokens. We follow the BERT strategy but mask 50% of the text tokens.

Experiments

Datasets. Unless otherwise stated, we use Kinetics-710 dataset in Stage 1, which is a combination of Kinetics-400, 600 and 700 and excludes any repeated or leaked videos. In Stage 2, we utilize image-text data for co-training , where images are treated as single-frame videos. We use three corpora as in : (i) 5M Corpus comprises WebVid-2M video-text pairs and CC3M image-text pairs. (ii) 17M Corpus includes four other image-text datasets: COCO , Visual Genome , SBU Captions , and CC12M . (iii) 25M Corpus uses a larger version of WebVid containing 10M video-text pairs.

Settings. In this paper, we consider two model configurations: ViT-B/16 with BERTbase and ViT-L/16 with BERTlarge. And CLIP-ViT-B/16 and CLIP-ViT-L/14 are adopted as teachers for the base and large models, respectively. For Stage-1 pre-training, we follow most of the hyperparameter settings in VideoMAE . However, we sparsely sample 8 frames and use a masking ratio of 80%. By default, we train both models on 32 A100 with a batch size of 2048 for 200 epochs. The training on Kinetics-710 takes about 60 and 90 hours for ViT-B/16 and ViT-L/16, respectively. In Stage 2, we follow to sample 4 frames and train for 10 epochs. Specifically, we mask 50% image and 80% video tokens. Both models are trained on 32 A100 with a batch size of 4096. The pre-training on 25M Corpus takes about 24 and 40 hours respectively for the base and large models. For more implementation details about training, please refer to the appendix.

2 Ablation Study

We ablate the properties of UMT in both stages on both scene-related and temporal-related tasks . For single-modality learning, we pre-train ViT-B/16 for 200 epochs on SthSth V2 or K400 dataset. For multi-modality learning, we use K710 pre-trained models and further pre-train it for 10 epochs on 5M Corpus. Except for Table 1, where we use K400 pre-training.

Target. Table 1 presents a comparison of training targets. Compared with pixel reconstruction , our unmasked token alignment significantly improves the accuracy with only 36% memory cost. However, combining the two targets results in poor results on K400 and MSRVTT, indicating a conflict between low-level reconstruction and high-level alignment. Moreover, recovering the masked tokens has a detrimental effect, possibly due to the high masking ratio making high-level recovery too challenging. The results demonstrate our method is effective to learn temporal-sensitive and multimodal-friendly representation.

Mask type, sampling method, and temporal downsampling. Table 2 indicates that different masking strategies yield comparable results in SthSth V2. We contend that recognizing the category of “something” is not necessary for SthSth V2, but it requires deducing the intricate motion between objects, thus random masking suffices. However, it is critical for K400 to identify the scene and objects, making semantic masking advantageous for knowledge distillation. Moreover, sparse sampling without temporal downsampling is more appropriate for our approach.

Aligned layers. We try to align more layers in Figure 3, and the losses are averaged across multiple layers. Since the GPU memory and running speed are similar, we simply align the last 6 layers for the best results.

Masking ratio. Figure 4 shows that proper high ratios work better. When using a ratio of 95%, the performances dramatically drop since it is too challenging for token alignment. Conversely, when removing masks, the task is too easy to learn the token relationships in space and time. By default, we adopt the ratio of 80% for better trade-offs.

Why does UMT work? In Table 3, we investigate the crucial designs of our Unmasked Teacher. (i) Spatiotemporal attention: In the 2nd and 3rd parts, we compare the student with spatial attention and spatiotemporal attention during fine-tuning. Our results indicate that utilizing joint attention significantly enhances performance. Moreover, employing spatiotemporal attention during pre-training further improves performance (the 4th part), validating our assumption that joint attention encourages interaction among all unmasked tokens. (ii) Masked modeling: In the 4th part, we observe that masked modeling plays a crucial role. However, when using spatial attention during pre-training, masked modeling becomes detrimental. We argue that when processing each frame individually with a high mask ratio of 80%, the token alignment task becomes excessively challenging. (iii) Teacher attention: The 5th part shows that although CLIP-ST achieves better performance after fine-tuning, directly applying it as the teacher model leads to a performance drop. We contend that without post-training in the video domain, CLIP-ST may disrupt the representation learned in the image domain.

Outperforming the CLIP teacher. In the image domain, the prior research has shown that, CLIP itself with fine-tuning surpasses existing CLIP-targeted MIM methods . However, Table 3 indicates that in the video domain, the student model (the 4th part) clearly outperforms the teacher, i.e., CLIP- $ST$ with our elaborate fine-tuning. We attribute the success to masked video modeling with spatiotemporal attention, which encourages the model to capture long-term dependencies among objects.

Multi-modality masking ratios. In Table 4, we first alter the masking ratios of the image and video data. Since we co-train image-text and video-text data with the same batch size, the GPU memory primarily depends on the video masking ratio. As expected, processing images requires a lower masking ratio of 50%. Although higher masking ratios reduce memory consumption, the corresponding performances are lower. Additionally, masking too few (25%) or too many (75%) text tokens leads to inferior results.

Multi-modality pre-training objectives. For cross-modal retrieval, utilizing either VTC or VTM for visual-text pairs is necessary. In Table 5, all loss weights are set to 1. The 1st part reveals that VTM performs better than VTC. Besides, the 2nd part shows that combining VTC or MLM with VTM leads to a minor improvement, while integrating all three objectives significantly enhances the performance. Lastly, without our unmasked teacher alignment, the memory usage triples, while the performances drop.

3 Single-modality tasks

We evaluate our method on two conventional video-only tasks: recognizing and localizing actions on six large-scale benchmarks, including the Kinetics family (i.e., Kinetics-400, 600 and 700 ), Moments in Time V1 and Something-Something V2 for action recognition, and AVA V2.2 for spatiotemporal localization.

Kinetics. Table 6 reports the SOTA methods with supervised and self-supervised learning on K400. On one hand, our UMT with intermediate fine-tuning outperforms the previous models that rely on web-scale pre-training, e.g., the UMT-L achieves 0.4% higher top-1 accuracy than MTV-H with only 1/10 of the FLOPs and 1/3 of the parameters. On the other hand, our UMT surpasses its counterparts with masked video modeling, e.g., compared with VideoMAE with 1600-epoch pre-training, the UMT-L with 400-epoch pre-training obtains 3.7% accuracy improvement. For K600 and K700, our UMT-L also obtains the SOTA performances (90.5% and 83.6% see Table 7).

Moments in Time. As shown in Table 8, our UMT-L achieves 1.0%/1.7% higher top-1/5 accuracy compared to the advanced UniFormerV2-L , while utilizing fewer FLOPs. Note that MiT is more challenging due to the large inter-class and intra-class variation, thus the results demonstrate the robustness and effectiveness of our method.

Something-Something. Distinct from previous benchmarks, this particular dataset requires complex and long-term modeling to accurately recognize temporal-related actions, such as ”pretending to close something without actually closing it”. Without any additional data, our UMT-L model outperforms the UniFormerV2-L (74.4% vs. 73.0% in Table 9) which was specifically tailored for temporal modeling. Additionally, our approach achieves comparable performances to VideoMAE with significantly fewer epochs. Intriguingly, VideoMAE performs worse when utilizing Kinetics for masked modeling, while our UMT performs even better. This demonstrates the versatility and adaptability of our method, which can be applied to diverse video domains with the same pre-training.

AVA. Table 10 presents the results of the action detection on AVA. Remarkably, our UMT achieves 2.0 mAP improvement over the advanced VideoMAE with only K400 pre-training. Furthermore, our method achieves the impressive 39.8 mAP with K710 pre-training, showcasing its robust transferability for spatiotemporal understanding.

4 Multi-modality tasks

We further validate our model on two mainstream video-language tasks, including video-text retrieval (MSRVTT , DiDeMo , ActivityNet , LSMDC , MSVD and Something-Something ) and video question-answering (ActivityNet-QA , MSRVTT-QA , MSRVTT-MC and MSVD-QA ).

Zero-shot text-to-video retrieval. Table 11 indicates that the UMT-B outperforms the top-performing models by 0.9%, 5.0%, and 4.6% R@1 on MSRVTT, DiDeMo, and ActivityNet, respectively. Moreover, our UMT-L achieves new state-of-the-art results among all the datasets, highlighting its remarkable robustness.

Text-to-video retrieval. Table 12 lists the fine-tuned results, where our UMT-L significantly outperforms previous methods pre-trained with large-scale pairs . Specifically, our UMT-L achieves 58.8% (+3.6%), 70.4% (+9.2%), 66.8% (+4.6%), 43.0% (+9.0%), and 80.3% (+21.9%) on MSRVTT, DiDeMo, ActivityNet, LSMDC, and MSVD, respectively. Besides, the strong results on the temporally-heavy SthSth V2 dataset (73.3% and 90.8%) in Table 13 further supports our broad applicability.

Video question-answering. As shown in Table 14, our UMT outperforms the methods specifically designed for QA such as JustAsk , and achieves comparable performance with state-of-the-art models that pre-trained with large-scale pairs , which demonstrates its powerful capability of complex multimodal reasoning.

Conclusion

In this paper, we propose using the image foundation model as the unmasked teacher for masked video modeling. Besides, we present a progressive pre-training framework for building environmentally friendly video foundation models, which handles both scene-related and temporal-related actions, as well as complex video-language understanding. We hope that our simple, scalable, and reproducible framework will facilitate further research on video foundation models for future AI systems.

References

Appendix A More results

We conduct more ablation studies based on ViT-B/16. Results are shown in Figure 5, Table 15 and Table 16.

Training schedule. Figure 5 presents the results of different training schedules. On one hand, a longer training schedule consistently improves the performances on both benchmarks. On the other hand, compared to VideoMAE , our method shows a faster convergence speed. For example, when pre-training for 200 epochs, our models achieve 3.9% and 6.8% top-1 accuracy on SthSth V2 and Kinetics-400, respectively.

Different teachers. In Table 15, we adopt different models as the unmasked teachers. As expected, the student models clearly outperform the corresponding teacher models, which have undergone elaborate fine-tuning. It’s important to note that both student and teacher models share the same architecture, further emphasizing the effectiveness of our approach.

Other designs. Table 16 showcases alternative designs for our multi-modality pre-training. Firstly, we attempt to directly perform Stage 2 with a randomly initialized video encoder. For a fair comparison, we incorporate Kinetics-710 and conduct the same number of data iterations. However, the results demonstrate that the one-stage pre-training is challenging to converge, leading to poor performance. Secondly, we randomly mask the video without an unmasked teacher for supervision, which slightly reduces the overall performance. Additionally, we consider aligning the visual and text projection with the CLIP teacher, since the teacher model also adopts contrastive learning. However, introducing extra alignment tasks turns out to be redundant and even harmful. Finally, we conduct extra pre-training without masks after masked pre-training. Though it improves zero-shot performance (+1.5% higher average recall accuracy), the fine-tuned results are not as good as expected.

A.2 Video-text retrieval

Table 17 and Table 18 show more zero-shot and fine-tuned retrieval results on MARVTT , DiDeMo , ActivityNet , LSMDC and MSVD .

Appendix B More implementation details

In this section, we introduce the model architectures and training hyperparameters in our experiments.

Stage 1. In Stage 1, we train the video encoder from scratch, which is a vanilla ViT without temporal downsampling. We use the same patch size for both ViT-B and ViT-L, i.e., $1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×16<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×16$ ( $T<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×H<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×W$ ). To align with the unmasked teacher, we use a simple linear projection, including Layer Normalization and one linear layer. The example architecture is shown in Table 19. For pre-training, we follow most of the hyperparameters in VideoMAE , as presented in Table 20. However, to prevent overfitting, we use drop path in our approach.

Stage 2. In Stage 2, we equip the pre-trained video encoder with a text encoder and cross-modal decoder. Following Singularity , for the base model, we use the first 9 layers and the last 3 layers of BERTbase to initialize the text encoder and recorder, respectively. While for our large model, we respectively adopt the first 19 layers and the 5 layers of BERTlarge. For pre-training, we set all the loss weights to 1. And more details are shown in Table 21.

Action Recognition. We adopt the Stage-1 pre-trained video encoder and add an extra classification layer for fine-tuning. Detailed hyperparameters for different datasets are shown in Table 22. In our experiments, we have tried to fine-tune the Stage-2 pre-trained video encoder, but the results on Kinetics are similar.

Action Detection. Following VideoMAE and ST-MAE , we add ROIAlign with MaxPooling to generate the regions of interest. Since we the Kinetics pre-trained models adopt sparse sampling , we use a frame span of 300 for action detection, which is the default frame number of Kinetics videos. More details are listed in Table 23.

Video-text retrieval. For fine-tuning, we adopt the same architecture as in Stage 2, but we only apply VTC and VTM losses. For all datasets, we sparsely sample 12 frames for both training and testing. More details are listed in Table 26. For a fair comparison, we follow Singularity to apply flip augmentation for SSV2 retrieval, which may harm the performance of this temporal-related dataset.

Video question-answering. Following the previous works , we formulate this task as text generation instead of classification. We add an extra multi-modal decoder that takes the output of the cross-modal decoder as the keys/values. And it decodes the answer text with “[CLS]” as a start. We follow to adopt the same architecture as the cross-modal decoder, and initialize it using the pre-trained cross-modal decoder. As for multiple-choice question-answering, we follow to convert it to a text-to-video retrieval task, where the question and candidate answers are concatenated. The detailed hyperparameters are shown in Table 24 and Table 25.

B.2 Dataset descriptions

We show the statistics of pre-training datasets in Table 27, and downstream datasets in Table 28.