Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning

Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang

Introduction

For self-supervised visual representation learning, recent masked image modeling (MIM) methods like MAE , BEiT and PeCo achieve promising results with vision transformers on various vision downstream tasks. Such a pretraining paradigm has also been adapted to the video domain and boosts video transformers by clear margins compared with supervised pretraining on several video downstream tasks. Representative masked video modeling (MVM) works include BEVT , VideoMAE and ST-MAE .

Following MAE and BEiT , existing masked video modeling methods pretrain video transformers through reconstructing low-level features, e.g., raw pixel values or low-level VQVAE tokens. However, using low-level features as reconstruction targets often incur much noise. And due to the high redundancy in video data, it is easy for masked video modeling to learn shortcuts, thus resulting in limited transfer performance on downstream tasks. To alleviate this issue, masked video modeling often uses larger masking ratios.

In this paper, we observe that much better performance on video downstream tasks can be achieved by conducting masked feature prediction by using the high-level features of pretrained MIM and MVM models as masked prediction targets. This can be viewed as two-stage masked video modeling, where MIM pretrained image models (i.e., an image teacher) or MVM pretrained video models (i.e., an video teacher) are obtained in the first stage, and they further act as teachers in the second stage for the student model via providing the high-level feature targets. Therefore, we call this method Masked Video Distillation (MVD).

More interestingly, we find that student models distilled with different teachers in MVD exhibit different properties on different video downstream tasks. Specifically, students distilled from the image teacher perform better on video tasks that mainly rely on spatial clues, while students distilled from the video teacher model perform better on the video downstream tasks where temporal dynamics are more necessary. We think during the pretraining process of masked video modeling in the first stage, video teachers have learned spatial-temporal context in their high-level features. Therefore, when employing such high-level representations as prediction targets of masked feature modeling, it will help encouraging the student model to learn stronger temporal dynamics. By analogy, image teachers provide high-level features as targets that include more spatial information, which can help the student model learn more spatially meaningful representations. We further analyze the feature targets provided by image teachers and video teachers, and calculate the cross-frame feature similarity. It shows that the features provided by the video teachers contain more temporal dynamics.

Motivated by the above observation, to leverage the advantages of video teachers and image teachers, we propose a simple yet effective spatial-temporal co-teaching strategy for MVD. In detail, the student model is designed to reconstruct the features coming from both the image teacher and video teacher with two different decoders, so as to learn stronger spatial representation and temporal dynamics at the same time. Experiments demonstrate that MVD with co-teaching from both the image teacher and the video teacher significantly outperforms MVD only using one single teacher on several challenging downstream tasks.

Despite the simplicity, our MVD co-teaching is super effective and achieves very strong performance on multiple standard video recognition benchmarks. For example, on Kinectics-400 and Something-Something-v2 datasets, compared to the baseline without MVD, MVD co-teaching with 400 epochs using a teacher model of the same size achieves 1.2%, 2.8% Top-1 accuracy gain on ViT-B. If a larger teacher model ViT-L is used, more significant performance gains (i.e., 1.9%, 4.0%) can be obtained. When ViT-Large is the target student model, our method can achieves 86.4% and 76.7% Top-1 accuracy on these two datasets, surpassing existing state-of-the-art method VideoMAE by 1.2% and 2.4% respectively. When a larger ViT-Huge model is adopted, MVD achieves the state-of-the-art performance with 77.3% Top-1 accuracy on Something-Something-v2 and 41.1 mAP on AVA v2.2.

Our contributions can be summarized as below:

We find that using MIM pretrained image models and MVM pretrained video models as teachers to provide the high-level features for continued masked feature prediction can learn better video representation. And representations learned with image teachers and video teachers show different properties on different downstream video datasets.

We propose masked video distillation together with a simple yet effective co-teaching strategy, which enjoys the synergy of image and video teachers.

We demonstrate strong performance on multiple standard video recognition benchmarks, surpassing both the baseline without MVD and prior state-of-the-art methods by clear margins.

Related Work

Vision transformers for video understanding. For video understanding tasks, modeling the spatial-temporal information is the most important factor to consider in the architecture design. In the early works of video understanding, common video architectures, e.g., 3D CNNs and 2D CNNs with temporal modules , are designed by extending existing 2D CNN models on the temporal dimension. Recently, Vision Transformers achieve significant progress on several computer vision tasks. Some works also adapt vision transformers to the video domain and achieve superior performance compared to previous CNN-based architectures. For example, TimeSformer and ViViT study several variants of space-time factorization for extending the plain ViT architecture to video domain. Some works further explore how to reduce computational cost of the space-time attention. VideoSwin and MViT study the hierarchical architecture and introduce an inductive locality bias into video transformers. Uniformer and Video Mobile-Former propose to integrate 3D CNNs and spatial-temporal self-attention mechanism for efficiency consideration. For convincing performance on the video understanding tasks, most video transformers require model weights pretrained on the large-scale image datasets. In this paper, we explore the self-supervised pretraining of video transformers and show pretraining strategy will significantly influence the downstream performance, which is orthogonal to the transformer architecture design.

Self-supervised video representation learning. The early works of self-supervised video representation learning focus on designing the pretext tasks based on the temporal structure of videos. More recently, contrastive learning that forces different views of the same image sample to be closer in the feature space while pushing the views of different images farther becomes a new paradigm of representation learning, and some works design the contrastive learning methods on the video domain by exploring effective ways of spatial-temporal augmentations. However, as the learning supervision based on contrastive learning is applied on global representation, it cannot well model the local relationship or learn fine-grained local representation.

Masked visual modeling. Masked language modeling has been one of the dominant pretraining methods of language transformers. With the success of vision transformers, masked visual modeling has been introduced to self-supervised visual pretraining and demonstrates to be also helpful to multimodal visual-language learning . Following BERT pretraining, BEiT and PeCo pretrain ViT by predicting the discrete visual tokens of masked patches, which are encoded by a pretrained VQ-VAE. MAE proposes an asymmetric encoder-decoder framework for the reconstruction of pixels, which reduces the computational cost of masked image modeling significantly. SimMIM and MaskFeat propose to recover low-level features of masked patches like pixels or HOG features with hierarchical ViT. In contrast, iBOT , BootMAE and sdAE adopt an exponential moving average of the student model as the online teacher model, which makes the target features bootstrapped during training. In the video domain, some pioneering works extend masked image modeling to masked video modeling. BEVT proposes a two-stream pretraining joint pretraining framework by predicting the discrete tokens with both image transformer and video transformer. VideoMAE and ST-MAE follow MAE and reconstruct the pixels of masked video patches with an extremely high masking ratio. Unlike most previous works of masked video modeling, our MVD focuses on masked feature modeling with high-level features as targets, and finds that student models using image and video teacher models will have different properties and complement each other.

Knowledge distillation. Knowledge distillation aims to transfer the knowledge of the teacher model to the student model by adopting the output of the teacher model as the target for training the student model. Typical knowledge distillation works mainly focus on supervised learning, e.g., image classification. Recently, self-supervised knowledge distillation has also been studied to learn representations from self-supervised pretrained models. In this paper, we present the first attempt that uses the masked image modeling pretrained image and video model as the masked feature prediction target in the video domain. It shows self-supervised MIM pretrained model can further boostrap the mask video pretraining and bring significant performance gain.

Method

While masked video modeling has demonstrated promising performance for self-supervised learning, most existing approaches reconstruct relatively low-level information in the forms of raw pixels , low-level features like HOG and VQVAE tokens . In this paper, instead of reconstructing low-level information, we conduct masked video modeling at the feature-level. This is achieved by a two-stage framework, MVD, optimized to predict high-level features that derived from off-the-shelf MIM pretrained image models and MVM pretrained video models which are readily available. Below, we first given an overview of the masked feature modeling paradigm in Sec. 3.1 and then we introduce our proposed MVD in Sec. 3.2. Finally, we present the architectural design of MVD in Sec. 3.3.

where $X_{vis}$ denotes the visible input tokens, and $T_{m}$ denotes the mask tokens. The subset of output tokens from decoder corresponded to input mask tokens contains reconstructed information of masked tokens. The reconstruction target for each masked patch $X(p)$ is represented as a patch feature $h(X(p))$ . Here, $h$ is represents a function for generating the target features, e.g. $h$ produces low-level RGB values of pixels in the patch in . Then, to train the encoder and the decoder, a loss function that measures the distance $D$ between the ground-truth features of masked patches and reconstructed ones is defined as:

where $p$ is the token index and $M$ is the set of masked tokens. For pixel regression in MAE and VideoMAE , the L2 distance is used as the distance metric.

2 Masked Video Distillation

In this paper, we propose Masked Video Distillation (MVD), which performs masked feature modeling on videos using high-level features as opposed to low-level pixels. In particular, we simply use outputs generated by off-the-shelf self-supervised pretrained image or video models, which are readily available, as reconstruction targets. These high-level features, serving as targets of the mask & prediction tasks, are encoded by teacher models pretrained by masked visual modeling like MAE or VideoMAE. For video representation learning, the reconstruction targets can take the form of spatial features encoded by image teacher models, or spatial-temporal features encoded by video teacher models. More specifically, the image teachers is pretrained by masked image modeling, while the video teacher is pretrained with masked video modeling, both of which aim at reconstructing raw pixels. Once trained, we use the image encoder $h_{img}$ to generate the spatial targets, and the pretrained video transformer encoder $h_{vid}$ to generate spatial-temporal targets. The loss function of MVD with the image teacher and video teacher can be denoted by $L_{mfm}(h_{img})$ and $L_{mfm}(h_{vid})$ , respectively

Spatial-temporal Co-teaching. When performing MVD with a single teacher, we observe that students distilled from different teachers learn different video representations and perform well on different kinds of downstream video tasks. To improve the performance of MVD on different downstream video tasks, we propose spatial-temporal co-teaching that explores information from both image and video teachers such that the student model can handle videos of different types better. For instance, videos with fastly changing human actions require more temporal information while spatial clues might be sufficient for relatively static videos. To this end, MVD is trained to predict target high-level features produced by the image teacher and the video teacher at the same time. This is achieved by using two separated decoders to reconstruct different target features. The final loss of MVD with spatial-temporal co-teaching is:

where $\lambda_{1}$ and $\lambda_{2}$ denote the hyper-parameters that balance the weights of the image teacher and the video teacher. The pseudo code of MVD is shown in Algorithm 1.

3 Architectural Design

Mask strategy. For MVD, we follow and adopt tube masking for masked feature modeling. First a 2D random mask is generated and then extended along the temporal dimension. Therefore, the spatial mask on each time slice is the same, which prevents information leakage between frames. Tube masking with a high masking ratio (e.g., 90%) encourages the video transformer to model the high-level semantics during pretraining.

Decoder. For MVD, shallow decoders consist of vanilla transformer layers and a linear projection layer. The transformer layers in decoders are the same as those in the encoder. Since spatial-temporal co-teaching introduces two different reconstruction targets for masked feature modeling, two separated decoders that share the same architecture but contain different weights are placed on the top of the encoder. Learnable masked tokens corresponded to masked patches are concatenated with visible tokens from the encoder before fed into the decoder. After jointly modeling the spatial-temporal relationships, the output tokens of transformer layers are mapped to final predictions by the linear projection layer.

Reconstruction targets. For generating spatial-temporal target features, the video teacher, which shares the same architecture as the student model, is pretrained by a VideoMAE manner on the video dataset. For obtaining spatial targets, we adopt the vanilla image ViT pretrained by masked image modeling on the image dataset (e.g., ImageNet-1K). It is worth noting that one 3D patch (with size of $2\times 16\times 16$ ) for the video transformer corresponds to two 2D patches (with size of $16\times 16$ ) for the image transformer. Following , we predict the spatial features of a single time slice (that is the front 2D patch), which reduce the prediction layer’s size.

Experiments

In this section, we first introduce the experimental setup in Sec. 4.1, and then present the main results in Sec. 4.2, followed by an extensive analysis to verify the effectiveness of different components in Sec. 4.3.

Dataset. We pretrain the vanilla ViT with MVD on Kinetics-400 by default, and evaluate the learned model on four video recognition downstream tasks: (a) Kinetics-400 (K400) , which consists of ${\sim}240K$ training videos and ${\sim}20K$ validation videos with an average duration of 10 seconds. All video clips are labeled into 400 classes. (b) Something-Something V2 (SSv2) , which contains ${\sim}160K$ videos for training and ${\sim}20K$ videos for validation. The videos in SSv2 with an average duration of 4 seconds are labeled into 174 motion-centric categories. (c) UCF-101 is a relatively small dataset, consisting of ${\sim}9.5K$ training videos and ${\sim}3.5K$ validation videos. (d) HMDB51 is also a small video dataset that contains around 3.5K/1.5K train/val videos. On UCF101 and HMDB51, we follow the commonly used protocols and evaluate our method across all 3 train/val splits. We also transfer pretrained models to a challenging spatial-temporal action detection dataset AVA .

Implementation details. Our MVD is performed on vanilla ViTs with different capacities (i.e., ViT-S, ViT-B, ViT-L). By default, image teacher models are pretrained on ImageNet-1K for 1600 epochs and video teachers are pretrained on K400 for 1600 epochs. We follow the training strategy in MAE and VideoMAE for image teachers and video teachers respectively. In the distillation stage, student models are first pretrained from scratch on K400 for 400 epochs unless mentioned otherwise. The resulting models are then finetuned on downstream video tasks. The video clip length is 16 for both pretraining and finetuning. We adopt AdamW optimizer and Smooth L1 loss for the optimization of student models. We conduct experiments of pretraining on 32 NVIDIA V100 GPUs and experiments of finetuning on 16 NVIDIA V100 GPUs. More details are included in supplementary materials.

2 Main Results

Students distilled from different teachers. Unlike masked image modeling, masked feature modeling on video data has more choices on reconstruction targets. Besides spatial features, to include temporal dynamics in the reconstruction target, we can also adopt spatial-temporal features encoded by pretrained video models. In Table 1, we compare students distilled by the image teacher and the video teacher on K400, a downstream task that mainly relies on spatial clues, and SSv2, a temporally-heavy downstream task. Our observations are as follow: (a) Masked feature modeling with high-level features as targets achieves convincing performance on downstream video tasks, and outperforms VideoMAE baseline significantly (compared with the baseline results in Table 2). In particular, with both image and video teachers using a ViT-S as the backbone, MVD achieves consistent gains over VideoMAE on both K400 (80.6% vs. 79.0%) and SSV2 (70.7% vs. 66.4%). (b) Students distilled from the image teacher achieve higher top-1 accuracy on K400, while students distilled from the video teacher perform better on SSv2. For example, ViT-S achieves an accuracy of 80.4% and 69.4% using an image teacher on K400 and SSv2 respectively. With a video teacher, on the other hand, ViT-S offers a top-1 accuracy of 80.1% and 70.0% respectively. As it has been shown that videos in K400 less sensitive to temporal modeling compared to SSv2, the results demonstrate students learn stronger spatial representation from the image teacher while the video teacher transfer more knowledge about temporal dynamics to students.

Co-teaching outperforms distilling with a single teacher. To improve the performance on different kinds of downstream video tasks, we introduce spatial-temporal co-teaching in MVD, which trains the model to predict spatial features and spatial-temporal features of masked patches in a decoupled way. The results in Table 1 indicate that students distilled with spatial-temporal co-teaching outperforms students distilled from either single teacher on both spatially-heavy task and temporally-heavy task.

MVD outperforms VideoMAE baseline significantly. In Table 2, MVD with spatial-temporal co-teaching is compared with VideoMAE pretrained on K400. When the size of teacher models is the same as that of students, MVD outperforms VideoMAE by a clear margin on both K400 and SSv2. Larger models as teachers can further boost the performance of MVD. It is worth mentioning that not only is our MVD particularly effective for relatively small models, but also it improves the performance of large vision models like ViT-L. For example, with a ViT-L model as the student model, MVD achieves 86.0% and 76.1% on K400 and SSv2, surpassing the VideoMAE model by 0.8% and 2.1%, respectively.

Comparison with state-of-the-art. We compare MVD with prior studies on four video recognition tasks. Results on K400 are shown in Table 3. Our MVD outperforms previous self-supervised methods with similar or less computational cost. Even compared with video transformers pretrained on ImageNet-21K, MVD achieves superior performance. Particularly, MVD-H achieves an accuracy of 87.2% on K400, outperforming previous top-performing methods by clear margins. Table 4 presents comparisons to the state-of-the-art methods on SSv2. We observe that for downstream tasks depending on temporal relationship modeling, self-supervised methods based on masked video modeling achieve better performance in comparison with supervised methods (cf. results in the middle group of Table 4 vs. results in the top group). Once again, our MVD, producing an accuracy of 76.1% with a large model, beats both supervised methods and self-supervised methods by clear margins. With more training epochs (i.e., 800 epochs), MVD with ViT-L achieves more significant performance gains (i.e., 1.2%, 2.4% on K400 and SSv2) compared with VideoMAE. When a huge model is used, the performance can still be boosted and MVD achieves 77.3% top-1 accuracy on SSv2.

We also evaluate the transfer learning ability of MVD on two relatively small datasets, UCF101 and HMDB51. As shown in Table 6, MVD with ViT-B obtains higher accuracy compared with prior works based on well-designed pretext tasks, contrastive learning and masked video modeling methods. Especially compared to the original VideoMAE ViT-B teacher model, we achieve 0.9% and 3.1% higher points on these two datasets respectively. Additionally, when teachers with larger size are adopted, MVD achieves stronger transfer learning performance.

When transferred to the more complicated action detection task (AVA v2.2), MVD still shows remarkable improvement compared with previous methods, as shown in Table 5. For example, without additional labels of K400, MVD with ViT-L outperforms VideoMAE by 3.4 to achieve 37.7 mAP. When we intermediately finetune the pretrained models on K400, MVD with ViT-L also achieves significant performance improvement (i.e., 1.7 mAP) compared with VideoMAE. Finally, with a ViT-Huge model, MVD achieves 41.1 mAP, improving 1.6 over the prior state-of-the-art method.

3 Analysis and Discussion

In this section, we provide an in-depth analysis of the effectiveness of different components in MVD.

Analysis of features encoded by different teachers. The properties of target features generated by different teachers may influence the performance of students on different downstream tasks. To quantify the temporal dynamics that teacher models capture from the input video, we study the similarity between feature maps across different frames of each input video clip via the cosine similarity. As similarity matrices shown in Figure 3, for image teachers, the feature maps of different frames are almost the same. However, for video teachers, the features of different frames have larger difference. This indicates that video teachers capture more temporal difference. Therefore, students distilled from video teachers can learn stronger temporal dynamics and perform better on temporally-heavy downstream tasks.

Training time comparison. We study whether MVD is able to achieve better balance of accuracy and efficiency than VideoMAE. For fair comparisons, the training time of teacher models is also counted in the total training time of MVD. Results are shown in Table 7. We see that MVD can achieve better accuracy (i.e., 81.9%) by a total of 164 hours of training which is 50 hours less than VideoMAE (producing an accuracy of 81.5%) trained for 1600 epochs.

Reconstruction signals in MVD. In MVD, we first pretrain teacher models by recovering the pixels of masked patches in a MAE manner, then adopt features produced by teacher models as the targets of masked feature modeling. In Table 8, we study whether to include an additional decoder branch for reconstructing pixels of masked patches in the distillation stage. The experimental results show that for both MVD with a single teacher and MVD with spatial-temporal co-teaching, the reconstruction of low-level feature targets degrades the performance for downstream tasks. Therefore, we only reconstruct high-level features of masked patches in the distillation stage of MVD.

Comparison with bootstrapped teachers. Some recent approaches of image representation learning adopt features of a momentum encoder as the targets of masked image modeling, while we use frozen teacher models in MVD. In Table 9, we compare fixed teachers with bootstrapped teachers that are updated by an exponential moving average of the online encoder during pretraining. According to a masked image modeling method , two strong baselines of bootstrapped teachers are built for video representation learning: (a) The student model is trained from scratch with masked feature modeling, and the target features are generated by a momentum encoder. (b) The framework of bootstrapped teachers is first pretrained on IN-1K for 800 epochs, then the pretrained weights are adopted to initialize the video pretraining. As shown in Table 9, MVD with the frozen teacher beats the method with a bootstrapped teacher on the downstream video tasks, even if only a single teacher is utilized.

Comparison with feature distillation. In the previous work of self-supervised feature distillation , the distillation loss is directly computed upon the full feature maps between teachers and students. Accordingly, we build a baseline method named per-token distillation. Specifically, the output features of students are projected by a MLP, and then forced to mimicking the teacher’s feature at each token with a Smooth L1 loss. As shown in Table 10, masked feature reconstruction in our MVD outperforms per-token distillation on both K400 and SSv2.

Conclusion

In this paper, we study masked video distillation upon MIM pretrained image or video transformers. We have three interesting findings: 1) Using MIM pretrained image transformers or MVM pretrained video transformers as teachers supervise masked feature prediction can significantly boost the finetuning performance on video downstream tasks; 2) The representation distilled with image and video teachers will have different properties, i.e., image teachers will benefit spatial-heavy video tasks more while video teachers benefit temporal-heavy video tasks more; 3) Combining image and video teachers will enjoy the synergy and thus produce higher performance. Even though the proposed masked video distillation seems very straightforward, we hope such interesting findings can motivate more thinking about masked video pretraining.

References

Appendix A Additional Results

We perform masked reconstruction of high-level features for the image ViT on ImageNet-1K. For masked feature modeling on the image data, only the image teacher in MVD can be used. As the results shown in Table 11, compared with the MAE baseline, masked feature distillation achieves 0.4% Top-1 accuracy gain on ImageNet-1K. When comparing the performance improvement against masked reconstruction of pixels between image models and video models, we observe that MVD achieves greater performance gains on video downstream tasks.

Appendix B Implementation Details

We pretrain image teacher models on ImageNet-1K following the strategy in , and pretrain video teacher models on Kinetics-400 following the strategy in . For the distillation stage in MVD, we distill student models with teacher models for 400 epochs on Kinetics-400 unless otherwise stated. The length of input videos is 16 frames during pretraining. We adopt tube masking in and the masking ratio in the distillation stage is 90%. The default setting of pretraining is presented in Table 12.

B.2 Finetuning Experiments

We transfer models pretrained by MVD on Kinetics-400 to video downstream tasks with the default setting in Table 13.

Kinetics experiments. When finetuning on Kinetics-400, we adopt the dense sampling following and the default length of input videos is 16 frames. For inference, we use 3 spatial crops $\times$ 5 temporal clips.

Something-Something v2 experiments. During finetuning on Something-Something v2, we adopt the uniform sampling following and the default length of input videos is 16 frames. For inference, we use 3 spatial crops $\times$ 2 temporal clips.

UCF101 and HMDB51 experiments. For finetuning on UCF101 and HMDB51, we adopt the dense sampling and the default length of input videos is 16 frames. For inference, we use 3 spatial crops $\times$ 5 temporal clips.

AVA experiments. When finetuning on AVA v2.2, following , we adopt the detection architecture in and the detected person boxes from AIA . The default length of input videos is 16 frames. We also use the default finetuning setting in for a fair comparison.

Appendix C Visualization

In our paper, to quantify the temporal dynamics that models capture from the input video, we study the similarity between feature maps across different frames of each input video clip via the cosine similarity.

Analysis of features encoded by different teachers. The properties of target features generated by different teachers may influence the performance of students on different downstream tasks. As similarity matrices shown in Figure 4, for image teachers, the feature maps of different frames are almost the same. However, for video teachers, the features of different frames have larger differences. This indicates that video teachers capture more temporal difference. Therefore, students distilled from video teachers can learn stronger temporal dynamics and perform better on temporally-heavy downstream tasks.

Analysis of features encoded by students distilled from different teachers. To study what students learn from different teachers, we visualize the feature similarity across different frames for student models. As results shown in Figure 5, we observe that (a) for the student distilled from the image teacher, the features of different frames have larger differences compared with those encoded by the image teacher. This indicates that students can learn temporal dynamics from the masked reconstruction of spatial features on videos. (b) For the student distilled from the video teacher, the features of different frames have larger differences compared with those encoded by the student distilled from the image teacher. This demonstrates that students learn stronger temporal dynamics from video teachers.