Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

cs.CV cs.AI

Introduction

One major gap between artificial intelligence and human intelligence lies in their abilities to generalize and perform well on new tasks with limited annotations. Recent advances in large-scale pre-trained generative language models have shown promising few-shot capabilities in understanding natural language. However, few-shot video-language understanding is still in its infancy. A particular limitation of most recent video-language pretraining frameworks is that they are encoder-only, which means they do not have the ability to generate text from videos for purposes such as captioning , question answering , and future prediction . Meanwhile, unified video-language models that are capable of language decoding still rely heavily on finetuning using a large number of manually annotated video-text pairs, therefore cannot adapt quickly to unseen tasks. Few-shot video-to-text decoding is challenging because the natural language supervision for learning video-language representation is typically based on subtitles and automatic speech recognition (ASR) transcripts , which differ significantly from downstream tasks in terms of distribution and may have poor semantic alignment across vision and text modalities.

We propose to address this problem by harnessing the few-shot power of frozen large-scale language models, such as InstructGPT . Our inspiration is derived from the fact that humans are excellent visual storytellers , with the ability to piece together a coherent story from a few isolated images. To mimic this, we propose VidIL, a few-shot Video-language Learner via Image and Language models, to use image models to provide information about the visual content in the video (as well as optionally use ASR to represent speech), and then we instruct language models to generate a video-based summary, answer, or other target output for diverse video-language tasks.

The main challenge of understanding videos is that, videos contain rich semantics and temporal content at multiple granularities. Unlike static images which depict objects, attributes and events in a snapshot, the temporal dimension of videos further conveys the state changes of the objects, actions, and events. For example, in Figure 1, the individual frame captions of the video clip only describe static visual features such as "a person holding a green object in hand". In contrast, a correct video-level description would be "a woman makes realistic looking leaves and flowers for a cake", which involves reasoning over a collection of objects and events that occur at different timestamps in the video clip, such as "cake decorating" and "flowered design". Hence, to inform video-level description and queries, we need to represent all of this information and its temporal ordering.

To address the unique challenges of videos, we propose to decompose a video into three levels: the video output, frame captions, and visual tokens (including objects, events, attributes). One major benefit from this hierarchical video representation is that we can separate the visual and temporal dimensions of a video. We leverage frozen image-language foundational models at lower levels to collect salient visual features from the sparsely sampled frames. Specifically, we first leverage a pretrained image-language contrastive model CLIP to perform visual tokenization, based on the similarity score between frames and tokens of objects, events and attributes. The tokenization is done under the guidance of semantics role labeling , which provides us with candidate events with involved objects and related attributes. Next, in order to capture the overall semantics at the frame level, we employ the pretrained image captioner in the image-language model BLIP to obtain frame captions. Finally, we instruct a pretrained large language model using in-context learning to interpret visual tokens and frame captions into the target textual output. In detail, we temporally order visual tokens and frame captions using specially designed prompts such as “First…Then…Finally”, to instruct the pretrained language model to track the changes of objects, events, attributes and frame semantics along the temporal dimension.

Without pretraining or finetuning on any video datasets, we show that our approach outperforms both video-language and image-language state-of-the-art baselines on few-shot video captioning and question answering tasks. Moreover, on video-language event prediction, our approach significantly outperforms fully-supervised models while using only 10 labeled examples. We further demonstrate that our generative model can benefit broader video-language understanding tasks, such as text-video retrieval, via pseudo label generation. Additionally, we show that our model is highly flexible in adding new modalities, such as ASR transcripts.

Related Work

Large-scale image-language pretraining models optimize image-text matching through contrastive learning and multimodal fusion . Recently, BLIP proposes a bootstrapping image-language pretraining framework with a captioner and a filterer which has shown promising performance on various image-language tasks. However, video-language pretraining is still hindered by noisy and domain-specific video datasets . Naturally, researchers start to explore transferring the rich knowledge from image models to videos. Different from the traditional way of representing videos by 3D dense features , recent work proves that sparse sampling is an effective way to represent videos, which facilitates applying pre-trained image-language models to video-language tasks . Specifically, the image-language model BLIP sets new state-of-the-art on zero-shot retrieval-style video-language tasks, such as video retrieval and video question answering. However, for generation-style tasks such as domain-specific video captioning, video-language model UniVL still leads the performance but highly rely on fine-tuning. In this work, we extend the idea of leveraging image-language models to a wide variety of video-to-text generation tasks. We further connect image-language models with language models which empowers our model with strong generalization ability. We show that the knowledge from both image-language pretraining and language-only pretraining can benefit video-language understanding in various aspects.

2 Unifying MultiModal Tasks with Language Models

The community has paid much attention to connecting different modalities with a unified representation recently. Text-only generation models, such as T5 , have been extended to vision-language tasks by text generation conditioned on visual features . In order to fully leverage the generalization power from pretained language models, represents images using text in a fully symbolic way. includes more modalities such as video and audio, but requires annotated video-text data to jointly training the language model with the video and audio tokenizer. In this work, we propose a temporal-aware hierarchical representation for describing a video textually. To our knowledge, we are the first work to leverage prompting a frozen language model for tackling few-shot video-language tasks with a unified textual representation. Concurrent work Socratic uses a zero-shot language-based world-state history to represent long videos with given time stamps, while our model can quickly adapt to different video and text distributions with few examples. Furthermore, we show that by injecting temporal markers to the prompt we can make a pre-trained language model understand fine-grained temporal dynamics in video events. Compared with the concurrent work Flamingo , which requires dedicated vision-language post-pretraining, our framework does not require to pretrain or finetune on any video data. Our framework is simple and highly modulated where all the components are publicly available. Additionally, our framework is more flexible on adding new modalities, e.g., automatic speech recognition, without the need for complex redesigning.

Method

We propose a hierarchical video representation framework which decomposes a video into three levels, i.e., visual token level, frame level and video level. The motivation is to separate the spatial and temporal dimension of a video in order to leverage image-language and language-only foundation models, such as CLIP and GPT-3 . All three levels use a unified textual representation which enables us to leverage the powerful few-shot ability from pretrained language models.

Following we first perform sparse sampling to obtain several video frames. Unless otherwise specified, we sample 4 frames for frame level and 8 frames for visual token level. We then feed each frame into a pre-trained image-language model to obtain frame level captions. An example can be found in the blue part of Figure 2. In our experiments, we use BLIP , a recent image-language framework containing both image-grounded encoder and decoder, for generating frame captions. We follow to do both captioning and filtering on each frame. However, as mentioned in Section 1, videos contain rich semantics and temporal contents at multiple granularities. It is not enough to generate video-level target text such as video captions solely based on frame captions. Thus, we further perform visual tokenization for each frame to capture features at a finer granularity.

2 Visual Token Level: Structure-Aware Visual Tokenization

At this level, we aim to extract the textual representations of salient visual token types, such as objects, events and attributes. We found that pre-defined classes for classification, such as those in ImageNet , are far from enough for covering the rich semantics in open-domain videos. Thus, instead of using classification-based methods for visual tokenization as in previous work , we adopt a retrieval-based visual tokenization approach by leveraging pre-trained contrastive image-language models. Given a visual token vocabulary which contains all candidate object, event, and attribute text phrases, we compute the image embedding of a frame and the text embeddings of the candidate visual tokens using a contrastive multi-modal encoder, CLIP . We then select top 5 visual tokens per frame based on the cosine similarity of the image and text embeddings. An example of the extracted object tokens can be found in the green part of Figure 2.

Unlike in images where objects and attributes already cover most visual features, events are more informative in videos. In order to discover events from video frames, we construct our own event vocabulary by extracting event structures from Visual Genome synsetsWe use the keys in Visual Genome object synsets which contains frequent ¡verb,object¿ pairs. using Semantic Role Labeling. Specifically, we first select the phrases that contain at least one verb and one argument as events. Then we remove highly similar events based on their sentence similarity using SentenceBERT embeddings. For object vocabulary, we adopt OpenImage full classes ( $\sim$ 20k), instead of using the visually groundable subset ( $\sim$ 600) as in concurrent work . We found that using large but noisy vocabulary is more effective than using small but clean vocabulary in our retrieval-based setting with CLIP. For attribute vocabulary, we adopt visual genome attribute synset. In Section 4.6, we provide ablation study on the impact of different types of visual tokens. The statistics of visual token vocabulary can be found in Appendix Table 8.

3 Video Level: Temporal-Aware Few-shot Prompting

Once we obtain the textual representation from frame level and visual token level, the final step is to put the pieces together to generate a video level target text. The goal is to build a model that can be quickly adapted to any video-to-text generation task with only a few examples. To this end, we propose to leverage large-scale pre-trained language models, such as GPT-3 , with a temporal-aware few-shot prompt. As shown in Figure 2, our framework can be readily applied to various video-to-text generation tasks, such as video captioning and video question answering, with a shared prompt template. The proposed prompting strategy enables a language model to attend to the lower level visual information as well as taking into account the temporal ordering.

Here, we use the video captioning task depicted in Figure 2 to illustrate the details. The few-shot prompt consists of three parts: instruction, few-shot context, and task query. The instruction is a concise description of the generation task, e.g., "Generate a video caption based on the objects, events, attributes and frame captions. Example:", which is proved to be effective in zero-shot and few-shot settings . The few-shot context contains the selected in-context examples as well as the test video instance. Each video instance is represented by the aggregated visual tokensTo obtain video level visual tokens, the visual tokens extracted from each frame are further ranked and ordered based on frequency and frame index. More details can be found in Appendix C. , e.g., "Objects: First, bath toy. Then,...", the frame captions, such as "Frame Captions: First, a toddler playing in a bathtub filled with toys. Then,...", and the ASR inputs if available, e.g., "Subtitle:". Finally, the task query is a task-specific suffix indicating the target text format, e.g. "Video Caption:". For in-context examples (omitted here for simplicity), the task query is followed by ground truth annotation, while for the test instance, the generation starts at the end of the task query.

Formally, we denote the instruction line as $\mathbf{t}$ , few-shot context as $\mathbf{c}$ , the task query as $\mathbf{q}$ , and the target text as $\mathbf{y}$ , where $\mathbf{y}=(y_{1},y_{2},...,y_{L})$ . The generation of the next target token $y_{l}$ can be modeled as:

In order to capture the temporal dynamics between frames and visual tokens, we further propose to inject temporal markers to the prompt. As shown in the few-shot context in Figure 2, each visual token and frame caption is prefixed with a natural language phrase indicating its temporal ordering, e.g., "First,","Then,", and "Finally,". We found adding the temporal marker can make the language model conditioned on not only literal but also temporal information of the context. We show an example in Figure 3, where we compare our temporal-aware prompt with a static prompt on video captioning using InstructGPT. Again, the in-context examples are omitted here, which can be found in Appendix B. In this example, the only difference between these two contexts is the ordering of the visual tokens and the frame captions. For the context on the left, where "sun moving" appears before "night sky", we are expected to see a story talking about sunset, while for the context on the right, we are expected to see sunrise. We can see the static prompt generates captions about sunset for both contexts, while the temporal-aware prompt can capture temporal ordering correctly and generate sunrise for the context on the right.

Experiments

To comprehensively evaluate our model, we show results on four video-language understanding tasks in few-shot settings: video captioning, video question answering (QA), video-language event prediction, and text-video retrieval. We compare our approach with state-of-the-art approaches on five benchmarks, i.e, MSR-VTT , MSVD , VaTeX , YouCook2 , and VLEP . The statistics of the datasets can be found in Table 1. For more details please refer to Appendix C.

We use CLIP-L/14https://huggingface.co/openai/clip-vit-large-patch14 as our default encoder for visual tokenization. We adopt BLIP captioning checkpointhttps://github.com/salesforce/BLIP#finetuned-checkpoints finetuned on COCO for frame captioning. We use InstructGPT as our default language model for generating text conditioned on the few-shot prompt. To construct event vocabulary, we use the semantic role labeling model from AllenNLPhttps://docs.allennlp.org/models/main/models/structured_prediction/predictors/srl/. The experiments are conducted on 2 NVIDIA V100 (16GB) GPUs. All few-shot finetuning on baselines and semi-supervised training are performed on 2 Nvidia V100 16G GPUs.

In-context Example Selection.

From our preliminary experiments, we find that the generation performance is sensitive to the quality of in-context examples. For example, for QA tasks such as MSVD-QA where the annotations are automatically generated, the pair in randomly selected in-context examples can be only weakly-correlated with the video context. Thus, instead of using a fixed prompt for each query, we dynamically filter out the irrelevant in-context examples. Specifically, given a randomly sampled M-shot support set from the training set, we select a subset of N-shots as in-context examples based on their SentenceBERT similarities with text queries. Furthermore, we reorder the selected examples in ascending order based on the similarity score to account for the recency bias in large language models. For QA tasks, we choose the most relevant in-context examples by comparing with questions. While for captioning task, we compare with frame captions. If not otherwise specified, we use M=10 and N=5, which we consider as 10-shot training.

2 Few-shot Video Captioning

We report BLEU-4 , ROUGE-L , METEOR , and CIDEr scores on three video captioning benchmarks covering both open-domain (MSR-VTT, VaTeX) and domain-specific (YouCook2) videos. We compare with both state-of-the-art video captioner (UniVL ) and image captioner (BLIP ). In order to implement the BLIP baseline for few-shot video captioning, we extend the approach used for text-video retrieval evaluation in to video-language training. Specifically, we concatenate the visual features of sampled frames and then feed them into the image-grounded text-encoder to compute the language modeling loss. This is equivalent to stitching the sampled frames into a large image and then feeding it to BLIP for image captioning. We found that this simple approach results in very strong baselines.

As shown in Table 2, existing methods have strong bias on certain datasets. For example, UniVL performs well on YouCook2 but fails on MSR-VTT and VaTeX, while BLIP performs the opposite. This is because UniVL is pretrained on HowTo100M which favors instructional videos, i.e., YouCook2, while BLIP is pre-training on image-caption pairs which favors description-style captions, i.e., MSR-VTT and VaTeX. On the contrary, our model performs competitively on both open-domain and instructional videos, and significantly outperforms the baselines on the average CIDEr score across all three benchmarks. This indicates that by leveraging language models, we can maintain strong few-shot ability regardless of the video domain or the target caption distribution.

As discussed in Section 1, video captions describe the content in various semantic levels. The N-gram based metric may not fairly reflect the models’ performance in capturing the video-caption alignment. We further verify this hypothesis in Section 4.5. Thus, in addition to automatic metrics, we include qualitative examples illustrated in Figure 4. More examples are in Appendix A.

Additionally, for most existing methods and also concurrent work, e.g., Flamingo , adding a new modality often requires a dedicated model redesign or retraining. However, the nature of our framework, where we use a unified textual representation for each level, makes it highly flexible for incorporating new modalities. As shown in row 6 in Table, our model can effectively utilize extra information from ASR to obtain significantly better few-shot performance on certain datasets such as YouCook2.

3 Few-shot Video Question Answering

We compare the test accuracy of our approach with few-shot pretrained BLIP, BLIPVQA , and concurrent work Flamingo on two video question answering benchmarks, MSR-VTT_QA and MSVD_QA. BLIPVQA represents finetuned BLIP on VQA dataset, which is the previous SOTA on zero/few-shot video question answering. In order to have fairer comparison with BLIPVQA, we reduce the shot number to 5 and report the average accuracy on three sets of randomly selected 5-shot examples. As shown in Table 4, our method outperforms previous SOTA by a large margin. Comparing with concurrent work Flamingo, which is post-pretrained on a large number of video-text data, our model is training-free and did not observe any video data. However, with only image-language and language-only knowledge, our 5-shot model is able to outperform 8-shot Flamingo-3B and achieve on-par performance with 4-shot Flamingo-80B.

4 Few-shot Video-Language Event Prediction

In this section, we show that our model not only can answer questions about the video visual features but also answering "What is more likely to happen next?". Given a video with associated subtitle transcript as premise, the video-language event prediction (VLEP) task is to predict the most likely future event. The original VLEP paper formulates the problem as a binary classification problem where the model will be chosen from two possible future event candidates. Instead, we formulate this problem as another video-to-text generation problem to fit into our framework. Figure 5 depicts an example with the same format as in Figure 2. Similar to the evaluation setting in QA, the generated free-form text will first be mapped to one of the two candidate answers using SentenceBert , and then calculate the accuracy. In Table 4, we report accuracy on the hidden test set of VLEP . To our surprise, our 10-shot model outperforms state-of-the-art fully-supervised baseline, i.e., MERLOT , by a large margin ( $\sim 4\%$ ). This shows that our model has strong few-shot ability not only on video-language understanding but also on prediction. Since event prediction tasks rely heavily on temporal ordering, we show that with the proposed temporal-aware prompting, language models can be guided to capture temporal dynamics between historical and future events.

5 Semi-supervised Text-Video Retrieval

In addition to video-to-text generation tasks, we show that a broader range of video-language tasks can benefit from our few-shot video captioner from a data perspective. Here, we consider a low-budget semi-supervised setting where we only have a few labeled video-caption pairs and a large amount of unlabeled videos. The idea is to leverage our video captioner to generate pseudo labels for training any given vision-language models. As a case study, we evaluate on two text-video retrieval benchmarks, i.e., MSR-VTT and VaTeX. We use greedy decoding to generate pseudo caption for each video in the training set. We then train an identical base model, i.e., BLIP, using different pseudo labeled data as well as ground truth annotations. We report Recall @ 1 and 5 for both video-to-text and text-to-video retrieval. Table 5 shows that through training on our pseudo labels, we can achieve significant improvements compared with zero-shot BLIP. We also show that the performance gain is not simply a result of training on more data, since finetuning on the pseudo labels generated by other baselines (UniVL, BLIP) is less effective and can even hurt the performance. Furthermore, on MSR-VTT Recall @ 5 we can even achieve comparable performance against BLIP model finetuned on full ground truth annotations.

Another interesting observation is that, compared with the video captioning results in Table 2, we found that the gain of our model over baselines on text-video retrieval is more visible than on captioning. A key factor in performing well on text-video retrieval tasks is to learn a good video-text multi-modal alignment. This result shows that our pseudo labels capture richer video-text alignment that can benefit the retrieval-style downstream task. The N-gram based generation metrics, e.g., BLEU, may not be able to fully reflect the alignment information, due to the variety of semantic levels in video captions. Furthermore, from a data perspective, our video captioner can be viewed as a data augmentation tool which is capable of generating or augmenting any open-domain video-language pretraining datasets with minimal human effort. As a result, we can potentially improve video-language pretraining by constructing a cleaner and more diverse video-text corpus.

6 Ablation Studies

We perform comprehensive ablation studies on our few-shot prompt including the impact of different video representation, number of shots and in-context selection. All the ablation results are evaluated on MSVD_QA validation set, and we report the mean and standard deviation of each setting on three sets of randomly sampled shots. For the cases with in-context example selection, we further select $5$ examples as in-context examples from the sampled shots, while for the cases without in-context selection, all shots will be feed into the prompt. In Table LABEL:tab:ablation_representation, we show adding visual tokens consistently improves not only the model accuracy but also the model variance. A lower standard deviation indicates that the model is less sensitive to the few-shot sampling.

To further demonstrate the impact of the additional temporal dimension of videos, we perform two ablations on the "Frame+Object+Event+Attribute" setting. First, we reduce the number of frame captions and visual tokens to be onewe use the frame caption and visual tokens from the middle frame. for each video. We found that the performance drops significantly compared with using the default four frames, which indicates the model’s ability to incorporate information from multiple timestamps. Further, we found that fine-grained temporal modeling is rarely required for performing well on current video-language benchmarks. As shown in the ablation result where we reverse the order of all visual tokens and frame captions, the performance decreased only marginally, which indicates that current benchmarks may not be sufficient in reflecting the benefits from better temporal ordering.

In Table 7, we first show that, with the same context length, namely, $5$ in-context examples, in-context example selection significantly increases the performance as well as the robustness. At $10$ -shot, and $20$ -shot, directly fitting more shots into the prompt results in better performance. In-context selection achieves slightly lower performance but with significantly better efficiency due to shorter context. Interestingly, at $30$ -shot, in-context selection with ${5}$ examples outperforms directly adding all ${30}$ shots into the prompt. This is showing that in-context selection can help the model utilize a larger number noisy video examples. Nevertheless, we still observe that the benefit of adding more shots saturated at around $20$ to $30$ shots, even if with in-context selection. we view this as a remaining challenging on how to make language models benefit from longer contexts.

Conclusions, Limitations and Future Work

This paper proposes VidIL, a few-shot Video-language Learner via Image and Language models. It demonstrates the strong ability of large-scale language models on performing video-to-text tasks when frame features are provided as unified text representations using image-language models. We propose a temporal order aware prompt by decomposing videos into a hierarchical structure, which is able to plug in multiple levels of frame features, along with speech transcripts. Without pretraining on videos, our model outperforms vision-language models learned from large-scale video datasets on a variety of few-shot tasks, such as domain-specific captioning, question answering, and future event prediction. One limitation of using unified textual representation is that we might lose low-level visual features which can be essential for some specific tasks, such as fine-grained spatial visual question answering. We also observe that current video-language benchmarks rarely require explicit temporal tracking on the frames and visual tokens. Future work will focus on leveraging large-scale language models for learning script knowledge from long videos where temporal dynamics are better emphasized.

Broader Impact

An open-domain few-shot video-language learner has a wide range of beneficial applications for society, such as automatically detecting violent or mature content in videos and helping people with vision impairment understand videos. However, since the language model is pretrained on massive internet-scale text data, there might be unexpected output that can have potential negative impact on the society, such as bias against people of a certain gender, race or sexuality. Future work and dedicated collaboration from the community are needed to alleviate the potential negative societal impact of large language models.

Acknowledgements

We thank the anonymous reviewers helpful suggestions. This research is based upon work supported in part by U.S. DARPA AIDA Program No. FA8750-18-2-0014 and U.S. DARPA KAIROS Program Nos. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

References

Appendix A Additional Qualitative Examples

Additional qualitative examples on MSR-VTT, YouCook2 and VaTex captioning can be found in Figure 6,7. We show that our framework can capture important video semantics (shown in bold green text), such as objects, events and attributes, that are missing in the captions generated by baselines.

Appendix B Few-shot Prompt Examples

We show a full view of the few-shot prompts used in video captioning (Figure 8, 9), video question answering (Figure 10) and video-language event prediction (Figure 11). Additionally, in Figure 12, we show the omitted in-context examples for Figure 3 in the main body.

Appendix C Additional Experimental Details

For MSR-VTT captioning and question answering, we use the original split containing 6,513 videos for training and 2,990 for testing. For MSR-VTT retrieval, we use the split containing 7,010 videos for training and 1,000 for testing following previous work. For MSVD question answering, we use the original split containing 30,933 questions for training and 13,157 questions for testing. For VaTeX captioning and retrieval, we use the latest v1.1 versionPrevious work only reports results on 1,500 validation videos, since previous version of VaTeX does not have public testing set., which contains 25,991 videos for training and 6,000 videos for public testing. For YouCook2 captioning, we use 10,337 short clips for training and 3,492 for validation following the VALUE benchmark. For Video-Language Event Prediction (VLEP), we report result on the hidden test set using its official CodaLab evaluation server.https://github.com/jayleicn/VideoLanguageFuturePred/tree/main/standalone_eval

Statistics of Visual Token Vocabulary.

We construct our visual token vocabulary based on OpenImage v6 class nameshttps://storage.googleapis.com/openimages/v6/oidv6-class-descriptions.csv, visual genome object synsetshttps://visualgenome.org/static/data/dataset/object_synsets.json.zip and visual genome attribute synsetshttps://visualgenome.org/static/data/dataset/attribute_synsets.json.zip. The statistics can be found in Table 8. Visual genome synsets are pairs, where the keys are noisy natural language phrases and the values are the mapped WordNet synsets . For object vocabulary, we perform minimum cleaning by removing fictional character names such as "robin (fictional character)", which we found are highly biased by the CLIP model on video frames. For attribute vocabulary, we clean up attribute synset keys by removing phrases with a cosine similarity larger than 0.9 using SentenceBert embedding, such as "facing upward" and "facing upwards". For event vocabulary, we select phrases containing structures from the object synset keys by running semantic role labelinghttps://docs.allennlp.org/models/main/models/structured_prediction/predictors/srl/. We then remove semantically similar entries with a threshold of 0.9 based on SentenceBert embeddings.

Implementation Details for Visual Token Aggregation.

Once we obtained top 5 visual tokens for each frame, we further aggregate them to construct the video-level visual tokens which will be part of the few-shot prompt. We first rank the visual tokens based on their single frame ranking score with the appearance frequency across all frames as tie breaker. In our implementation, we consider up to top 4 video-level visual tokens, we then filter out any visual token that has not been ranked within top 2 in any frames. To identify the ordering of the obtained video-level visual tokens, we consider the frame index from which they are extracted from as their temporal indicator. If a visual token occurs in multiple frames, we use the averaged frame index as its temporal indicator. Finally, in order to apply temporal prompt template to variable number of visual tokens, we use a dynamic template which changes according to the number of tokens. For example, if we have three visual tokens, we remove "After that" and only use "First", "Then", "Finally". If we have more then four visual tokens, we repeat "Then" or "After that" for tokens in the middle.

Implementation Details for Few-shot Video Captioning Baselines.

In order to finetune the pretrained baselines (UniVL , BLIP , BLIPcap ) with few annotated examples on video captioning, we set the learning rate to be small and the warm-up steps to be high. Specifically, for UniVL, we set the number of epoches to be $50$ and the linear warmup steps to be $40$ . We use a learning rate of 1 $e$ -6 for captioning task without ASR input and 3 $e$ -6 with ASR input. For BLIP and BLIPcap, we set the number of epoches to be $5$ with a learning rate of 5 $e$ -7. For each video, we sample 4 frames (each with a size of 224) at training time and 8 frames at test time. We set all batch size to be the same as the few-shot number, i.e., $10$ .

Implementation Details for Semi-supervised Text-Video Retrieval.

We use pretrained BLIP with Vit-B/16https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth as our base model for training on different pseudo labeled datasets as well as ground truth annotations for text-video retrieval. We train the model for one epoch using a batch size of 16 and a learning rate of 5 $e$ -6. For each video, we sample 4 frames (each with a size of 224) at training time and 8 frames at test time. We follow to first select $k$ candidates based on the video-text feature similarity, where the video features are represented by concatenated frame features. We then rerank the selected candidates based on their pairwise Image-Text Matching (ITM) score. We set $k=64$ for both MSR-VTT and VaTex retrieval.