VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu

Introduction

Large language models (LLMs) have garnered significant attention due to their exceptional capabilities in text understanding and generation . However, harnessing the potential of LLMs for understanding and reasoning over multimodal data, especially videos, still remains a substantial challenge. This is because analyzing videos requires deep understanding of both visual details and temporal dynamics for models. Several preliminary attempts utilize LLMs for video understanding. Nevertheless, these works predominantly focus on generation of generic video captions and can merely offer surface-level summaries of the content, thus failing to capture the relationships between specific moment boundaries and the bounded events, as shown in Figure 1.

To tackle the problem, in this paper, we investigate improving the boundary-aware ability of Video LLM, which faces the following two challenges.

There is a scarcity of large-scale video datasets with accurate boundary annotations to train the Video LLM for temporal alignment.

It is non-trivial to design effective temporal-related video tasks for training LLM to understand the content of multiple moments within videos.

To address these challenges, we propose VTimeLLM, a novel Video LLM that can perceive fine-grained segments in videos with better temporal reasoning ability. VTimeLLM consists of i) a visual encoder and a visual adapter to process the input video, and ii) a tailored LLM to understand both text and video content, which is trained via a novel boundary-aware three-stage training strategy. Specifically, visual features are aligned with LLM’s semantic space through image-text training in the first stage. In the second stage, we then design the single-turn and multi-turn related question answering (QA) tasks to endue VTimeLLM with the awareness of time boundaries and the ability to understand the corresponding events bounded within the boundaries. We employ a large-scale video-text dataset containing multiple segments together with their roughly annotated labels for training VTimeLLM with the QA tasks. Finally, in the third stage, we further create a high-quality dialogue dataset for instruction tuning, which simultaneously aligns VTimeLLM with human intention and enables VTimeLLM to conduct temporal understanding for video segments more precisely. Extensive experiments show that VTimeLLM significantly outperforms existing Video LLMs in time-related video understanding tasks, such as Temporal Video Grounding and Dense Video Captioning. In addition, benefiting from the fine-grained temporal understanding of videos, VTimeLLM is able to beat existing Video LLMs in video dialogue benchmark, demonstrating its superiority in cross-modal understanding and reasoning for videos. Our contributions in this paper are listed as follows,

We propose VTimeLLM, the first boundary-aware Video LLM, to the best of our knowledge.

We propose the boundary-aware three-stage training strategy, which consecutively leverages i) large-scale image-text data for feature alignment, ii) large-scale multi-event video-text data together with the temporal-related single-turn and multi-turn QA to enhance the awareness of time boundary, and iii) instruction tuning on the high-quality dialog dataset for better temporal reasoning ability.

We conduct extensive experiments to demonstrate that the proposed VTimeLLM significantly outperforms existing Video LLMs in various fine-grained temporal-related video tasks, showing its superior ability for video understanding and reasoning.

Related Works

To enable Large Language Models (LLMs) to comprehend visual information, significant efforts have been made to align visual and linguistic modalities. BLIP-2 introduced the concept of Q-Former, utilizing learnable query vectors to extract visual features from frozen image encoders. MiniGPT-4 demonstrated that further fine-tuning with detailed image descriptions significantly enhances its usability. LLAVA explored diverse multi-modal instruction-following data, including conversations, detailed descriptions, and complex reasoning, aiming to construct a general-purpose visual assistant. Recent endeavors, such as Kosmos-2 and VisionLLM , delved into more detailed aspects of image comprehension, including referring and grounding, significantly enhancing the capability to describe intricate image details.

Driven by the success of Image LLM, researchers have naturally extended their focus from single-frame images to multi-frame videos, leading to the emergence of Video-compatible LLMs like VideoChat , Video-LLaMA , and Video-ChatGPT . These models employ a two-stage training strategy. In the first stage, large-scale datasets align video features with the feature space of LLMs. In the second stage, a limited amount of GPT-annotated or human-annotated datasets are used for instruction tuning. While these models exhibit impressive overall video comprehension, their abilities to describe specific video segments and perform temporal reasoning remain limited. The limitation arises mainly due to the nature of datasets used in the first training stage, such as WebVid , which usually consist of one-event videos and noisy textual annotations. Moreover, the scarcity of high-quality, temporally annotated data in the second stage poses a challenge for models to conduct temporal reasoning. To bridge this gap, our approach, VTimeLLM, introduces a boundary perception stage between these two stages. This stage enables the model to precisely locate events within videos and describe multiple distinct events accurately, empowering our model to grasp fine-grained details of video moments.

2 Fine-Grained Video Understanding

Fine-grained video understanding, the ability to precisely locate and comprehend specific events within a video, is a crucial challenge for video analysis. When integrated with natural language, there are two primary tasks: Temporal Video Grounding and Dense Video Captioning .

Temporal Video Grounding aims to identify corresponding video segments for given textual inputs. Traditional approaches can be categorized into two types: proposal-based and proposal-free methods . Proposal-based techniques generate candidate proposals before ranking them based on relevance. In contrast, proposal-free methods directly predict the start and end boundaries of the target moment.

Dense Video Captioning is a more intricate task, demanding both temporal localization and captioning for all events within an untrimmed video. Earlier methods employed a two-stage process involving temporal localization followed by event captioning. Recent developments in this field have witnessed a shift towards joint training of captioning and localization modules. For instance, Vid2Seq , enhances a language model by incorporating specific time tokens, enabling the model to generate event boundaries and textual descriptions within the unified output sequence.

Both these two tasks share a fundamental requirement: the alignment of video segments with semantic context. Leveraging the power of LLM with the help of our training strategy, our VTimeLLM model unifies these tasks and has demonstrated remarkable effectiveness. Concurrently, VTimeLLM enables natural language interaction with humans, establishing itself as an excellent assistant for comprehending video content.

VTimeLLM: Being Aware of Time Boundaries in Videos

In this section, we introduce VTimeLLM, which is designed to grasp precise video moments for LLMs. We first provide a detailed description of the model architecture, and then our innovative boundary-aware three-stage training framework, as shown in Figure 2.

To enable the LLM to comprehend videos, our VTimeLLM model incorporates two additional modules within LLM, i.e., the visual encoder and the visual adapter, which transform the visual information into text space.

where pp represents the number of patches in the ViT.

We utilize the global feature viclsv_{i}^{cls} as the feature for the i-th frame, and apply a linear layer f()f(\cdot) to project the features of each frame into the same embedding space as that of LLM:

Note that in the visual modules, we do not model the temporal relationships for the frames, inspired by the fact that the LLM itself can receive sequential input embeddings and capture their temporal relations.

To enable the simultaneous processing of video and text inputs, we introduce a special token, ‘

where j1j-1 and jj are the indexes of the words that are close to the special token

We employ the text format ‘from ss to ee’ to denote a video moment, where ss and ee represent the starting and ending frame indexes of the moment, ranging from 00 to 99, with each number corresponding to a specific frame.

2 Boundary-aware Training

In contrast to the previous typical two-stage training approaches , consisting of alignment and instruction tuning, our approach introduces an additional stage to improve the temporal understanding ability of the model. Specifically, the first stage, feature alignment, aims to train the visual adapter, to align video features with LLM’s semantic space. The second stage, boundary perception, focuses on enabling LLM to develop attentional capabilities for specific moments, facilitating the understanding of various events occurring within the video. The third stage, instruction tuning, allows LLM to align with human intent and enabling more precise event localization and description. In the following sections, we will elaborate on the training methods and datasets utilized for each of these three stages.

In the feature alignment stage, we employ the image-text LCS-558K dataset as curated by LLaVA . This dataset is meticulously filtered to achieve a more balanced distribution of conceptual coverage. Comprising image-text pairs, we deliberately choose not to incorporate datasets containing video-text pairs with the following two considerations. Firstly, contemporary large-scale video-text datasets contain substantial textual noise, which severely impedes the alignment between visual features and textual semantics. Secondly, the transformation from visual information to text space usually suffers from information loss, e.g. when captioning an image or a video into “a dog is running on the grass”, we may lose information about the visual details (such as the color) of the dog. Comparatively, the loss of information resulting from summarizing an image into a few words is less than that of videos. Our experiments also demonstrate the superiority of using image datasets for alignment over video dataset (a filtered subset of WebVid2M ), and even a combination of both.

For each image-text pair <II, TT> in the dataset, a special token is directly appended before the text TT, the embedding of this token is extracted with the visual encoder and the visual adapter as follows, denoted as ZIZ_{I}:

and we can obtain the embedding sequence:

Subsequently, we can use the sequence to train the visual adapter ff, with the original auto-regressive training objective of the LLM.

2.2 Stage 2: Boundary Perception

After the training in the first stage, the LLM model becomes proficient in understanding visual information. In the second stage, we enhance the model’s capabilities to comprehend sequential image frames, i.e., video, encompassing the semantic understanding of video segments while ensuring alignment with the corresponding boundaries.

Due to the time-consuming nature of manually annotating timestamps and semantics for video segments, there is currently a lack of large-scale multi-event video-text datasets. Traditional methods align video segments with text transcripts generated by Automatic Speech Recognition (ASR). However, this approach faces challenges due to the lack of synchronicity and consistency between actions performed and spoken content, leading to weak correlations and inaccuracies in boundary annotations.

Recently, we identified the InternVid-10M-FLT dataset, which offers a viable solution for our boundary-aware training. This dataset employs an entirely automated process to segment and annotate video clips, eliminating the need for manual intervention. Consequently, a single video may contain multiple event annotations. To ensure suitability for our study, we selected specific videos, each not exceeding 120 seconds in length. These videos encompass multiple non-overlapping event annotations, each lasting more than 3 seconds, and the average duration of these events exceeds 8% of the video length. Thus, we curate a dataset comprising 134k videos, where each video contains multiple events and their rough temporal annotations and descriptions.

In each video, a series of events {si,ei,Ti}\{s_{i},e_{i},T_{i}\} is contained, where sis_{i} and eie_{i} represent the start and end timestamps of a segment, ranging from 00 to 99. TiT_{i} corresponds to its textual description. To transform these events into dialogue data {Q1,A1,Q2,A2,...}\{Q_{1},A_{1},Q_{2},A_{2},...\} suitable for training LLM, we devise two types of QA dialogues: single-turn and multi-turn, constituting 20% and 80% respectively. In Box 1, we have provided examples of both single-turn and multi-turn QA dialogues for a video containing three events. Specifically, the task of single-turn QA is dense video captioning. Q1Q_{1} prompts a question requiring a comprehensive description of all events and their corresponding timestamps, while A1A_{1} outputs the respective textual descriptions and timestamps in a specified format as shown in the upper box of Box 1. On the other hand, multi-turn QA involves segment captioning and temporal video grounding tasks, demanding the description generation given timestamps or timestamps generation given descriptions, as shown in the lower box of Box 1. In multi-turn QA, each event will be randomly queried for one of these two tasks, and the questions are not necessarily presented in the order of the events’ occurrence. We design 10 templates for each task to transform events into QA dialogues, which can be found in the appendix.

We format these QA pairs according to the original LLM’s format, keeping the initial system prompts intact. Moreover, we insert the statement “This is a video with 100 frames:

2.3 Stage 3: Instruction Tuning

Following the training in the second stage, our VTimeLLM model demonstrates the ability to comprehend all events within the video and align them with the corresponding timestamps. Despite the diverse templates employed, the model’s output still tends to overfit the answers, which behaves more like a multi-task pretrained model while losing chatting ability with the user, e.g., when we input “What color is the coat of the man” to the model, it may response “from 00 to 10”. Additionally, the labels of the video-text data in the second stage are originally annotated in an automated way, which are not so accurate and noisy. To tackle the two problems, in the third stage, we incorporate high-quality dialogue data for instruction tuning, enabling the model to follow human instructions for more accurate video temporal comprehension and reasoning.

In this stage, we select a subset from ActivityNet Captions and DiDeMo datasets, and transform it into a high-quality QA dialogue dataset with the assistance of Large Language Models. In contrast to InternVid which employs automated segmenting and labeling, these two datasets are entirely manually annotated, resulting in descriptions that are more detailed and temporal boundaries that are more accurate. Specifically, we carefully selected a subset of videos from the training set of ActivityNet Captions. These videos contained a minimum of three non-overlapping events, collectively covering over 90% of the video duration, amounting to approximately 4.2k videos. Similarly, a subset of videos is being selected for the DiDeMo dataset, each containing at least two non-overlapping events and covering 40% of the video duration. This process results in a total of about 4k videos for the DiDeMo subset. Subsequently, we also transform these videos, which contain a series of events {si,ei,Ti}\{s_{i},e_{i},T_{i}\}, into QA dialogues. However, results from the second stage of training indicate that template-based conversations lead to model overfitting. Therefore, we utilize LLM for this transformation. Specifically, we provide these events to the LLM, prompting it to assume the role of an AI visual assistant capable of analyzing the video and generate a dialogue about the video between itself and a user. The prompt can be found in the appendix. This approach results in QA dialogues that are grammatically correct, linguistically coherent, and may encompass a variety of tasks. We generate two distinct sets of dialogues for each video, yielding a final dataset comprising around 16k high-quality QA dialogues. Additionally, we observe that introducing a comparable number of other video instruction tuning datasets further enhances the descriptive capabilities of the model, with minimal impact on temporal understanding abilities. Therefore, we add an extra 20k QA pairs from the VideoInstruct100K dataset. Overall, in this stage, a total of approximately 36k QA dialogues are used for training, which is significantly smaller than the dataset used in the second stage.

We merge the LoRA module trained in the second stage with the original model and introduce a new LoRA module, which serves as the only trainable parameters. All other training details remain consistent with those of the second stage.

Experiment

To assess the capability of VTimeLLM in comprehending various event segments, we mainly conduct evaluations on two tasks: Temporal Video Grounding and Dense Video Caption.

For the Temporal Video Grounding task, we utilize datasets from ActivityNet Captions and Charades-STA . We calculate the Intersection over Union (IoU) between the time segments generated by the model and the corresponding ground truth time segments. We report mean IoU (mIoU) and recall@1, IoUm\geq m (R@m) metric, where mm values are set at {0.3,0.5,0.7}\{0.3,0.5,0.7\}.

In the case of Dense Video Captioning, we employ the ActivityNet Captions dataset. The evaluation process encompasses two categories of metrics. Firstly, we employ SODA_c , a metric specifically tailored for dense video caption tasks, taking into account the video’s storyline. Secondly, we compute matched pairs between the generated events and the ground truth across IoU thresholds of {0.3,0.5,0.7,0.9}\{0.3,0.5,0.7,0.9\}, and calculate captioning metrics based on these matched pairs . We report CIDEr and METEOR averages under different IoU thresholds to provide a comprehensive analysis.

In our study, we use Vicuna v1.5 as the Large Language Model and train two versions: 7B and 13B. We use a total batch size of 128 throughout the training process. The AdamW optimizer is applied with a cosine learning rate decay and a warm-up period. In the first training stage, the total epoch number is 1 with a learning rate of 1×1031\times 10^{-3}, and the subsequent second and third stages we will train for 2 epochs each with a learning rate of 1×1041\times 10^{-4}. The LoRA parameters are set to r=64r=64 and alpha=128alpha=128. Thanks to the efficiency of LoRA, we can complete the training of the 7B model within 30 hours with 1 RTX-4090 GPU.

2 Main Results

We evaluate the capabilities of existing Video LLMs in temporal video grounding and dense video captioning tasks, as shown in Table 1. Detailed information about the evaluation process can be found in the appendix. VTimeLLM-7B outperforms these Video LLMs of the same size by a significant margin. Upon further scaling up the model to 13B parameters, we observe minor changes in performance on ActivityNet tasks, while the temporal grounding ability improves on Charades-STA. It is worth mentioning that our training dataset does not include Charades-STA training data, indicating that increasing the scale of our VTimeLLM model enhances its out-of-distribution generalization capability.

We provide several possible explanations to account for the poor performance of other models: firstly, both VideoChat and VideoLLaMA extract only N=8 frames as input, making it challenging for them to achieve a fine-grained understanding of the video content. Secondly, the commonly used LLM (Vicuna) lacks robust positional awareness in input sequences. For instance, when posed with the question “What is the position of the word ‘video’ in the phrase ‘a video clip’ ?”, it may erroneously respond, “The word ‘video’ appears at position 67.” Relying solely on a limited set of temporally annotated data for instruction tuning is insufficient to address this issue. Therefore, it is essential to integrate boundary-aware training to achieve precise video comprehension.

3 Ablation Study

In this section, we provide detailed ablations about our three-stage training strategy through experiments on the 7B model, as illustrated in Table 2. In the ablation, our most conerned questions and their results are provided in the following.

In contrast to other Video LLMs, we utilize a pure image modality for the first stage and find it to be superior across all metrics than using a pure video modality (Rows 1, 2 vs Rows 3, 4). This effectiveness of using images to alignment could be attributed to the higher quality and reduced information loss in image datasets. Additionally, using pure images outperforms the fusion of two modal datasets (Rows 1, 2 vs Rows 5, 6). This could be due to the significant disparity in tasks, where describing a single frame event and describing a sequence of 100 frames events pose distinct challenges for model fitting.

Another question arises during the following stage: should the previously pretrained visual adapter be tuned or frozen? Upon comparing Row 1 vs Row 2, Row 3 vs Row 4, and Row 5 vs Row 6, we observed minor difference in performance between the two approaches. To retain the comprehensive information acquired during the pretraining stage, we opt to freeze the parameters of the visual adapter in the latter two stages.

Upon comparing Row 9 to Row 10, it is evident that in stage 3, merging the LoRA module from the second stage with the LLM parameters and additionally incorporating another LoRA module yields superior results. This approach ensures that the temporal understanding capabilities acquired during stage 2 are effectively preserved within the model.

By comparing Rows 1~6 with Row 7, we observe a substantial disparity in the model’s performance on temporal grounding when training without stage 1. The scores are abnormally high on the ActivityNet dataset, while significantly low on the Charades-STA dataset. Upon careful analysis of the outputs under this setting, we find that the model has not effectively learned to localize events. Instead, it tends to predict a temporal segment spanning nearly the entire video (e.g., from 00 to 95). In such cases, if the ratio of ground truth duration to video length is denoted as xx, the IoU with the model’s output is approximately xx. The ActivityNet dataset contains a significant number of long samples, with 20% of queries having x>0.5x>0.5, leading to an inflated evaluation metric. Conversely, in the Charades-STA dataset, xx rarely exceeds 0.5, demanding more precise localization . However, the model without stage 1 training fails to achieve it. Moreover, the model’s performance in dense captioning tasks is unsatisfactory, which also highlights the essential nature of the feature alignment stage.

The necessity of stage 2 can be demonstrated by comparing Row 8 with Rows 9, 10. Despite the higher quality of annotations in stage 3, the limited dataset size hinders the model’s ability to achieve a robust temporal understanding through stage 3 training alone. Models trained solely in stage 3 exhibit inferior performance across various tasks compared to those that have undergone preliminary training in stage 2.

After stage 3 training, the model exhibits comprehensive improvement in the tasks outlined in the table (Row 1 vs Row 10). Furthermore, it regains chatting ability, enabling it to respond to a wide range of questions posed by humans.

4 Video Dialogue Performance

Besides the ability for fine-grained video understanding tasks, we explore whether VTimeLLM can address a broader range of questions through dialogue. We employ the Video-ChatGPT benchmark and conduct an evaluation of video-based generative performance. This benchmark covers many questions associated with five key aspects. GPT-3.5 assigns a score, not exceeding 5, to the model-predicted answer based on the question and the correct answer. We present the average scores in Table 3 and compare VTimeLLM with all existing Video LLMs, including VideoLLaMA , LLaMA-Adapter , VideoChat , VideoChatGPT and BT-Adapter .

Thanks to the fine-grained video comprehension capabilities, VTimeLLM achieves state-of-the-art results in all aspects. The most substantial improvement is observed in the aspect of detail orientation, where VTimeLLM achieves a noteworthy enhancement of +0.41 (15.2%). We attribute this progress to two primary factors. Firstly, the image-based training in stage 1 ensures comprehensive preservation of visual details in individual frames, facilitating a detailed understanding of spatial dimension. Secondly, the temporal-aware training employed in the second and third stages enables VTimeLLM to capture multiple events within videos, enhancing its ability to depict details of temporal dimension.

To better illustrate the video dialogue performance of VTimeLLM, we present a qualitative example, as shown in Figure 3.

Conclusion

In this work, we introduce VTimeLLM, a Video LLM capable of comprehending multiple events within a video and providing precise temporal boundaries. We unify video tasks demanding fine-grained comprehension, such as temporal video grounding and dense video captioning, and pioneer their addressing using Video LLM. Specifically, we propose a three-stage temporal-aware training framework. This framework utilizes large-scale image-text data for feature alignment, leverages extensive multi-event video-text data along with temporal-related question-answering to enhance temporal awareness, and employs instruction tuning on a high-quality dialogue dataset to improve temporal reasoning ability. Extensive experiments demonstrate that VTimeLLM outperforms existing Video LLMs significantly across various tasks, particularly excelling in fine-grained temporal-related video tasks, showing VTimeLLM’s superior ability for video understanding and reasoning.

References

Appendix A More Examples

We showcase additional examples of video dialogues across various tasks, encompassing a creative task (Figure 4), a fine-grained understanding task (Figure 5), and a video reasoning task (Figure 6). In the creative task (Figure 4), our VTimeLLM demonstrates a remarkable capacity to comprehend visual information and subsequently craft a poem inspired by it. This achievement is attributed to we freeze the LLM at all three stages of training, thereby preserving its ability for engaging in creative dialogue. In the fine-grained understanding task (Figure 5), our VTimeLLM comprehends multiple events within the video, as well as the specific visual content within individual events. This demonstration underscores its proficiency in grasping temporal and spatial details, a capability attributed to our three-stage training strategy. In the video reasoning task (Figure 6), our VTimeLLM responds to several questions requiring inference, showing its capacity to engage in reasoning based on a comprehensive understanding of visual content.

Appendix B Templates and Prompts

In Stage 2, we need to transform events {si,ei,Ti}\{s_{i},e_{i},T_{i}\} into template-based QA, where sis_{i} and eie_{i} represent the start and end timestamps of a segment, ranging from 00 to 99. TiT_{i} corresponds to its textual description. For a given sequence of events, there is a 20% probability of transformation into single-turn QA, completing a dense caption task where all events are described within a single answer. Conversely, there is an 80% probability of transformation into Multi-turn QA. In this scenario, each event is individually queried and answered within a dialogue, in the form of two tasks, event captioning or temporal grounding. We provide 10 templates for each task, as shown in Box 3.

In Stage 3, we need to transform events into high-quality dialogue. This is accomplished by providing a prompt to a text-based LLM(Vicuna-7B v1.5). The prompt can be found in Box 4. In the prompt, specific timestamps are not provided because their inclusion does not enhance the LLM’s comprehension of temporal relationships. On the contrary, they may introduce errors into the dialogue. Consequently, events are presented in a sequential order, accompanied by specific symbols e.g., , in the box, denoting the timestamps. The generated dialogue is expected to integrate temporal perception and reasoning.

Appendix C Evaluation Process

In this section, we provide a detailed process on the evaluation of temporal grounding and dense captioning tasks for VTimeLLM and other Video LLMs.

For VTimeLLM that has undergone only stages 1, 2 training without stage 3, the input and output formats remain entirely consistent with the template. Consequently, we can directly employ the templates in Box 3 as queries. Specifically, for the dense captioning task, we employ QD1Q_{D1}, i.e., “Could you please detail the events that took place during different time segments in the video?” as the query. For the temporal grounding task, we employ queries QT1Q_{T1}, QT2Q_{T2}, and QT3Q_{T3} to compute IoU for their respective outputs, and we report the average metrics. The performance obtained from different queries is similar.

VTimeLLM that has undergone stage 3 training demonstrate commendable instruction-following ability, and the performance may vary with different queries. For example, the inclusion of the phrase “in detail” in the query leads to a more detailed description of the video. For the dense captioning task, we utilize the following query: “Could you please describe the events in the video in detail? Be specific about the activities of individuals, their surroundings, and interactions with others. The output should be in JSON format, structured as follows: {‘event’: ‘xx’, ‘timestamps’: ‘from xx to xx’}.” We find that this query outperforms QD1Q_{D1} across various metrics by approximately 10%. For the temporal grounding task, we continue to report the average results of queries QT1Q_{T1}, QT2Q_{T2}, and QT3Q_{T3}, with metrics for each query remaining consistently close. Notably, even with the adoption of a simpler query such as ‘When does TiT_{i} happen?”, we achieve comparable results, underscoring the stability of outputs in this task.

C.2 Evaluation of other Video LLMs

For other Video LLMs (VideoLLaMA, VideoChat, and VideoChatGPT) that we test in our study, we try our best to assess their optimal performance as they were not trained on these tasks. Our testing methodology follows several principles: First, we include video duration DD in the query. Second, as these models often fail to adhere to our prompt for outputting in JSON format, we apply multiple regular expressions to format the output. This successfully handles over 70% of the outputs. For these outputs cannot be processed, we exclude the corresponding data from metric calculations. Third, we design multiple queries and select the one yielding the best performance as the final result. For example, in our experiment, we find that the best query for VideoChatGPT in the dense captioning cask is: “This video has a duration of DD seconds. From which second to which second in the video, what event happens? Be specific about the activities of individuals, their surroundings, and interactions with others. List the events in the format: 1. From x1 second to y1 second: event1. \\backslashn 2. From x2 second to y2 second: event2.\\backslashn …”