TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou

Introduction

From educational tutorials to feature films, long-form videos have been an essential medium in our daily lives. However, it is both time-consuming and frustrating for individuals to sift through lengthy videos. Instead, human attention is consistently drawn to meaningful or highlighted visual segments such as essential steps in a cooking tutorial or fantastic moments from sports events . An intelligent time-sensitive video assistant to analyze long videos for users, encompassing temporal localization, timestamp detection, and key moment summarization, is a longstanding pursuit of the community. With the emergence of Large Language Models (LLMs) and their impressive capacity to execute human instructions , a natural question arises, i.e. Is it feasible to develop an LLM-based assistant for long-form video comprehension tasks to satisfy realistic user requirements?

Preliminary endeavors have been made to integrate video encoders with LLMs for basic video understanding including detailed captioning and question answering . However, existing Video LLMs (VidLLMs) can only capture global visual semantics for short clips and fail to associate the significant video content with accurate timestamps. For example, Video-LLaMA and VideoChat struggle to localize and describe meaningful events in untrimmed videos leading to a low accuracy verified in Tab. 2. Two main obstacles hinder the performance of existing VidLLMs. Firstly, their rigid compression converting video tokens to a fixed number (e.g. 32 ) is unsuitable for long-form video input . It neglects the video’s duration and results in severe spatial-temporal semantics degradation when processing massive frames from long videos. Secondly, they handle video and timestamp information separately without considering the explicit time-vision association thus being unable to localize timestamps accurately.

In this paper, we propose TimeChat, a time-sensitive multimodal large language model for long video understanding and accurate temporal localization. To handle long video input, we propose a sliding video Q-Former to accommodate adaptive video token length during the extraction and compression of video features. Specifically, the video Q-Former compresses the frames within a sliding window into video tokens. By temporally moving the window, we can dynamically create a video token sequence of varying lengths to accommodate videos of various durations. It preserves the significant visual semantics of long videos and leads to more expressive and scalable video representation. Furthermore, to enhance the vision-timestamp association, we propose a time-aware frame encoder, which explicitly binds the visual context with the timestamp description of each frame.

To stimulate the intrinsic timestamp localization capability of TimeChat and enhance its instruction-following ability, we construct a novel instruction tuning dataset TimeIT involving diverse timestamp-related user instructions. This dataset is compiled from a variety of timestamp-associated long-video datasets with an average video length of 190.8 seconds. It is composed of 6 diverse tasks, 12 widely-used academic benchmarks, and a total of 125K instances. We reformat the original academic datasets into dialog style with manually written high-quality instructions. To our best knowledge, TimeIT is the first time-sensitive video-centric dataset designed for instruction tuning, with the aim to facilitate the development of VidLLMs.

Utilizing the TimeIT dataset, We perform instruct tuning on TimeChat and then assess its performance across various downstream tasks including dense video captioning, temporal video grounding, and video highlight detection. Experimental results show that our model substantially outperforms previous VidLLMs under zero-shot settings, with +9.2 F1 score and +2.8 CIDEr on YouCook2 , +5.8 HIT@1 on QVHighlights , and +27.5 R@1 (IoU=0.5) on Charades-STA , respectively. Furthermore, qualitative results in new domains such as movie and egocentric videos demonstrate the generalization of TimeChat towards a versatile and practical video assistant.

Related Work

With advancements in Large Language Models (LLMs), numerous studies have endeavored to integrate LLMs with a video encoder, thereby harnessing the powerful comprehension and generation capabilities of LLMs for video tasks . These studies typically employ open-source LLMs such as Vicuna and LLaMA . Their key difference lies in how they encode the video into vision tokens compatible with the LLMs. Representative work like VideoChat utilizes a video transformer to encode video features and subsequently implement a Query Transformer (Q-Former) to compress video tokens. Video-LLaMA first uses a vision transformer (ViT) with an image Q-Former to encode individual frames, and then employs a video Q-Former for temporal modeling. However, these methods compress video tokens to a fixed number, resulting in visual semantic degradation when handling lengthy videos. In contrast, our model TimeChat offers adjustable compression rates for visual tokens, increasing adaptability to varying video lengths. Moreover, our model explicitly establishes a frame-level vision-timestamp relationship to improve temporal localization capabilities.

2 Vision-Language Instruction Tuning

Inspired by the recent success of instruction tuning on LLMs , researchers have adopted vision-language instruction tuning to improve instruction following capabilities of Multimodal LLMs . This primarily entails producing high-quality data with human instructions, which can be categorized into two technical branches. The first branch integrates available multi-modal benchmark datasets and converts them to instruction format, with efforts like MultiInstruct , InstructBLIP , and M3IT\text{M}^{3}\text{IT} . The second branch leverages LLMs such as ChatGPT and GPT-4 to create more diverse dialog-style data. Approaches like MiniGPT4 , LLaVA , MIMIC-IT , VideoChat , and Valley obtain detailed visual descriptions and build image-centric or video-centric conversation data from LLMs. However, they all neglect the time-aware user requests for video understanding. To address this, we propose a time-aware instruction tuning dataset to enhance the time-vision association ability of Multimodal LLMs.

3 Video Temporal Localization

Temporal localization is a foundational capability in video understanding tasks, particularly for untrimmed long videos. There have been miscellaneous time-sensitive video tasks, including temporal video grounding , dense video captioning , video summarization , video highlight detection , and step localization , etc. These tasks necessitate explicit associations between video semantics and the corresponding timestamps. Previous studies tend to settle each task separately on specialized downstream datasets. Although recent works make preliminary attempts to bridge several tasks, a generalist paradigm based on LLMs is under exploration. In this paper, we unify a wide range of time-sensitive video tasks in language modeling format and take a first step to leverage LLMs.

Method

In this section, we present TimeChat, a VidLLM featuring two novel modules: a timestamp-aware frame encoder and a sliding video Q-Former. These modules enhance our TimeChat’s ability to localize temporally and understand long videos (§\S 3.1). To further empower TimeChat to follow human instructions across time-sensitive video tasks, we collect an instruction-tuning dataset named TimeIT (§\S 3.2). This dataset comprises 6 tasks and 125K instances. Based on TimeIT, we perform instruction tuning on our model to unlock its full potential.

TimeChat is composed of a time-aware frame encoder, a sliding video Q-Former, and a large language model, as depicted in Fig. 1. Given an input video, the frame encoder first extracts visual and timestamp features for each frame independently. Next, the video Q-Former models temporal relations across frames within a sliding window to produce video tokens. Finally, these video tokens are concatenated with optional transcribed speech and user instructions, which are then fed into the LLM to generate responses.

1.2 Timestamp-aware Frame Encoder

Previous studies typically separate the modeling of visual semantics and their respective timestamp information of input frames . For example, VideoChat utilizes a visual encoder to process visual frame semantics but an LLM to receive timestamp information, e.g. “This video contains 8 frames sampled at 2s, 4s, \dots, 16s”. As a result, this approach fails to directly capture the time when a visual event occurs, thereby leading to inaccurate temporal localization.

1.3 Sliding Video Q-Former

After applying the time-aware frame encoder, we obtain T×NIT\times N_{I} visual tokens for a TT-frame video input. Since frames are encoded independently, the temporal relationship across frames has not been modeled yet. To this end, we incorporate a sliding video Q-Former (yellow block in Fig. 1) to enhance the feature fusion in the temporal dimension. The video Q-Former mirrors the structure of the image Q-Former, except that it only takes NVN_{V} learnable queries in dimension DQD_{Q} as input without timestamps. We design a sliding window of length LWL_{W} and within each window utilizing the video Q-Former extract NVN_{V} video tokens from LWL_{W} frames. By sliding the video Q-Former in strides of SS, we can represent the input video as (T/S)×NV(T/S)\times N_{V} video tokens.

Considering the 3D nature of videos and the redundancy in space-time information, the original sequence of visual tokens (i.e., patches in all frames) can be extremely long . Thus, it’s crucial to condense video information to a reduced number of video tokens, thereby decreasing the computation burden on the LLM. However, previous work usually set a fixed number of video tokens NVN_{V}, such as 32, which can result in severe visual semantics degradation when the number of input frames TT is large. Concretely, we define the compression rate RR as the ratio using the number of original visual tokens divided by the number of final video tokens. The compression rate for previous work like Video-LLaMA is:

where NPN_{P} is the number of patches of each frame. This ratio increases with the number of input frames TT and can cause excessive compression for long videos. With our sliding video Q-Former, our compression rate RR^{{}^{\prime}} becomes a constant value:

retaining richer semantics for long videos. By adjusting the stride SS, we can control the final number of video tokens according to the computation budget. Finally, we use a linear layer to transform the dimension DQD_{Q} of video tokens to match the dimension DLLMD_{LLM} of the LLM embedding space.

1.4 Large Language Model

Ultimately, we concatenate inputs from various modalities, e.g., the video tokens Xv\mathbf{X}_{v}, text query tokens Xq\mathbf{X}_{q} (including optional transcribed speech and user instruction), and feed these into a large language model to generate reasonable and coherent responses (answers) Xa\mathbf{X}_{a}. Here, Xv\mathbf{X}_{v}, Xq\mathbf{X}_{q}, and Xa\mathbf{X}_{a} have the same token embedding dimension DLLMD_{LLM}. The training of the VidLLM typically utilizes a two-stage training framework. The first stage pre-trains the model using large-scale image/video-text pairs for vision-language alignment . The second stage finetunes the model with instruction data for instruction following. Considering computing efficiency, we reuse the checkpoints of the existing open source models after the first stage training (see §\S 4.1), conducting only instruction tuning. During the training procedure, we utilize the language modeling loss for generating target answers Xa\mathbf{X}_{a} with length LTL_{T}, which serves as the objective function:

where θ\theta is the trainable parameters, and Xa,<i\mathbf{X}_{a,<i} refers to the answer tokens preceding the current prediction token xix_{i}. To better adapt the LLM to video tasks, we apply the parameter-efficient fine-tuning method, LoRA .

2 Instruction Data TimeIT

To boost TimeChat’s ability to understand time-sensitive human instructions, we introduce TimeIT, a video-centric instruction-tuning dataset involving timestamps. This dataset integrates a wide range of timestamp-associated video datasets and is characterized by long-form videos.

TimeIT encompasses 6 longstanding timestamp-related video tasks, i.e., (1) temporal video grounding, (2) dense video captioning, (3) video summarization, (4) video highlight detection, (5) step localization and captioning, as well as (6) transcribed speech generation. It also incorporates 12 specific datasets derived from different domains as illustrated in Fig. 2. Please refer to Appendix A for details. Our dataset accommodates prevalent user requests involving video timestamps when interacting with AI assistants in real-world applications.

2.2 Data Construction

We convert the above datasets into an instruction-following format to obtain high-quality video-centric instruction data. The construction process comprises two main steps including (1) instruction writing and (2) answer formatting.

The quality and diversity of instructions are essential in the construction process. We manually write well-designed instructions for each task as a good starting. Then we utilize GPT-4 to extend more diverse and flexible expressions based on the manual initialization. Eventually, we manually select and refine the LLM-generated instructions to obtain the final version. Inspired by the observation in M3IT\text{M}^{3}\text{IT} that using around five instructions per task is sufficient, we generate six high-quality instructions for each task. Specific instructions designed for each task are depicted in the appendix B.

Based on the written instructions, we further reformulate the task outputs into a user-friendly natural language response paradigm (format details are provided in Appendix B). Considering the involved video datasets are manually collected, the overall quality of TimeIT data is guaranteed.

Tab. 1 compares our TimeIT data with existing video-centric instruction tuning data, revealing our significant advantages across data scale, task diversity, and video length. We hope it can promote the investigation of challenging long-form video understanding in the community. Furthermore, it can also serve as a good complement to the existing datasets for two reasons: (1) the original video sources are distinct, and (2) the video-timestamp interrelation is replenishment to common video content reasoning.

Experiments

We take ViT-G/14 from EVA-CLIP as the image encoder and LLaMA-2 (7B) as the language foundation model. The parameters of the image Q-Former are initialized from InstructBLIP’s checkpoint, while the video Q-Former is initialized from Video-LLaMA’s checkpoint. We finetune our TimeChat on TimeIT for 3 epochs, using a batch size of 32, with a single 8-V100 (32G) machine. As shown in Fig. 1, the parameters of ViT and LLM are frozen, while those of image Q-Former, video Q-Former, and linear layer are tuned. The rank in LoRA is 32. The window size LWL_{W}, stride SS, and the number of video tokens NVN_{V} per window are 32. The number of input frames is 96. Please refer to Appendix C for additional hyper-parameters.

2 Evaluation Setups

We evaluate our model on three tasks of long video understanding, i.e., dense video captioning, temporal grounding, and highlight detection, in a zero-shot setting. The evaluation datasets include YouCook2 , Charades-STA , and QVHighlights . See Appendix D for details of the datasets and evaluation metrics.

It is important to note that the outputs generated by LLMs may include colloquial expressions, leading to a wide range of response variations. Accordingly, we carefully devise a considerable number of heuristic rules to guarantee that predicted answers can be accurately extracted from the model’s responses for the computation of final metrics.

We compare our model with two branches of baselines. (1) Multi-model Pipielines, including VideoChat-Text , InstructBLIP +ChatGPT . These pipelines integrate specialized visual models with ChatGPT, which firstly convert video semantics (e.g. frame descriptions, clip captions or action tags) into textual descriptions and then leverage ChatGPT to process all inputs to solve the target task. See Appendix E for more details. (2) End2end Models, including Valley , VideoChat-Embed , Video-LLaMA with 7B LLMs. These models directly take videos as inputs and generate responses in an end2end manner.

3 Zero-shot performance

Tab. 2 shows the zero-shot performance of TimeChat (7B), which outperforms previous VidLLMs (7B/13B) in all tasks.

This task on YouCook2 is quite challenging. The model is required to accurately identify roughly 8 essential cooking steps within the average video duration of 320 seconds, alongside providing faithful descriptions that match the visual content. Moreover, the specialized nature of cooking amplifies the task complexity, thereby challenging the model’s generalizability. Existing end-to-end VidLLMs struggle with precise moment localization, as evidenced by the low F1 score of 3.4 achieved by the top-performing VideoChat-Embed model. Such imprecision in moment localization significantly impacts the captioning evaluation, with both SODA_c and CIDEr metrics approaching zero. Compared to them, our model achieves remarkable performance gains exceeding the previous SOTA by +1.0 SODA_c, +2.8 CIDEr, and +9.2 F1 score. This reveals that TimeChat effectively processes lengthy videos with precise temporal localization capability. Moreover, our performance also significantly surpasses the multi-model pipelines powered by ChatGPT (F1 score from 8.4 to 12.6), which demonstrates both the challenging nature of this task and the superiority of our model in processing long videos.

While the dense video captioning task focuses on localizing events at the clip level, this task requires a more fine-grained video comprehension at the frame level. For an input video, it aims to output the times and the related salient scores of highlight frames. Overall, our model achieves a 14.5 mAP and 23.9 HIT@1 on QVHighlights, surpassing the previous VidLLMs by +1.4 and +5.8 points, respectively. This highlights the contribution of our timestamp-aware frame encoder in identifying the salient semantics of each frame. Moreover, this task is a held-out task in TimeIT, indicating the generalization ability of our model on novel tasks. As for multi-model pipeline approaches, they achieve even stronger performance. We speculate that this is due to the format of highlight detection being more compatible with their methods, as the model receives a series of joint timestamp-visual descriptions for input frames. This enables the frame-by-frame assessment by the LLM, facilitating more accurate judgments.

This task aims to identify the corresponding timestamp described by a query sentence. TimeChat achieves 32.2 points on “R@1, IoU=0.5” of the Charades-STA dataset, which surpasses the previous SOTA end2end VidLLM, i.e. Valley, by a substantial margin (+27.5). This demonstrates that our model excels at accurately localizing the video moment content referred to a given text query. Notably, TimeChat gains the most improvements on the temporal grounding task, we argue that this task mainly emphasizes the temporal localization capability of long videos which is exactly the best advantage of TimeChat.

4 Qualitative Evaluation

Fig. 3 presents qualitative comparisons between TimeChat and other VidLLMs in zero-shot settings. Video-LLaMA falls short in fully adhering to the user instruction, as it only describes the cooking steps without the corresponding start and end timestamps for each step. VideoChat, on the other hand, produces captions that fit the requested format but misplaces the timing of all the steps. Despite this, the generated description from VideoChat includes several hallucinations , such as references to “pepper” and “avocados” that are not present in the video. In contrast, TimeChat demonstrates improved temporal localization and summarization capabilities compared to the previous models. It successfully matches the content of the video for almost all extracted clips. Furthermore, the occurrence of hallucinations is significantly reduced. However, there is still room for improvement in terms of enhancing the richness and details in the summarization generated by our model.

In Fig. 4, we show qualitative results in new domains such as movie and egocentric videos , demonstrating the generalization of TimeChat to novel scenarios. This generalization is a key characteristic towards a practical video assistant and represents a fundamental difference between LLM-based TimeChat and the current specialized models tailored for specific downstream datasets. More cases can be found in Appendix F.

5 Ablation Study

We conducted an ablation study based on YouCook2 to assess the efficacy of key designs in our TimeChat. As illustrated in Tab. 3, when the sliding video Q-Former is removed, the number of final visual tokens decreases from 96 to 32, resulting in a 3×3\times information compression rate. This reduction in semantic information leads to a decrease in the alignment between the generated descriptions and the video content. Specifically, the SODA_c metric decreases by 1.0, while the CIDEr metric decreases by 2.8. Additionally, the accuracy of timestamps (measured by F1 score) decreases by 3.0. In the case of the removal of the timestamp-aware frame encoder, the model’s ability to temporally ground the descriptions diminishes dramatically, as indicated by a decrease of 2.3 in the F1 score. These results highlight the effectiveness of the two novel modules in the model.

6 Further Analysis

We provide further analysis to validate the superiority of our model. To demonstrate that the performance gain of our model is not solely attributed to the new TimeIT dataset, but also to the improvements in our model architecture, we conduct fine-tuning and evaluation using only the YouCook2 dataset. In this setup, we initialize our model with existing open-source checkpoints (see §\S 4.1). For all the models, we finetune their Q-Formers and apply LoRA for their LLMs. Tab. 4 presents the results, showing that our model consistently outperforms previous models across all metrics, with increases of +2.2 CIDEr and +3.9 F1 score.

In Fig. 5, we examine the performance scalability of our model with respect to the number of input frames . As mentioned in §\S 3.1.3, previous models like Video-LLaMA and VideoChat compress excessive information for long videos, resulting in minimal performance impact when increasing the number of input frames from 32 to 96. In contrast, our TimeChat decouples the number of frames TT and the compression rate RR^{{}^{\prime}} using the sliding video Q-Former. Our curve exhibits linear improvement in performance as the number of frames increases, showcasing superior scalability.

7 Comparison with Specialized Models

In this subsection, we compare our generalist model, TimeChat, with state-of-the-art specialized models on the three tasks, respectively. Given that all specialized models have been fine-tuned on specific datasets, we also finetune our model for a fair comparison. As shown in Tab. 5, after fine-tuning, TimeChat has made further performance gains, e.g., +6.9 F1 score on YouCook2, +16.9 HIT@1 on QVHighlights, and +16.4 R@1 (IoU=0.5) on Charades-STA. Nonetheless, there is still much room for boosting our approach compared to specialized models, whose superior performance arises from task-specific designs. For example, Vid2Seq pretrains on YT-Temporal-1B , which contains much more high-quality long videos than the tuning dataset we used. QD-DETR employs a special saliency token for saliency prediction and introduces 4 loss functions for training, while our model trains purely through language modeling. In addition, these models also used much more fine-tuning steps to better fit the downstream dataset. However, as a generalized model, TimeChat exhibits a strong generalization ability in zero-shot scenarios, multi-task, and multi-domain settings, which is not present in those expert models. Achieving state-of-the-art performance on every task is not the major goal of this paper, and we leave this as future work.

Discussion and Conclusion

We present TimeChat, a time-sensitive VidLLM for long video understanding. Benefiting from the novel time-aware frame encoder, sliding video Q-Former, and instruction tuning on TimeIT, our model demonstrates strong temporal localization capabilities that were absent in previous VidLLMs. Through its ability to identify significant events within lengthy videos, pinpoint events’ start and end times, and generate concise summarization, TimeChat makes a crucial step toward an intelligent video assistant. In the future, we will make architectural advances to improve video semantic density while reducing spatial-temporal redundancy. We will also collect more diverse and high-quality instruction-tuning data to broaden the time-related applications.

Acknowledgements

This work is supported in part by a Huawei Research Grant and National Natural Science Foundation of China (No. 62176002). Xu Sun is the corresponding author of this paper.

References

Appendix A Task Coverage in TimeIT

TimeIT encompasses 6 longstanding timestamp-related video tasks and incorporates 12 specific datasets derived from different domains.

This task aims to predict a timestamp boundary including the start and end time in the video given a natural language query. We include DiDeMo , QuerYD , HiRESTgrounding\text{HiREST}_{grounding} , and Charades-STA datasets to achieve accurate moment localization when users interact with natural language.

This task unifies the event localization and event captioning subtasks. It detects a series of events in the given video and outputs the corresponding timestamps and descriptions. We gather ActivityNet Captions , ViTT , and YouCook2 datasets to facilitate the narration of significant events for users when watching long videos.

The goal is to create a compressed set of frames or clip shots to represent the most informative content of the given video. TVSum and SumMe datasets are compiled to achieve an efficient video overview for busy stakeholders to save time.

Different from the video summarization, it identifies the most exciting, impressive, or emotional moments that may not cover the full scope of the original video. QVHighlights dataset is utilized to evaluate the highlight moment recommendation ability of AI assistants.

This task is designed to automatically segment and describe significant steps in a long untrimmed video, which is useful for instructional videos. We incorporate two datasets including COIN and HiRESTstep\text{HiREST}_{step} to fulfill key steps detecting when processing noisy instructional videos under the cooking, repairing, or assembling furniture scenarios.

The objective of this task is to predict the speech content and its corresponding start and end timestamps based on visual signals in the video. This task can be regarded as a weakly-supervised event localization and description task. We use the YT-Temporal-1B dataset . The original dataset includes 18 million narrated videos collected from YouTube, while we sample 31.6K videos from it for instruction tuning. Following Vid2Seq , we leverage Whisper-timestamped to automatically transcribe speech and use it as the target answer.

Our dataset accommodates prevalent user requests involving video timestamps when interacting with AI assistants in real-world applications.

Appendix B Instructions for Each Task

The quality and diversity of instructions are essential in the construction process. We manually write well-designed instructions for each task as a good starting. Then we utilize GPT-4 to extend more diverse and flexible expressions based on the manual initialization. Eventually, we manually select and refine the LLM-generated instructions to obtain the final version. Inspired by the observation in M3IT\text{M}^{3}\text{IT} that using around five instructions per task is sufficient, we generate six high-quality instructions for each task.

Tab. 6 shows instruction template examples and formatted output answers for each task.

Appendix C Hyper-parameters for Instruction Tuning

Tab. 7 lists hyper-parameters for instruction tuning.

Appendix D Details of Evaluation Datasets and Metrics

We evaluate our model on a range of benchmarks for long video understanding, i.e., dense video captioning, temporal video grounding, and video highlight detection, in a zero-shot setting.

(1) For dense video captioning, we use the YouCook2 dataset , which has 1,790 untrimmed videos of cooking procedures. On average, each video lasts 320s and is annotated with 7.7 temporally-localized imperative sentences. The dataset is split into 1,333 videos for training and 457 videos for validation. We evaluate caption quality using CIDEr . For an overall evaluation at the story level, we use the SODA_c metric . We also report the F1 score, which is the harmonic mean of the average precision and recall across IoU thresholds of 0.3, 0.5, 0.7, 0.9, to measure event localization performance.

(2) For temporal video grounding, we use the Charades-STA dataset. The dataset contains 6,670 videos and involves 16124 queries, where 12,404 pairs are used for training and 3,720 for testing. The average duration of the videos is 30.59 seconds and each video contains 2.41 annotated moments, and the moment has an average duration of 8.09 seconds. The evaluation metric is "R@1, IoU = μ\mu", which denotes the percentage of retrieved moments with an intersection over union (IoU) greater than μ\mu compared to the ground truth, given language queries.

(3) For video highlight detection, we use the QVHighlights dataset . It consists of over 10,000 videos annotated with human-written text queries. The evaluation metrics are mAP (mean average precision) with IoU thresholds of 0.5 and 0.75, and HIT@1 (the hit ratio of the highest-scored clip).

Appendix E Details of Multi-model Pipelines

We take VideoChat-Text and InstructBLIP +ChatGPT as the baselines of Multi-model Pipielines. These pipelines integrate specialized visual models with ChatGPT, which firstly convert video semantics into textual descriptions and then leverage ChatGPT to process all inputs to solve the target task.

utilizes ffmpeg to extract key frames from the video at FPS=1. Then it leverages visual tools to obtain rich video information including action labels, frame summaries, video tags, comprehensive descriptions, object positional coordinates, video narratives, timestamps, and segment-related details. The overall visual information will be processed by the ChatGPT to respond to user instructions. We design task-related prompts to endow VideoChat-Text with the capability to solve timestamp-sensitive tasks.

endows a more powerful visual expert model, i.e. InstructBLIP, to describe each frame with exhaustive paragraphs containing detailed video semantics. We employ well-designed prompts (illustrated in Fig. 6) for ChatGPT to solve each task. For video input, we uniformly sample 50 frames to obtain frame descriptions.

Appendix F More Qualitative Results

Within Figures 7-9, we present an extended range of qualitative results, encompassing dense video captioning, temporal video grounding, and video highlight detection tasks. Overall, our model demonstrates proficiency in executing a diverse array of intricate temporal localization tasks.