Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan

Introduction

The surge of deep learning applications for video understanding has lead to major advancements in video-related tasks. However, the current video understanding models are still unable to hold an open-ended conversation about the video content in a coherent manner. A video-based dialogue model can revolutionize video search, surveillance operations and help summarize key events and abnormal event detection. Above all, it can provide a unified human-understandable interface to video-related tasks such as action recognition, localization, detection, segmentation, retrieval, and tracking. Further, such a capability is of great interest as it will demonstrate the model’s ability to encode temporal and spatial cues, contextual relationships and long-term dependencies.

Recent advancements in multimodal understanding are largely based on the combination of pretrained image models with Large Language Models (LLMs) but generally do not consider video inputs . It is therefore interesting to leverage the vast capabilities of LLMs for video understanding tasks in a way that would not only maintain the temporal and spatial characteristics but also be adept at generating human-like conversations about videos. In this paper, we introduce Video-ChatGPT, a novel multimodal model that merges the representational abilities of a pretrained visual encoder and the generative powers of an LLM, capable of understanding and conversing about videos.

Video-ChatGPT leverages an adapted LLM that integrates the visual encoder of CLIP with Vicuna as a language decoder, fine-tuned on generated instructional image-text pairs. Our approach further adapts the desgin for spatiotemporal video modeling and fine-tunes the model on video-instruction data to capture temporal dynamics and frame-to-frame consistency relationships available in video data. In contrast to other concurrent works for video-based conversation , Video-ChatGPT excels at temporal understanding, spatial consistency and contextual comprehension as demonstrated by our extensive evaluations.

A fundamental contribution of this work is the creation of a dataset of 100,000 video-instruction pairs using a combination of human-assisted and semi-automatic annotation methods. Each pair consists of a video and its associated instruction in the form of a question-answer. This provides Video-ChatGPT with a large and diverse dataset to learn from, increasing its video-specific understanding, attention to temporal relationships and conversation capabilities.

Moreover, we introduce the first quantitative video conversation evaluation framework for benchmarking, allowing for a more accurate evaluation of the performance of video conversation models. This framework evaluates models on a variety of capabilities, such as correctness of information, detail orientation, contextual understanding, temporal understanding, and consistency.

The contributions of this work are as follows,

We propose Video-ChatGPT, a video conversation model capable of generating meaningful conversations about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representations.

We introduce 100,000 high-quality video instruction pairs together with a novel annotation framework that is scalable and generates a diverse range of video-specific instruction sets.

We develop the first quantitative video conversation evaluation framework for benchmarking video conversation models. We demonstrate Video-ChatGPT to perform well compared to concurrent conversational engines for videos such as Video Chat .

Related Work

Vision Language Models: Significant advancements in the field of computer vision have recently been observed due to the development of many foundational vision-language models. These models represent a significant leap towards creating general-purpose vision models capable of tackling various tasks simultaneously . A prime example is CLIP , which is trained on 400M image-text pairs and has demonstrated impressive zero-shot performance on numerous benchmarks. It has been employed in various downstream applications, from image-based object detection and segmentation to 3D applications . Numerous attempts have also been made to adapt CLIP for video applications . Similar to our design, ViFi-CLIP suggests employing temporal pooling across video frames to adapt the image-based CLIP model for video-based tasks.

Large Language Models: The field of natural language processing has witnessed a paradigm shift with the advent of pretrained Large Language Models (LLMs) such as GPT , LLaMA , OPT , and MOSS . These models exhibit extraordinary abilities like language generation and in-context learning, and their knack for understanding intricate tasks given user prompts in a zero-shot manner reflects their impressive adaptability and generalization. The proven capabilities of LLMs have encouraged researchers to fine-tune them to maximize their proficiency.

A key strategy in this pursuit is instruction tuning. This approach focuses on improving the model’s alignment with user intentions and optimizing their output quality. For instance, InstructGPT and ChatGPT significantly benefit from this technique, showcasing improvements in diverse conversational interaction capabilities and their aptitude to answer a broad range of complex questions. This effective approach has recently been employed in open-source models like Alpaca and Vicuna , both developed using the LLaMA framework, resulting in performance improvements.

Pre-trained LLMs in Vision-Language Tasks: The recent strides in multimodal understanding have primarily been driven by the integration of image-based vision models with LLMs. Seminal contributions such as Flamingo and BLIP-2 have demonstrated the power of utilizing web-scale image-text data, as well as pioneering techniques in cross-modal alignment, to exhibit dynamic abilities in conversational and few-shot learning contexts. Building on this foundation, MiniGPT-4 allows image-based conversations by integrating BLIP-2 and Vicuna for zero-shot image comprehension.

Equally significant is the emergence of LLaVA , a model derived from the LLaMa architecture, leveraging GPT-4’s language proficiency to generate multimodal instruction-following data. With instruction tuning applied on the derived data, LLaVA has displayed interesting multimodal chat capability, hinting at the scalability potential of such a methodology. In addition, InstructBLIP model has demonstrated strong image-based dialogue capabilities via vision-language instruction tuning by innovating with instruction-aware visual feature extraction.

More closely related to our work, VideoChat employs selective components of video foundational models and image foundation models , and integrates them with LLMs in conjunction with few learnable layers, tuned using a two-stage lightweight training. Additionally, they construct a video-specific dataset using off-the-shelf vision-language models for generating noisy detailed textual descriptions to enhance the training of video-centric conversational models.

Different from VideoChat, we propose a novel human assisted and semi-automatic annotation framework for generation high quality instruction data for videos (see Sec. 4). Our simple and scalable architecture design utilizes pretrained CLIP to generate spatiotemporal features which help Video-ChatGPT in generating meaningful video conversation. Further, we are the first to propose quantitative framework for evaluating video conversation tasks (see Sec. 4).

Video-ChatGPT

Video-ChatGPT is a large vision-language model that aligns video representations with a Large Language Model (LLM), thus enhancing its ability to generate meaningful conversation about videos. Our approach draws from the approach employed in designing vision-language (VL) models for the video domain. Given the limited availability of video-caption pairs and the substantial resources required for training on such data from scratch, these models commonly adapt pretrained image-based VL models for video tasks . We adopt a similar approach, starting with the Language-aligned Large Vision Assistant (LLaVA) as our foundation.

LLaVA is a LMM that integrates the visual encoder of CLIP with the Vicuna language decoder and is fine-tuned end-to-end on generated instructional vision-language data. We fine-tune this model using our video-instruction data, adapting it for video conversation task. The video-instruction data is obtained as a combination of manual and automated pipelines in our proposed instruction generation setup. This adaptation on video-specific instructions allows for accommodating additional temporal dynamics, frame-to-frame consistency, and long-range relationships present in video data. As a result, our Video-ChatGPT excels in video reasoning, creativity, and understanding of spatial, temporal, and action-oriented components within videos.

A simple trainable linear layer $g$ , projects these video-level features into the language decoder’s embedding space, transforming them into corresponding language embedding tokens $Q_{v}$ ,

2 Video Instruction Tuning

We employ instruction-tuning of the LLM on the prediction tokens, utilizing its original auto-regressive training objective. The pretrained model is finetuned with curated, high-quality video-text pairs. During the finetuning phase, we use predefined prompts based on the following template:

USER: Assistant:

Using the notations, we can represent it as,

In this prompt, the represents a question pertaining to the video, randomly sampled from the training set of video-question-answer pairs. Questions can be general, asking to describe the video, or they may relate to specific temporal, spatial, or creative aspects of the video content. The prediction answer corresponds to the specific question asked. Throughout the training, the weights for both the video encoder and LLM remain frozen, and the model maximizes the likelihood of predicting tokens representing the answer by adapting the linear layer. Consequently, the video features $Q_{v}$ become aligned with the pre-trained LLM word embeddings, equipping Video-ChatGPT with the ability to produce more natural and dependable responses.

Video Instruction Data Generation

In this section, we discuss our data-focused approach, which uses both human-assisted and semi-automatic annotation methods to generate high-quality video instruction data. This data is crucial for training Video-ChatGPT, making sure the model gives accurate and meaningful responses. Our data collection involves two key methods. The human-assisted annotation, involves expert annotators analysing video content and providing detailed descriptions. This process generates data rich in context and detail, which helps our model understand complex aspects of video content. On the other hand, the semi-automatic annotation framework is more cost-effective and scalable. Leveraging state-of-the-art vision-language models, this method generates broad, high-volume annotations, thus increasing the quantity of data without compromising the quality substantially. Through these combined methods, we have successfully accumulated a robust set of 100,000 video-instructional pairs. This extensive dataset is crucial in fine-tuning our model to comprehend video content effectively, integrating both spatial and temporal cues into its understanding.

Our instructional data is both diverse and comprehensive, incorporating a wide range of data types. These include detailed descriptions, summarizations, question-answer pairs, tasks that stimulate creativity or generation of new ideas, and conversational tasks. The data spans a broad spectrum of concepts, ranging from visual appearance and temporal relations to complex reasoning tasks and beyond, providing a diverse training ground for our model to learn from.

In this process, we leverage datasets containing video-caption pairs and utilize the expertise of human annotators to enrich the original ground truth annotations. Specifically, we use a subset of the ActivityNet-200 dataset which provides concise ground truth descriptions of various activities in distinct video segments.

The annotators further enrich the captions by adding comprehensive information about physical appearances and spatial and temporal localization, among other critical contextual details. Figure 2 shows an example of how a ground truth caption is enriched using human-assisted annotation.

2 Semi-automatic Annotation Framework

In addition to the rich human-assisted annotations, we also harness the capabilities of advanced dense image vision-language models, developing a semi-automatic annotation framework. This approach is cost-effective and scalable, thereby increasing the quantity of data without substantially compromising the quality.

Similar to the human-assisted process, this framework also leverages datasets containing video-caption pairs. We enrich these datasets using contextual information drawn from off-the-shelf dense prediction and captioning image-based vision-language models. These models provide predictions that deliver additional contextual information, thereby enriching the video captions. We crafted developed a comprehensive method that combines these predictions, and utilize specific models for the purpose of eliminating noisy or irrelevant context from the data. This ensures that the data maintains its accuracy and relevance.

Building on the use of off-the-shelf models, we apply pretrained models like BLIP-2 and GRiT for key-frame analysis in the videos. The BLIP-2 image-captioning model generates frame-level captions, while the GRiT dense captioning model provides detailed captions for scene objects. Additionally, the pretrained Tag2Text model is used to generate tags for each key-frame of the video. Despite their utility, these models can introduce noise into the data.

To ensure high-quality data and mitigate noise, we implement three key steps. First, we maintain a high prediction threshold for all off-the-shelf models to uphold accuracy. Second, we employ a specialized filtering mechanism that removes any frame-level caption from BLIP-2 or GRiT not matching with the Tag2Text frame-level tags. This process involves extracting words from the frame-level captions that are within the predefined Tag2Text tags vocabulary, and eliminating any captions that contain words not in the tags for a given frame. This strategy acts as an additional filtering layer, enriches the captions by integrating predictions from multiple models.

In the third step, we merge frame-level captions and use the GPT-3.5 model to generate a singular, coherent video-level caption. This step augments the original ground truth caption with context from these models. We also direct GPT-3.5 to discard inconsistent information across frames, ensuring a precise, contextually rich video instruction dataset. Figure 3 illustrates how a ground truth caption is enriched using this process after all three refinement stages.

3 GPT-Assisted Postprocessing

Lastly, we implement a GPT-Assisted Postprocessing mechanism that refines and optimizes the enriched annotations, in order to generate high-quality video instructional data. We prompt GPT-3.5 model to create question-answer pairs from the enriched and detailed captions that cover a wide variety of aspects. These aspects include detailed descriptions, summarizations, question-answer pairs, tasks that stimulate creativity or the generation of new ideas, and conversational tasks.

Each of these elements plays a crucial role in our data-centric approach. Our ultimate goal is to create a video-based conversation model that is accurate, capable of understanding video content from both spatial and temporal cues, and adept at engaging in conversations.

Experiments

We use LLaVA as our baseline model and finetune it on 100K video instruction pairs. We only update the linear layer projecting the video features to the LLMs’ input space, while the rest of the architecture is kept frozen. We finetune the model for 3 epochs using a learning rate of 2 $e^{-5}$ and an overall batch size of 32. The training of our 7B model took around 3 hours on 8 A100 40GB GPUs. During inference, for memory efficiency, we load the models in FP16 mode.

In our semi-automatic annotation framework, we use Katna to extract the video key-frames. For the off-the-shelf Tag2Text model, we use the Swin-B version with input size of 384 $\times$ 384 and confidence threshold of 0.7. For GRIT , we use ViT-B version with CenterNet2 .

2 Quantitative evaluation

In this section, we highlight a key contribution of our work: the quantitative evaluation of Video-ChatGPT using advanced metrics and comparative evaluations with existing state-of-the-art models. We conduct two types of quantitative evaluations: i) Video-based Generative Performance Benchmarking and ii) Zero-Shot Question-Answer Evaluation.

Video-based Text Generation Performance Benchmarking: We introduce a benchmark to evaluate the text generation performance of video-based conversation models. To do this, we curate a test set based on the ActivityNet-200 dataset , featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We also develop an evaluation pipeline using the GPT-3.5 model. This pipeline assesses various capabilities of the model and assigns a relative score to the generated predictions on a scale of 1-5, in the following five aspects:

Correctness of Information: We verify the accuracy of the generated text, ensuring it aligns with the video content and doesn’t misinterpret or misinform.

Detail Orientation: We evaluate the depth of the model’s responses, looking for both completeness, meaning the model’s response covers all major points from the video, and specificity, denoting the inclusion of specific details rather than just generic points in the model’s response.

Contextual Understanding: We assess the model’s understanding of the video’s context, checking if its responses aligns with the overall context of the video content.

Temporal Understanding: We examine the model’s grasp of the temporal sequence of events in the video when answering questions.

Consistency: We evaluate the model’s consistency across different but similar questions or different sections of the video.

We present the results of the evaluation of our proposed model, Video-ChatGPT, using the quantitative benchmarking framework in Table 1. The results reveal its competent performance across all key aspects when compared with the recently introduced contemporary video conversation model, Video Chat . Video-ChatGPT shows good performance, largely due to the instruction tuning we perform and its straightforward architecture that leverages LLMs with a pretrained visual encoder fine-tuned for video data. This provides it with the robust ability to generate contextually relevant, detailed, and temporally accurate text from video input.

Zero-Shot Question-Answer Evaluation: We conducted a comprehensive quantitative evaluation using several commonly used open-ended question-answer datasets: MSRVTT-QA , MSVD-QA , TGIF-QA FrameQA , and ActivityNet-QA . These evaluations were carried out in a zero-shot manner, employing GPT-assisted evaluation to assess the model’s capabilities. This evaluation process measures the accuracy of the model’s generated predictions and assigns a relative score on a scale of 1-5.

To benchmark Video-ChatGPT, we compared its performance with other significant models, such as FrozenBiLM and the generative video model, Video Chat. FrozenBiLM is a model that adapts frozen bidirectional language models pretrained on Web-scale text-only data to multi-modal inputs, showing promising results in zero-shot VideoQA settings. Despite the solid foundation established by these models, Video-ChatGPT consistently outperformed them, achieving state-of-the-art (SOTA) performance across all datasets. These results indicate Video-ChatGPT’s ability to understand video content and generate accurate, contextually rich answers to questions.

3 Qualitative Evaluation

We performed an extensive evaluation of our model on a variety of open-ended video question-answering tasks, utilizing diverse videos sourced from ActivityNet and YouTube. The evaluation tasks included video reasoning (Figure 4), creative and generative tasks (see Figure 5), spatial understanding (Figure 6), action recognition (Figure 7), video conversation (Figure 8), question answering (Figure 9) and temporal understanding (Figure 10). Our model demonstrates proficiency in comprehending the content of the videos and generating accurate responses across multiple video based task. Our model can effectively understand the visual information present in the videos and provide precise answers (see Figures 4, 5, 6, 7, 8, 9 and 10).

Conclusion and Future Directions

In this work, we presented Video-ChatGPT, a multimodal model that merges a pretrained visual encoder with a large language model (LLM) to enable video understanding and conversations based on videos. Video-ChatGPT leverages an adapter on top of pretrained LLM and vision backbones and is fine-tuned on video-instruction data to capture temporal dynamics and spatial consistency relationships in spatiotemporal sequences. A dataset of 100,000 video-instruction pairs is created to enhance Video-ChatGPT’s video-specific understanding and conversation capabilities. The work also introduced a quantitative video conversation evaluation framework for benchmarking, evaluating models on a diverse set of capabilities including conventional video question answering as well as open-ended descriptions. While the model performs competitively in several scenarios, we note it finds it challenging to understand subtle temporal relationships and the visual details of small objects. As a future work, Video-ChatGPT can be extended to simultaneously deal with multiple modalities and to enhance its video comprehension capabilities towards an all-in-one dialogue agent for universal visual content understanding.

Acknowledgements

We would like to thank colleagues for their contribution to the video annotation task, including Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Jiale Cao, Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Muzammal Naseer, Akshay Dudhane, Jean Lahoud, Awais Rauf, Sahal Shaji, Bokang Jia.