VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao

Introduction

Videos offer a remarkably close representation of how humans consistently perceive the visual world. Intelligent video understanding is crucial for various real-world applications, such as human-robot interaction, autonomous driving, and intelligent surveillance. However, current paradigms in video understanding are limited by task-specific tuning of pre-trained video foundation models, restricting a general spatiotemporal comprehension for video content.

Vision-centric multimodal dialogue systems have recently emerged as an essential research area . By utilizing a pre-trained large language model (LLM), an image encoder, and additional learnable modules, these systems can deeply understand images (e.g., recognizing memes or jokes) and perform image-related tasks through multi-round dialogues with user queries . This revolutionizes numerous visual applications, but existing systems have yet to formally address video-centric tasks from a data-centric perspective using learning machines.

Our initial video-centric multimodal dialogue system https://github.com/OpenGVLab/Ask-Anything, released on April 15, 2023. formulates video understanding as a natural language processing (NLP) question-answering, by textualizing video content with open-sourced visual models. Despite demonstrating decent performance in short-term scenarios with clear objects and actions, transforming videos into textual descriptions inevitably results in visual information loss and over-simplification of spatiotemporal complexities. Additionally, almost all utilized vision models struggle with spatiotemporal reasoning, event localization, and causal relationship inference within videos.

To tackle these challenges, we improve our initial dialogue system and introduce a groundbreaking chat-centric video understanding system that leverages state-of-the-art techniques from both video and language domains. Our approach creates a full loop, integrating video and language foundation models in a learnable manner from a model perspective, and provides all techniques required to learn the system from a data perspective.

We begin by presenting our novel video-centric multimodal dialogue system. We propose an innovative system architecture that combines video foundation models and large language models (LLMs) through a learnable neural interface. By a two-stage lightweight training (with only spatiotemporal and video-language alignment modules) on large-scale video-text datasets and self-built video instruction ones, our method excels in spatiotemporal perception & reasoning, and causal inference, marking the first attempt to create a fully learnable and efficient video understanding system that facilitates effective communication.

We introduce a novel video-centric multimodal instruction fine-tuning dataset. We create a unique dataset comprising thousands of videos paired with detailed textual descriptions and conversations generated using dense captions fed to ChatGPT in temporal order. This dataset emphasizes spatiotemporal objects, actions, events, and causal relationships, offering a valuable resource for training video-centric multimodal dialogue systems.

Through these contributions, our work pioneers new frontiers in video and natural language processing integration. By developing a new and effective chat-centric video understanding dialogue system, we pave the way for a wide range of applications across various domains while setting a standard for future research in this field. Our research not only pushes the boundaries of video understanding and reasoning but also offers protocols for both academic and industrial communities.

Related Work

Large-scale video-text pretraining coupled with downstream task fine-tuning has emerged as the standard paradigm in the video-language domain . Early methods employed pretrained visual and language encoders to derive offline video and text features; however, more recent approaches have demonstrated the effectiveness of end-to-end training. Additionally, prevalent techniques often encompass two or three pretraining tasks, such as masked language modeling , video-text matching , video-text contrastive learning , masked video modeling and video-text masked modeling . Within the realm of video multimodal tasks, VIOLET integrates masked language and masked video modeling, while All-in-one suggests a unified video-language pretraining methodology using a shared backbone, and LAVENDER consolidates the tasks through masked language modeling. Although these approaches yield impressive results in multimodal benchmarks, their training relies on limited video-text data, which leads to difficulties in video-only tasks such as action recognition. On the other hand, MERLOT Reserve compiles 20 million video-text-audio pairs for training joint video representations via contrastive span matching, thereby establishing state-of-the-art outcomes in video recognition and visual commonsense reasoning.

Recent advances in large language models (LLMs) have showcased remarkable capabilities such as language generation, in-context learning, etc. These abilities enable LLMs to tackle complex tasks with user prompts in a zero-shot fashion. GPT-3 shows notable zero-shot performance across numerous benchmarks. InstructGPT models are finetuned using datasets containing prompts with corresponding human-annotated desired behavior. This results in better alignment with users, improved output quality compared to GPT-3, increased truthfulness, and reduced risks. Instruction-tuned models also present remarkable generalization capacity for zero-shot tasks. Therefore, instruction-tuning is crucial in leveraging LLMs’ potential. Besides of GPT family , there are multiple LLMs, including OPT , LLaMA , MOSS , and GLM , providing high-performance, open-source resources that can be finetuned for various purposes. For instance, Alpaca proposes a self-instruct framework to instruction-tune LLaMA models without heavily relying on human-authored instruction data.

The accomplishments of LLMs have accelerated the creation of AI systems that merge vision models with LLMs to enable multimodal reasoning and action . Flamingo pioneered this approach by capitalizing on both vision and language models using web-scale image-text interwoven data, unveiling exceptional zero-shot image-text abilities in a conversational format for the first time. The study in demonstrates that Kosmos-1 models are naturally equipped to tackle a broad array of perception-intensive tasks, including visual dialogue, visual explanation, visual question answering, image captioning, basic math equations, OCR, and zero-shot image classification using descriptions. Visual instruction tuning introduces an innovative technique for refining large language models on visual instruction tasks, enabling pretrained BLIP and Vicuna to nearly match GPT-4 level conversation performance for image-based tasks . MiniGPT-4 is a multimodal large language model, fine-tuned on multimodal tasks, and exhibits respectable zero-shot image comprehension in dialogues .

VideoChat

VideoChat unifies video-related tasks into the formulation of multiple-round video question answering, in which tasks are defined by words in a live inference and no or a few instances are given for learning. In this formulation, we treat an LLM as a universal video task decoder, turning video-related descriptions or embeddings into human-understandable text. This procedure is user-friendly in employing foundation models to address various video applications.

Formally, we extract concepts from videos using vision models as:

where $\mathbf{E}$ denotes a text description or embedding according to context, $f_{\text{img}}^{j}$ denotes the $j_{\text{th}}$ image model to predict human-readable annotations or visual feature, while $\mathbf{I}$ and $\mathbf{V}$ denote an image and video, respectively. Then we decode the task prediction from a LLM based on user’s question as:

where $\mathbf{W}_{t}^{a}$ and $\mathbf{W}_{\leq t}^{q}$ stand for the answers from the LLM at the round $t$ and all questions given by users before round $t$ , respectively. $f_{\text{llm}}$ denotes an LLM model.

In technical terms, an ideal end-to-end chat-centric video understanding system should utilize a video/vision model (an encoder) to convert visual sequences into latent features for LLM, guaranteeing the system’s overall differentiability. Prior to this, we verify the efficacy of LLM as a universal video task interpreter through our proposed VideoChat-Text (Section 3.1). This method transforms videos into textual streams for subsequent discrimination/reasoning tasks using LLMs by incorporating various open-source vision models. While VideoChat-Text can tackle typical spatiotemporal tasks such as spatial and temporal perception, it falls short in comprehending intricate temporal reasoning and causal inference. Therefore, we introduce VideoChat-Embed (Section 3.2), a multimodal system that combines both video and language foundation models. Finetuned with video instruction data, it significantly enhances performance in higher-order temporal assignments. We will describe these two approaches in the following sections.

We employ several vision models to convert video data into textual format. Subsequently, we create purpose-built prompts to temporally structure the predicted text. Ultimately, we rely on a pretrained LLM to address user-specified tasks by responding to questions based on video text descriptions.

In particular, for a given video, we use $\mathtt{ffmpeg}$ to extract key frames from the video at a low $\mathtt{FPS}$ , resulting in $T$ video frames and associated audio. By feeding the extracted frames and audio into various models, we acquire action labels, frame summaries, video tags, comprehensive descriptions, object positional coordinates, video narratives, timestamps, and other segment-related details. We then consolidate related content in the captions considering the timing and generate a timestamped video text description. We will first outline the vision models and prompt schematics employed, and then conclude with an analysis of VideoChat-Text.

Utilizing a combination of video and image models , we analyze videos from various aspects such as actions (with InternVideo ), objects , object annotations with positions , and more. While the majority of these models’ outputs are comparatively independent, we utilize the pretrained T5 language model to refine their descriptions for improved clarity. Moreover, we integrate the Whisper speech recognition model into VideoChat-Text to capitalize on audio data within videos, further enhancing the richness of video descriptions.

1.2 Prompt System

We process the video into different visual models to obtain different textualizing videos and then organize them together in a template (Table 1) as inputs to an LLM. Then, we present the LLM with the context that we instruct it to pretend to watch the given video through the input formatted texts (the structured video knowledge from perception models) and then chat with users. Such prompt is shown in Table 2.

Lite perception models enable VideoChat-Text to convert videos into time-stamped text at 1 FPS, processing a 10-second video clip in about 2 seconds using an NVIDIA-A10 GPU. It communicates with users through an LLM. However, using text as the communication medium restricts the representation capabilities of the perception models, as it limits their decoders. To provide richer visual information from videos to the LLM, we must employ more advanced and more perception models, which may conflict with VideoChat-Text’s efficiency. Additionally, VideoChat-Text has limited potential to benefit from popular visual instruction tuning .

2 VideoChat-Embed : VideoChat by Encoding Videos as Embeddings

VideoChat-Embed is an end-to-end model designed to handle video-based dialogue. It employs an architecture that combines both video and language foundation models with an addition learnable Video-Language Token Interface (VLTF). To achieve better cross-modality optimization, the model incorporates language-friendly video foundation models, inspired by . Considering the video redundancy , we introduce the VLTF, using cross-attention to compress the video tokens. It is tuned with video-text data for video-to-language representation alignment. Finally, the video tokens, user queries, and dialogue context are input into the LLM for communication.

In this paper, we instantiate the VideoChat-Embed based on BLIP-2 and StableVicuna (Figure 2). Concretely, we incorporate the pretrained ViT-G with Global Multi-Head Relation Aggregator (GMHRA), a temporal modeling module used in InternVideo and UniFormerV2 . For the token interface, we employ the pretrained QFormer with extra linear projection, supplemented by additional query tokens to account for video context modeling. This allows us to obtain compact LLM-compatible video embeddings for dialogues.

When training, we freeze most of the parameters except the newly incorporated GMHRA, queries and linear projection. Inspired by , we introduce image data for joint training (Figure 2). In Stage1, we align the video encoder with LLM via large-scale video-text fine-tuning. In Stage2, we tune the system with two types of video instruction data: in-depth video descriptions and video question-answer pairs. The following section will describe the process of generating instruction data and present the details of two-stage training paradigm.

2.2 Instruction Data

We build a video-centric multimodal instruction data based on WebVid-10M . The corresponding detailed descriptions and question-answer generations are produced by ChatGPT based on video text (aided by VideoChat-Text) with several prompts concerning spatiotemporal features. Compared with detailed video descriptions, video conversations are introduced to further improve the diversity, temporal and casual features in the video instruction data.

We condense the provided video description into a video narrative employing GPT-4, as shown in Table 5. This highlights the temporal aspects of the video by illustrating its progression over time. The associated prompts can be found in Table 5 and 5. The first converts the various predicted textual labels into a cohesive, evolving story, while the second one refines the narrative to improve clarity and coherence, as well as minimize hallucination. We generated a total of 7K descriptions from randomly chosen videos.

With the video description, we generate multi-round dialogues with three types of prompts concerning descriptive, temporal, and causal content for videos with ChatGPT. The descriptive part mostly inherits key points from LLaVA . For the temporal and causal parts, we propose prompts (Table 6) focus on temporal perception/reasoning and explanation/uncovering intentions/causes, respectively. We produced multi-round dialogues from 4K randomly chosen videos. One example of the video conversation can be found in Table 7

2.3 Two-stage Training

Motived by MiniGPT-4 and LLaVA , we have designed a two-stage joint training paradigm. This approach allows us to benefit from readily-available image instruction data, creating a system capable of handling both images and videos with shared spatial perception and reasoning capacity.

To strike a balance between training convergency and efficiency we introduce 25M vision-text pairs for one epoch of fine-tuning, The data consists 10M video-text pairs from WebVid-10M, and 15M image-text pairs from COCO Caption , Visual Genome , SBU Captions , CC3M and CC12M . The input prompts for LLMs are as followed:

“###Human: video_instruction ###Assistant:”

“###Human: image_embed image_instruction ###Assistant:”

The video_embed and image_embed are the output from the token interface. Meanwhile the video_instruction and image_instruction provide concise video and image descriptions randomly sampled from predefined instructions in Table 8. Language models receive corresponding visual descriptions as answers.

As discussed in Section 3.2.2, our self-built video instruction data consists of 7K detailed video descriptions and 4K video conversations. To improve spatial perception and reasoning capabilities, we also gather 3K detailed image descriptions from MiniGPT-4 , 2K image conversations, and 2K image reasoning tasks from LLaVA . With this 18K data collection, we tune the system for 3 epochs. Note we include temporal reasoning sampling information for video data: “The video contains T frames sampled at $t_{0},t_{1},...,t_{T}$ seconds.”

Experiments

We give some case studies at this stage. Besides of our VideoChat-Text and VideoChat-Embed , we make qualitative comparisons with LLaVa , miniGPT-4 , and mPLUG-owl .

In Figure 7, our approach (VideoChat-Embed) accurately deduces the corresponding music by recognizing Japanese-style clothing and determining the number of individuals present. This confirms the system’s ability to identify objects along with their properties, while also providing pertinent recommendations based on visual elements. Also, we give some image-centric dialogue examples in Figure 10.

Figure 5, 7, and 9 demonstrate that VideoChat-Embed is capable of performing accurate temporal perception and reasoning. In Figure 5, our system identifies actions over time in a zero-shot fashion, recognizing that the subject played basketball and engaged in dance movements within a specific timeframe. Additionally, it captures camera motion, showcasing its understanding of filming perspectives. In Figure 9, VideoChat-Text accurately identifies yoga in the video and provides rational explanations for this activity (practice and enjoyment). Intriguingly, when questioned about the likelihood of the yoga practitioner falling, it asserts that proper safety precautions were taken by the individual.

It is evident that VideoChat-Embed can infer causal relationships using spatiotemporal clues, as demonstrated in Figure 5, 7, 7, and 5. In Figure 5, the model provides an impartial description of the video, primarily highlighting objects, actions, and emotions without commenting on the boy’s dance style. To explain why the video is amusing, VideoChat-Embed cites the erratic and spontaneous nature of the boy’s movements while also conveying an emotional assessment that his dancing appears foolish, accounting for the unexpected humor within the clip. Empirically, we confirm that these visually associated abstract concepts are derived from the video foundation model instead of being hallucinations produced by the utilized LLM. In Figure 7, VideoChat-Embed accurately assesses that a car accident occurred due to the collision of vehicles and the damage sustained to the front car’s license plate visible to the camera. In Figure 7, VideoChat suggests pairing the video with light, enjoyable music, as it can sense the girls’ dancing rhythm and progression over time. This demonstrates that our system can effectively capture and summarize abstract concepts related to movement patterns.

2 Comparisons

As depicted in Figure 3, we present a comparison of our approach to recent image-based multimodal dialogue systems in image-related tasks, using a query example from the TVQA dataset. We assess this case through the online demos provided by each respective system. It is evident that our VideoChat-Embed correctly identifies the scene, while other systems inaccurately perceive the conversation setting as indoors. This result highlights the superior spatial perception abilities of VideoChat-Embed in relation to its counterparts. Furthermore, this proficiency remains consistent when dealing with a video from the same dataset, as demonstrated in the right-hand example.

Conclusion

We have embarked on a pioneering investigation into general video understanding by developing VideoChat, a multimodal dialogue system specifically designed for videos. Two versions of VideoChat are implemented: a text-based version demonstrates the effectiveness of employing large language models as universal decoders for video tasks, while an end-to-end version makes a preliminary attempt to tackle video understanding using instructed video-to-text formulation. Our end-to-end solution effectively merges video foundation models with large language models through a trainable neural interface. To enhance the system’s performance, we introduced a video-centric instructional dataset, which highlights spatiotemporal reasoning and causality by offering a learning resource for video-based multimodal dialogue systems. Initial qualitative evaluations showcase our system’s promising capabilities across various video applications and drive its ongoing advancement.

1) Both VideoChat-Text and VideoChat-Embed struggle with managing long-term videos ( $\geq$ 1min). On one side, effectively and efficiently modeling the context of long videos remains a complex and persistent research issue. Conversely, balancing response time, GPU memory usage, and user expectations for system performance becomes challenging when striving to provide user-friendly interactions while processing longer videos. 2) Our system’s capacities for temporal and causal reasoning remain rudimentary. These limitations stem from the current scale of our instruction data and its construction approaches, as well as the overall system scale and the models employed. 3) Addressing performance disparities in time-sensitive and performance-critical applications, such as egocentric task instruction prediction and intelligent monitoring, is an ongoing challenge.

Our future works lie in 1) scaling video foundation models both in capacity and data for better spatiotemporal modeling, 2) video-centric multimodal training data and reasoning benchmark for evaluations at scale, and 3) long-term video processing techniques.

References

Appendix A Appendix

Following the brief image instruction in LLaVA, we generate the video instruction with the aid of ChatGPT as shown in Table 8. In Stage1, we randomly sample the instruction to generate brief descriptions of images and videos.

Following the detailed image instruction in LLaVA, we generate the video instruction with the help of ChatGPT as shown in Table 9 To build the instruction data used in Stage2, we randomly sample the instruction and combine it with the detailed descriptions.