Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan

Introduction

Large language models (LLMs) have demonstrated strong capabilities in handling natural language processing (NLP) tasks, including comprehension, composition and reasoning, and achieved remarkable advancements on NLP benchmarks. This success has also inspired studies on Video-LLMs , where models process video inputs with textual prompts and generate corresponding answers, shedding light on the future format of artificial general intelligence (AGI) for video understanding.

With the ultimate goal of achieving artificial general intelligence in mind, we assert that a truly intelligent video-language model should at least exhibit three distinct human-like capabilities: (i) Video-exclusive Understanding, i.e., performing well for questions whose answer can be extracted from the video itself; (ii) Prior Knowledge-based Question-Answering, i.e., answer questions that require the prior knowledge beyond the video, such as commentary on NBA games or providing background information on specific music videos; (iii) Comprehension and Decision-making, enabling a comprehensive understanding of scenarios, along with the ability to make predictions and informed decisions. Example applications encompass 3D scene understanding and decision-making for autonomous driving.

To gradually approach this goal, the establishment of an evaluation benchmark is indispensable for precisely measuring and steering the development progress. However, we find that existing benchmarks fall short of serving this purpose comprehensively. For instance, MMBench and LVLM-eHub are concentrated on image understanding, ignoring the video understanding ability. SEED-Bench includes several video tasks but is limited to temporal understanding, thus only covering the first level. To this end, we propose a new large-scale benchmark along with a toolkit, referred to as “Video-Bench”, to furnish a thorough evaluation of Video-LLMs. The composition of Video-Bench is depicted in Fig. 1.

In detail, aligning with our motivation, our Video-Bench encompasses tasks categorized into three distinct levels of capability: (i) For Video-exclusive Understanding, we begin by randomly selecting parts of traditional QA pairs , and proposing more challenging tasks to assess both temporal and contextual aspects of videos. Tasks include video summarization , abnormal detection , and crowd counting ; (ii) For Prior Knowledge-based Question-Answering, we evaluate the capability of model in understanding TV dramas , appreciating music videos, and providing information about players and games in NBA videos. (iii) For Comprehension and Decision-making, we employ two classical tasks: 3D indoor scene understanding and auto-driving decision-making to assess the comprehension and decision-making abilities of models.

To streamline the evaluation process, we include another crucial component, i.e., the evaluation toolkit, along with the benchmarks. The toolkit automatically maps the long text outputs of Video-LLMs to corresponding answers with probability selection or LLM-based semantic understanding . Subsequently, it calculates accuracy for each question and generates a final score, enhancing the efficiency of the evaluation workflow.

We evaluate eight representative Video-LLMs on Video-Bench: VideoChat , Video-ChatGPT , Otter , Valley , PandaGPT , mPLUG-Owl , Video-LLaMA , and Chat-UniVi with verified open-source model weights. The evaluation results reveal several interesting findings: (i) Most recent models can summarize the main content of videos but lack the capacity to detect details and temporal information. (ii) Due to the absence of domain-specific prior knowledge in the training data, these models encounter challenges in accurately comprehending and responding to queries within a particular domain. (iii) Due to constraints in multimodal information extraction and the use of a weakened LLM backend (either 7B or 13B), the majority of tested models exhibit limited proficiency in comprehending and decision-making within complex scenarios. Our contributions can be summarized as follows:

We introduce Video-Bench, the first comprehensive evaluation benchmark for Video-LLMs, featuring a three-level ability assessment that systematically evaluates models in video-exclusive understanding, prior knowledge incorporation, and video-based decision-making abilities.

We provide a user-friendly evaluation toolkit. Accompanied by our datasets and QA pairs, the toolkit can streamline the performance assessment of Video-LLMs.

We conduct extensive experiments to evaluate prominent Video-LLMs, summarizing their behaviors, analyzing main causes for observed limitations, and proposing future directions for improvement.

Related Work

Video-LLMs. Extending Image-based Large Language Models (Image-LLMs) to the video modality introduces a complex challenge, necessitating the incorporation of temporal dimensions to interpret diverse frame information. Beyond visual content, the integration of audio, subtitles, and other modalities becomes crucial for a comprehensive understanding of video semantics. In response to this challenge, a series Video-LLMs have emerged, building upon open-source LLMs or Image-LLMs .

As outlined in Table 1, VideoChat utilizes the Q-Former to map visual representations to Vicuna , implementing a two-stage training process. Video-ChatGPT and Valley originate from the LLaVA framework and introduce average pooling to enhance temporal sequence perception. Otter proposes the MIMIC-IT dataset and fine-tunes Openflamingo on their dataset. PandaGPT employs the ImageBind as its backend for video comprehension. mPLUG-Owl introduces an abstractor module to align image and text. Video-LLaMA incorporates a frame embedding layer and ImageBind to inject temporal and audio information into the LLM backend, while Chat-UniVi merges visual tokens with similar semantic meanings using a clustering strategy. Existing Video-LLMs vary in their training strategies and data scales, with only a subset addressing challenges related to temporal dimensions and audio modalities.

Video Datasets. Deep learning for video analysis relies on diverse datasets tailored to specific tasks. A notable task is human action recognition, featuring action classification datasets such as UCF-101 , HMDB51 , and Kinetics , and action localization datasets like AVA and Fineaction . Tasks involving anomaly detection in surveillance videos are addressed by datasets like UCSD-anomaly and UCF-crime . Object identification and tracking in videos encompass multiple object tracking (MOT), video object segmentation (DAVIS), and video instance segmentation (Youtube-VIS) . For multimodal tasks, video captioning datasets such as MSVD , MSRVTT , and Activitynet exist, along with their corresponding QA datasets . Scenario-specific datasets like MovieQA and TVQA also contribute to the diversity of available datasets. However, these datasets often focus on specific tasks and lack the complexity needed to measure the comprehensive abilities of Video-LLMs effectively.

Vision Language Evaluation Benchmarks. To evaluate the capabilities of LLMs, various benchmarks have been introduced, including AI2 Reasoning , HellaSwag , MMLU , and TruthfulQA . These benchmarks assess reasoning, scientific knowledge, fact retention, and the ability to generate misinformation. In the realm of multimodal LLMs, corresponding benchmarks have also emerged. MMBench constructs a broad spectrum of evaluation for Vision-LLMs, and converts free-form predictions into pre-defined choices, enhancing the robustness of the evaluation process. SEED-Bench introduces a series of temporal understanding tasks and establishes an automatic filtering and manual verification pipeline to ensure the quality and relevance of the evaluations. LVLM-eHub presents an online arena platform for user-level evaluation, providing a more realistic assessment of model performance in real-world applications. ELEVATER focuses on evaluating the transferability of language-augmented visual models across multiple tasks. However, the aforementioned vision-language benchmarks are not tailored specifically for videos. Drawing inspiration from HELM , we introduce Video-Bench, specifically designed to measure human-like abilities of Video-LLMs across various capabilities and scenarios.

Video-Bench

In Fig.2, we show the overall structure of Video-Bench and the corresponding average results for existing Video-LLMs.

As illustrated in Fig. 3 (A), we aim to measure the capacity of Video-LLMs to comprehend and summarize information from video itself, encompassing objects, actions, attributes, and their temporal connections. These tasks are video-exclusive, requiring no external prior knowledge or complex logic inference.

Basic Understanding. This task primarily evaluates the basic video recognition ability, such as responding to queries related to human actions in Activitynet-QA , providing answers related to objects, attributes, and actions corresponding to videos in MSVD-QA and MSRVTT-QA , and comprehending GIFs in TGIF-QA .

Summarization. This task assesses the summarization ability of Video-LLMs when dealing with longer videos. Using the YouCook2 dataset with rich annotations and extended video duration, we generate a series of QA pairs to evaluate whether the model can comprehend cooking information presented in the videos and audios, and then provide accurate feedback about the correct procedure.

Abnormal Detection. This task evaluates the ability to review videos and identify anomalies. Leveraging the UCF-Crime dataset , a collection of surveillance videos annotated with the type and timestamp of anomalies, we construct questions to assess the temporal comprehensive ability of Video-LLMs.

Crowd Counting. This task primarily evaluates the ability to localize and count dense objects. Utilizing the MOT dataset , which annotates all pedestrians, vehicles, and other targets in street or mall images, we test whether Video-LLMs can identify different pedestrians in different frames and provide the correct number of people.

2 Prior Knowledge-based Question-answering

ChatGPT and LLaMA exhibit strong capability in answering questions and giving suggestions across various domains due to the extensive prior knowledge acquired during pre-training. This prompts us to investigate whether Video-LLMs possess similar abilities. As depicted in Fig. 3 (B), our goal is to assess the capability of Video-LLMs in addressing questions that require prior knowledge, akin to human beings. Examples include identifying actors in a movie or discerning the music style of a particular song.

TV-QA. Television programs, as prevalent sources of entertainment videos, integrate multiple modalities, including video, audio, and subtitles, to convey information. Utilizing the TVQA dataset , we transform image formats into videos, and incorporate audio and subtitles. This dataset allows us to evaluate the ability of Video-LLMs to integrate prior knowledge and information from video, audio, and text to answer questions related to TV content.

MV-QA. Music videos, characterized by the synchronization of visual elements with music, pose a unique challenge due to their reliance on prior knowledge. Answering questions about these videos requires familiarity with the song, recognition of artists, and potentially basic music theory. In the absence of relevant existing datasets, we search for top music videos on YouTube and construct corresponding QA pairs based on authoritative wiki sources. This task assesses the ability of Video-LLMs to understand the song associated with the music video and provide answers regarding performers, background information, and relevant music theory knowledge.

NBA-QA. Understanding competitive sports videos also demands relevant prior knowledge. Viewers must possess knowledge of the corresponding rules and engage in long-term observation to identify competing teams, players, technical actions, scores, or fouls within the video. We select top NBA plays from YouTube and manually annotate teams, players, and technical actions in each game, transforming them into question-answer pairs. These videos and questions serve as input to the model, expecting it to respond based on relevant prior knowledge.

3 Comprehension and Decision-making

Humans possess the innate ability to comprehend complex scenarios and make informed decisions and judgments. As shown in Fig. 3 (C), to assess a similar capability in Video-LLMs, we propose evaluations in the realms of 3D scene understanding and autonomous-driving related tasks.

3D Scene Comprehension. Indoor scene comprehension and navigation hold significant practical implications. The complexity arises from the necessity for extensive knowledge-intensive reasoning to understand different situations (scenes and locations). The SQA3D dataset is introduced to evaluate the 3D scene comprehension of Video-LLMs within the video modality. The models are tasked with understanding their environment and engaging in perception, reasoning, and action to accomplish the task.

Driver’s License Examination. Video-based questions in driver’s license examinations assess the ability of candidates to interpret simple animations depicting motor vehicle and driver status, requiring judgments of potential anomalies. In this task, we challenge Video-LLMs to comprehend scenarios and answer exam questions.

Driving Decision-Making. Making decisions for real-world driving scenarios is a more intricate task that demands a higher level of scene understanding and decision-making ability. For this task, we compile a diverse collection of YouTube driving videos depicting complex traffic situations and accidents. We conduct manual annotations for scene analysis and accident causes. Our expectation is that the model can effectively comprehend the origins of these complex traffic situations or accidents and make correct decisions to prevent their occurrence.

4 Automatic Evaluation Toolkit

LLMs are known for generating long-form text responses, often without adhering to a fixed format, making it challenging to quantify the correctness of their answers. To address this, we propose an automatic evaluation toolkit to systematically assess the performance of Video-LLMs. Our toolkit provides three metrics to map the output of Video-LLMs to pre-defined answer choices and subsequently calculating the final scores. The first one is Probability , a logits-based metric to acquire the probability of the next token following the prompt and treat the highest probability option as the prediction:

The other two metrics are sentence-based, leveraging the natural language understanding capabilities of LLMs to obtain options. T5-based one calculates the textual similarities of generated sequences and options. GPT-3.5-based transforms the sequences to a fixed format with prompt ‘Please output your responses in the form of a dictionary ”maximum probability”:”xxx”, where xxx is A or B or C or …’. All the above metrics can be implemented automatically with our toolkit, and users can analysis the ability of video-LLMs to comprehend video content and provide accurate responses to questions faithfully.

Experiment and Result

Implementation details. The detailed statistics of Video-Bench are listed in Fig. 4. To mitigate the impact of randomness, we multiply an additional weight of 0.5 for tasks with a smaller quantity of questions during the computation of the final average score. To ensure a fair comparison, we utilize the 7B LLM backend versions for all tested Video-LLMs during the inference process, thereby mitigating language ability discrepancy stemming from different model sizes. The GPT-based metric are employed in the reported results by default, and the API version is set to gpt-3.5-turbo-0613 in the automatic evaluation toolkit.

Results on Video-exclusive Understanding. To evaluate the video-exclusive understanding ability, we validate Video-LLMs on the traditional basic QA tasks, summarization, abnormal detection and crowd counting tasks, as reported in Table. 2 (A). We have three observations. (i) Most Video-LLMs perform well on the four traiditional QA datasets due to the simplicity of their questions, especially the Video-ChatGPT and Otter with massive video instruction data, and the PandaGPT with a well-pretrained video encoder from ImageBind , which suggests extending the video data scale could be effective. (ii) Existing Video-LLMs are not temporal-sensitive. They cannot effectively summarize the order of each operation in YouCook2, and cannot respond effectively on the timestamp-related problems in UCF-Crime. (iii) These methods almost fail in the crowd counting task. These failure may come from the weak ability of precise locating and the temporal association.

Results on Prior Knowledge-based QA. Compared to enormous training data of LLMs, existing Video-LLMs are trained with limited instruction tuning data as Table. 1, resulting in the poor ability to recognize objects and information in specific domains. As shown in Table. 2 (B), we can have two observations. (i) Existing methods lack visual prior knowledge, which means they struggle to establish effective connection between the video and knowledge. For example, in NBA-QA task, even the players and technical actions are stored in the LLM backend, they cannot answer the questions when watching videos. Otter , which has the most instruction tuning data, achieves the best performance in this project, indicating that some prior knowledge is indeed contained in MIMIC-IT. (ii) Their poor performance on MV-QA indicates that they have limited audio understanding ability, since only some of the Video-LLMs possess audio modules. PandaGPT with the audio module of ImageBind shows the consistent results with the champion Otter in MV-QA, proving that adding an audio encoder might improve this problem. In conclusion, existing Video-LLMs are requiring abundant prior knowledge pre-training for general domains on different modalities.

Results on Comprehension and Decision-making. The performance of existing Video-LLMs on 3D scene understanding and driving decision-making tasks is shown in Table. 2 (C). In these tasks, Video-ChatGPT continues to perform the best, thanks to its robust video instruction tuning. The followings are the Valley , which also possess powerful multi-modal understanding ability from vast instruct-tuning videos. To enhance the comprehensive and decision-making abilities, we suggest that future Video-LLMs must be trained with more prior knowledge and larger-scale data to cover more diverse domains. Besides, adopting Reinforcement Learning from Human Feedback (RLHF) and larger model capability is also important for generalization and specific applications.

Results on Different Metrics. Our Video-Bench consists of a series of multiple-choice questions. Compared to open-ended questions, this test is relatively straightforward. However, due to the uncertainty and free form of LLM outputs, there is still room for designing more robust metrics. We evaluate the results of the best tested model, comparing the results with Probability , T5-based and the GPT-based metrics. as shown in the Fig. 5. It can be seen that the result of Probability is overall low, because the output of Video-LLMs cannot effectively give a clear choice answer and the probability-based mapping may not faithfully reflect the correctness. Therefore, we recommend adopting GPT as the metric, especially considering the Video-LLMs with fewer LLM parameters and unstable outputs.

Visualization and Multi-Dimension Analysis

Visualization. Fig. 6 illustrates a set of typical responses from tested Video-LLMs. It can be observed that only Video-ChatGPT provides the correct response, while other models engage in discussions related to the video but fail to make the correct judgment after a lengthy discourse. This highlights the issue that the models struggle with questions with even the most fundamental prior knowledge. This situation reflects the current state of Video-LLMs, which can generate responses related to videos while lacking trustful reference value. Therefore, we can conclude that current Video-LLMs are limited to generating human-like text while lacking the desired intelligence.

Multi-dimension Analysis. In Fig. 7, a comparative analysis of Video-LLMs with different modules is presented. We can conclude that with the current data and training setting, Video-LLMs lack tailored focus on the three-level ability of video comprehension. And the empirically proposed modules have not yielded significant improvements.

We also analysis the impact of different data sizes in pre-training or instruction tuning process, as shown in Fig. 8. It can be observed that pre-training datasize may not necessarily play a decisive role, as the top-3 models, Video-ChatGPT , PandaGPT and Otter , have no extra pretraining process. We suppose that the video encoders have received adequate training in multimodal pre-training. In contrary, the influence of the instruction tuning datasize is notably evident, showing two trends: (i) The models trained on videos demonstrate overall better performance compared to those trained on images. This substantiates that native video data facilitates enhanced comprehension of video information by Video-LLMs. (ii) Model performance is positively correlated with the amount of video instruction tuning data. Video-ChatGPT and Otter trained on large-scale video instruction tuning datasets are significantly better than other models.

Discussion and Conclusion

According to the above experimental results, we can conclude that the existing models are far from the truly intelligent Video-LLM that can fully understand the visual and audio content in videos, and help people precisely summarize videos, explain details with priority knowledge or help providing global perception and making decisions. We believe there are primarily three improvement directions.

Vision Encoder with Temporal Awareness. Existing methods process videos as frame clips, potentially missing crucial temporal information. Ideal Video-LLMs should understand the temporal sequence, possibly by selectively choosing keyframes or sampling frames to traverse the content efficiently.

Domain-Specific Prior Knowledge Pre-training. Lack of visual prior knowledge hinders accurate video comprehension. Incorporating domain-specific prior knowledge through pre-training can enhance domain expertise.

Long Video Understanding. One key differentiation point of Video-LLMs when compared to Image-LLMs should be the capability of processing long videos, which is highly neglected by existing research. Due to the memory and computation constraint, how to efficiently compress the past frames and design an effective memory mechanism is super crucial.

Simultaneously, we also require more robust and effective evaluation metrics that can measure the long-text response of Video-LLMs. The utilization of GPT or similar LLMs in this process could be highly beneficial.

References

T5 evaluation

In our answer evaluation benchmark project, we explore two approaches: GPT-based metric and T5-based metric. T5-based metric serves as an auxiliary tool in the evaluation process, offering advantages in terms of cost, deployment, and performance. It provides a cost-effective solution by eliminating the need for ChatGPT API usage and allows for offline deployment on personal servers. As shown in Table 3, T5-based results demonstrate comparable performance to GPT-based in answer evaluation tasks, making it a valuable addition to our benchmark project for reliable and efficient assessment.

Visualization Samples

In this part, we provide more samples of on all datasets concluded in Video-Bench, to illustrate the performance and behaviour of the tested Video-LLMs.

Activitynet-QA. The results of the Activitynet-QA is shown in Fig. 9. As mentioned in Sec 4, Video-LLMs perform well on these simple questions. The similar results are shown on the remaining three datasets of Basic QA.

MSVD-QA. The results of the MSVD-QA is shown in Fig. 10. As part of the Basic QA, the performance of Video-LLMs here are overall good.

MSRVTT-QA. The results of the MSRVTT-QA is shown in Fig. 11. The results shows a similar trend of the above.

TGIF-QA. The results of the TGIF-QA is shown in Fig. 12. Results prove that Video-LLMS can also understand simple GIFs.

YouCook2. The results of the YouCook2 is shown in Fig. 13. The poor results show that existing Video-LLMs possess limited temporal awareness, and they are difficult to summarize the sequence of action steps.

UCF-Crime. The results of the UCF-Crime is shown in Fig. 14. The poor performance illustrates the existing Video-LLMs lack the ability of temporal perception again.

MOT. The results of the MOT is shown in Fig. 15. Existing Video-LLMs are proved to lack the ability to count accurately.

2 Prior Knowledge-based Question-Answering

TV-QA. The results of the TV-QA is shown in Fig. 16, which demonstrate that existing Video-LLMs can hardly understand TV segments. This could be caused by the lack of prior knowledge and audio or subtitle understanding ability.

MV-QA. The results of the MV-QA is shown in Fig. 17. The poor performance may be also caused by the lack of prior knowledge and audio understanding ability.

NBA-QA. The results of the NBA-QA is shown in Fig. 18, which illustrates that without vision-language pre-training for specific domains, the Video-LLMs can not connect the knowledge stored in LLM with visual content and response to corresponding questions.

3 Comprehension and Decision-Making

Driver’s License Examination. The results of the Driver’s License Examination is shown in Fig. 19. The poor performance validates the tested Video-LLMs have limited scene understanding and decision-making ability.

Driving Decision-Making. The results of the Driving Decision-Making is shown in Fig. 20, which demonstrates the tested Video-LLMs are difficult to understand the real driving environment.

SQA3D. The results of the SQA3D is shown in Fig. 21. The results show that they can only understand the simple environment and cannot understand the complex spatial relationship.