MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

cs.CV cs.AI cs.CL

Introduction

Foundation models, such as Large Language Models (LLMs) [OpenAI, 2023c; Touvron et al., 2023a; Jiang et al., 2023; Anil et al., 2023] and Multimodal LLMs (MLLMs) [OpenAI, 2023b; Team et al., 2023; Lin et al., 2023a; Li et al., 2023c; Maaz et al., 2024; Chen et al., 2023], have demonstrated remarkable abilities in text and image domains, igniting debates about their potential pathways to Artificial General Intelligence (AGI). This raises a critical question: how well do these models understand the dynamics of the real world? Are they equipped with an inherent World Model [LeCun, 2022; Chen et al., 2024; Ha and Schmidhuber, 2018; Xiang et al., 2024] that can understand and reason about the underlying principles and causalities of the dynamic, multimodal world?

Videos, with their rich, dynamic portrayal of the real world, are ideally suited for evaluating the "world modeling" capabilities of MLLMs. Existing video understanding benchmarks [Li et al., 2023d; Ning et al., 2023b; Pătrăucean et al., 2023; Li et al., 2023d], however, fall short in two key perspectives for such evaluations. First, as LeCun et al. [LeCun, 2022] discussed, the world model should be able to (1) estimate missing information about the state of the world not provided by perception, and (2) predict plausible future states of the world. Evaluation of such capabilities requires multi-faceted reasoning beyond perception level, including explaining the video dynamics, counterfactual thinking of alternative consequences, and predicting future activities within videos. Moreover, the multi-discipline nature of the multimodal world necessitates a grasp of diverse fundamental principles—ranging from physics and chemistry to engineering and business. Hence, domain expertise across a variety of disciplines is imperative for a thorough evaluation of a model’s world understanding towards AGI [Morris et al., 2023; Yue et al., 2023].

Therefore, we introduce MMWorld, a multi-discipline multi-faceted multimodal video understanding benchmark to comprehensively evaluate MLLMs’ abilities in reasoning and interpreting real-world dynamics Note that MMWorld is not a sufficient testbed for world model evaluation, but we believe overcoming the unique challenges presented in MMWorld is essential and necessary towards comprehensive world modeling.. MMWorld encompasses a wide range of disciplines and presents multi-faceted reasoning challenges that demand a combination of visual, auditory, and temporal understanding. It consists of 1,910 videos that span seven common disciplines, including Art & Sports, Business, Science, Health & Medicine, Embodied Tasks, Tech & Engineering, and Games, and 69 subdisciplines (see Figure 1) such as Robotics, Chemistry, Trading, and Agriculture, thereby fulfilling the objective of breadth in discipline coverage. The dataset includes a total of 1,559 question-answer pairs and video captions annotated and reviewed by humans. Meanwhile, for multi-faceted reasoning, MMWorld mainly contains seven kinds of questions focusing on explanation (explaining the phenomenon in videos), counterfactual thinking (answering what-if questions), future prediction (predicting future events), domain expertise (answering domain-specific inquiries), temporal understanding (reasoning about temporal information), and etc. A video example with these four questions from the Health & Medicine discipline is depicted in Figure 1. MMWorld comprises two datasets: a human-annotated dataset for evaluating MLLMs on the whole video and a synthetic dataset designed to analyze MLLMs’ perception within single visual or audio modalities. We evaluate 12 MLLMs that can handle videos or image sequences on MMWorld, including both open-source (e.g., Video-LLaVA-7B [Lin et al., 2023a]) and proprietary models (GPT-4V [OpenAI, 2023b] and Gemini [Team et al., 2023]).

We summarized the contributions and key findings as follows:

We introduce MMWorld, a new benchmark designed to rigorously evaluate the capabilities of Multimodal Large Language Models (MLLMs) in world modeling through the realm of video understanding. MMWorld spans a broad spectrum of disciplines, featuring a rich array of question types for multi-faceted reasoning.

In addition to the human-annotated dataset, we develop an automatic data collection pipeline, streamlining video content selection and question-answer generation, and construct a well-controlled synthetic dataset to analyze MLLMs within single visual or audio modalities.

We observe that existing MLLMs still face substantial challenges posed by MMWorld. Even the best performer, GPT-4V, can only achieve a 52.30% overall accuracy, and four MLLMs particularly trained on videos perform worse than random chance.

Although there is stll a clear gap between open-source and proprietary models, the best open-source model Video-LLaVA-7B outperforms GPT-4V and Gemini on Embodied Tasks by a large margin and performs similarly on Art & Sports, where spatiotemporal dynamics play a more crucial role in video understanding. This is further validated with its leading results on the Temporal Understanding question type.

In our study comparing MLLMs with average humans (non-experts), we notice some correlation between question difficulties as perceived by humans and MLLMs. However, MLLMs present different skill sets than humans in that they can answer reasonable amount of difficult questions that humans completely fail but also struggle at easy questions that humans excel at. This indicates different perception, cognition, and reasoning abilities between MLLMs and humans.

Related Work

With recent breakthroughs [OpenAI, 2023a; Google, 2023; Touvron et al., 2023a; Chiang et al., 2023; Touvron et al., 2023b; Bai et al., 2023a] in Large Language Models (LLMs), several counterparts in the vision-and-language domain have been proposed [Dai et al., 2023; Liu et al., 2023b, a; Li et al., 2023a; Zhu et al., 2023; Zheng et al., 2023; Bai et al., 2023b], and recently released GPT-4V [OpenAI, 2023b], followed by Gemini Vision family [Team et al., 2023]. Many MLLMs have expanded their capabilities beyond handling only text and image inputs. VideoChat [Li et al., 2023c] leverages the QFormer [Li et al., 2023b] to map visual representations to LLM [Chiang et al., 2023], and performs a multi-stage training pipeline. Otter [Li et al., 2023a] proposes to conduct instruction finetuning based on Openflamingo [Awadalla et al., 2023]. PandaGPT [Su et al., 2023] employs the ImageBind [Han et al., 2023] as the backbone and finetunes it. mPLUG-Owl [Ye et al., 2023] introduces an abstractor module to perform visual and language alignment. VideoLLaMA [Zhang et al., 2023a] introduces a frame embedding layer and also leverages ImageBind to inject temporal and audio information into the LLM backend. Chat-UniVi [Jin et al., 2023] uses clustering to do feature fusion. Observing their emerging abilities in multimodal video understanding, we propose MMWorld to evaluate these models’ skills in understanding the dynamics of the real world.

Benchmarking MLLMs

To evaluate MLLMs, there is a flourishing of analysis [Liu et al., 2024a; Zhang et al., 2023b; Jiang et al., 2022; Lu et al., 2024; Fan et al., 2024; Cui et al., 2023; Guan et al., 2024; Yu et al., 2023; Fu et al., 2023a] and the establishment of innovative benchmarks such as VisIB-Bench [Bitton et al., 2023] which evaluates models with real-world instruction-following ability given image inputs, MMMU [Yue et al., 2023] designed to access models on college-level image-question pairs that span among different disciplines, and VIM [Lu et al., 2023] which challenges the model’s visual instruction following capability. However, these recent analyses and benchmarks only cover the image input, which hinders the evaluation of MLLM’s performance as a world model. Recently, video benchmarks such as Perception Test [Pătrăucean et al., 2023] is proposed to focus on perception and skills like memory and abstraction. However, it uses scenarios with a few objects manipulated by a person, which limits the variety of contexts. MVBench [Li et al., 2023d] centers on temporal understanding, while MMWorld not only includes temporal reasoning but also evaluates other multi-faceted reasoning abilities.

2 Video Understanding Benchmarks

Previous video benchmarks, as shown in Table 1, focus on video understanding tasks, including activity-focused on web videos [Yu et al., 2019a], description-based question answering [Zeng et al., 2017], video completion [Fu et al., 2023b], and video infilling [Himakunthala et al., 2023]. Recently, Video-Bench [Ning et al., 2023b] introduces a benchmark by collecting videos and annotations from multiple existing datasets. LWM [Liu et al., 2024b] collects a large video and language dataset from public books and video datasets and trains a world model that is capable of processing more than millions of tokens. However, modeling millions of tokens is extremely difficult due to high memory cost, computational complexity, and lack of suitable datasets. Mementos [Wang et al., 2024a] builds a benchmark for MLLM reasoning for input image sequences. STAR [Wu et al., 2021] builds a benchmark for situated reasoning in real-world videos. CLEVER [Yi et al., 2020] builds a benchmark containing videos focusing on objects with simple visual appearance. Our contribution, in contrast, presents a new video understanding benchmark designed to evaluate models on several pivotal components crucial for a comprehensive world model. These components encompass interdisciplinary coverage, task diversity, and multifaceted reasoning capabilities—including future prediction, counterfactual thinking, and more—underpinned by original human annotations and integrated domain knowledge.

The MMWorld Benchmark

The MMWorld benchmark is built on three key design principles: multi-discipline coverage and multi-faceted reasoning. It spans various disciplines that require domain expertise and incorporates diverse reasoning skills such as explanation, counterfactual thinking, and future prediction. The benchmark consists of two parts: a human-annotated dataset and a synthetic dataset. The human-annotated dataset serves as the main test bed to evaluate MLLMs from multiple perspectives. The synthetic dataset contains two subsets, focusing on evaluating MLLMs’ perception behavior from both visual signals and audio inputs, respectively.

We collect videos from YouTube with the Creative Licence in seven disciplines: Art $\&$ Sports (18.5%), Business (12.0%), Science (20.4%), Health $\&$ Medicine (12.0%), Embodied Tasks (12.0%%), Tech $\&$ Engineering (12.9%), and Game (12.2%). For Art $\&$ Sports, 29 videos are collected from the SportsQA dataset [Li et al., 2024]. And for Embodied Tasks, 24 videos are sourced from IKEA Assembly [Ben-Shabat et al., 2021], RT-1 [Brohan et al., 2022], and Ego4D [Grauman et al., 2022] datasets to increase video diversity.

Our manual benchmark collection takes two stages. In the first stage, we conduct a detailed examination of each of the seven primary disciplines to identify a comprehensive range of subdisciplines for inclusion in our benchmark. Our selection of videos is driven by three key principles:

The first principle, multi-discipline coverage, emphasizes the requirement for domain knowledge—selecting videos that inherently demand an understanding of specialized content across various disciplines.

The second principle, multi-faceted annotation, involves collecting videos that enable the creation of question-answer pairs from multiple perspectives to evaluate world model properties comprehensively.

The third principle, temporal information, prioritizes the inclusion of videos that provide meaningful content over time, as understanding temporal information is crucial for grasping world dynamics. This allows models to engage in temporal reasoning. Therefore, answering questions in our dataset requires implicit temporal reasoning, e.g., the model needs to understand temporal information to explain “why does the robot need to do the step shown in the video”. We also design a “temporal understanding” question type to explicitly test models’ ability to reason about temporal information (examples can be found in Section F in the Appendix).

During the second stage, our team embark on the task of question annotation. We craft questions that primarily test seven aspects of multimodal video understanding also from the perspective of multi-faceted reasoning: 1) Explanation: Questions ask the model to elucidate the underlying logic or purpose within the video; 2) Counterfactual Thinking: Tests the model’s ability to hypothesize and consider alternative outcomes; 3) Future Prediction: Aims to predict future events based on the current scenario, challenging the model’s foresight; 4) Domain Expertise: Evaluates the model’s depth of knowledge in specific fields, such as how to assemble a coffee table; 5) Temporal Understanding: Assesses the model’s capability to reason about temporal sequences and dynamics; 6) Attribution Understanding: These questions focus on identifying cause-and-effect relationships within the video, including tasks like counting; 7) Procedure Understanding: Tests the model’s ability to comprehend and explain procedural tasks shown in the video. The detailed distribution and examples are shown in Figure 2.

2 Automated Data Collection

Understanding real-world dynamics requires models to process both audio and visual modalities. To evaluate MLLMs’ perception abilities in these modalities, we designed an automated data collection pipeline. This pipeline collects targeted videos and generates QA pairs based on either audio or visual information, ensuring the model’s capabilities are assessed independently for each modality. By using information from a single modality to generate QA pairs, our pipeline ensures that the synthetic data remains unbiased regarding input modality.

The synthetic data generation pipeline is illustrated in Figure 3. We employ a systematic approach to gather videos with Creative Commons licenses from YouTube and the extensive YouTube-8M dataset [Abu-El-Haija et al., 2016]. This method ensures a diverse and comprehensive collection of video data, which is important for the robust evaluation of multimodal video understanding models.

We start with the video Query Generator. We start with the same seven disciplines as the manually collected dataset. For each discipline, a set of subdisciplines is defined to encapsulate a wide spectrum of topics, ensuring a diverse and comprehensive dataset. Once the queries are generated, the Video Mapping and Filtering step is initiated. We perform mapping of videos to YouTube-8M and online videos, constrained by a strict time limit of two minutes per query, keeping only the most pertinent videos that satisfy the predefined criteria. Simultaneously, the works in conjunction with the video transcripts to extract key terms and concepts. This iterative process refines the search parameters and enhances the semantic richness of the dataset by identifying and encoding the salient themes present in the videos. The Video Summarization module utilizes Query-focused video summarization techniques based on Katna https://github.com/keplerlab/katna and UniVTG [Lin et al., 2023b]. This module selects ten representative frames from each video, distilling the essence of the content while preserving the narrative context. This summarization facilitates efficient storage and quicker processing times, which are crucial for large-scale analysis.

QA Generation

The final stage in our pipeline is the QA / Caption Generation module, where we leverage the capabilities of GPT-4V to generate accurate and contextually relevant questions and answers, as well as captions, based on the video frames and transcripts. This step not only provides rich annotations for each video but also equips the dataset with a multimodal dimension that supports various downstream tasks such as video QA, captioning, and more.

Quality of the Synthetic Dataset

Human evaluators were engaged to ascertain the reasonableness of automatically generated questions and answers, ensuring that the synthetic dataset maintains a high standard of quality and relevance. The findings from this human evaluation phase are detailed in Section D of the Appendix, offering insights into the dataset’s efficacy and the realism of its constructed queries and responses.

Finally, the statistics of automated curated data, which is used for the ablation study, are shown in Table 2. The taxonomy of our dataset is shown in Figure 1. We note that only a portion of the subdisciplines are shown due to space concerns. Please refer to the Appendix for full information.

Experiments

In our study, we compare MLLM’s performance on the MMWorld benchmark, including GPT-4V [OpenAI, 2023b], Gemini Pro [Team et al., 2023], Video-Chat [Li et al., 2023c], Video-LLaMA [Zhang et al., 2023a], ChatUnivi [Jin et al., 2023], mPLUG-Owl [Ye et al., 2023], Otter [Li et al., 2023a], ImageBind-LLM [Han et al., 2023], PandaGPT [Su et al., 2023], LWM [Liu et al., 2024b], and X-Instruct-BLIP [Panagopoulou et al., 2023]. For both Gemini Pro and GPT-4V, we adhere to the default settings provided by their official APIs. They both take ten image frames extracted from the video content as the input. The Gemini Pro is set to process visual input and configured with safety settings to filter a range of harmful content. The configuration thresholds are set to ‘BLOCK_NONE’. For PandaGPT, we set ‘top_p’ to 0.7 and ‘temperature’ to 0.5. For VideoChat, we set ‘max_frames’ to 100. For X-Instruct-BLIP, the model is implemented using four image frames. We use GPT-4-32K as the judge for judging whether the model answer is correct when it can not mapped to the option letter using the rule-based method. For others, we all use the default setting. All inferences are run on a NVIDIA A6000 workstation. The detailed implementation is given in the Appendix.

2 Evaluation

Our dataset includes multiple-choice questions and captions corresponding to each video, enabling tasks such as video question answering and video captioning. We focus on video question answering by evaluating a model’s performance based on its accuracy in selecting the correct answer from the provided options. One challenge lies in reliably parsing the model’s response to map it to one of the predefined choices. To address this, we employ two mapping strategies. We employ two mapping strategies. The first method employs automated scripts to parse the models’ predictions and compare the parsed results with the ground truth, similar to the approach used in [Yue et al., 2023]. The second method involves models freely generating answers, which are then evaluated by GPT-4. Given the question, correct answer, and model’s prediction, GPT-4 returns a True or False judgment. This approach is based on recent works in model evaluation [Maaz et al., 2024; Hsu et al., 2023; Hackl et al., 2023; Liu et al., 2023c]. We validated this method with human evaluators, showing an error rate of 4.76% across 189 examples, confirming the effectiveness of GPT-4 as an evaluator. Detailed results for human evaluation and for these two different strategies are provided in Appendix B. In the main paper, all results are evaluated using the second approach.

3 Main Evaluation Results

We show in Table 3 the main evaluation results of different MLLMs. Among these, GPT-4V emerges as the top performer, closely followed by Gemini Pro. Video-LLaVA also demonstrates strong results, primarily due to the extensive training data which consists of 558K LAION-CCSBU image-text pairs and 702K video-text pairs from WebVid [Bain et al., 2021]. For instruction tuning, datasets were gathered from two sources: a 665K image-text instruction dataset from LLaVA v1.5 and a 100K video-text instruction dataset from Video-ChatGPT [Maaz et al., 2024]. This superior performance may also be attributed to Video-LLaVA’s adoption of CLIP ViT-L/14 trained in LanguageBind [Lin et al., 2023a] as its vision model and the inclusion of a large volume of image-video-text pairings within the training data. On the other hand, models like Otter and LWM perform poorly across most disciplines, possibly due to their weaker backbone and architecture used. Otter uses the LLaMA-7B language encoder and a CLIP ViT-L/14 vision encoder, both of which are frozen, with only the Perceiver resampler module fine-tuned, which may contribute to its lower performance. Additionally, some MLLMs perform even worse than random, highlighting the challenging nature of MMWorld.

4 Study on Multi-faceted Reasoning on MMWorld

Figure 4 illustrates the multi-faceted reasoning performance for each MLLM. GPT-4V emerges as the strongest model across Future Prediction, Domain Expertise, and Attribution Understanding. Closed-source models like GPT-4V and Gemini Pro perform similarly on counterfactual thinking and outperform all others. However, for temporal understanding, Video-LLaVA performs the best. This may be due to its extensive training on large amounts of video-language data, which enhances its spatio-temporal reasoning abilities. This can be also observed in its high scores on the Art & Sports and Embodied Tasks, which involve dense spatio-temporal information, as shown in Table 3. Video-LLaVA’s performance is comparable to GPT-4V and Gemini on explanation tasks, likely because of its two-stage training process and exposure to a large amount of instruction-tuning data in the second stage, which includes similar instructions.

5 Study on MLLM Performance at Different Difficulty Levels for Average Humans

Figure 5(a) indicate some correlation between the difficulty levels as perceived by humans and the performance of MLLMs. MLLMs generally follow a trend where accuracy decreases as the difficulty level increases, which aligns with human performance patterns. However, the correlation is not perfect, suggesting that while models and humans share some common ground in understanding question difficulty, there are also notable differences in their capabilities. The data reveals that MLLMs exhibit different skill sets compared to humans. As highlighted in Figure 5(b), models like GPT-4V can correctly answer expert-level questions that humans often get wrong, particularly in disciplines such as Business and Health & Medicine, where humans often struggle, yet they sometimes falter on easier questions, likely due to the lack of contextual understanding. Notably, discrepancies in disciplines like Art & Sports and Tech & Engineering highlight areas where MLLMs’ performance does not align with human results, suggesting different perception, cognition, and reasoning abilities in handling abstract concepts. These differences suggest that MLLMs can complement human capabilities, offering potential for enhanced task performance by combining the data-driven insights of models with human intuition and contextual knowledge.

6 Study on Modality of Perception

We conduct ablations to evaluate MLLMs ability to perceiving the world on the synthetic dataset of MMWorld. With our synthetic dataset, we considered scenarios where only one modality—either audio or visual—is available. Table 4 shows the results which evaluates the model’s ability to interpret spoken language, background noises, and other audio elements without the aid of visual context and the model’s perception ability to operate without any audio input. For the visual perception test, Gemini Pro performed the best, demonstrating its strong ability to process visual information. Interestingly, Video-Chat exhibited better audio perception than ChatUnivi, despite its poorer visual perception. This may be attributed to its use of the Whisper [Radford et al., 2022] speech recognition model. It also explains that in Table 3, Video-Chat outperforms ChatUnivi in the Art & Sports discipline, which requires a greater understanding of music, voice, and background audio. However, in other disciplines such as Science and Health & Medicine, Video-Chat’s performance is significantly poorer.

7 Error Analysis

To gain deeper insights into the limitations of MLLMs, we prompted the models to explain the reasoning behind their choices, particularly when errors occurred. Through this analysis, we identified common error patterns and summarized them into seven distinct categories. We conducted a simple test where the same questions that triggered errors in GPT-4V were also posed to other MLLMs. The frequencies of each type of error are presented in Figure 6, as annotated by human evaluators. Detailed qualitative examples of these errors and further analysis are provided in the Appendix.

Conclusion

Our MMWorld Benchmark represents a significant step forward in the quest for advanced multi-modal language models capable of understanding complex video content. By presenting a diverse array of videos across seven disciplines, accompanied by questions that challenge models to demonstrate explanation, counterfactual thinking, future prediction, and domain expertise, we have created a rigorous testing ground for the next generation of AI. While using LLMs for data generation can introduce hallucination issues, these challenges are manageable and are commonly addressed [Wang et al., 2024b; Shen et al., 2023]. Another potential risk is the misuse of MLLMs for surveillance or privacy invasion. The ability of models to understand video content and perform reasoning could be exploited to monitor individuals without their consent, leading to serious ethical and legal concerns regarding privacy.

References

Appendix A Overview of the Appendix

We host the project website on https://mmworld-bench.github.io/. The benchmark and code implementations can be found at https://github.com/eric-ai-lab/MMWorld. The link to Croissant metadata record documenting the dataset/benchmark available for viewing and downloading is available at https://github.com/eric-ai-lab/MMWorld/blob/main/data/croissanta_hf_data.json. This Appendix is organized as follows:

Section B contains additional experimental results;

Section C contains the implementation details;

Section D contains the settings and results from human evaluations;

Section F contains the data examples from MMWorld;

Section G contains additional data statistics of MMWorld;

Section H contains the datasheet of MMWorld;

Section I contains the author statement, licence, and maintenance plan.

Appendix B Additional Results

In Table 5, we show detailed results using three different seeds for each evaluated models.

B.2 Results from Amazon Turkers

Table 6 presents the evaluation results from three sets of Amazon Turkers across various disciplines. The results indicate that there is slightly variability in performance across different human evaluators.

B.3 Results for the Two Different Evaluation Strategies

In Table 8, we give additional evaluation results for different MLLMs evaluated in this paper. For closed-source models, the evaluation pipeline is the one used in the main paper, which involves utilizing GPT-4V as a judger. The process consists of presenting GPT-4V with the question, a corresponding answer generated by the baseline model, and the set of possible options. GPT-4V then assesses whether the model-generated answer is accurate within the given context; Another is open-ended generation where we employ a two-step methodology. We first prompt each model to do open-ended generation. Subsequently, we prompt the model to align its generative response with one of the predefined options: ‘a’, ‘b’, ‘c’, or ‘d’.

B.4 Detailed Results on Multi-faceted Reasoning

In Table 7, we give detailed performance numbers of different MLLMs on multi-faceted reasoning corresponding to Figure 4 in the main paper.

Appendix C Implementation Details

We use the optimum number of video frames and report the performance in the main paper. The numbers of the sampled frames are 10 for GPT-4V/o and Gemini Pro, 8 for Video-LLaVA, 32 for ChatUniVi. For closed-source models, for both Gemini Pro and GPT-4V, we use the default settings provided by their official APIs. We use Katna https://github.com/keplerlab/katna to extract key video frames as input to these two models. The Gemini Pro is set to process visual input and configured with safety settings to filter a range of harmful content. The configuration thresholds are set to ‘BLOCK_NONE’. For PandaGPT, we set ‘top_p’ to 0.7, and ‘temperature’ to 0.5. For VideoChat, we set ‘max_frames’ to 100. For LWM, we use the LWM-Chat-1M variant. For X-Instruct-BLIP, the model is implemented using four image frames. For Otter, we use the video variant. We use GPT-4-32K as the judge for judging whether the model answer is correct when it can not mapped to the option letter using the rule-based method. The prompt provided to GPT-4-32K is structured as follows: "I will present a response from a question-answering model alongside several answer options. Your task is to evaluate the response and determine which of the following options it most closely aligns with, denoting the most similar option by its corresponding letter (a, b, c, or d).".

For the discipline of Science, queries are generated for subdisciplines such as Geography, Chemistry, Wildlife Restoration, Mycology, Nature, Physics, Weather, Zoology, Math, Botany, Biology, and Geology. In the Tech & Engineering discipline, our queries span across Electronics, Animal Behavior, Mechanical Engineering, Energy & Power, Architecture, Agriculture, Nature, Physics, Robotics, Woodworking, and Gardening. The Sports & Arts discipline encompasses a broad range of cultural and physical activities, including Music, Drawing and Painting, Football, Volleyball, Aerobic Gymnastics, Basketball, Instrument, Baking, Dance, Woodworking, Graffiti, Anatomy, and additional Music-related topics. Embodied Tasks are represented through queries for Assembly, Ego-motion, and Single Object Manipulation, focusing on the interaction between agents and their physical environment. The Health & Medicine discipline is segmented into Pharmacy, Public Health, Clinical Medicine, and Basic Medical Science, reflecting the multifaceted nature of healthcare and medical studies. The Business discipline is stratified into fundamental areas such as accounting, finance, management, marketing, and economics, each representing key facets of the commercial and economic world. Lastly, the Game discipline consists of Role Playing Game, First Person Shooting game, Racing Game, Adventure Game, Real-Time Strategy Game, Tower Defense game, and Fighting Game.

Each generated query retrieves relevant video content, which is then filtered and processed to align with the specific needs of our research objectives. Videos that meet our criteria in terms of content, length, and quality are downloaded and incorporated into our dataset, forming the basis for subsequent analysis and model training.

Appendix D Human Evaluation

We hired Amazon Mechanical Turk to do human evaluation on the data with the results shown in Table 6. Workers were required to have completed more than 1000 Human Intelligence Tasks (HITs) and have an HIT approval rate greater than 95% to qualify for our tasks. We show in Figure 7 the human evaluation interface on the generated data. Each worker was compensated $0.20$ for completing an assignment. This amount was determined based on the estimated time and effort required to complete each task. We set the number of unique workers per task to 3 to collect diverse perspectives while avoiding redundancy. Workers were given 1 hour to complete each assignment. This time frame was chosen to enable thoughtful responses from workers.

We also hired students from campus to do human evaluation on subset of the data. The results are shown in Table 10. The performance of the human evaluators did not surpass that of GPT-4V and Gemini-Pro. This outcome underscores the challenging nature of the dataset, which often necessitates specialized domain knowledge that our evaluators—primarily non-experts—found demanding. These results highlight the complexity of the questions and the potential necessity for discipline-specific understanding to achieve high accuracy

D.2 Quality of Using GPT as the Judger

For a comprehensive assessment of GPT-4V’s accuracy when using it as the judger, we devised a human evaluation protocol also resort to Amazon Mechanical Turk, as visualized in Figure 8. The evaluators present a series of statements derived from the video, and GPT-4V is tasked with selecting the most accurate answer from a set of multiple-choice questions. Through this interface, human evaluators can efficiently gauge GPT-4V’s performance across different types of questions—when using it as the judger.

The results obtained from this human evaluation process are shown in Table 9, across 189 examples, there are only 9 incorrect ones with the error rate of 4.76%, validating the effectiveness of using GPT-4V as the judger.

Appendix E Error Analysis

In this section, we delve into the analysis of errors from evaluated MLLMs. We summarized error types as follows:

Question Understanding Error (QUE): Models misinterpret the question’s intent, such as misunderstanding how a pendulum’s period would change if a condition in the scenario is altered.

Audio Understanding Error (AUE): Models fail to interpret audio cues correctly, shown by their failure to recognize blue and red lines on a stock chart.

Visual Perception Error (VPE): There is a misinterpretation of visual content, leading to incorrect assumptions about the visual data presented in the video.

Hallucinations (HE): Models generate content or details that are not present in the actual data, essentially ‘hallucinating’ information.

Reasoning Error (RE): Models demonstrate a lack of logical reasoning, leading to incorrect conclusions based on the given data.

Lack of Domain Knowledge (LDK): Models show an inability to answer questions that require specific domain expertise, indicating a gap in their knowledge.

Reject to Answer (RA): An example of this error was observed when the model was asked to select an answer regarding the outcome of an experiment involving liquid nitrogen. Instead of choosing an option, the model provided an unrelated response concerning a light bulb, indicating either a misunderstanding or a cautious approach due to the potential for the question to be interpreted as pertaining to a sensitive topic, which can trigger content filters focused on safety and compliance policies.

We show in Figure 15, 16, 17, 18 some error cases of Question Understanding Error, Audio Understanding Error, Visual Perception Error, Hallucinations, Reasoning Error, Lack of Domain Knowledge, and Reject to Answer respectively from MLLMs evaluated on MMWorld.

Appendix F Data Examples

We show in Figure 9, 10, 11, 12, 13, 14 some additional examples from MMWorld.

Appendix G Additional Data Statistics

For human annotated dataset, the length of each video was capped at approximately two minutes. The statistical distribution of the disciplines within the dataset for this part is as follows:

Sports & Arts: The subset that consists of 77 videos, showcasing a vibrant collection that covers a wide range of topics from athletic endeavors to various forms of artistic expression.

Science: A subset of 75 videos, which delves into the empirical world of scientific inquiry, spanning a multitude of specializations from fundamental physics to advanced biological studies.

Tech & Engineering: Encompassing 54 videos, this segment captures the cutting-edge advancements and foundational concepts that drive innovation and infrastructure in the modern world.

Embodied Tasks: With 50 videos, the dataset provides a focused insight into the dynamic field of Embodied Tasks, highlighting the intersection of AI, mechanics, and automation.

Health & Medicine: This essential discipline is well-represented with 50 videos, offering perspectives on medical breakthroughs, healthcare practices, and life sciences.

Business: This discipline includes 50 videos, reflecting on the multifaceted nature of commerce, from economics to management sciences.

Game: This discipline includes 51 videos, reflecting various aspects of gaming.

Altogether, the MMWorld Benchmark’s diversity is visually encapsulated in Figure 19, which delineates the distribution of videos across 61 subdisciplines. The horizontal bar chart provides a quantified representation of the dataset’s range, reflecting the careful curation process that has gone into ensuring breadth across various knowledge areas.

The world we live in is rich with both audio and visual information, and effective world modeling requires an understanding of how these modalities interact and convey meaning. To achieve this, we annotated additional attributes such as "Requires Audio," "Requires Video," and "Question Only." These annotations help determine whether correctly answering a question necessitates audio information, visual cues from the video, or can be addressed based solely on the question itself. By doing so, we ensure that our benchmark tests the full spectrum of multimodal comprehension, reflecting the complex, sensory-rich environment in which real-world understanding takes place. The statistics of these annotations are shown in Figure 20.

Appendix H Datasheets

For what purpose was the dataset created?

To introduce a multi-discipline multi-faceted multimodal video understanding benchmark to comprehensively evaluate MLLMs’ abilities in reasoning and interpreting real-world dynamics.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

The dataset is created by authors from UCSC, UCSB, and Microsoft.

H.2 Composition

What do the instances that comprise the dataset represent? (e.g., documents, photos, people, countries)

Videos along with captions and question/answer pairs.

How many instances are there in total (of each type, if appropriate)?

6,627 instances. The data distribution over different types can be found in Figure 2 of the main paper.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

Is there a label or target associated with each instance?

Is any information missing from individual instances?

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?

Are there recommended data splits (e.g., training, development/validation, testing)?

The MMWorld is used for evaluation purpose only.

Are there any errors, sources of noise, or redundancies in the dataset?

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

Does the dataset contain data that might be considered confidential?

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

H.3 Collection Process

The data collection process is described in Section 3 of the main paper.

H.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values

We extract video frames from collected videos in automatically generated.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?

Is the software that was used to preprocess/clean/label the data available?

Yes. The source code can be found in https://github.com/eric-ai-lab/MMWorld.

H.5 Uses

Has the dataset been used for any tasks already?

Yes. We have used the dataset to evaluate video question answering.

Is there a repository that links to any or all papers or systems that use the dataset?

Yes. The GitHub repository https://github.com/eric-ai-lab/MMWorld here.

What (other) tasks could the dataset be used for?

Video captioning and evaluating faithfulness of evaluation metrics.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

Are there tasks for which the dataset should not be used?

The videos in this dataset are from different sources and are unique. The dataset should not be used for tasks such as video editing.

H.6 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

Yes. The benchmark is publicly available.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?

We host it on the webpage, GitHub, and Huggingface.

It’s availale and open to the public now.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?

Have any third parties imposed IP-based or other restrictions on the data associated with the instances?

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?

H.7 Maintenance

Who will be supporting/hosting/maintaining the dataset?

The authors will be supporting/hosting/maintaining the dataset.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

No. We will make it if there is any erratum.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

Yes. We will make announcements on GitHub if there is any update.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?

Will older versions of the dataset continue to be supported/hosted/maintained?

Yes. Old versions can still be accessed from Huggingface.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

Yes. Contributors can post issues or submit pull requests on GitHub. We will review and verify contributions, and update the dataset if the contribution is useful.

Appendix I Author Statement, Hosting, Licensing, and Maintenance Plan

We bear all responsibility in case of violation of rights and confirmation of the data license.

Hosting

MMWorld is hosted on https://mmworld-bench.github.io/. The dataset is provided in the JSON file format. The metadata can be found at https://huggingface.co/datasets/Xuehai/MMWorld.

License

MMWorld is licensed under the CC-BY 4.0 license.

Maintenance Plan

We will keep maintaining and updating the dataset and benchmark, including the leaderboard.