MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang

Introduction

The breakthroughs in large language models (LLMs) bring generalist AI models that can solve a wide range of complicated natural language tasks, many approaching the human-expert-level performance . Large multimodal models (LMMs) aim to achieve even stronger general intelligence via extending LLMs with multimodal inputs. Since more than 80% of our human being’s perception, learning, cognition, and activities are mediated through vision , it is natural to start the exploration by equipping LLMs with “eyes.” One main thread of LMM works, represented by Frozen , Flamingo , PaLM-E , GPT-4 , extend LLMs with the visual understanding capability via end-to-end tuning. There also exists the exploration on the modular combination of LLMs and image-to-text vision-language models. Recently, thanks to the open-source of powerful LLMs like LLaMA , more open-sourced LMMs are built, including OpenFlamingo , LLaVA , MiniGPT-4 , Otter , InstructBLIP , and many more . These studies showcase the intriguing ability to solve various complicated multimodal tasks, such as open-world recognition, multimodal knowledge and commonsense, scene text understanding, and so on.

Despite the promising qualitative results on LMM’s capabilities, it remains unclear how to systematically evaluate those showcased complicated multimodal tasks and what are the relationships among evaluated tasks, which is the first step in developing a quantitative evaluation benchmark. As shown in Figure 1, existing vision-language benchmarks focus on simple Vision-Language (VL) tasks that require specific one or two capabilities, such as recognition, language generation, or OCR, but fall short in benchmarking more complicated tasks. Alternatively, we examine the arbitrary integration of core VL capabilities for complicated tasks, with the insight that the intriguing ability to solve complicated multimodal tasks can be achieved by a generalist model mastering and integrating different core capabilities. Following this insight, we propose a new benchmark for evaluating LMMs, namely MM-Vet. MM-Vet defines six core VL capabilities, including recognition, OCR, knowledge, language generation, spatial awareness, and math, which integrate to solve various complicated multimodal tasks. MM-Vet contains 16 tasks for quantitative evaluation. For example, in Figure 1(d), answering the question “What will the girl on the right write on the board?” in MM-Vet requires recognizing the genders of the three kids, locating queried girl spatially, recognizing the scene text written by the girl, and finally calculating the result.

Other than the evaluation category definition, the evaluation metrics are another challenge in benchmark development, given the diverse answer styles and question types. Specifically: (1) The desired outputs in different multimodal tasks have diverse formats, e.g., Figure 1(d)’s math problem can be answered by a single word, while outputs for the essay writing question are hundred-words long; (2) The core aspect to evaluate in different tasks varies, e.g., text generation focuses more on the text quality, recognition can be considered correct with the key concept recognized. Most integrated tasks would require comprehensive evaluations from multiple dimensions. Inspired by recent NLP studies that use LLMs for model evaluation, we propose an LLM-based evaluator as the evaluation metric for open-ended model outputs. As shown in Table 1, we prompt GPT-4 with few-shot evaluation prompts to obtain an evaluation score ranging from to $1$ . Instead of manually defining the possible answer styles and question types, we include different sample types as few-shot examples and let LLMs infer the scoring criteria automatically. Such metric design eases the future extension to more question types, such as box localization .

MM-Vet’s evaluation category and metric designs allow users to obtain capability insights for different LMMs. Such model analyses are more informative than a single overall ranking, which highly depends on the dataset sample composition and might be biased. We evaluate two sets of multimodal systems, i.e., the end-to-end tuned LMMs including OpenFlamingo , LLaVA , MiniGPT-4 , Otter , InstructBLIP , etc, and the LLM-tool-using systems such as MM-ReAct . Despite not knowing model details, we also evaluate industry solutions such as Bard . We first discuss the capability analyses of these two system paradigms and the representative models. We then dive deeper into the open-sourced LMMs and examine how the training data, vision encoder, and LLM selection influence the performance on different capabilities.

Our contributions are summarized as follows.

We propose MM-Vet to evaluate LMMs’ ability on complicated multimodal tasks. MM-Vet defines 16 emergent tasks of interest, integrated from the six defined core VL capabilities.

We propose an LLM-based evaluator for open-ended outputs of LMMs, which unifies the evaluation across different answer styles and question types. The evaluation metrics ensure the thorough evaluation of both the factual correctness and text quality of the responses.

We benchmark representative LMMs on MM-Vet, revealing the relative strengths and weaknesses of different system paradigms and models, as summarized in Section 4.5.

Related work

Multimodal models. Vision-language models approach multimodal intelligence of jointly understanding and generating vision and language signals. Inspired by the impressive quality and genericity in recent large language models (LLMs) , researchers explore large multimodal models (LMMs) that seamlessly integrate different vision-language capabilities to solve complicated multimodal tasks. In approaching such multimodal generalist systems, one direction is to extend LLMs with the multi-sensory ability, such as pioneer works Frozen , Flamingo , PaLM-E , GPT-4 . Recent open-sourced LLMs also facilitate various research studies including OpenFlamingo , LLaVA , MiniGPT-4 , Otter , InstructBLIP , and so on . On the other hand, multimodal agents explore chaining different vision tools with LLMs to achieve integrated vision-language capabilities.

VL benchmarks. Classic VL benchmarks focus on specific capabilities of interest, such as visual recognition , image description , as well as other benchmarks for specialized capabilities such as scene text understanding , commonsense reasoning , outside knowledge . The recent development of generalist LMMs posts a strong need for modernized VL benchmarks, which contain complicated multimodal tasks that require integrated VL capabilities.

Our MM-Vet is most related to the concurrent evaluation studies such as MME and MMBench, which design comprehensive evaluation samples to facilitate the LMM evaluation. One major difference is that MM-Vet defines and studies the integrated VL capabilities, allowing the evaluation to provide insights beyond the overall model ranking.

LLM-based evaluation. MM-Vet adopts the open-ended LLM-based evaluator, allowing the evaluation across answer styles and question types without requiring binary or multiple answer choices. The technique of prompting LLMs for model evaluation is related to the explorations in NLP . We show that the technique extends well to multimodal tasks, and presents a unified prompt to evaluate samples with different answer styles and question types.

MM-Vet

Our aim is to develop a multimodal benchmark that requires comprehensive capabilities, corresponding to realistic scenarios an AI agent might encounter. Consider, for instance, this scenario: Awakening from slumber, you reach out for your smartphone (recognition capability) to check the current time (OCR capability). Today, your plan is to visit a new grocery that you have not been to. Guided by the information that the grocery is situated directly opposite the stadium and next to the cinema (spatial awareness), you manage to locate it successfully. Keeping in mind your doctor’s advice to shed some weight, you consciously steer clear of high-calorie food and choose milk, vegetables, and fruits instead (knowledge capability). In the dairy aisle, you’re faced with a choice between two types of pure milk. The first is 4 dollars for one liter with 20% discount, while the second is 7 dollars for 1.5 liter with 25% discount. After some quick arithmetic, you find the former is cheaper (math capability) and and opt for the one-liter package. After shopping, you walk past the cinema and find a person pointing to the poster to introduce a new movie (language generation).

From the scenarios of interest, we summarize the following six core VL capabilities for evaluation, with corresponding MM-Vet examples shown in Tables 10-15.

Recognition (Rec). Recognition refers to the general visual recognition capability, including recognizing scenes, objects, object attributes (color, material, shape, etc), counting, and various other high-level visual recognition tasks in computer vision.

Knowledge (Know). The knowledge category covers various knowledge-related capabilities, including social and visual commonsense knowledge, encyclopedic knowledge, and time-sensitive knowledge like news. This capability necessitates that the model not only possesses such knowledge, but also effectively utilizes it to solve complicated tasks as required.

OCR. Optical character recognition (OCR) refers to the scene text understanding and reasoning capability. The models are tested to read the scene text in images, and reason over the texts to solve various tasks.

Spatial awareness (Spat). Spatial awareness embodies a diverse spectrum of capabilities related to understanding space, including the comprehension of the spatial relationship among object and scene text regions.

Language generation (Gen). Language generation is a vital ability that empowers models to articulate their responses in a clear, engaging, and informative manner. We use questions that demand more extended answers for language generation capacity evaluation.

Math. Math evaluates the model’s arithmetic capability in solving either written equations or problems in the wild.

In real-world scenarios, various complicated multimodal tasks would require the integrations of different core VL capabilities. For instance, explaining visual jokes as shown in Table 10(a) requires recognition, knowledge of humor, and language generation; reading documents and solving math problems as shown in Table 11(a) takes OCR, spatial awareness and math; and answering exam questions given images as shown in Table 14(b) needs OCR, knowledge, spatial awareness. To solve these complicated tasks, LMMs are expected to seamlessly integrate different VL capabilities. Therefore, it is crucial to establish a benchmark that evaluates the performance of these integrated abilities within LMMs.

To build the benchmark, we have gathered 187 images from various online sources and ask 205 questions, each of which requires one or more capabilities to answer. As shown in Tables 10-15, these questions are varied in type and entail open-ended responses of differing lengths. The ground truths for 155 questions are human-annotated, while the remainder of the answers for 50 questions were gathered from the Internet. In addition to the 187 images, ten extra images with high-quality questions are collected from VCR , with the questions and answers modified to an open-ended answering format. Another three images are from ChestX-ray14 to obtain corresponding medical expert knowledge. In total, our MM-Vet contains 200 images, and 218 questions (samples), all paired with their respective ground truths. For each question, we have also identified the capacities required to answer them and displayed this information statistically in Figure 2.

2 LLM-based evaluator for open-ended model outputs

Questions and expected responses in MM-Vet are designed to be open-ended to cover the diverse real-world scenarios. This naturally poses a great challenge in terms of model evaluation and metric design. Drawing inspiration from recent NLP studies that utilize LLMs for open-ended evaluations, we leverage GPT-4 to assist evaluation. As shown in Table 1, we craft a few-shot prompt for model evaluation. The few-shot design allows us to define the scoring metrics via in-context examples and supports easy extension onto new problem sets. Specifically, our implemented prompt incorporates five in-context examples with open-ended short answers and two examples with long answers. We cover examples that are fully correct (i.e., 1.0) or incorrect (i.e., 0.0), as well as examples used to define different types of “partially correct” responses. The LLM-based evaluator allows any style of model outputs to be evaluated with a unified consistent metric. Furthermore, it also supports easy adaptation to diverse question types and answer styles by simply modifying the evaluation examples.

By inputting the prompt, GPT-4 automatically generates scores for each sample, conditioned on each sample’s input question, ground truth, and model output. The score for each sample ranges from 0 to 1. The total scores are computed by

where $s_{i}$ is the score of sample $i$ , and $N$ is the sample number. The score regarding each capability or capability integration can be similarly obtained by

where $C$ is the set of samples requiring a specific capability or capability integration, and $N_{c}$ is the sample number of the set.

Evaluation results

We utilize MM-Vet to evaluate two types of LMMs, i.e., (1) end-to-end tuned LMMs (OpenFlamingo , BLIP-2 , LLaVA , MiniGPT-4 , Otter and InstructBLIP ); (2) LLM-tool-using methods (MM-ReAct and Transformers Agent ). The summary of these methods is shown in Table 2. As shown in Table 1, for each sample, we fill the prompt template with its question, ground truth, and output from a specific LMM. By taking the filled prompt into GPT-4, GPT-4 will generate a score from 0 to 1 for the sample. It is found that outputs of GPT-4 still exist variance, although the temperature is set as 0. Therefore, we utilize GPT-4 to evaluate the outputs of LLMs by 5 times. Due to the space limit, we report average scores for capabilities/capability integrations, and average as well as variance for total score.

2 Result analyses

The main results of different methods are shown in Table 3 regarding each capability, and Table 4 for each capability integration.

Recognition. The “Recognition” category contains the questions requiring recognition capability to answer. Examples are shown in Tables 10(a, b), 11(b), 12(a, b), 13(a, b), 14(a, c), and 15(b). The “Rec” column in Table 3 compares the performance on the “Recognition”. Among the evaluated models, LLaVA-13B (LLaMA-2) is the best one, obtaining 39.2%. There may be two reasons. First, LLaVA-13B (LLaMA-2) adopts ViT-L/14 from CLIP as a vision model, which is trained by a large amount of data, 400 million image-text pairs; 2) Second, it is surprising that stronger language model can largely boost the recognition performance. LLaVA-13B (LLaMA-2) obtains 8.3% important over LLaVA-13B (Vicuna-13B). Stronger LLMs may help understand questions better and identify key information from visual inputs.

LLaMA-Adapter v2-7B is another strong model in recognition, achieving 38.5%. This outstanding ability may be obtained from its various and large amounts of tuning data, LAION-400M , COYO-700M , Multimodal C4 and Tuning data of LLaVA etc as shown in Table 2. Besides, InstructBLIP-8B attains 32.4%. As shown in Table 2, the tuning data of InstructBLIP includes 26 publicly available datasets, which contain recognition heavily datasets, like VQA v2 and GQA . The promising capability of InstructBLIP in recognition may benefit from these datasets.

OCR. OCR assesses models’ capabilities in recognizing scene texts in images and performing various types of reasoning including math, spatial, recognition, etc. Examples are shown in Tables 10(c), 11(a, c, d), 12(b), 13(a, b), 14(a, b), 15(a, b). As shown in Table 2’s “OCR” column, MMReAct-GPT4 performs the best (65.7%) in OCR capability with the assistance of an external OCR model as a tool. Among end-to-end tuned models, LLaVA-13B (LLaMA-2) achieves the highest performance (22.7%). This superior performance may be attributed to LLaVA’s adoption of CLIP ViT-L/14 as its vision model, and the inclusion of a large volume of image-OCR pairings within the training data .

Knowledge. As depicted in Tables 10(a), 12(a, b) and 14(b, c), the “knowledge” category covers a wide range of knowledge-related questions, ranging from joke understanding to encyclopedia knowledge. LLaVA-Adapter v2-7B is the best model in this capability with a score of 31.4%, as shown in Table 3. It may be beneficial from its large-scale tuning data including GPT-4-LLM . MMReAct-GPT-4 also achieves a remarkable score (29.0%) in this capability, because of its strong LLM backbone , coupled with external tools like Bing search for knowledge acquisition.

Language generation. “Language generation” denotes the proficiency to produce fluent and informative text outputs, as illustrated in Table 10(a), 12(b), 13(a), and 15(a). The performance within this category is highly correlated with the efficacy of language modeling. As a result, MMReAct-GPT4 and LLaVA-13B (LlaMA-2) stand out as the top two models. Their success can be attributed to the GPT-4 and LlaMA-2 language models on which these systems are built.

Spatial awareness. “Spatial awareness” involves the understanding of the spatial relationship among visual object regions (e.g., Table 10(c)) and scene text regions (e.g., Table 13(a, b)). MMReAct-GPT4 has a significant lead in this capability (56.8%), because the adopted tools, such as dense captioning and OCR, provide detailed object and scene text location information in the form of coordinates, which can be understood and processed by GPT-4.

When it comes to end-to-end tuned models, LLaVA-13B (V1.3, 336px) exhibits the best performance of 31.3%. The tuning data for LLaVA is partly derived from capturing object names and their corresponding coordinates as input. This procedure ensures the generation of data imbued with spatial information, potentially aiding the models in developing and enhancing their spatial awareness capabilities.

Math. “Math” measures the arithmetic capability on either written equations (e.g., Table 15(b)) or problems in the wild (e.g., Table 11(d)). Notably, MMReAct-GPT4 consistently outperforms other models. This superior performance may be attributed to the adopted PAL math tool (Program-aided Language Models) .

2.2 Regarding each capability integration

Recognition, knowledge, and language generation.. As shown in Table 10(a), this capability integration can enable models to explain visual jokes. LLaMA-Adapter-v2-7B is the best model in this capability integration. This may be attributed to its large scale of tuning data as shown in Table 2. LLaVA-13B (LLaMA-2) and LLaVA-13B (V1.3, 336px) are the other two outstanding models. Stronger language models may be the reason. The tuning data of LLaVA shown in Table 2 can also not be ignored.

Recognition (sole). This category contains samples only requiring recognition, as shown in Table 10(b). InstructBLIP-14B and InstructBLIP-8B achieve the best performance, which may result from the tuning data including recognition datasets, like VQA v2 and GQA .

OCR and spatial awareness. For this integration, an example is shown in Table 10(c). MM-ReAct-GPT-4 is the best method for this integration. Notably, compared with MM-ReAct-GPT-3.5, MM-ReAct-GPT-4 has a significant improvement, over 40%, indicating the importance of LLMs to integrate information of OCR and location.

OCR, spatial awareness, and math. An example of this integration is shown in Table 11(a), which requires reading the floor plan and conducting arithmetic. Compared with the above integration, this combination involves one more capability of math. The observation is similar to the integration of OCR and spatial awareness. MM-ReAct-GPT-4 still achieves the best performance.

Recognition and spatial awareness. Table 11(b) shows an example for this integration. LLaVA-13B (V1.3, 336px) performs best for this category. Compared with LLaVA-13B (LLaMA-2), LLaVA-13B (V1.3, 336px) obtains an improvement of 8.4%, indicating the significant contribution of larger resolution of images.

OCR (sole). This task requires OCR only, as shown in Table 11(c). MM-ReAct-GPT-4 has the best results for sole OCR due to an OCR tool from Azure API. Notable, MM-ReAct-GPT-4 is much better than MM-ReAct-GPT-3.5 with an improvement of 23.0%, demonstrating the importance of language models in OCR.

OCR and Math. This integration enables reading text from real-world scenarios and solving math problems, as shown in Table 11(d). MM-ReAct-GPT-4 obtains the best performance in this capability integration, far ahead of other models. We highly recommend using MM-ReAct-GPT-4 to complete tasks related to this capability integration.

Other capability integrations. 9 other capability integrations are in long-tailed distribution, where MMReAct-GPT-4 achieves the best scores in 5 integrations out of 9. Their examples are shown in Tables 12-15.

3 Result discussion

In this subsection, we discuss the modules in LMMs and speculate how each component may affect the LMMs’ capabilities in different aspects, evaluated by MM-Vet. We mainly consider the models based on open-sourced LLMs, i.e., Flan-T5 , LLaMA , Vicuna , and LLaMA-2 .

Vision. For the Vision component, two models have been employed in the end-to-end LMMs we evaluated, namely, CLIP-ViT/L14 (428M) and EVA-ViT-G (1.13B). Determining a superior model is currently not possible due to the absence of a comprehensive ablation study . However, it’s noteworthy that, when paired with the same language model, Vicuna-7B, InstructBLIP-8B excels in recognition tasks, while LLaVA-7B works particularly well for OCR.

Language. There is a notable trend indicating that superior language models (LLMs) typically yield better performance, such as comparing the 7B and 13B variants of different models, except for the outlier of InstructBLIP where the 8B version performs better than the 14B one.

Tuning data. Increasing the volume of data can enhance performance. An example is InstructBLIP-8B , which utilizes more data from 26 publicly available datasets to tune the model and achieve higher scores than BLIP-2-12B.

3.2 Comparison with Bard

Bard is one popular closed-source commercial LMM system. One problem in evaluation is that Bard rejects images containing people and instead outputs “Sorry, I can’t help with images of people yet.” To conduct a fair comparison with other models, we constructed a subset of MM-Vet with 168 samples that Bard could process, henceforth referred to as the Bard set. The results on the Bard set are shown in Tables 5 and 6.

Bard achieves the highest scores in three out of six capabilities, seven out of fifteen capability integrations, and holds the highest overall score (53.5%). MM-ReAct-GPT-4 outperforms in the remaining three out of six capabilities, and tops the chart in nine out of the fifteen capability integrations. Particularly, MM-ReAct performs better in OCR, spatial awareness, and math capabilities, indicating the potential benefit of having specialized external tools, even when working with state-of-the-art LMMs.

When considering end-to-end models, there is still a big gap from Bard. For instance, Vicuna-13B (V1.3, 336px) obtains 31.5%, a substantial 22.0% lower than Bard. Future stronger open-sourced LLMs and advancements in multimodal training hold potential to further narrow this gap.

3.3 Comparison with GPT-4V(ision)

We evaluate and benchmark the state-of-the-art LMM, GPT-4V(ison) on MM-Vet. In our queries to GPT-4V, we prepend the prompt with “Generate a short and concise response to the following image text pair.” The quantitative results are shown in Tables 7, 8, and the qualitative results are expressed in Figures 3-6. Remarkably, GPT-4V achieves a score of 67.7%, surpassing both open-sourced LMMs and LLM-based multimodal agents by substantial margins.

We aspire that the detailed per-category performance breakdown sheds light on potential avenues for enhancing model capabilities, thereby bridging the existing performance gap. To illustrate, integrating specialized tools within agent systems proves advantageous for specific functionalities like OCR and math. While other categories, such as recognition and language generation, would require enhancements in the core vision and language modules, respectively. Figures 3-6 offer an exhaustive analysis, highlighting exemplary success and failure instances of GPT-4V’s performance.

This MM-Vet analysis is intended as a source of inspiration for future research, specifically in the realms of advanced multimodal prompting techniques and model refinements to further improve the LMM performance.

4 Effectiveness analysis of LLM-based evaluation

To verify the effectiveness of LLM-based evaluation for LMM predictions, we select the outputs from MMReAct-GPT-4 on 138 objective questions, which can be objectively annotated by humans. We compute the absolute value of the difference between the evaluator’s output score and the human-annotated score on each sample. By default, we use GPT-4 (0613) as the evaluator. Here we also replace it with other LLMs, e.g. LLaMA-2, GPT-3.5. The average difference to the human scoring is reported in Table 9, represented as $\overline{\Delta}$ .

The maximum potential discrepancy is 1.0. The baseline evaluation method, keyword matching, results in a high difference of 0.273. This illustrates the unsuitability of keyword matching for MM-Vet when dealing with open-ended answers. It is surprising that $\overline{\Delta}$ of LLaMA-2-7B is even higher than that of keyword matching, while $\overline{\Delta}$ LLaMA-2-13B only marginally less than keyword matching. This suggests that assessing open-ended outputs from models is far from straightforward. For OpenAI’s models, GPT-3.5 (turbo-0613) obtains 0.178 of $\overline{\Delta}$ , and GPT-4 (0613) achieves the lowest difference of 0.042. In this paper, we utilize GPT-4 (0613) to evaluate the outputs of LMMs.

5 Takeaway notes

We summarize the above analyses and discussions as follows:

In the evaluation of integrated capabilities on MM-Vet (Sections 4.2, 4.3.2, 4.3.3), GPT-4V and Bard outperform existing open-sourced methods. The tool-using approach, MM-ReAct-GPT-4 , achieves comparable performance to Bard with effective external tools. The pros and cons in different categories motivate future studies on tool-enhanced LMMs. Among end-to-end LMMs, LLaVA-13B (LLaMA-2)/LLaVA-13B (V1.3, 336px) demonstrates the best performance on MM-Vet.

Analysis of open-source LMMs (Section 4.3.1) leaves room for ambiguity regarding the superior vision encoders for LMMs, based on current model comparisons. However, it is evident that stronger LLMs can boost the performance of LMMs.

For open-ended evaluation (Section 4.4), it is effective to use GPT-4 for evaluating the open-ended outputs of LMMs. The use of less powerful LLMs could result in more significant deviations from the gold standard of human evaluation results.

Current top-performing methods, such as GPT-4V and MM-ReAct-GPT-4 , only achieve scores of around 68%/45% on MM-Vet (where full score is 100%). The gap signifies that further effort is necessary to enhance the performance of LMMs in terms of integrated capabilities, e.g., by developing stronger LMMs or extending LMMs with external tools.

Conclusion

In this paper, we introduce the MM-Vet benchmark to evaluate LMMs in terms of their integrated vision-language capabilities. We have assembled a new multimodal dataset, which requires the integration of multiple vision-language capabilities. To facilitate open-ended evaluation, we adopt an LLM-based evaluator to grade open-ended outputs from LMMs. We then evaluate various LMMs on MM-Vet, analyzing their results to provide insights into different LMM system paradigms and module selections. We observe that the current best LMMs GPT-4V achieve around 68% score on MM-Vet (full score 100%), indicating the need for efforts to further improve the integrated capabilities of LMMs.