MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, Ran He
Introduction
The thriving of Large Language Model (LLM) has paved a new road to the multimodal field, i.e., Multimodal Large Language Model (MLLM) . It refers to using LLM as a brain to process multimodal information and give reasoning results . Equipped with the powerful LLM, MLLM is expected to address more complex multi-modal tasks . The three representative abilities of LLM , including instruction following , In-Context Learning (ICL) , and Chain-of-Thought (CoT) are also manifested in multimodality. For example, Flamingo turns on multimodal ICL, which can adapt to new tasks by giving a few examples. PaLM-E achieves amazing OCR-free math reasoning via CoT. GPT-4V shows even more ability in a variety of complex reasoning tasks . MiniGPT-4 implements GPT-4-like instruction following capabilities, such as converting images into corresponding website codes, by introducing multimodal instruction tuning. These emergent abilities of MLLMs are exciting and imply that a new dawn has broken in artificial intelligence.
Although these models exhibit surprising conversational capabilities when conducting everyday chats, we still know little about how well they quantitatively perform in various aspects. The existing three common quantitative evaluation manners for MLLMs have their limitations that are difficult to comprehensively evaluate performance. Specifically, the first manner evaluates on existing traditional multimodal datasets, such as image caption and VQA . However, on the one hand, it may be hard to reflect the emergent abilities of MLLMs on these datasets. On the other hand, since the training sets of large models are no longer unified, it is difficult to guarantee that all MLLMs have not used the testing set for training. The second manner is to collect data for an open-ended evaluation, but either the data is unavailable to public by now or the amount is small (only 50 images) . The third manner focuses on one aspect of MLLMs, such as object hallucination or adversarial robustness , which is powerless to comprehensive evaluation.
In light of these concerns, a new comprehensive evaluation benchmark is urgently needed to match the flourish of MLLMs. We argue that a universal comprehensive evaluation benchmark should have the following four characteristics: (1) It should cover as much as possible, including both perception and cognition abilities. The former refers to recognizing the specific object, such as its existence, count, position, and color. The latter refers to compositing the perception information and the knowledge in LLM to deduce more complex answers. It is obvious that the former is the premise of the latter. (2) Its data or annotations should not come from existing publicly available datasets as much as possible, avoiding the risk of data leakage. (3) Its instructions should be as concise as possible and in line with human cognition. Although instruction design may have a large impact on the output, all models should be tested under the same unified instructions for fair comparison. A good MLLM should be able to generalize to such concise instructions. (4) The responses of MLLMs to the instructions should be intuitive and convenient for quantitative analysis. The open-ended answer of MLLMs poses significant challenges to the quantization. Existing methods tend to use GPT or manual scoring , but there may be problems of inaccuracy and subjectivity.
To this end, we collect a comprehensive MLLM Evaluation benchmark, named as MME, which meets the above four characteristics at the same time:
MME covers the examination of perception and cognition abilities. Apart from OCR, the perception includes the recognition of coarse-grained and fine-grained objects. The former identifies the existence, count, position, and color of objects. The latter recognizes movie posters, celebrities, scenes, landmarks, and artworks. The cognition includes commonsense reasoning, numerical calculation, text translation, and code reasoning. The total number of subtasks is up to 14, as shown in Fig. 1.
All instruction-answer pairs are manually constructed. For the few public datasets involved in our study, we only use images without directly relying on their original annotations. Meanwhile, we make efforts to collect data through real photographs and image generation.
The instructions of MME are designed concisely to avoid the impact of prompt engineering on the model output. We argue that a good MLLM should be able to generalize to such simple and frequently used instructions, which are fair to all models. Please see Fig. 1 for the specific instruction of each subtask.
Benefitting from our instruction design “please answer yes or no”, we can easily perform quantitative statistics based on the “yes” or “no” output of MLLMs, which is accurate and objective. It should be noted that we have also tried to design instructions with multiple choice questions, but find that it may beyond the capabilities of current MLLMs to follow complex instructions.
We conduct massive experiments to evaluate the zero-shot performance of 30 advanced MLLMs on the 14 subtasks. The evaluated MLLMs include BLIP-2 , InstructBLIP , MiniGPT-4 , PandaGPT , Multimodal-GPT , VisualGLM-6B , ImageBind-LLM , VPGTrans , LaVIN , mPLUG-Owl , Octopus , Muffin , Otter , LRV-Instruction , Cheetor , LLaMA-Adapter-v2 , GIT2 , BLIVA , Lynx , MMICL , GPT-4V , Skywork-MM , mPLUG-Owl2 , Qwen-VL-Chat , XComposer-VL , LLaVA , Lion , SPHINX , InfMLLM , and WeMM . As displayed in Fig. 2 that consists of 2 overall leaderboards (perception and cognition) and 14 individual leaderboards, these MLLMs show clear discrepancies in our MME evaluation benchmark. Fig. 3 also provides a comparison from the other perspective. We can see the range that current MLLMs can reach in each capability dimension. More importantly, we have summarized four prominent problems exposed in experiments, including inability to follow basic instructions, a lack of basic perception and reasoning, as well as object hallucination , as shown in Fig. 4. It is expected that these findings are instructive for the subsequent model optimization.
In summary, the contributions of this work are as follows: (1) We propose a new benchmark MME to meet the urgent need of MLLM evaluation. (2) A total of 30 up-to-date MLLMs are evaluated on our MME. (3) We summarize the exposed problems in experiments, proving guidance for the evolution of MLLMs.
MME Evaluation Suite
In order to facilitate quantitative performance statistics, the orientation of our instruction design is to let the model to answer “yes” or “no”. As a result, the instruction consists of two parts, including a concise question and a description “Please answer yes or no.” For each test image, we manually design two instructions, where the discrepancy lies in the questions. The ground truth answer of the first question is “yes” and that of the second question is “no”, as shown in Fig. 1. When MLLM answers both of the two questions correctly, it appears more confident that the MLLM actually comprehends the image and the corresponding knowledge behind it, rather than just guessing.
2 Evaluation Metric
Since the output of the model is limited to two types (“yes” or “no”), it is convenient to measure the metrics of accuracy and accuracy+. The former is calculated based on each question, while the latter is based on each image where both of the two questions need to be answered correctly. The random accuracies of the two metrics are equal to 50% and 25%, respectively. It can be seen that accuracy+ is a stricter measurement but also better reflects the comprehensive understanding degree of the model to the image. In addition, we calculate the score of a subtask based on the sum of accuracy and accuracy+. The perception score is the sum of scores of all perception subtasks. The cognition score is calculated in the same way. Therefore, the full scores of perception and cognition are 2000 and 800, respectively.
3 Data Collection
We argue that perception is one of the most fundamental capabilities of MLLMs, and the lack of perception will easily lead to the object hallucination problem . That is, MLLM will answer questions based on its own fantasies rather than based on the realistic content of the image, as displayed in Fig. 4.
Coarse-Grained Recognition. The contents of coarse-grained recognition include the existence of common objects, and their count, color, and position. The images are sampled from COCO , but the instruction-answer pairs are all manually constructed, rather than directly using publicly available annotations. Even if MLLMs have seen these COCO images, our manually prepared pairs are not presented in their training sets. This requires MLLMs to be able to understand the instructions and infer corresponding answers. In each perception subtask of existence, count, color, and position, we prepare 30 images with 60 instruction-answer pairs.
Fine-Grained Recognition. The fine-grained recognition is more about testing the knowledge resources of MLLMs. The subtasks consist of recognizing movie posters, celebrities, scenes, landmarks, and artworks, containing 147, 170, 200, 200, and 200 images respectively. For the celebrities, we plot a red box to a person with a clearly visible face in the image, and the corresponding instruction is “Is the actor inside the red box named [celebrity name]? Please answer yes or no.” Similar with the above coarse-grained recognition, the images of these subtasks are from publicly available datasets and all of the instructions are manually designed.
OCR. Optical Character Recognition (OCR) is also a foundational capability of MLLMs, serving for subsequent text-based tasks such as text translation and text understanding. The images are sampled from and all of the instruction-answer pairs are manually designed. Considering that MLLMs are still in its infancy, we only choose the relatively simple samples in this version of MME. The numbers of image and instruction-answer pairs are 20 and 40, respectively.
3.2 Cognition Tasks
We evaluate if any MLLM can carry out further logical reasoning after perceiving the image, which is the most fascinating aspect of MLLMs over previous traditional methods. In order to infer the correct answer, MLLMs need to follow the instruction, perceive the contents of the image, and invoke the knowledge reserved in LLMs, which is much more challenging than the single perception tasks. Examples of the following subtasks are shown in Fig. 1.
Commonsense Reasoning. Unlike the ScienceQA dataset that requires specialized knowledge, the commonsense refers to the basic knowledge in daily life. For example, given a photo of a down jacket, asking MLLMs whether it is appropriate to wear the cloth when it is cold (or hot). These are basic knowledge that humans can judge instantly without complex step-by-step reasoning. Therefore, we expect MLLMs to perform well in a zero-short setting. The images are all manually photographed or generated by diffusion models, and the instruction-answer pairs are all manually designed. There are a total of 70 images and 140 instruction-answer pairs.
Numerical Calculation. It requires MLLMs to be able to read the arithmetic problem in the image and output the answer in an end to end way, which has been demonstrated in . In this version, we only consider relatively easy arithmetic problems, such as addition and multiplication. There are 20 images and 40 instruction-answer pairs. The images are all manually taken, and the instruction-answer pairs are all manually designed.
Text Translation. Considering that the MLLM supports both English and Chinese, we set the text translation subtask. It requires MLLMs to translate the Chinese written in an image to the corresponding English. In this version, we only design basic translation problems, which will be updated according to the development of MLLMs in the future. The images of this part are all manually taken, and the instruction-answer pairs are all manually designed. There are a total of 20 images and 40 instruction-answer pairs.
Code Reasoning. It requires MLLMs to read the code in the images and automatically complete logical operation inside the code. A similar task that writes website code based on an image has been demonstrated in . The images are all manually taken, and the instruction-answer pairs are all manually designed. We only set basic code problems in this version. There are in total 20 images and 40 instruction-answer pairs.
Experiments
In this section, a total of 30 MLLMs are evaluated on our MME benchmark, including BLIP-2 , InstructBLIP , MiniGPT-4 , PandaGPT , Multimodal-GPT , VisualGLM-6B , ImageBind-LLM , VPGTrans , LaVIN , mPLUG-Owl , Octopus , Muffin , Otter , LRV-Instruction , Cheetor , LLaMA-Adapter-v2 , GIT2 , BLIVA , Lynx , MMICL , GPT-4V , Skywork-MM , mPLUG-Owl2 , Qwen-VL-Chat , XComposer-VL , LLaVA , Lion , SPHINX , InfMLLM , and WeMM .
There are a total of 10 subtasks for the evaluation of the perception ability, from the perspectives of coarse-grained recognition, fine-grained recognition, and OCR. Figs. 2 (3)-(6) show the score leaderboards of individual coarse-grained recognition subtasks. With respect to the object existence, Otter, Lynx, WeMM, Muffin, and SPHINX get the highest score 195, with a 98.33% accuracy and a 96.67% accuracy+ listed in Table 1. Contrastively, the second place, including GIT2, XComposer-VL, Lion, GPT-4V, and etc, lag behind the first place only by 5 scores. The results show that these models already have a good performance on object existence. For the object count, position, and color, Muffin, Lion (parallel with SPHINX), and InfMLLM make the top one, respectively. It suggests that different models have their own strengths. Note that in the four coarse-grained subtasks, these MLLMs get the worst results on object position, indicating that the current models are not sensitive enough to the position information.
Figs. 2 (7)-(11) display the score leaderboards of individual fine-grained recognition subtasks. Regarding to poster recognition, GPT-4V, Lion, and Qwen-VL-Chat are the top three. It is interesting that Qwen-VL-Chat relatively underperforms in the coarse-grained recognition, but now it exhibits good. This implies that our division of coarse-grained and fine-grained is reasonable, enabling us to examine different aspects of MLLMs. For the celebrity recognition, WeMM, SPHINX, and Otter take the top three with similar scores. It is worth noting that GPT-4V refuses to answer questions that involve individuals, resulting in a zero score in the celebrity subtask. For the scene recognition, WeMM, InfMLLM, and Lynx ahead of other MLLMs. This is the first time InfMLLM and Lynx have broken into the top three in the fine-grained recognition subtasks. For the landmark recognition, top three places are taken by Lion, WeMM, and LLaVA respectively, of which Lion gets the top spot. For the artwork recognition, WeMM, GPT-4V, and GIT2 exceed other counterparts, where the last two scores are similar. Note that GPT-4V declines to answer some questions about private art collection, which lowers its score. With respect to OCR listed in Fig. 2 (12), GPT-4V, Skywork-MM, and WeMM get the top three with scores of 185, 162.5, and 147.5 respectively. GPT-4V presents a huge advantage, leading the other two models by 22+ socres. As presented in Fig. 2 (1), in the leaderboard of the whole perception recognition, WeMM, InfMLLM, and SPHINX come in top three, closely followed by Lion, LLaVA, and XComposer-VL.
1.2 Cognition
There are four subtasks for the evaluation of the cognition ability, including commonsense reasoning, numerical calculation, text translation, and code reasoning. Figs. 2 (13)-(16) plot the score leaderboards of individual subtasks. In terms of the commonsense reasoning, the “ever-victorious generals” GPT-4V, WeMM, and XComposer-VL exceed other MLLMs, especially GPT-4V, which gets a score of 142.14. With respect to numerical calculation, GPT-4V still achieves first place, but falls short in the text translation. Regardless of whether it is commonsense reasoning, numerical calculation, or text translation, none of the highest scores exceed 150. This suggests that MLLMs have a lot of room for improvement in these capabilities. For the code reasoning, GPT-4V achieves a high score of 170, far ahead of other counterparts. For all of the cognition tasks, GPT-4V, Lion, and WeMM win the gold, silver, and bronze medals respectively, as shown in Fig. 2 (2).
Analysis
We conclude four common problems that largely affect the performance of MLLMs. The first problem is not following instructions. Although we have adopted a very concise instruction design, there are MLLMs that answer freely rather than following instructions. For example, as shown in the first row of Fig. 4, the instruction has claimed “Please answer yes or no”, but the MLLM only makes a declarative expression. If no “yes” or “no” is appeared at the beginning of the generated languages, the model is judged to make a wrong answer. We argue that a good MLLM (especially after instruction tuning) should be able to follow such a simple instruction, which is also very common in everyday life.
The second problem is a lack of perception. As shown in the second row of Fig. 4, the MLLM misidentifies the number of bananas in the first image, and misreads the characters in the second image, resulting in wrong answers. We notice that the performance of perception is vulnerable to the nuance of instructions, since the two instructions of the same image differ in only one word, but lead to completely different and even contradictory perception results.
The third problem is a lack of reasoning. In the third row of Fig. 4, we can see from the red text that the MLLM already knows that the first image is not an office place, but still gives an incorrect answer of “yes”. Analogously, in the second image, the MLLM has calculated the right arithmetic result, but finally delivers a wrong answer. These phenomena indicate that the logic chain is broken during the reasoning process of MLLMs. Adding CoT prompts, such as “Let’s think step by step” , may yield better results. We look forward to a further in-depth research.
The fourth problem is object hallucination following instructions, which is exemplified in the fourth row of Fig. 4. When the instruction contains descriptions of an object that does not appear in the image, the MLLM will imagine that the object exists and ultimately gives a “yes” answer. Such a case of constantly answering “yes” results in an accuracy about 50% and an accuracy+ about 0, as shown in Tables 1 and 2. This suggests an urgent need to suppress hallucinations, and the community should take into account of the reliability of the generated answers.
Conclusion
This paper has presented the first MLLM evaluation benchmark MME that has four distinct characteristics in terms of task type, data source, instruction design, quantitative statistics. 30 advanced MLLMs are evaluated on MME and the experimental results show that there is still a large room to improve. We also summarize the common problem raised in experimental results, providing valuable guidance for the development of MLLM.