MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

cs.CL

Introduction

In recent years, advancements in large language models (LLMs) have significantly transformed the field of natural language processing (NLP). These models, including state-of-the-art examples like GPT-4, Gemini, and Claude , are pushing the envelope both in general applicability across various tasks and specialized performance in specific areas. A key objective in this ongoing development is achieving expert-level intelligence, characterized by performance that meets or surpasses the top 10% of skilled adults in a diverse range of tasks .

To effectively track the progress towards the goal of this expert-level intelligence, it is essential to evaluate these models on a broad range of tasks. There are multiple popular benchmarks used to measure such general intelligence. For example, AGIEval focuses on general-exam questions from SAT, Gaokao, GRE, etc. ARC focuses on science-based questions. BBH focuses on solving hard synthetic tasks. MMLU includes a broad range of exam questions from 57 subjects across STEM, the humanities, the social sciences, etc. Among these benchmarks, MMLU emerged as the de facto standard for evaluating LLMs due to its broad coverage and high quality. However, the rapid progress of current LLMs has quickly led to performance saturation on MMLU. Since GPT-4 achieved 86.4% in March 2023, there has not been any significant progress on the benchmark. Most recent frontier models like GPT-4-Turbo, Gemini-1.5-Pro, Claude, and LLaMA-3-400B (all published in early-mid 2024) all settle at an accuracy between 86% - 87%. The recent published GPT-4o has achieved remarkable performance boost (10+% improvement) on MATH , Chatbot Arena . However, it only obtains 1% improvement on MMLU to obtain 87.4%. This leads us to re-examine the effectiveness of MMLU in measuring future (stronger) LLMs. Besides the saturation issue, the performance on MMLU is also known to be highly sensitive to the prompt and scoring function , which causes significant order changes in the leaderboard. Here, we conjecture that these issues are due to the following causes:

The questions in MMLU only have three distractor options. LLMs could potentially exploit shortcuts to derive the answer without truly understanding the rationale. This could lead to an overestimate of LLMs’ true performance, also leading to a degree of instability.

The questions in MMLU are mostly knowledge-driven without requiring too much reasoning, especially in the STEM subjects, which reduces its difficulty. In fact, most models achieve better performance with ‘direct’ answer prediction without chain-of-thought .

There is a portion of questions that are either unanswerable or mistakenly annotated. This dataset noise leads to a lower ceiling, which the frontier models hit.

These issues have highlighted the need for more challenging, discriminative, and reliable datasets to track the progress of LLMs. In this paper, we introduce MMLU-Pro: a comprehensive benchmark designed for proficient-level multi-discipline language understanding and reasoning. MMLU-Pro spans 14 diverse domains including mathematics, physics, chemistry, law, engineering, psychology, and health, encompassing over 12,000 questions and thus meeting the breadth requirement. MMLU-Pro is distinctive from MMLU in the following aspects:

MMLU-Pro has ten options, which contain 3x more distractors than MMLU. By increasing the distractor numbers, we significantly reduce the probability of correct guess by chance to boost the benchmark’s difficulty and robustness.

MMLU-Pro increases the portion of challenging college-level exam problems. These questions require LLM to perform deliberate reasoning in different domains to derive the final answer.

We integrate two rounds of expert reviews to reduce the noise of the dataset. The first round is based on expert verification. In the second round, we utilize the SoTA LLMs to identify potential errors and employ annotators to perform more targeted verification.

We evaluated more than 50 LLMs including open-source and closed-source models, such as GPT-4o , Claude-3-Opus , and Gemini , LLaMA-3 , Phi-3 on MMLU-Pro. Our key findings are summarized as follows:

MMLU-Pro presents significant challenges; notably, the leading model, GPT-4o, only achieves an accuracy of 72.6% and GPT-4-Turbo reaches 63.7%, indicating substantial room for improvement.

MMLU-Pro is more discriminative than MMLU in distinguishing the nuances between models. For example, the gap between GPT-4o and GPT-4-Turbo is 1% on MMLU, while it becomes 9% on MMLU-Pro. This discriminative nature makes MMLU-Pro a more suitable benchmark.

Advanced open-source models like Llama-3-70B-Instruct and DeepSeek-V2-Chat , while not yet performing at the level of leading closed-source models such as GPT-4o and Claude-3-Opus, have shown performance that is close to Claude-3-Sonnet.

MMLU-Pro necessitates chain-of-thought (CoT) to achieve promising results. For instance, CoT can boost the performance of GPT-4o by 19%. In contrast, CoT will actually hurt the performance of models on MMLU. This reflects the necessity to perform deliberate reasoning on MMLU-Pro, which is not required in the knowledge-driven MMLU questions.

Our error analysis on 120 erroneous cases of GPT-4o, the current top-performing model, reveals that 39% of errors are due to flaws in the reasoning process, 35% stem from a lack of specific domain expertise, and another 12% from computational errors. These results highlight the MMLU-Pro benchmark’s difficulties and indicate areas needing further research and model enhancement.

Related Work

Recent advancements in LLMs have significantly propelled the field of natural language processing. GPT-3 demonstrated robust few-shot prediction capabilities, interpreting tasks and examples from natural language inputs. Subsequent models like InstructGPT , which employ human-feedback reinforcement learning, have achieved strong user instruction-following capability. More recent models including GPT-4o, GPT-4, Claude-3, Gemini, and Llama-3, have shown notable improvements in complex reasoning across various domains. To rigorously assess and push the capabilities of these LLMs, we introduce MMLU-Pro, a new benchmark designed to test the upper limits of reasoning and knowledge in advanced language models.

2 LLMs Evaluation Benchmarks

In recent years, the development of various benchmarks has significantly enhanced the evaluation of Large Language Models (LLMs). For instance, GLUE and its successor SuperGLUE , have played a pivotal role in advancing language understanding tasks, setting the stage for more specialized evaluations. Other recent benchmarks, including MMLU , HELM , BigBench , HellaSwag , and the AI2 Reasoning Challenge (ARC) , have broadened the scope by assessing capabilities across language generation, knowledge understanding, and complex reasoning .

To facilitate performance comparison across diverse LLMs, several leaderboards have been established, such as the OpenLLM Leaderboard and OpenCompass . However, as LLM capabilities have rapidly advanced, leaderboard scores have become increasingly concentrated at the top, with models like GPT-4 achieving near-perfect scores on multiple benchmarks. This trend highlights the urgent need for more challenging benchmarks to fully test the limits of LLM capabilities.

Recent studies have revealed that the performance of Large Language Models (LLMs) on current benchmarks is not robust to minor perturbations . Specifically, slight variations in the style or phrasing of prompts can lead to significant shifts in model scores. Beyond the inherent non-robustness of the models themselves, the typical four-option format of multiple-choice questions (MCQs) also contributes to this instability in model scoring. This format may not sufficiently challenge the models or differentiate between closely performing systems, leading to potential overestimations of their capabilities. Our new benchmark, MMLU-Pro, aims to address these issues by introducing more complex questions and increasing the number of answer choices from four to ten, thus enhancing performance differentiation and reducing the likelihood of correct guesses by chance.

The MMLU-Pro Benchmark

Our dataset comprises 14 discipline subsets, totaling 12,032 questions, with their distribution and origins detailed in Figure 3. It integrates questions from several sources: (1) Original MMLU Questions, which form the core of our dataset and include questions adapted from the original MMLU dataset with trivial and erroneous questions removed; (2) the STEM Website, providing selected high-quality STEM problems from online platforms; (3) TheoremQA, featuring high-quality, human-annotated questions that necessitate the application of theorems for resolution; and (4) SciBench, which includes advanced science questions derived from college exams, ensuring the inclusion of curriculum-aligned questions. For more details about detailed data statistics, prompt instructions, and incorrect answer cases we eliminated, please refer to Appendix A.1.

2 Dataset Construction Pipeline

As shown in Figure 2, our dataset construction begins with a comprehensive review of the original MMLU dataset, which we streamline by merging the previous 57 subject categories into 14 broader categories. This restructuring aims to better focus the evaluation on key areas of knowledge and reduce redundancy. We also aim to eliminate overly simple questions that fail to challenge the models effectively. We employ eight models: Llama-2-7B, Llama-2-7B-Chat, Llama-2-13B, Llama-2-13B-Chat , Mistral-7B , Gemma-7B , Yi-6B, and Yi-6B-Chat . Each model is evaluated on the MMLU, and questions answered correctly by more than four models are considered as “too easy” and subsequently excluded from consideration. Using this criterion, a total of 5,886 questions are filtered out across the various subjects.

Question Collection and Integration

We then expand our dataset by incorporating questions from the STEM Website 111https://stemez.com/subjects, TheoremQA , and SciBench . It is important to note that the questions from the STEM Website are in the form of problem statements with solutions, while those from TheoremQA are in the format of questions accompanied by brief answers. To adapt these for our dataset, we utilize GPT-4-Turbo222In this paper, GPT-4-Turbo refers to GPT-4-turbo-2024-04-09. For more details, see https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4. to extract short answers from the solutions, serving as the correct answer options. We also generate three additional distractors for each question. We then compare the solutions with the extracted answers manually, to remove questions where the extracted answers are incomplete or incorrect. This step is essential for aligning the STEM Website and TheoremQA questions with those from other sources, preparing them for future option augmentation.

Option Augmentation

We enhance the question options from four to ten using GPT-4-Turbo, introducing six additional choices. These are not merely quantitative additions; rather, they serve as plausible distractors that necessitate discerning reasoning for correct selection. This approach significantly lowers the likelihood of correctly guessing an answer, thereby increasing both the difficulty and the robustness of the benchmark. In experiments, we found that GPT-4-Turbo does not gain additional advantage from such an augmentation procedure.

Expert Review

The expert review process for our dataset construction comprises two main phases to ensure its quality and reliability. Phase 1: Verification of Correctness and Appropriateness involves experts verifying the accuracy of each answer, removing questions unsuitable for a multiple-choice format, and discarding questions that lack necessary information or require non-textual elements like images or tables. Phase 2: Ensuring Distractor Validity involves the Gemini-1.5-Pro model re-evaluating all answer options to identify potential false negatives. In this context, a ‘false negative’ refers to a correct option initially misclassified as incorrect. Subsequently, human experts rigorously review these identified options to ensure that actual distractors are indeed incorrect and distinctly different from the correct answer. In Table 1, we present a distribution of issues identified during the expert review process. For better illustration, we categorize them into three types:

Incorrect Answers: refer to instances where the provided answer is incorrect. There are two main sources of these errors: the pre-existing errors within the original MMLU dataset and errors on the STEM Website arising from flawed or incomplete answer extraction.

False Negative Options: primarily arise from distractors generated in two key stages: converting single answers into four options from sources like the STEM Website and TheoremQA, and further expanding these four options to ten in the option augmentation phase. Human experts will remove each False Negative Option, keeping the correct answer and suitable distractors. In our dataset, 83% have ten options, 17% have fewer, and the average options count per question is 9.47.

Bad Questions: include several problematic aspects: (1) Questions that require non-text information such as images or tables. (2) Questions that lack sufficient textual information to derive a conclusive answer. (3) Questions that are unsuitable for a multiple-choice format, such as proof problems, true or false questions, and open-ended questions.

Experimental Setup

We utilize a 5-shot Chain-of-Thought (CoT) approach to measure model performance on challenging tasks presented by MMLU-Pro. This CoT reasoning, adapted from Chain-of-Thought Hub , incorporates essential reasoning steps from the original MMLU dataset. Our approach introduces two enhancements: firstly, extending the original options available from the Chain-of-Thought Hub, and secondly, selecting five representative demonstration examples for each discipline. Unlike traditional performance measures such as Perplexity (PPL) which primarily focus on linguistic probabilities, our method emphasizes reasoning capabilities, crucial for handling the complexities of MMLU-Pro. A comprehensive comparison of performances using direct answering and CoT methods will be presented in Section 6.2, demonstrating the effectiveness of the CoT approach.

To extract answers from the model-generated reasoning content in response to the MMLU-Pro dataset, we initially use the regular expression ‘answer is $?\([A-J]$?\)’ to match the format specified in the prompt instructions and few-shot examples. If this regex fails to retrieve a valid response, possibly due to formatting deviations by the model, we employ a secondary regex ‘\.*\[aA\]nswer:\s*$[A-J]$’ for a second attempt to extract the answer. If both of them fail to retrieve a valid response, a fallback mechanism is implemented where a random option from the answer choices is selected. This ensures consistent answer provision across all evaluations.

Results and Analysis

Table 2 showcases the performance of frontier models on the MMLU-Pro benchmark. Due to space constraints, we selected a subset of models and representative domains, including three reasoning-focused subjects (Mathematics, Physics, Engineering) and three knowledge-heavy subjects (History, Law, Psychology). Full results are available on our leaderboard 333Please visit our leaderboard at https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro..

GPT-4o emerges as the strongest model with an overall accuracy of 72.6%, showing superior performance across all subjects. Additionally, Phi-3-medium-4k-instruct (14B parameters) and Phi-3-mini-4k-instruct (3.8B parameters) perform exceptionally well, possibly due to their pre-training on high-quality educational data and code.

Additionally, Results from Table 2 indicate that top-tier closed-source models outperform the open-source ones. Among the leading open-source models, Llama-3-70B-Instruct performs the best, achieving an accuracy of 56.2%, close to that of Yi-Large and Claude-3-Sonnet. However, it still significantly lags behind GPT-4o and Claude-3-Opus in all subjects.

2 Subject-Specific Insights

Math and Physics: In computation and reasoning-intensive subjects like Math and Physics, we observe significant performance disparities among models. The gap stretches from over 70% accuracy for GPT-4o to just over 20% for Mistral-7B-v0.1, illustrating a wide range in capabilities and underscoring the value of our benchmark in distinguishing these differences.

History and Psychology: In knowledge-intensive subjects such as History and Psychology, models generally show a higher performance floor compared to reasoning-intensive disciplines. Interestingly, the DeepSeek-V2-Chat model underperforms relative to its peers in these subjects, indicating its comparatively stronger reasoning abilities over its knowledge retrieval capabilities.

Engineering and Law: Among the 14 subjects evaluated, Engineering and Law consistently scored lower. Upon reviewing model outputs, we found that the lower scores in Engineering are largely due to the addition of new questions sourced from the STEM Website, which require complex formula derivations and multi-step calculations. This aspect leaves substantial room for improvement in future, more advanced models. Law scores suffer as questions become more intricate and detailed with additional options, necessitating a deeper understanding of legal reasoning.

3 Error Analysis

In this section, we explore an error analysis of GPT-4o, currently the best-performing model on the MMLU-Pro benchmark, to examine its performance strengths and weaknesses. This examination not only highlights areas where the model falls short but also provides insights that could inform future improvements in both its architecture and training processes. We conducted a detailed review of 120 randomly selected erroneous predictions made by GPT-4o. These errors were analyzed by expert annotators who determined the underlying causes of each misprediction using their expert judgment. Specific cases and further detailed discussions are provided in Appendix A.6.

The model frequently encounters difficulties with logical reasoning, even when it recalls the correct information and knowledge. These issues often arise from logical inconsistencies in its responses, likely due to its dependence on recognizing patterns in training data rather than engaging in a true understanding of the problem.

Lack of Specific Knowledge (35%)

A fundamental root cause of domain-specific errors in the GPT-4o model is the lack of specialized knowledge. Errors such as incorrect financial calculations and misapplications of optical principles highlight this issue.

Calculation Errors (12%)

We distinguish calculation errors from reasoning errors to aid model developers, as many AI systems can effectively utilize calculators or Python for complex, multi-step calculations. In our review of error cases, it is common to find instances where the model has the correct formula but makes errors in computing values.

Other Errors

The remaining errors include No Selection Made (5%), Question Understanding Errors (4%), Generation Issues (2%), Annotation Errors (2%), and Answer Extraction Errors (1%). These errors are attributed to various factors, such as limitations in final response selection, complex text interpretation challenges, limitations in response generation, inaccuracies in data annotation, and issues in extracting precise answers from model outputs.

Comparison with MMLU

In this section, we will compare the MMLU and MMLU-Pro benchmarks from three perspectives: difficulty level, reasoning strength, and robustness degree.

In Figure 5, we present scores of different models on both MMLU and MMLU-Pro benchmarks. It is evident that as language model capabilities enhance, the scores on MMLU are not only increasing but also clustering closely together, making it difficult to distinguish between models. For instance, models like Gemini-1.5-Flash, Llama-3-70B-Instruct, Phi-3-medium-4k-instruct, and Qwen1.5-110B all score between 78% and 82%, a narrow 4% range that encompasses four models, challenging the differentiation of their performance. MMLU-Pro expands this range to approximately 10%. Similarly, the score difference between models like GPT-4o, Claude-3-Opus, and GPT-4-Turbo has widened from about 2% on MMLU to around 9% on MMLU-Pro. Additionally, the increased difficulty in MMLU-Pro ensures ample room for future model improvement. Currently, the best-performing model, GPT-4o, scores 72.6% on MMLU-Pro, leaving a substantial margin of 27.4% for potential improvement, whereas MMLU offers only about 11.3% space for further enhancement.

2 Reasoning Level

According to Table 3, we can observe differences in performance between the Chain of Thought (CoT) method and Direct Answering (DA) across various models on MMLU and MMLU-Pro. The comparison shows that the CoT method generally results in more significant performance improvements on MMLU-Pro compared to MMLU.

Specifically, GPT-4o improves by 1.5% using the Chain of Thought (CoT) method compared to direct answering on MMLU, while on MMLU-Pro, its improvement reaches 19.1%. Similarly, GPT-4-Turbo shows a 15.3% increase in performance using CoT over direct answering on MMLU-Pro, although its performance slightly decreases by 0.2% on MMLU. Other models such as Phi3-medium-4k-instruct, Llama-3-8B, and Gemma-7B also display similar trends, exhibiting greater performance improvements using CoT on MMLU-Pro compared to direct answering. These findings indicate that the MMLU-Pro benchmark is specifically designed to assess deeper and more complex reasoning skills, as evidenced by the enhanced performance of models using chain-of-thought (CoT), highlighting its focus on professional-level problem-solving.

3 Robustness Degree

It is widely recognized that even minor variations in prompts can significantly impact model outputs, leading to substantial fluctuations when evaluating models. This poses challenges for accurately ranking models and maintaining consistent leaderboards . This sensitivity is generally attributed to models’ lack of robustness , a characteristic tied to the underlying principles of language models that fall outside the scope of this study. However, a high-quality benchmark should aim to minimize the impact of prompt variability on scores, ensuring more consistent and reliable evaluations.

To assess this, we evaluated models using 24 different but reasonable prompts. Figure 5 showcases the score range for different models under varying prompts. On the MMLU benchmark, the influence of these prompts generally ranges between 4-5%, with peaks up to 10.98%. In contrast, on the MMLU-Pro benchmark, the impact of prompt changes is generally around 2%, with a maximum of 3.74%. This reduced variability highlights an improvement in consistency and reliability over the original MMLU benchmark, ensuring more reliable assessments of language models’ capabilities.

Limitations

The MMLU-Pro dataset, while enhancing the complexity of MMLU by incorporating more challenging, reasoning-focused questions, remains constrained by the limitations of the multiple-choice format. This format may not capture the depth of comprehension and creative response generation as effectively as open-ended answers, which better reflect real-world scenarios. Additionally, MMLU-Pro exclusively focuses on language models and does not include assessments for multi-modal models, limiting its applicability in scenarios requiring synthesis of visual, auditory, and textual data.

Conclusion

In this paper, we introduce MMLU-Pro, a more challenging benchmark designed to elevate the assessment of multi-task language understanding capabilities in language models. By incorporating more complex, reasoning-intensive tasks, MMLU-Pro addresses the performance saturation observed in previous benchmarks, effectively differentiating models’ capabilities. Our evaluations show that even leading models like GPT-4o encounter significant challenges, indicating a successful increase in difficulty and an improvement in the benchmark’s ability to test deeper cognitive processes. MMLU-Pro also enhances its robustness by reducing dependency on prompt styles, making it a valuable tool for advancing our understanding of AI language capabilities. As AI technology evolves, we hope MMLU-Pro plays a crucial role in pushing the boundaries of what language models can achieve.

Acknowledgments and Disclosure of Funding

We would like to thank Reddit user Dorrin Verrakai, who provided invaluable feedback for this work. We also express our gratitude to Ankesh Anand from Google DeepMind and Ning Shang from Microsoft for their insightful comments and suggestions. Additionally, we appreciate the contributions of all open-source language model providers, whose efforts have significantly propelled the advancement of research in this field.

References

Appendix A Appendix

Table 4 provides a comprehensive overview of the initial data filtering performed on the Massive Multitask Language Understanding (MMLU) dataset across various academic disciplines. It details the original number of items in each discipline, the number filtered out due to specific criteria, the percentage of items filtered, and the remaining number of items after filtering. Among the disciplines, Business, History, Other, and Psychology exhibit the highest filtering percentages, with more than 50% of the original questions being filtered out. This indicates that the questions in these disciplines are relatively simple and that many current language models have already mastered the relevant knowledge.

Distribution of MMLU-Pro Question Origin Details

Table 5 illustrates the varied sources of questions across different disciplines, highlighting the dependency on specific databases. Disciplines like Law, Other, Health, Philosophy, and History rely exclusively on questions from MMLU. Engineering, on the other hand, predominantly uses questions from STEM websites, accounting for 93.08% of its total. Similarly, Business, Chemistry, and Biology show a significant reliance on STEM website sources. Additionally, Math and Physics display a more diversified sourcing pattern, relatively evenly drawing questions from three or four sources.

Details of LLMs Utilization in Dataset Construction

In Table 6, we showcase the prompts used with Large Language Models (LLMs) in the dataset construction process. These prompts include using GPT-4-Turbo to convert problems from STEM Website and TheoremQA into multiple-choice questions (MCQs), expanding four-option MCQs to ten-option MCQs with GPT-4-Turbo, and employing Gemini-1.5-Pro to recall False Negative Options.

A.2 5-shot CoT Prompt example

As shown as Table 7, when evaluating models on MMLU-Pro, the prompt consists of an initial prompt, 5 demonstration examples and a question to be answered. These demonstration examples are defined in the validation subset of the MMLU-Pro dataset.

A.3 List of Language Models Studied

In this part, we detail the model families included in our study. Our focus is on widely-used models in current production environments, such as GPT, Claude, Gemini, LLaMA, Yi 444https://www.lingyiwanwu.com/, Phi, and other popular model families:

For closed-sourced models, we utilized the APIs of the most recent versions as of May 2024:

OpenAI GPT including GPT-4o and GPT-4-Turbo. Currently strongest GPT models.

Anthropic Claude including Claude-3-Opus and Claude-3-Sonnet.

Google Gemini including Gemini-1.5-Pro, the most powerful model in the series, and Gemini-1.5-Flash, the newest and fastest model in the Gemini family served in the API.

01.AI Yi including Yi-Large. A capable closed-source model that achieves a high score on the Chatbot Arena leaderboard.

We also examined a range of open-source base and instruction-tuned models:

Meta LLaMA including Llama-3-70B-Instruct, Llama-3-70B, Llama-2-70B, Llama-3-8B-Instruct and Llama-3-8B. Important open-sourced base and instruction-tuned models.

Microsoft Phi including Phi-3-medium-4k-instruct and Phi-3-mini-4k-instruct. Compact yet powerful models, excelling in knowledge and reasoning.

Qwen including Qwen1.5-110B and Qwen1.5-72B-Chat.

TIGER Lab MAmmoTH2 including MAmmoTH2-8x7B-Plus. A reasoning-enhanced LLM instruction tuned from Mixtral-8×7B.

Mistral AI Mixtral and Mistral including Mistral-7B-v0.1 and two Mixture of Experts (MoE) models: Mixtral-8x7B-Instruct-v0.1, Mixtral-8x7B-v0.1.

Google Gemma including Gemma-7B and Gemma-2B. A family of lightweight open models from Google, built from the same research and technology used to create the Gemini models.

01.AI Yi including Yi-1.5-34B-Chat and Yi-34B.

InternLM including InternMath-20B-Plus and InternMath-7B-Plus.

Other open-source LLMs including Staring-7B, c4ai-command-r-v01, OpenChat-3.5-8B, Zephyr-7B-Beta, Neo-7B-Instruct and Llemma-7B.

A.4 Computational Resources

Our experiments were conducted on NVIDIA A100 GPUs. To enhance the inference speed of our models, we employed the vLLM (very large language model) acceleration technique. For instance, evaluating a language model with 7 billion parameters on the MMLU-Pro dataset takes approximately 20-30 minutes. Additionally, for closed models that necessitate API calls, our evaluations on our custom dataset involved processing approximately 20M input tokens and 5M output tokens.

A.5 Dataset Licensing

The MMLU-Pro dataset comprises data from four distinct sources, each governed by its own licensing terms:

MMLU dataset: Licensed under the MIT License. This license allows for free usage, modification, and distribution, provided the original license and copyright notice are included.

TheoremQA: Licensed under the MIT License.

SciBench: Licensed under the MIT License.

Additionally, the MMLU-Pro dataset itself is licensed under the MIT License, ensuring broad usability and distribution rights under similar conditions.

A.6 Error Analysis Cases

Even though the model may recall correct knowledge, it often struggles to logically process steps toward the right answer. This issue often stems from logical inconsistencies in the output, possibly caused by the model’s reliance on patterns from its training data rather than true understanding. For instance, as shown in Table 10, when calculating the pressure difference inside and outside a container, the model erroneously added the internal and external pressures together.

Lack of Specific Knowledge (35%)

A fundamental root cause of domain-specific errors in the GPT-4o model is the lack of specialized knowledge. For example, as shown in Table 8, the model lacks financial knowledge: The cash balance for interest calculation is determined by subtracting the down payment from the product price. Due to the incorrect use of $1650 instead of$ 1600 as the principal, the result was erroneous. Similarly, as in Table 9, the model did not correctly understand that when using a lens in different media, the ratio of the refractive indices of the lens material and the medium should be considered, rather than directly subtracting their numerical values. This lack of understanding of how to properly apply optical principles led to a misconception.

Calculation Errors (12%)

We distinguish calculation errors from reasoning errors to aid model developers, as many AI systems can utilize calculators or Python for complex, multi-step calculations. For example, as in Table 11, the model had the correct formula for calculating the molecular weight of a compound but made an error in summing the values, leading to an incorrect final answer.

Other Errors

The remaining errors include No Selection Made (5%), Question Understanding Errors (4%), Generation Issues (2%), Annotation Errors (2%), and Answer Extraction Errors (1%). “No Selection Made” refers to instances where the model responded but did not select a final option as dictated by the prompt and few-shot format. "Question Understanding Errors" occur when the model incorrectly interprets the question or options, such as in Table 13 in the appendix, where the model incorrectly focused on Singer’s broader views on equality for all beings rather than strictly on the equality principle as it applies to humans. The correct answer (E), focusing solely on human beings, was overlooked in favor of a broader interpretation (H). “Generation Issues” refer to anomalies in the generation process, such as in Table 12 in the appendix, where the model repeatedly generated one sentence until it exceeded the maximum length and terminated. “Annotation Errors” occur when the ground truth answer is incorrect. “Answer Extraction Errors” refer to failures in extracting the chosen option due to an unusual format, causing the extraction script to fail in extracting the answer choice.