CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Timothy Baldwin

Introduction

Large language models (LLMs) have driven remarkable advancements in natural language processing and artificial intelligence, revolutionizing the field (Zhang et al., 2022; Scao et al., 2022; Zeng et al., 2023; Touvron et al., 2023a; OpenAI, 2023; Wu et al., 2023; Taori et al., 2023; Li et al., 2023a). However, assessing the knowledge and reasoning abilities of these models has become increasingly challenging, especially with the proliferation of LLMs that generate fluent and plausible responses.

To this end, researchers have created various benchmarks intended to evaluate different model capabilities (Wang et al., 2019b; a; Lin et al., 2022; Zellers et al., 2019; Hendrycks et al., 2021b; Chen et al., 2021). Specifically, Hendrycks et al. (2021a) proposed MMLU, a benchmark that encompasses various tasks ranging from elementary mathematics and computer science to management and law, which can be used to comprehensively measure LLM capabilities in terms of the knowledge embedded in them. Due to its multiple-choice question format, which facilitates easy evaluation, and the breadth of subject areas it encompasses, it has become widely used as a fundamental assessment tool of the knowledge encoded by LLMs. However, this benchmark is in English, which limits its ability to assess LLMs in other languages. Although some researchers (OpenAI, 2023) have attempted to automatically translate it to evaluate LLMs in other languages, the inherent bias towards Western (and specifically US) culture in the dataset renders it unsuitable and even inappropriate for assessing LLMs across diverse cultures and languages.

In this paper, we propose CMMLU (Figure 1), a comprehensive Chinese assessment suite specifically designed to evaluate the advanced knowledge and reasoning abilities of LLMs in a Chinese linguistic and cultural context. CMMLU covers a wide range of subjects, comprising 67 topics from elementary to advanced professional levels. It includes subjects that require computational expertise, such as physics and mathematics, as well as disciplines within the humanities and social sciences. Many of these tasks are not easily translatable from other languages due to their specific contextual nuances and wording. Furthermore, numerous tasks within CMMLU have answers specific to China, which may not be universally applicable or considered correct in other regions or languages.

We assess GPT4, ChatGPT, and more than 20 advanced open-source multilingual and Chinese LLMs on CMMLU. The results reveal that the majority of these models struggle to achieve an accuracy score of 60%, relative to random accuracy of 25%. Notably, GPT4 achieves an average accuracy of 71%. These findings highlight the considerable room for improvement in LLMs in terms of Chinese knowledge and language understanding.

To gain a deeper understanding of the proficiency of the models in handling Chinese knowledge, we conduct a comprehensive analysis. We first focus on examining model performance across various subjects and find that all models exhibit uneven performance across different subjects, with comparatively higher scores in humanities and social sciences, but lower scores in China-specific and STEM subjects.

Furthermore, through extensive experiments, we find that: (1) most existing models do not benefit from chain-of-thought prompts in CMMLU; (2) few-shot examples help foundation models in the comprehension of tasks and enhance their reasoning abilities but do not help models that have undergone supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF); (3) LLMs perform worse on questions with negation words compared to those without negation words, but recently-released models mitigate this disparity either through better pre-training data or fine-tuning; and (4) questions with sub-options (Section 4.2) are difficult for all existing LLMs, with even GPT4 dropping 20% in accuracy over such questions.

Related Work

Benchmarking plays a crucial role in measuring AI development, particularly in the domain of LLMs. While benchmarks such as GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) have played an important role in tracking progress in natural language understanding (NLU) tasks, they primarily focus on specific language skills. With an increasing move to generative models which are highly adept at generating fluent outputs, the value of these benchmarks has diminished, and new datasets have been proposed to evaluate LLM abilities over more general tasks, such as reading comprehension (Rajpurkar et al., 2018; Kwiatkowski et al., 2019; Li et al., 2022), summarization (Hermann et al., 2015), commonsense reasoning (Clark et al., 2018; Talmor et al., 2019; Sakaguchi et al., 2020), mathematical reasoning (Hendrycks et al., 2021b; Cobbe et al., 2021), and code generation (Chen et al., 2021; Austin et al., 2021).

In order to comprehensively assess the capabilities of LLMs, some benchmarks have incorporated massive multi-task evaluations into their frameworks (Hendrycks et al., 2021a; Liang et al., 2022; Srivastava et al., 2023). An example is MMLU (Hendrycks et al., 2021a), which includes multiple domains and tasks based on real-world exams. It has become very popular for LLM evaluation due to its standardized and simplified format, comprehensive nature, and real-world relevance. However, all aforementioned benchmarks are primarily focused on English.

Given that Chinese is the language with the largest number of speakers worldwide, several benchmarks have been proposed for Chinese LLM evaluation. Following in the footsteps of GLUE and SuperGLUE, Xu et al. (2020) introduced CLUE, a benchmark for Chinese NLU that is widely used today. They also recently proposed SuperCLUE (Xu et al., 2023), which specifically focuses on LLMs. Recently, several Chinese benchmarks have emerged that follow the MMLU style, all of which are concurrent work with ours. In detail, Zhang & Li (2023) proposed ACLUE, focusing on ancient Chinese language understanding. Zeng (2023) presented MMCU, which covers four major domains (medicine, law, psychology, and education), with a particular focus on medicine and education. AGIEval (Zhong et al., 2023) provides problems from both Chinese and English standardized exams. C-Eval (Huang et al., 2023) and M3KE (Liu et al., 2023) collect more than 50 tasks from standard exams in China, while C-Eval covers various professions, and M3KE focuses on education examinations.

Compared to these benchmarks, CMMLU has several distinct features. Firstly, it includes more than 10 subjects that are not typically found in standard exams but are relevant to daily life, such as Chinese food culture, and Chinese driving rules. Secondly, it covers not only China-specific knowledge but also general world knowledge, such as world religion, world history, and global facts. Lastly, we have made our data completely public, enabling the community to evaluate their models freely and conveniently. A detailed comparison between CMMLU and other concurrent benchmarks is provided in Appendix A.

CMMLU

We created an extensive multitask test for Mandarin Chinese, which covers diverse areas of knowledge, including the humanities, social sciences, STEM (science, technology, engineering, and mathematics), and other areas that are important in daily life. It includes common test questions in subjects like mathematics, physics, and chemistry with answers that are not language or region specific, but also several tasks that are very region-specific, such as Chinese driving rules, Chinese food culture, and Chinese teacher qualifications. The questions in these tasks involve lots of China-related knowledge and can test a model’s understanding and adaptability to Chinese. In addition, CMMLU also contains tasks that can only expressed in Chinese, such as ancient Chinese language and Chinese literature. The terms and concepts involved in these tasks heavily rely on Chinese expression and are almost impossible to be obtained from translation. The full list of subjects, the concepts tested in each subject, the number of questions, and the statistics of question and answer lengths are provided in Appendix B.

Data collection

We hired four annotators with undergraduate or higher education levels to manually collect the questions and answers from freely available resources, at a rate of 50 CNY per hour. To prevent our questions from appearing in the training set of LLMs, we invested specific effort in identifying non-publicly available materials, mock exam questions, and questions from quiz shows. More than 80% of our data was crawled from PDFs (after OCR), which further reduces the possibility of it occurring in LLM training data. The entire collection process took around 250 hours.

Format

Each question in the dataset is a multiple-choice question with 4 choices, only one of which is correct; see Figure 2 for an example. The questions are expressed as fill–in–the-blank (by choosing the correct option), or direct-answer questions. For chemical formulae and mathematical expressions, we use a 50:50 mixture of LaTeX and plain text, where plain text was only allowed if an expression is commonly used and not prone to ambiguity (as judged by the annotators). For instance, the chemical expression for water can be written in plain text as H2O, or in LaTeX format as $H_{2}O$ .

Quality Check

To further check data quality, we randomly sampled 5% questions with answers for each subject, and conduct detailed verification through online resources. We estimate that there is around 2% of noise in the data, in terms of the correct answer not being present or being incorrectly labeled. Based on the results in Section 4 that most models struggle to achieve an average accuracy of 60%, we believe such an error rate does not compromise the overall results.

Statistics

CMMLU contains 11,528 questions across 67 subjects. Each subject has at least 105 questions, which we split into a few-shot development set with 5 questions, and a test set with more than 100 questions. In terms of task types, CMMLU comprises 17 STEM tasks, 13 humanities tasks, 22 social science tasks, and 15 other tasks. Among these, 16 tasks are China-specific, which means they either do not exist in other countries or regions, or their answers may be different in other places. We provide an example for each subject type in Appendix C.

Experiments

To provide an overview of existing LLMs on language understanding within the context of Chinese, we evaluate two commercial LLMs and more than 20 open-source LLMs in different sizes, language orients, and stages (i.e. either foundation model or SFT/RLHF model). We analyse their performance and investigate several factors that could affect the performance of LLMs.

Our goal is to assess the LLMs performance on CMMLU, which contains multiple-choice questions with one correct answer for each question. There have been several strategies to perform multiple-choice question-answering task. In this paper, for commercial models which we cannot get the weights (i.e., GPT4 and ChatGPT), we input the question with all candidate choices, allowing the model to generate the output, and use a series of regular expressions (regex) to match the model’s prediction. We call this free generation strategy. For open-source models, we follow Hendrycks et al. (2021a) to input the question and choices, and prompt the model by asking the answer key. Then we obtain the logits of the next predicted token, and compare the probability among the 4 tokens: ‘A’, ‘B’, ‘C’, and ‘D’ and select the token with the highest probability as the model’s choice. We named this as next token prediction strategy. Besides these two strategies, there is another way which is to select the answer with the lowest perplexity when concatenated with the question.

We compared different strategies in Appendix G, and found that next token prediction is the most efficient way. Therefore, for the majority of the remaining paper, we report the results of the next token prediction. However, for some analysis in Section 4.2, we use the free generation strategy. The regex is designed based on the observation of ChatGPT and ChatGLM responses. The detail of regex and matching algorithm is provided in Appendix H.

Prompt

We introduce each question with the phrase “以下是关于[主题]的单项选择题，请直接给出正确答案的选项 (Here are some multiple-choice questions about [subject], please provide the correct answer choice directly)”, and evaluate models in both zero-shot and few-shot settings. For zero-shot evaluation, we present a question with choices directly after the prompt. For few-shot evaluation, we provide up to 5 demonstration examples with answers before the question. The prompt concludes with the phrase “答案是：(Answer:)”, as shown in the example in Figure 2. If the context exceeds the model’s maximum length with few-shot examples, we dynamically remove the longest examples by counting sub-tokens.

Models

we assessed more than 20 models in different sizes from 12 model families. For commercial models, we evaluated ChatGPT and GPT4, which are two of the strongest LLMs.The evaluation was conducted in May for ChatGPT and July for GPT4, 2023.. For open-sourced models, we selected (1) English and multilingual-oriented models: BLOOM-7.1B (Scao et al., 2022), BLOOMZ-7.1B (Muennighoff et al., 2022), LLaMA-7B/13B/30B/65B (Touvron et al., 2023a), Bactrian-X-LLaMA (BX ${}_{\text{LLaMA}}$ )-7B/13B (Li et al., 2023a), Falcon-7B/40B (Almazrouei et al., 2023), LLaMA2-7B/13B/70B (Touvron et al., 2023b), Chinese-LLaMA (ZH ${}_{\text{LLaMA}}$ )-7B/13B (Cui et al., 2023); (2) Chinese-oriented models: Baichuan-7B/13B and Baichuan2-7B/13B (Yang et al., 2023), ChatGLM-6B and ChatGLM2-6B (Zeng et al., 2023), Xverse-13B,https://github.com/xverse-ai/XVERSE-13B InternLM-7B/20B (Team, 2023), MOSS-SFT-16B (OpenLMLab, 2023), Chinese-GLM-10B (Du et al., 2022), BatGPT-15B (Li et al., 2023b). The details about these models are provided in Appendix F.

1 Main Results

Table 1 shows the performance of all models under the five-shot setting. Since the zero-shot results are similar to the five-shot results, we provide them in Appendix J.1.

From the first block of the table, we observe the following: (1) LLaMA2-70B is the best open-sourced multilingual model, achieving an average accuracy of 53.21%, coming close to the ChatGPT performance at 55.51%. However, there is still a significant gap between LLaMA2-70B and GPT4 (70.95%); (2) 7B pre-trained multilingual models (except LLaMA2-7B) achieve nearly random results of 25% (since it’s lower than 30%, they are not displayed in the table); (3) For those multilingual models, fine-tuning using Chinese resources consistently improves their performance (BX ${}_{\text{LLaMA}}$ and ZH ${}_{\text{LLaMA}}$ vs. LLaMA, BLOOMZ vs. BLOOM).

From the second block, we find that: (1) Among the Chinese LLMs, Baichuan2-13B demonstrates the best overall performance (beats ChatGPT) with only 13B parameters. We attribute it to the high quality of the training data; (2) Several Chinese LLMs achieve competitive results compared to LLaMA2-70B with less than 20B parameters. This demonstrates that when focusing on a single language, high-quality monolingual (or bilingual) training data can empower small models (7B or 13B) with good capability compared to multilingual training data. An overall observation is that models from the same family always improve as the model size increases.

By subject

From the perspective of subject type, all models exhibit relatively high performance in humanities, social sciences, and other subjects, and medium performance in China-specific subjects, while low performance in STEM subjects. We attribute this to the nature of each subject type, and the capability of LLMs: (a) humanities, social sciences assess more on memorization which is relatively easy for LLMs; (b) China-specific topics encompass information that is either absent from the training data or inconsistent in multilingual training data; (c) STEM topics usually require complex reasoning, which has been proven to be difficult for existing LLMs. As expected, Chinese LLMs exhibit smaller gaps between China-specific subjects and other categories.

We compare the performance of the best-performing Chinese model, Baichuan2-13B, with the best-performing multilingual model, GPT4, for each subject. We categorize the subjects and present the results in Figure 3. The numerical results can be found in Appendix J.2.

From the figure, we note that the model’s performance appears to be unbalanced, excelling in certain subjects but struggling in others. Specifically, ancient Chinese and college actuarial science are the most challenging subjects for both Baichuan2 and GPT4, yielding slightly better results than random, while the legal and moral basis is one of the easiest subjects for both models. When comparing the two models, we find that for most subjects, GPT4 outperforms Baichuan2 by a significant margin, while Baichuan2 surpasses GPT4 in 8 subjects, 6 of these are China-specific subjects, and the other 2 (arts and philosophy) contain a large amount of Chinese elements.Due to these subjects contain a mixture of Chinese elements and global elements, we did not categorize them as China-specific. These findings suggest that including region- and culture-specific data in training is essential to accommodate users with different language backgrounds.

2 Analysis

In order to gain a comprehensive understanding of the LLM’s performance on CMMLU, we explored three factors that may enhance the model’s performance and two factors that could potentially diminish its performance. Specifically, we investigated whether the following factors can improve the model’s performance: (1) utilizing chain-of-thought prompts, (2) increasing the number of input examples, and (3) employing larger-sized models within the same family. Conversely, we explored whether the following factors make the task more challenging for LLMs: (4) questions containing negation words, and (5) questions with sub-options within them. For different analyses, we choose different models in different stages according to the relevance and result availability.

To investigate the potential benefits of chain-of-thought (COT) prompt in generating better results, we modified the prompt from “请直接给出正确答案的选项 (please provide the correct answer choice directly)” to “逐步分析并选出正确答案 (Analyze step by step and select the correct answer).” Since our dataset does not contain answer analysis, we adopt zero-shot setting for this experiment. The results are presented in Table 2, the breakdown of all sub-categories is provided in Appendix J.3.

From the table, we see that for most models, the use of chain-of-thought prompt does not lead to improvement. ChatGPT and ChatGLM2 slightly gain improvement after using COT prompt for STEM subject, despite that the overall accuracy still decreases. We manually checked the outputs and found that models either fail to explicitly generate the answer option after the analysis (instead generating the content of the answer), or generate complex context to wrap the choice, which leads to the failure of regex match. An obvious case is Xverse, compare to the direct answer prompt, the use of COT prompt results in an increase of 19.77% responses that cannot be matched by our regex.

Do few-shot examples help?

Many studies have shown that LLMs can benefit from the in-context examples, while some other studies have reported opposite observations (Liu et al., 2023; Zeng, 2023). In this context, we use CMMLU as a case study to investigate in-context learning (ICL) in LLM evaluation on multiple-choice questions.

As illustrated in Figure 4, we present the overall accuracy of models utilizing varying numbers of in-context examples. There is a clear discrepancy that, when provided with only one example, foundation models exhibit an overall boost, whereas fine-tuned models experience a decline in performance. We conjecture this is because foundation models are primarily optimized for natural text and may struggle to follow instructions. Providing examples helps these models better understand the task. In contrast, SFT/RLHF models are optimized to follow instructions, and the introduction of examples introduces a certain degree of mismatch with the data distribution during their fine-tuning, thus leading to a decline in performance.

When provided with more examples, while there may be fluctuations, the overall trend for foundation models indicates an improvement in performance with an increase in the number of examples. However, for fine-tuned models, there is no consistent trend.

Impact of model size on performance

We explored how the model’s performance improves with an increase in the number of parameters. To this end, we examine several model families and present their five-shot accuracy in relation to model size in Figure 5.

From the figure, we see that both LLaMA and LLaMA2 gain 5-point increase in scores as the model size changes from 7B to 13B, while Baichuan shows a remarkable 10-point improvement despite Baichuan-13B has 0.2T more training tokens than Baichuan-7B. We believe that have 7 billion parameters limit the model’s capability in numerous tasks, while doubling the parameters to about 13 billion significantly enhances certain capabilities and improves memorization. As the model size continues to increase (as seen with LLaMA and LLaMA2), the efficiency of performance improvement decreases, with a 5x increase in model size resulting in a 7% improvement for LLaMA and a 15% improvement for LLaMA2. Comparing LLaMA2 and Baichuan, it becomes evident that a smaller model equipped with higher-quality monolingual training data not only can achieve but also surpass the performance of a larger model with insufficient monolingual training data in terms of monolingual performance.

Are questions with negation more challenging?

Previous research has pointed out that language models may encounter challenges with negation expression (Kassner & Schütze, 2020; Hosseini et al., 2021). To investigate whether this issue persists in the context of Chinese language and LLMs, we firstly employ string matching to classify the test set into questions with and without negation words. We then compare the performance of different models on these two subsets. Note that according to our string matching results, approximately 10.7% data contains negation expressions.

In Table 3, we present 4 model families, from the table we find that most models (with the exception of GPT4 and ChatGLM2) perform less effectively on questions containing negative words compared to those without, aligning with the findings of previous studies, and highlights this common limitation of large language models.

Interestingly, developers have successfully mitigated this problem in different stages of development. For example, LLaMA2 demonstrates the enhancement of model’s negation process ability using SFT/RLHF. The accuracy gap between question w/ and w/o negations decrease by about 5% after applying SFT/RLHF. Baichuan shows that better pre-training can also effectively alleviate this issue. Specifically, Baichuan2 reduces such a gap to 1-2% compared to Baichuan’s 8-10% by using improved pre-training data. ChatGLM2 almost shows the same performance when answering questions with and without negations. We think the researcher has noticed the negation problem, and found that compared to complex reasoning ability, enhancing negative processing is relatively easy.

Are questions with sub-options more challenging?

There is a typical question type in all kinds of Chinese exams called sub-option questions. These questions include a main statement along with multiple sub-options, and inquire about the count, order, or selection of the sub-options, which requiring the model to have deeper reasoning and inference skills (see example in Figure 6). The sub-options in CMMLU can appear in different formats, such as “a, b, c…; ①, ②, ③…”, and account for about 10.8% of the dataset. We classified the data into two subsets based on sub-option presence, and put the evaluation results in Table 4. We observed that all these models performed weaker on sub-options questions compared to those without sub-options, with a decline ranging from 10% to 20%. Intuitively, the COT prompt should alleviate such a problem by guiding the model to analyze the sub-options one by one. However, the observation is that ChatGLM2 and BatGPT benefit from COT prompt while Baichuan doesn’t.

Conclusion

We introduce CMMLU, a groundbreaking benchmark designed to assess the multi-task language understanding capabilities in Chinese. Our experimental findings reveal substantial opportunities for improvement within existing large language models. Through extensive analysis, we identify several factors that impact model performance and propose actionable directions for enhancing LLMs. We are confident that our benchmark dataset and analytical insights will empower researchers to effectively evaluate and design Chinese LLMs.

References

Appendix A Comparison to concurrent benchmarks

C-Eval (Huang et al., 2023) and M3KE (Liu et al., 2023) are two similar benchmarks concurrent with our work. We compare the task distribution of these benchmarks in Table 5, and demonstrate that CMMLU contains more culture-related and region-related tasks. While there are differences in task distribution, we acknowledge that these datasets exhibit similarities in the task types and can, therefore, be jointly used as assessment criteria for evaluating the Chinese language capabilities of large models.

We further assess the overlap between CMMLU and both of these benchmarks. For this purpose, we first sort four choices for each question to eliminate the influence of choice order. Subsequently, we concatenate the question string with the sorted choice strings. Then, we remove all punctuation marks, including underscores and brackets, from the resulting strings. The final overlap, computed using exact string matching, yields a total of 74 for CEval and 158 for M3KE. This overlap accounts for approximately 1% of our dataset.

Appendix B CMMLU Subjects

Table 6 lists all subjects of CMMLU. The table also provides details for each subject test, including the concepts covered, the supercategory to which each subject belongs, and the total number of questions.

Table 7 presents the breakdown of statistical results of the CMMLU test set for each supercategory, including the number of tasks, number of questions, average question counts for each subject, maximum and minimum counts of questions, and average token length for question and choices. Meanwhile, Figure 7 provides a visualization of the token lengths of questions and answers for each subject.

Appendix C CMMLU Examples

Table 8 provides examples from CMMLU in each category.

Appendix D CMMLU Difficulty Distribution

We analyze the difficulty distribution of CMMLU from two perspectives. Firstly, the CMMLU benchmark encompasses a diverse range of difficulty levels: 5 subjects at primary school level, 10 at middle/high school level, 23 at college level, and 29 at professional level, ensuring a comprehensive difficulty spectrum.

Secondly, to estimate the difficulty distribution within each subject, we evaluated the top 20 models from our main results table. Each question was treated as a data point, and we recorded the number of models correctly answering each question. This approach allowed us to map out the difficulty distribution across subjects.

Figure 8 reveals that the majority of subjects exhibit a single peak in their difficulty distribution. This single-peak pattern indicates a uniform level of difficulty within these subjects, suggesting a consistent challenge for models across the range of questions. However, certain subjects, such as machine learning (located at position $) and professional law (at position$ ), display dual peaks. This dual-peak pattern signifies a notable presence of both relatively easy and challenging questions, with fewer intermediate-level questions. Despite the presence of two peaks, the transition between these peaks is gradual rather than abrupt, indicating a smooth progression in difficulty levels within these subjects.

Appendix E Emergent Ability shown in CMMLU subjects

We assessed the concept of emergent ability using the LLaMA-2 model family. Figure 9 illustrates the performance of the LLaMA-2 pre-trained models (7B, 13B, and 70B) across various subjects. The figure indicates that, for most subjects, there is a correlation between increased model size and enhanced performance. Notably, in subjects like college education (position $), elementary commonsense (position$ ), human sexuality (position $), and public relations (position$ ), the performance of the 7B and 13B models is comparable, while the 70B model shows a significant improvement.

However, since LLaMA-2-70B model has been trained on a more extensive dataset compared to its 7B and 13B counterparts, which likely includes more comprehensive coverage in these specific domains. We cannot simply attribute it to emergent ability. In addition, these tasks are mostly belongs to social science rather than STEM (which might need intensive reasoning). Given these complexities, we leave the exploration of emergent ability in our future research endeavors.

Appendix F Models being Evaluated

are GPT models developed by OpenAI and fine-tuned using reinforcement learning from human feedback (RLHF). As commercial products, specific details about the model size, training data, and training process remain undisclosed.

Falcon

is a decoder-only model created by TII and trained on 1,000B tokens of RefinedWeb (Penedo et al., 2023) data. Due to the high quality of its training data, Falcon-40B performs competitively with LLaMA-65B on various benchmarks.

LLaMA

is an auto-regressive language model proposed by Meta. It incorporates several structural improvements over the vanilla transformer and is trained on a mixture of publicly available data sources. LLaMA has demonstrated performance that is comparable to or even superior to models that are ten times its size.

LLaMA2

is an upgraded version of LLaMA developed by Meta. The preprocessing stage involves more robust data cleaning and updating data mixes, and the model employs a 40% increase in the total token count during training. Additionally, it up-samples the most factual sources to enhance knowledge and reduce hallucinations. Grouped-query attention (GQA) has been employed to reduce GPU memory usage.

BLOOM

is a multi-lingual targeted LLM developed by BigScience. It is trained on 46 natural languages and 13 programming languages. The largest BLOOM model consists of 176B parameters, but deploying such a large model can be challenging. In this paper, we evaluate the performance of the 7B BLOOM model.

BLOOMZ

is derived from BLOOM through fine-tuning on a cross-lingual task mixture (xP3), which is an instruction-following dataset. BLOOMZ exhibits competitive performance with models that have a larger number of parameters across various non-generation tasks.

Bactrian-X

is a series of LLMs (LLaMA, BLOOM, mT5) proposed by MBZUAI. These models are fine-tuned on a multilingual instruction-following dataset that encompasses 52 languages. All the fine-tuned Bactrian-X models demonstrate performance improvements compared to their corresponding base models in multilingual generation settings.

ChatGLM and ChatGLM2

are bidirectional dense models pre-trained using the General Language Model (GLM) algorithm developed by Tsinghua University. They support bilingual (Chinese and English) language processing. ChatGLM is a version of GLM that is enhanced with supervised fine-tuning, feedback bootstrap, and reinforcement learning with human feedback, specifically optimized for Chinese question answering (QA) and dialogue tasks. In this paper, we evaluate the performance of 10B and 6B models of GLM.

BatGPT

jointly developed by Wuhan University and Shanghai Jiaotong University, is a bilingual (Chinese and English) and bidirectional language model. BatGPT is initialized with a novel parameter expansion method, which enables it to absorb knowledge from the pre-training of other LLMs. With a bidirectional autoregressive architecture and further enhancement through Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human and AI Feedback (RLHAF), BatGPT is able to handle long-range, multi-turn question-answering tasks effectively and alleviate concerns regarding memory limitations. The evaluation of the 15B version is presented in this work.

MOSS-SFT

is an open-source Chinese language model proposed by Fudan University. It is comparable to ChatGPT in terms of training scale and alignment techniques. MOSS-SFT is initialized with CodeGen and further pre-trained on 100B Chinese tokens and 20B English tokens. The Supervised Fine-Tuned (SFT) version of MOSS-SFT enables the model to follow instructions in multi-turn dialogues.

Chinese-LLaMA

is part of the Chinese-LLaMA-Alpaca project, an open-source initiative that extends the vocabulary of LLaMA and Alpaca to include more Chinese tokens. The models are then further trained on a larger Chinese corpus to enhance their performance.

Baichuan and Baichuan2

are large language model families publicly released by Baichuan Intelligent Technology. Both include versions with 7B and 13B parameters, as well as base and chat variants. Baichuan models are trained on high-quality corpora totaling 1.4 trillion tokens, which surpasses LLaMA-13B by 40%. The models offer support for both Chinese and English languages, and have an extensive context window of 4096. Baichuan2 series is trained on nearly twice the amount of high-quality data, resulting in additional performance enhancements.

Xverse

is a 13B multilingual large language model developed by Shenzhen Yuanxiang Technology. It is trained on 1.4 trillion tokens from diverse sources and supports an extensive 8k context length, efficient tokenization, and advanced training technologies, making it both versatile and efficient.

InternLM

is an open-source, lightweight training framework developed collaboratively by Shanghai AI Laboratory in partnership with researchers from various universities and companies. Its primary objective is to facilitate model pre-training without the need for extensive dependencies. Utilizing a unified codebase, it supports both large-scale cluster pre-training on thousands of GPUs and fine-tuning on a single GPU, achieving remarkable performance enhancements. Notably, InternLM achieves nearly 90% acceleration efficiency when training on 1024 GPUs. Based on the InternLM framework, a model family including 7B and 20B versions as well as base and chat variants was released.

Appendix G Strategies for Estimating Model Choices

In this section, we compare three strategies for multiple-choice question evaluation. We introduce the mechanism of each strategy, explain its rationale, and compare their efficiency, strengths, and weaknesses. For convenience, we assume the question is “textQ”, and the four choices are: “textA”, “textB”, “textC”, “textD”.

The idea is to input the question along with all candidate choices and prompt the model with a direct answer text, such as “The answer is: ”. We then retrieve the probabilities of the next predicted token and compare these probabilities over the four choice indicator tokens, typically $[A,B,C,D]$ . The token with the highest probability is treated as the model’s choice.

Question: textQ A. textA B. textB C. textC D. textD Answer:

Con: The model may not tend to generate a token from these choice letters.

How to mitigate the cons: Provide few-shot examples with their expected answers.

Works or frameworks use this strategy: MMLU (Hendrycks et al., 2021a), HELM (Liang et al., 2022).

Strategy 2 – Perplexity Comparison

After combining question with all candidate choices. We concatenate each candidate answer with the full question and candidates text. These concatenated texts are then input to the model for a forward pass, and we compute the perplexity for each. The sequence with the lowest perplexity is treated as the model’s choice.

Question: textQ A. textA B. textB C. textC D. textD Answer: A. textA

Question: textQ A. textA B. textB C. textC D. textD Answer: B. textB

Question: textQ A. textA B. textB C. textC D. textD Answer: C. textC

Question: textQ A. textA B. textB C. textC D. textD Answer: D. textD

Pro: Aligns with the objective of language model optimization as perplexity reflects the true probability of a model generating the given text.

Con: Low efficiency. Usually take 4x time (for a 4-choice question) compared to Next Token Prediction.

How to mitigate the cons: Efficient implementation that only computes the same prefix once.

Works or frameworks use this strategy: LM-Evaluation-Harness (Gao et al., 2021), OpenCompass.https://github.com/open-compass/opencompass

Strategy 3 – Free Generation

We input the question and candidate choices to the model and prompt it by asking for the correct choices. We allow the model to continue generating text, and then use the auxiliary method to match the patterns and extract the model’s choices.

Question: textQ A:textA B:textB C:textC D:textD Answer:

Con: Need answer extraction via human/model/regular expression. This process can be costly and error-prone. The generation can be very long, resulting in significant time consumption.

How to mitigate the cons: Train a robust answer extraction model, or design robust regular expressions. Use a small temperature when doing generation.

Works or frameworks use this strategy: OpenCompass, C-Eval (Huang et al., 2023).

Table 9 compares models performance using strategy 1 and strategy 3. Since strategy 2 is time-consuming, we didn’t conduct results on it. From the table, we find that using next token prediction achieves a higher score than using the free generation strategy for all models, but the gap is less than 3% for most of the models under the zero-shot setting (with the exception of BatGPT which is about 5%). For both zero-shot and five-shot settings, the gap between strategy 1 and 2 is positively correlated to the proportion of the instances that cannot match any choice using regex. Hence, we believe using the next token prediction to force the model to make a choice among the given choices can effectively reflect its knowledge capacity.

Appendix H Regular expressions matching algorithmsl

The pseudocode in Algorithm 1 outlines the ExtractChoice function for extracting choices from an LLM output string.

Initially, the function examines whether the first character of the string corresponds to a valid choice and returns that choice if true. To accommodate the complex responses of different LL.M.s, we adopt a four-step matching mechanism.

First: Identify and extract choices by seeking patterns of some choice statements, such as the term ”answer” (answer) followed by valid options. Second: Employ a pattern to recursively identify and extract the choices mentioned in the string, iterating until they finally appear. Third: Use weak single matching patterns. Fourth: Check for responses that mention a single choice.

If there is no matching pattern or unique selection, ”E” is returned by default, indicating that no selection was confidently extracted.

Appendix I Correlation to other Benchmarks

To investigate the correlation between models performance on CMMLU and other benchmarks, we choose 6 popular English LLMs and 5 benchmarks to conducte correlation analysis.

From Figure 10 we find that CMMLU demonstrates a strong correlation with four of these benchmarks, which span areas such as mathematics, commonsense reasoning, and coding. The exception is the PIQA task, where the relevance is somewhat diminished due to most models achieving high scores ( $>$ 80%) on this task. However, 0.88 still shows strong positive correlation.

Appendix J Breakdown of Model Performance

Table 11 displays zero-shot results of the LLMs on CMMLU by 5 sub-categories.

J.2 The results of each subjects

We compared the 0-shot and 5-shot results of selected LLMs that showed higher performance on each subject in Table 10. We further analyze the performance distribution of multiple LLMs across all subjects in Figure 11. It is evident from the figure that LLMs with higher performance exhibit diverse abilities across various tasks, while those with lower performance face challenges in most subjects. Furthermore, the scatter plot distribution indicates comparable performance levels among LLMs across different subjects.

J.3 The effect of chain-of-thought prompt

Table 12 shows the breakdown of the models performance after using chain-of-thought prompt.