SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, Kai Yu

Introduction

Large Language Models (LLMs), such as ChatGPT (Schulman et al. 2022), have attracted widespread attention in general scenarios, including information search, code generation, and more. In the field of science, LLMs have also shown preliminary potential in improving scientific research efficiency and transforming scientific research paradigms (Blanco-Gonzalez et al. 2023; WANG and MIAO 2023). In the meanwhile, several scientific LLMs have been proposed by researchers (Taylor et al. 2022; Luo et al. 2022; Frey et al. 2022). In the general field, there are already numerous evaluation benchmarks to evaluate the language understanding, language generation and reasoning capabilities of LLMs, such as MMLU (Hendrycks et al. 2020), AGIEval (Zhong et al. 2023), and C-EVAL (Huang et al. 2023), shown in Table 1. Although these benchmarks cover data of science domain, the data sources are usually confined to educational materials, which can not adequately assess the research ability of LLMs and not align with real-life scientific research scenarios. In addition, some benchmarks have been proposed to evaluate the scientific capability of LLMs, such as MultiMedQA (Singhal et al. 2023), ChemLLMBench (Guo et al. 2023), and MATH (Hendrycks et al. 2021), while these benchmarks are restricted to a specific scientific discipline, leaving a lack of a more general scientific evaluation benchmark.222Due to the page limitation, we only compare some widely used benchmarks. For more information, we refer to (Chang et al. 2023). In addition, these benchmarks (1) lack evaluation systems for scientific capabilities, (2) are all based on objective questions, which are insufficient to assess scientific abilities, and (3) face the risk of data leakage.

In response to this gap, we present SciEval, an English benchmark designed to evaluate advanced abilities of LLMs in the scientific domain. SciEval consists of a total of about 18000 challenging scientific questions, spanning three important basic science fields: chemistry, physics and biology, each of which is further divided into multiple sub-topics. SciEval mainly has the following three characteristics:

Multi-level and comprehensive evaluation of the ability of LLMs in the scientific field. Scientific ability of LLMs needs to be evaluated from multiple aspects. Leveraging cognitive domains of Bloom’s taxonomy (Krathwohl 2002; Forehand 2010), which covers six levels, SciEval evaluates the scientific capabilities of large language models across four dimensions: basic knowledge, knowledge application, scientific calculation, and research ability, where each capability aligns with one or more cognitive levels.

Combination of objective and subjective questions. SciEval is mainly based on objective questions, which allow for quick and standard model evaluations, involving multiple-choice, fill-in-the-blank, and judgment questions. These questions can help us understand whether the model can correctly understand and memorize scientific knowledge. However, objective questions are insufficient to assess scientific capability holistically. To better assess scientific reasoning and application ability, SciEval introduces a small number of subjective questions, involving a total of twelve basic science experiments, which is named Experimental Data.

Dynamic data generation based on basic scientific principles. The huge amount of training data used for pre-training LLMs may cause the risk of data leakage for evaluation. In order to solve this problem, one of the main features of SciEval is the use of Dynamic Data, which can prevent potential data leakage and ensure the fairness and credibility of the evaluation results. The Dynamic Data will be updated regularly, and we will maintain a stable version to make a fair comparison of model performance. And the objective questions other than Dynamic Data are referred to as Static Data.

We conduct experiments to evaluate LLMs on SciEval in answer-only, chain-of-thought and few-shot settings. Results indicate that GPT-4 is the strongest model, with only GPT-4, GPT-3.5-turbo and Claude-v1.3 surpassing 60% average accuracy on the Static Data, signifying considerable opportunities for improvement. With the results of Dynamic Data, we find that these LLMs have little knowledge about molecules, and most models could only retain near-random accuracy in the physics subset. As for Experimental Data, some top-tier models could perform satisfactorily in experimental principle and designing, while almost all models struggle to analyze the experimental results. With the analysis of experiment results, we claim that training on large-scale scientific corpus is helpful for the scientific capability of LLMs, and most LLMs perform bad on calculation problems, especially in physics domain. We hope that SciEval can provide an excellent benchmark for the assessment of scientific capability of LLMs, and promote the wide application in science.

Related Work

To evaluate the performance of LLMs across different tasks, several benchmarks have been proposed. MMLU (Hendrycks et al. 2020) aims to develop a comprehensive test for evaluating text models in multi-task contexts. HELM (Liang et al. 2022) offers a comprehensive assessment, evaluating LLMs across various aspects, such as language understanding and common-sense reasoning. Big-Bench (Srivastava et al. 2022) introduces 204 challenging tasks covering various domains, aiming to evaluate tasks beyond the capabilities of existing language models. AGIEval (Zhong et al. 2023) serves as an evaluation framework for assessing the performance of foundation models in human-centric standardized exams. C-Eval (Huang et al. 2023) assesses the advanced knowledge and reasoning capabilities of foundation models in Chinese.

Specific Benchmarks for LLMs

Apart from general tasks, specific benchmarks are designed for certain downstream tasks. MultiMedQA (Singhal et al. 2023) focuses on medical question-answering, evaluating LLMs in terms of clinical knowledge and QA abilities. MATH (Hendrycks et al. 2021) assesses reasoning and problem-solving proficiencies of LLMs in mathematics. ScienceQA (Lu et al. 2022) proposes a multi-modal benchmark with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations, collected from elementary and high school science curricula. SCIBENCH (Wang et al. 2023) examines the reasoning capabilities required for complex scientific problem-solving and proposes two datasets of college-level scientific problems. Compared to these benchmarks, SciEval (1) evaluates scientific capabilities from multiple aspects, having a broader coverage, (2) uses data of community Q&A, which is more flexible and diverse, (3) designs a subset of dynamic data, making an effort to mitigate data leakage.

The SciEval dataset

In this section, we first introduce the evaluation system of SciEval (§Scientific Research Evaluation System), followed by the data collection process (§Data Collection). And finally, we show the data statistics (§Data Statistics).

Scientific research requires different dimensions of knowledge, such as understanding and calculation, thence evaluation of scientific ability should be conducted at multiple levels. Bloom’s taxonomy (Krathwohl 2002; Forehand 2010) is a set of three hierarchical methods used for classification of educational learning objectives covering cognitive, affective and psychomotor domains. The cognitive domain is frequently used to structure curriculum learning objectives, assessments and activities, and is broken into six levels: Remember, Understand, Apply, Analyze, Evaluate and Create, as is shown in Figure 1, which are suitable for the evaluation of scientific capability.

Based on the cognitive domain of Bloom’s taxonomy, the evaluation system of SciEval consists of four knowledge dimensions: Basic Knowledge, Knowledge Application, Scientific Calculation, and Research Ability. As is shown in Figure 1, Basic Knowledge primarily assesses the fundamental scientific knowledge of LLMs. Knowledge Application focuses on how to apply basic knowledge to solve scientific problems, requiring models to have comprehension, application, and analysis abilities. Scientific Calculation is a specialized application of knowledge that further examines complex reasoning capabilities of LLMs based on their general knowledge application abilities. Research Ability assesses evaluation capabilities at a higher cognitive level, requiring models to participate in various aspects of scientific research, including problem formulation, experimental design, data analysis, and summarization.

Based on the evaluation system, we design three different types of data: Static Data, Dynamic Data, and Experimental Data. The Static Data covers all these four knowledge dimensions and will remain constant throughout, while the Dynamic Data examines from the aspects of Knowledge Application and Scientific Calculation and will be regularly updated to prevent any data leakage. The Experimental Data comprises a set of questions for twelve scientific experiments and can be used to evaluate the Research Ability.

Data Collection

The collection steps of Static Data are shown in Figure 2. The primary source of Static Data is Socratic Q&A333https://socratic.org, a community-driven website that covers a wide range of subjects such as science and literature. Specifically, we collect data from the fields of biology, chemistry, and physics. To ensure quality, we employ rule-based methods to preprocess the crawled data. While gathering the questions, we found that not all of them are suitable as titles. To address this, we utilize GPT-4 with the “Task 1” prompt, as depicted in Figure 2, to process these questions. Since most of the collected questions are open-ended and challenging to evaluate, we employ GPT-4 to simplify ground-truth answers and generate three wrong answers to formulate them as multiple-choice questions. Additionally, we classify the questions into their respective knowledge domains. And during this process, we manually check the generated content of GPT-4 to ensure data quality.

To make the dataset more diverse and comprehensive, we further integrate data from some publicly available datasets:

MedQA (Jin et al. 2021) is a free-form multiple-choice OpenQA dataset for solving medical problems, collected from professional medical board exams. We use the test set of USMLE, which is the English subset of MedQA.

PubMedQA (Jin et al. 2019) is a biomedical question-answering dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe using the corresponding abstracts, which is fit for evaluating the literature comprehension ability. We incorporate 1000 expert-annotated data from it and frame them as judgment questions.

Reagent Selection (Guo et al. 2023) involves the identification and proposal of the most fitting reagents for a specific chemical reaction or process, which is a subset of ChemLLMBench. We randomly select 40% data and formulate them as multiple-choice questions.

Dynamic Data

The current training of LLMs often uses a large amount of data, resulting in a risk of data leakage for evaluation. In order to solve this problem, we design a “dynamic” subset, which can generate data dynamically according to scientific principles. The dynamic subset covers two disciplines, chemistry and physics. For chemistry data, we use the basic information and properties of molecules crawled from PubChem444https://pubchem.ncbi.nlm.nih.gov/ to create data. For physics data, we manually write some Python scripts according to the physics formulas. When obtaining the evaluation dataset, we will provide a regenerated version to users and we will update it regularly, while at the same time, we will maintain a stable version of the dynamic data to make a fair comparison.

Experimental Data

To better evaluate the scientific thoughts and abilities of LLMs, SciEval introduces a subset of experimental data, involving 12 different basic scientific experiments. These experiments are collected from basic science experiment courses at university, and each experiment conducts a comprehensive investigation of the ability of LLMs in scientific research and experimentation from the perspectives of experimental principle, process, and analysis and summarization of experimental results.

Data Statistics

Summarized statistics of SciEval are shown in Table 2, where we only count Static Data. For Dynamic Data, the chemistry part examines the Knowledge Application ability and contains 2000 data, while the physics part evaluates the Scientific Calculation ability and involves 890 data. All these questions are in English and we show some data examples in Appendix D.

For Static Data, we further split the data into dev, valid, and test set. For each data source, each knowledge domain, and each discipline, we randomly select 5 data to form the dev set, which can be used for few-shot learning, and we split the remaining data with a ratio of 1:9 to construct the valid set and test set respectively.

Experiment

We evaluate LLMs in both Answer-Only (AO) and Chain-Of-Thought (CoT) (Kojima et al. 2022) settings. The prompts we used are shown in Figures 3 and 4 respectively. Furthermore, we also evaluate using 3-shot setting, where the three exemplars are selected from the dev set.

In order to comprehensively assess the scientific capabilities of Large Language Models (LLMs), we evaluate 15 high-performing LLMs that are widely accessible. These models are selected to represent a diverse range of organizations and vary in size. The details of these models are summarized in Table 3.

GPT-3.5-turbo and GPT-4 (Schulman et al. 2022; OpenAI 2023) are the strongest GPT model variants from OpenAI that have undergone pretraining, instruction tuning, and reinforcement learning from human feedback (RLHF, (Ouyang et al. 2022)).

Claude555https://www.anthropic.com/index/introducing-claude., developed by Anthropic, is often considered comparable to GPT-3.5-turbo. We evaluate both the Claude-v1.3 and Claude-instant-v1.1, a lighter version of Claude.

ERNIE Bot666https://yiyan.baidu.com/ is developed by Baidu, possessing deep semantic understanding and generation capabilities across modalities and languages. SparkDesk777https://xinghuo.xfyun.cn/ is proposed by iFLYTEK. It has cross-domain knowledge and language understanding capabilities and can understand and execute tasks based on natural dialogue.

LLaMa (Touvron et al. 2023), developed by Meta, is probably the best open-weight foundation model so far. Vicuna (Zheng et al. 2023) and Alpaca (Taori et al. 2023) are both fine-turned from LLaMa with supervised instruction fine-tuning. It is believed that the performance of Vicuna is better than that of Alpaca.

Galactica (Taylor et al. 2022) is also developed by Meta, which is trained on a large-scale scientific corpus. It is developed to study the use of language models for the automatic organization of science and can perform numerous scientific tasks, such as citation prediction, scientific QA, and molecular property prediction.

ChatGLM and ChatGLM2, created by Tsinghua University, are based on GLM architecture (Du et al. 2022), and further adapted on conversational data. MOSS (Sun et al. 2023), developed by Fudan University, is the first publicly available Chinese LLM, and it follows a training procedure similar to ChatGPT.

We evaluate GPT-3.5-turbo, GPT4 and Claude on all three subsets, including Static Data, Dynamic Data, and Experimental Data. Since we can only assess ERNIE Bot and SparkDesk through web interface, we evaluate these two models only on the Experimental Data. And for the rest LLMs with billions or tens of billions of parameters, since the length of the Experimental Data exceeds the length limit of these models888The maximum context length of ChatGLM2 is extended to 32k, while it has limited ability to understand long texts., we evaluate them on Static Data and Dynamic Data, as is shown in Table 3.

In the case of Static Data, all questions are objective, making accuracy the appropriate evaluation metric. For Dynamic Data, the physics questions are presented as multiple-choice questions, which can also be evaluated using accuracy. Conversely, the chemistry questions involve complex components, such as “What is the molecular weight of A?” and “What is the SMILES expression of B?”. Hence, for questions with numerical answers, we employ MSE999If the predictions do not contain any number, we will regard the MSE as 1×10101\times 10^{10} as the evaluation metric, while for questions with string answers, we utilize the BELU score (Papineni et al. 2002). Additionally, we also calculate the extract match scores. As for Experimental Data, each experiment consists of multiple open-ended questions. As a result, we assess the model-generated responses manually.

Experiment Results

Answer-only results of all the models on the test set are shown in Table 4 and detailed results of Static Data across different knowledge domains are provided in Appendix B. Analyzing the results of Static Data, GPT-4 demonstrates significantly superior performance compared to other models. And only GPT-4, GPT-3.5-turbo, and Claude-v1.3 achieve an average accuracy exceeding 60%, which highlights the challenge posed by SciEval.

For the results of Dynamic Data, GPT-4 performs the best in terms of average accuracy and BLEU score. However, for counting and calculation questions, Galactica-30B yields the best results, indicating its strong aptitude in the field of science. Conversely, models with billions or tens of billions of parameters perform poorly on the chemistry subset, suggesting their limited knowledge about molecules. Regarding the performance of models on the physics subset, since all questions are 4-choices questions, the accuracy should be at least 25%. However, none of these models achieve satisfactory results in this subset.

As for Experimental Data, GPT-series models and Claude-series models achieve good results, while the other two models are not. The detailed scores models reached in each experiment are shown in Appendix C. However, although some models could get a great performance, during experiments, we find that these models are good at experimental principles and designing, while when it comes to analyzing the experiment results, the performances are not satisfying.

CoT Setting and 3-Shot setting

Comparison of experiment results among Answer-Only, Chain-of-Thought and 3-Shot settings are shown in Figure 5 and Table 5.101010When evaluating on CoT and 3-Shot settings, Claude-Instant and Claude are not available for us, due to the limitation of API. And we refer detailed results to Appendix A and B.

The experimental results from Static Data reveal that solely the GPT-series LLMs get performance enhancement within the CoT setting due to the limited CoT capabilities of other LLMs. As for the 3-Shot setting, roughly half of the LLMs analyzed demonstrate superior performances relative to the Answer-Only setting. The performances of the remaining LLMs are closely similar to those observed within the Answer-Only setting.

From the experimental results derived from Dynamic Data, it is observed that both CoT and 3-Shot significantly enhance the performance of most Language Learning Models (LLMs) in the chemistry subset. However, the performances achieved are still not up to the mark. In the physics subset, the impact of CoT and 3-Shot on most LLMs is less pronounced, resulting in nearly random performances. Under the CoT setting, GPT-3.5-turbo achieves an accuracy of 47.19, suggesting a robust understanding of physical principles. Conversely, the performance of GPT-4 is markedly poor, from which we find that despite its extensive knowledge of physical principles, it frequently employs incorrect formulas to solve problems. Nevertheless, GPT-4 attains an accuracy of 51.01 under 3-Shot setting, the highest among all models, demonstrating its ability to learn from a mere three examples.

Discussion

Based on experimental results (Table 4), Galactica (Taylor et al. 2022), which has been trained on an extensive scientific corpus, significantly outperforms other LLMs with a comparable number of parameters, although Galactica is trained with a much smaller amount of data. Remarkably, when tested on Dynamic Data, Galactica surpasses the GPT-series and Claude-series LLMs in computational problems.

Most LLMs perform bad on calculation problems, especially in physics domain.

Detailed results across various knowledge domains on Static Data (refer to Appendix B) reveal that most LLMs underperform in the Scientific Calculation domain, while demonstrate relatively superior performance in other domains, which is particularly acute in the field of physics. Similar issues are also observed in Dynamic Data and Experimental Data. In the context of Dynamic Data, the mean square error, employed to evaluate calculation abilities within the chemistry subset, is exceedingly high for most LLMs, and almost all LLMs can only achieve nearly random performance within the physics subset. Regarding Experimental Data, our findings indicate that these LLMs struggle with the analysis of experimental results.

Conclusion

In this paper, we introduce SciEval, a benchmark designed to evaluate scientific capabilities of LLMs. SciEval comprises about 18,000 challenging scientific questions, covering three fundamental fields of science. SciEval assesses the scientific ability of LLMs across four dimensions. It incorporates both objective and subjective questions, and employs dynamic data generation to mitigate potential data leakage. We conduct comprehensive experiments on various advanced LLMs using SciEval and perform thorough analyses. Our experimental results reveal that most LLMs do not perform well on our benchmark, with the exception of the GPT-series and Claude-series LLMs. We hope that SciEval can serve as a robust benchmark for assessing scientific capabilities of LLMs.

References

Appendix A A Detailed Results on Dynamic Data

In this section, we show detailed results on the Chemistry subset of Dynamic Data under Chain-of-Thought (Table 6), and 3-Shot settings (Table 7). The performance comparison under different settings can be found in Table 5 of the main body.

Appendix B B Detailed Results on Static Data

In this section, we show detailed results on Static Data across different knowledge domains under Answer-Only (Table 9), Chain-of-Thought (Table 10) and 3-Shot settings (Table 11), and the overall results are shown in Table 8.

Appendix C C Detailed Results on Experimental Data

In this section, we show detailed results in each experiment, referred to Table 12. Each category contains four experiments, and each experiment is composed of several questions.

Appendix D D Dataset Example

In this section, we show examples of different disciplines, different knowledge domains, and different subsets, including Static Data (Figures 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15) and Dynamic Data (Figures 17 and 16).