SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Hongshuai Wang, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Mingjun Xu, Jin Huang, Xi Fang, Jiaxi Zhuang, Yuqi Yin, Yaqi Li, Changhong Chen, Zheng Cheng, Zifeng Zhao, Linfeng Zhang, Guolin Ke

cs.CL

Introduction

Recent advances in Large Language Models (LLMs), such as Llama , Gemini , and GPT-4 , have attracted considerable attention due to their profound capabilities in natural language understanding and generation . Evaluating these models is crucial for exploring their capability boundaries and limitations, further driving technological advancements. In response, a variety of benchmarks tailor-made for LLMs have been proposed to broaden the scope of evaluation, covering wider skill sets and introducing more complex tasks .

Meanwhile, in the realm of scientific research, the role of LLMs has become increasingly significant , particularly in the domain of scientific literature analysis . Applications such as literature summarization and knowledge extraction have seen practical deployments, enhancing researchers’ productivity and broadening the scope of literature that can be synthesized and utilized. However, existing benchmarks are limited in their ability to fully assess the capabilities of LLMs within the field of scientific literature analysis, particularly when handling complex, comprehensive understanding and multimodal data scenarios. These benchmarks often fail to simulate the intricate challenges presented by scientific literature, which include but are not limited to domain-specific jargon, intricate relational reasoning, and the integration of multimodal information. To bridge this gap, there is a critical need for the development of advanced benchmarks that accurately reflect the complexity and specificity of scientific literature analysis.

In addition to expanding the evaluation scope to encompass the full range of LLM capabilities in the context of scientific literature analysis, meticulous design is imperative for creating evaluations that yield deep insights, ensure fairness across different LLMs, and offer high utility for stakeholders interested in selecting and optimizing LLMs. Thus, the benchmark design should be founded on three critical considerations:

Model Ability: A benchmark must delineate the desired capabilities and model the intrinsic relationships among them, facilitating diagnostic understanding of how these abilities can be acquired and enhanced.

Scope & Task: Benchmarks should be predicated on a broad array of scientific domains to ensure a comprehensive representation of disciplines. Within each domain, the tasks selected must authentically represent the typical challenges and scenarios characteristic of that field.

Quality Control: The quality of the benchmark dataset must be impeccable to serve as a dependable basis for deriving accurate, actionable, and applicable insights. Each data point should undergo stringent validation by domain experts to confirm its accuracy and reliability.

In light of existing limitations, we introduce SciAssess – a new benchmark specifically designed for scientific literature analysis. It covers diverse tasks and question types, aiming to provide a more nuanced and rigorous evaluation of LLMs to capture the complexity of scientific inquiry. We will detail it from aforementioned three cornerstones: model ability, scope & task, and quality control.

For model ability, SciAssess evaluates it across three progressive levels: memorization, comprehension, and analysis & reasoning. Consequently, SciAssess yields nuanced and informative evaluation outcomes, pinpointing specific aspects where the examined models may fall short.

For scope and task, SciAssess encompasses a wide spectrum of tasks pertinent to various scientific domains, such as general chemistry, organic electrolytes, alloy materials, drug discovery, and biology. To secure a thorough and representative benchmark, the raw data is meticulously curated from public accessible scientific publications and specialized databases to ensure that SciAssess is both comprehensive and reflective of the current state of scientific inquiry.

For quality control, SciAssess underwent rigorous expert cross-validation to ensure correctness and reliability. Meanwhile, meticulous screening is conducted to remove or anonymize sensitive information, thus protecting privacy and security. These measures guarantee compliance with copyright laws, upholding SciAssess’s legal and ethical integrity.

SciAssess aims to reveal the performance of LLMs in the domain of scientific literature analysis, thereby identifying both their strength and weaknesses. Following this, our objective is to promote the evolution of LLMs, ensuring they are not only better skilled at navigating scientific literature but also more capable of aiding the progress of research in diverse scientific fields. The insights gained from SciAssess could hopefully catalyze further enhancing the capabilities of LLMs in scientific literature analysis, ultimately contributing to the acceleration of scientific discovery and innovation.

Benchmark Dataset

In crafting a comprehensive benchmark for evaluating LLMs within the scientific domain, we have undertaken meticulous designs considering three factors: model ability, scope & task, and quality control.

Guided by the widely accepted cognitive learning processes outlined in Bloom’s Taxonomy , we have developed SciAssess – a benchmark specifically designed for the evaluation of LLMs within the context of scientific literature analysis. This evaluation encompasses three core competencies:

Memorization (L1) refers to the model’s extensive knowledge base, which allows it to accurately answer common factual questions in science autonomously.

Comprehension (L2) is the ability to precisely identify and extract key information and facts within a given text, and to comprehend them.

Analysis and Reasoning (L3) demonstrate the model’s advanced capability to amalgamate extracted information with its existing knowledge base for logical reasoning and analysis, leading to well-founded conclusions or predictions.

2 Scope & Task

As illustrated in Table 1, our benchmark encompasses a broad spectrum of practical scientific disciplines, ensuring a comprehensive evaluation of LLMs across various domains. Meanwhile, we devise five types of questions to evaluate the models: True/False questions, Multiple-Choice questions, Table Extraction, Constrained Generation, and Open-ended Generation. For an in-depth understanding of these question types, including detailed descriptions and concrete examples, please refer to Appendix A.

Note: The "Multimodal Content" column indicates whether the task requires the use of elements beyond the text from the literature, such as images, tables, charts, molecules, and so on. Note on Evaluated Ability: L1 represents Memorization, L2 represents Comprehension, and L3 represents Analysis and Reasoning.

The General Chemistry evaluation set is a comprehensive suite of tasks designed to assess LLMs’ capabilities across a range of chemistry-related skills, from foundational knowledge to applied problem-solving and research analysis. This set encompasses five distinct tasks, each targeting different chemistry and scholarly comprehension aspects. Including these tasks provides a holistic overview of an LLM’s ability to engage with the academic study of chemistry and the practical application of its principles. All test data are collected from OpenAI’s evals repository.

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and similar to how we evaluate humans. We select the high-school chemistry and college chemistry from its 57 subjects to examine the memorization of knowledge.

This evaluation benchmark tests the model’s ability to generate a proper title using the abstract section of the literature. The LLM should understand the abstract section and rephrase it concisely. The conciseness of the generated title is evaluated by GPT-4, prompting like:

The "Research Question Extraction" evaluation task for LLMs aims to assess the models’ ability to identify, extract, and summarize the primary research questions from a scientific paper’s abstract. This task challenges the LLMs to demonstrate deep comprehension of the abstract’s content, which typically includes the research’s background, purpose, methodology, results, and conclusions. The primary goal is to distill this information into a clear, concise statement or set of statements encapsulating the core research question(s) the paper seeks to address.

This evaluation task tests several key capabilities of LLMs, including their understanding of complex and domain-specific language, ability to discern the main focus amidst potentially broad and detailed information, and proficiency in summarizing and reformulating academic content. Research question extraction requires not just surface-level processing of the text but a deeper analysis to identify the research gaps the study aims to fill, the hypotheses it tests, or the problems it seeks to solve.

This task is particularly relevant for evaluating the utility of LLMs in academic and research settings. Efficiently understanding and extracting the essence of scholarly articles can aid in literature reviews, research proposal development, and the identification of research trends and gaps. It underscores the potential of LLMs to support researchers, scholars, and students by streamlining the process of engaging with the vast and ever-growing body of scientific literature.

The answers are evaluated by GPT-4 to give scores ranging from 1 to 5, similarly prompted to that in task Abstract2Title.

The "Balancing Chemical Equations" evaluation task for LLMs is designed to assess the LLMs’ ability to understand and apply principles of chemical stoichiometry and the laws of mass and energy conservation. Balancing chemical equations is a fundamental skill in chemistry that involves adjusting the coefficients of the reactants and products in a chemical equation to ensure that the number of atoms for each element is the same on both sides of the equation, reflecting the conservation of matter.

This task not only tests the LLMs’ capacity to parse and interpret the symbolic language of chemistry but also evaluates their problem-solving skills and their ability to engage with domain-specific knowledge. To successfully balance a chemical equation, an LLM must identify the reactants and products, understand the stoichiometric relationships between them, and apply mathematical reasoning to find a set of coefficients that balance the equation.

2.2 Alloy Materials

Alloy materials are materials with metal characteristics that are mixed by two or more metal elements in a certain proportion. Alloys are widely used in aerospace, automobile manufacturing, construction, electrical appliances, and other fields. By adjusting the composition and preparation process, the alloy material can achieve specific properties and requirements . Therefore, it is very important to extract the alloy composition and process value from the literature to design the alloy.

In this chapter, we will investigate the capabilities of LLMs in extracting information for alloy design. We have designed a comprehensive set of tasks related to literature research, including extraction of alloy composition, extraction of process values, and determination of process sequence, as well as sample differentiation. The standard answers for all tasks in this section are manually extracted for literature from different journals and verified by another person.

Extracting alloy composition information from the article’s text or table and unifying structured processing can help researchers use historical data more effectively and provide experienced guidance for the subsequent design. This comprehensive task evaluates the LLMs’ ability to extract alloy composition (containing all element content) from text and table. The extraction position of alloy elements typically falls into two cases: (1) The element content is stored in a table, as illustrated in the following Table 2; (2) The element content is implicitly indicated by the alloy name. For instance, ’Fe30Co20Ni50’ represents an atomic ratio of 30% Fe, 20% Co, and 50% Ni. The objective of this task is to comprehensively extract the aforementioned information and organize it into a resultant table while calculating the matching score between the standard answer table and the extraction result table. This showcases LLM’s comprehension ability to integrate, extract, and structure multi-modal information .

The properties of the alloy are determined by the composition of the alloy and the process, which contains processing and heat treatment, etc. So, the value of the extraction heat treatment is critical. The task aims to determine the maximum temperature value for heat treatment of the alloy. To ensure accurate statistical analysis, prompts are designed in a manner of multiple-choice questions.

Example Paragraph Cast NiMnGa samples, of Ni50Mn30Ga20 nominal composition, were prepared by 5 arc melting cycles of the pure elements (electrolytic Ni 99.97%, electrolytic Mn 99.5% and Ga 99.99%) in stoichiometric ratio, in a non-consumable electrode furnace (Leybold LK6/45) (Leybold, Cologne, Germany). The as-cast ingot was ground to powder in a planetary ball mill (Fritsch Pulverisette 4) (FritschIdar-Oberstein, Germany) and the powder size was selected by means of sieves. Densified pellets were produced by die-pressing alloy powders with different average sizes (lower than 50 um or between 50 and 100 um) at 0.75 GPa at room temperature and sintered by thermal treatment at 925 °C for 24, 72, and 168 h in an Ar atmosphere, followed by slow cooling in the furnace. Sintered pellets had the following dimensions: approximately 3 mm in height and 13 mm in diameter. Table 1 provides a summary of the prepared sintered samples. We prompt the model with the following:

Each process of alloy treatment has a clear sequence requirement, so it is necessary to ensure that the sequence of the extracted heat treatment process is consistent with the experimental sequence. For example, after solution treatment, the sample is further aged to ensure the internal gravitational release of the sample. This task aims to objectively analyze and evaluate the sequential relationship between two heat treatments and provide True/False answers. In addition, if a certain heat treatment name does not exist in the paper, it should be considered False. This task assesses the LLM’s comprehension ability to judge treatment order from text.

Prompt System Message: You are a highly intelligent alloy researcher who focuses on the heat treatment process (homogenization, annealing, aging, solution, quenching, tempering, etc.) and answers the following question correctly. Only write the True or False. User Message: In the upper paper, Is the processing heat treatment technique after the homogenization called aging? Expected Answer: True It is worth noting that the definition of the alloy process steps is based on the complete heat treatment process, such as the two consecutive process steps (annealing and tempering); each step contains the temperature rise, constant temperature, and cooling stages. Under this definition, the previous process of tempering should be judged to be annealing, rather than annealing cooling sub-stages.

The processing steps and properties of alloy materials are usually presented in the form of charts, such as a single chart comparing the performance of multiple alloys or showing how elongation changes with composition. Therefore, extracting information from charts and integrating it with textual information is of vital importance. To further evaluate the retrieval capability of models regarding alloy chart information, we have designed multiple-choice questions involving alloy composition, processing techniques, and properties.

In order to minimize the impact of the article’s body text, for every ChartQA task mentioned in this article, we only input the PDF page containing the targeted image. We prompt the model with the following:

Alloys with the same composition but treated by different processes are considered different samples because they have different properties. Therefore, the distinction between different samples and the differences in their processes needs to be considered. This multiple-choice question is designed to comprehensively judge the number of different alloy samples proposed or studied by the authors. It assesses the LLMs’ analysis and reasoning ability regarding alloy distinctions from text.

The following example is process paragraphs where the sample are treated by different processes :

Example Paragraph An induction furnace was used to produce the Zn-21A1-2Cu alloy by melting proper amounts of Zn (99.99%), Al (99.99%)and Cu (99.96%). The alloy was melted in a graphite crucible exposed to air and poured into cylindrical bars of 19 mm in diameter and 35 mm in length. After that, some bars were homogenized at 350 °C for 24h in the air. Cast and homogenized samples were subjected to an equal channel angular extrusion(ECAP) in a die with two cylindrical channels with a diameter of 15.8mm. The inner intersecting angle (y) was 90 and the outer angle (y) was 36°. All samples were extruded by two and sixpasses with a ram velocity of 5 mm/min and by using B. route. The lubricant used was MoS, and it was applied to both channels on each pass. We prompt the model with:

2.3 Organic Materials

Organic materials, derived from carbon-based molecules or polymers, exhibit distinct functionalities advantageous across a spectrum of applications. Distinguished from inorganic counterparts by their modifiable properties and adaptability, these materials are vital in applications such as electronics, photonics, sensing, and energy, leveraging the vast potential of organic chemistry for technological advancements.

In this chapter, our primary focus lies on two subfields within organic functional materials: organic electrolytes and polymer materials. For polymer materials, we evaluate the efficacy of LLMs in extracting crucial properties associated with polymer materials from the scientific literature. We specifically focus on the application of conjugated polymers in organic solar cells as a case study and design two tasks for plain text and tables, respectively. We aim to assess the model’s ability to recognize and identify information about these materials from a variety of tasks.

Organic electrolytes are electrolytic solutions extensively used in battery technologies, especially in lithium-ion batteries. Comprising organic solvents, lithium salts, and additives when necessary, these electrolytes facilitate ion transfer within the battery, enabling energy storage and release.

Understanding solubility in organic electrolytes is crucial as it directly impacts the efficiency of electrolytic processes, product selectivity, and equipment design in various applications. In this task, we intend to investigate the LLM’s capability in retrieving solubility-related tables. Considering the fact that papers related to the topic of electrolytes typically select data from various aspects to describe the system, requesting models combine multiple tables into an appropriate format for fuzzy matching turns out to be overly difficult. Therefore, we choose to focus more on examining the semantic understanding ability of LLMs, enabling the model to select the most relevant and largest table related to "solubility" from numerous alternatives, and then converting it into the specified format.

The composition and properties of organic electrolytes are crucial for battery performance, stability, and safety. Therefore, to further evaluate the model’s retrieval capabilities concerning information related to electrolytes, we posed multiple choice questions about both the components of the solution systems and the dissolution reactions, focusing on their physical and chemical properties as presented in the tables within the articles.

Investigating electrolyte reactions helps improve the SEI layer, directly affecting battery performance and lifespan. Such studies lead to the development of advanced electrolytes that enhance a robust SEI, resulting in more efficient and durable batteries. Hence we designed a complex task to test the model’s capability in extracting information from schematic diagrams of chemical reaction mechanisms. We require the model to understand the images specified in the articles and select the correct answer from the provided multiple-choice descriptions.

Extracting vital values from the literature, including power conversion efficiency (PCE), open-circuit voltage ( $V_{\mathrm{OC}}$ ), and other electronic properties. These properties are generally included in tables. By employing LLM to extract these properties, we demonstrate significant potential for the AI community in polymer modeling, such as computer-aided screening, targeted design, and optimization. Source data are collected from journals including Nature Communications, Advanced Materials, Nature Photonics, Nature Commun., J. Phys. Chem, and Appl. Phys. Lett.

Extraction of the blend ratio of donor to acceptor in the most efficient solar cell. The blend ratio are generally included in the plain text. By employing LLM to extract these properties, we demonstrate significant potential for the AI community in polymer modeling, such as computer-aided screening, targeted design, and optimization. Source data are collected from journals including Nature Communications, Advanced Materials, Nature Photonics, Nature Commun., J. Phys. Chem, J. Am. Chem. Soc. and Appl. Phys. Lett.

2.4 Drug Discovery

This comprehensive task evaluates the LLM’s ability to extract OLED molecules and their optical properties. This evaluation task tests several key capabilities of LLMs, including their understanding of complex and domain-specific language, and understanding of molecules and tables.

An example output is shown in Table 2.2.3.

The processing steps and properties of polymer materials are often represented through charts. As a result, it is crucial to extract information from these charts and combine it with textual data. To further assess the retrieval capabilities of models concerning polymer chart information, we have designed multiple-choice questions that involve polymer composition, processing techniques, and properties. The example of polymer ChartQA is similar to the Alloy ChartQA task in Section 2.2.2.

In this chapter, we examine the LLMs’ capabilities in the realm of drug discovery. We devise a comprehensive set of tasks related to patent and literature research, ranging from affinity data extraction and patent coverage.

This comprehensive task evaluates the LLM’s ability to extract an affinity table (containing molecules’ tags, SMILES, and affinity to different targets in bioassays). This evaluation task tests several key capabilities of LLMs, including their understanding of complex and domain-specific language, and understanding of molecules and tables. Affinity data extraction requires not just surface-level processing of the text but a deeper analysis to match different modalities.

Prompt System Message: You are an expert in the field of pharmaceutical chemistry, and your task is to summarize the results of activity assays from an article in a tabular format. Please follow these steps to complete the task: 1. Determine if the article includes an activity assay. If it does, locate the section(s) presenting the assay results, which are usually in one or more tables. 2. Compile all the activity assay results into a single table. You may use multiple columns to represent different conditions or outcomes of various experiments. 3. Identify the names or codes used in the table, such as Example 1 or Compound A, and find the corresponding sections in the article that mention these substances. Extract the full name and SMILES notation of each substance. 4. Compile the names and SMILES notations of each substance in the table. Output in CSV format with multi-index (Affinities, protein/cell line), write units not in the header but in the value like "10.5 µM". Quote the value if it has comma! For example: “‘csv Compound,Name,SMILES,Affinities,Affinities,Affinities,Affinities ,,,5HT1A (IC50),5HT1D (IC50),5HT-UT (IC50),5HT1E () "5a","Aspirin","CC(=O)Oc1ccccc1C(=O)O",2.0 nM,8.0 nM,12.6 nM,>1000 nM “‘ 5. If there are multiple tables, concat them. Don’t give me reference or using "…", give me complete table! The dataset is curated from PubChem BioAssays to cover literature from different journals and years. Since the original dataset is organized by bio-assay numbers, we merge the source data according to their DOIs, and carefully sampled a subset of them. The papers have a large coverage of protein targets and cell lines, in the meanwhile have different display formats of the tables.

In the following tasks, we divide the affinity table extraction task logically, into several direct sub-tasks.

This task evaluates the model’s ability to find correct SMILES given its tag in a document. Usually, a molecule is shown in an image of its structure and tag below it. The LLM should recognize both the two and understand the connection.

This task evaluates the model’s ability to find all related targets (proteins or cell lines) of the bioassays given a document.

Prompt User Message: Please show which targets (protein / cell line) this paper’s experiments do against. The extracted targets are compared to ideal answers by asking GPT-4:

LLM Scoring Prompt User Message: You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: Please show which targets (proteins / cell lines) this paper’s experiments go against. ************ [Expert]: {ideal} ************ [Submission]: {completion} ************ [END DATA] Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: (A) The submitted answer is a subset of the expert answer and is fully consistent with it. (B) The submitted answer is a superset of the expert answer and is fully consistent with it. (C) The submitted answer contains all the same details as the expert answer. (D) There is a disagreement between the submitted answer and the expert answer. (E) The answers differ, but these differences don’t matter from the perspective of factuality. Then the answer is given a score according to the model-selected choice: 0.5 for A, 0.75 for B, 1.0 for C, 0.0 for D, and 0.75 for E.

The analysis and understanding of drug properties, compositions, and processing techniques are critical for drug discovery and development. Often, this information is presented in the form of charts, making it essential to extract and integrate such information with textual data. To further assess the retrieval capabilities of models in the context of drug chart information, we have designed multiple-choice questions focusing on drug composition, processing methods, and properties. The example of drug ChartQA can be found in a similar format to the Alloy ChartQA task in Section 2.2.2.

Organic and bio-catalyzed synthetic reactions are of vital importance to the manufacture of drug-like molecules. Hence we designed a complex task to test the model’s capability in extracting information from schematic diagrams and texts of chemical reactions. We require the model to understand the images specified in the articles and select the correct answer from the provided multiple-choice descriptions.

This comprehensive task evaluates the model’s ability to judge whether the molecule (represented by SMILES) is covered in a document. The LLM should recognize all the Markush formulas and their substituents, and then judge whether the required molecule is covered.

2.5 Biology

This task evaluates the model’s ability to get correct SMILES given Markush formula (in CXSMILES pattern) and substituents.

Prompt System Message: You are ChemistGPT, can help user insert substituents into markush formula to get SMILES formula (removing Hs). For example, if user’s input is "*CCCC* | $R1_{p};;;;;R2_{p}$ |, R1 = OMe, R2 = NH2", you will reply the completed SMILES: "COCCCCN", without explanation. If you can’t answer the completion, just reply "Unknown". User Message: *C(*)CC(*)CC* | $A;;Pol_{p};;;Q_{e};;;M_{p}$ |, A = H, Pol = NH2, Q = OH, M = [Li] 2.2.5 Biology MedMCQA MedMCQA is an evaluation task designed to assess the capability of models in understanding and reasoning over medical multiple-choice questions, targeting artificial intelligence systems, and it originates from the compilation of clinically relevant queries and knowledge assessments.

CompDisease Recognition, which is the "Compound-Disease" Association Recognition task, featured in BioGPT as B5CDR, evaluates the capability of models to identify and understand the associations between compounds and diseases, with the task originating from the BioCreative V challenge, specifically detailed in the study by Chih-Hsuan Wei et al., titled "Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task" published in the 2016 issue of the Database journal.

Prompt System Message: You are a biologist AI. I’ll give you the abstract of literature. Please identify all the (compound,disease) relations in the abstract, and just give me a list of all relations you recognized. To be mentioned, all the relations must be strictly presented to me in the format ’(compound 1, disease 1),(compound 2, disease 2),(compound 3, disease 3),…’, without adding any additional comments or explanations! User Message: [processed text] Examples of process text:

Example Paragraph Twenty children with acute lymphoblastic leukemia who developed meningeal disease were treated with a high-dose intravenous methotrexate regimen that was designed to achieve and maintain CSF methotrexate concentrations of 10(-5) mol/L without the need for concomitant intrathecal dosing. The methotrexate was administered as a loading dose of 6,000 mg/m2 for a period of one hour followed by an infusion of 1,200 mg/m2/h for 23 hours. Leucovorin rescue was initiated 12 hours after the end of the infusion with a loading dose of 200 mg/m2 followed by 12 mg/m2 every three hours for six doses and then every six hours until the plasma methotrexate level decreased to less than 1 X 10(-7) mol/L. The mean steady-state plasma and CSF methotrexate concentrations achieved were 1.1 X 10(-3) mol/L and 3.6 X 10(-5) mol/L, respectively. All 20 patients responded to this regimen, 16/20 (80%) achieved a complete remission, and 20% obtained a partial remission. The most common toxicities encountered were transient serum transaminase and bilirubin elevations, neutropenia, and mucositis. One patient had focal seizures and transient hemiparesis but recovered completely. High-dose intravenous methotrexate is an effective treatment for the induction of remission after meningeal relapse in acute lymphoblastic leukemia. The ideal answer are "(methotrexate,transient hemiparesis), (methotrexate,neutropenia),(methotrexate,seizures), (methotrexate,mucositis)".

The GeneDisease Text Mining task is for "Gene-Disease" association semantics text mining. This task evaluates the ability of models to extract and understand the relationships between genes and diseases from scientific literature, focusing on the identification of gene and disease entities as the primary objects of evaluation.

In order to test the model’s ability of text comprehension and triplet extraction, predefined entities and relationships are designed to capture complex biomedical interactions: (entities, relationship, entities).

Entities encompass molecular objects such as Disease, Gene, Protein, and Enzyme, each representing various biological components crucial for understanding molecular biology and genetics. Trigger word entities include Variation (Var), indicating changes in DNA, RNA, proteins, or molecular structures; Molecular Physiological Activity (MPA), referring to molecular-level activities like gene expression; Interaction, denoting connections between molecules or between molecules and cells; Pathway, encompassing various biological pathways; Cell Physiological Activity (CPA), representing activities at or above the cellular level, including responses and development; and regulatory terms such as Regulation (Reg), Positive Regulation (PosReg), and Negative Regulation (NegReg), which indicate neutral, gain, or loss of function clues respectively. The relationships defined include semantic roles, like ThemeOf, pointing from the main event (Theme) entity towards the current entity, and CauseOf, directing from the current entity to the cause entity, as well as regulatory types are categorized into LOF (loss of function), GOF (gain of function), REG (general regulation), and COM (complex functional changes between genes and diseases), aiming to predict "gene; function change; disease" triples from the text, illustrating the intricate dynamics between these biological entities and processes.

For extracting triplets (entities, semantic roles, entities), we prompt the model with:

Prompt System Message: In this semantic role regconition task, you need to follow 3 steps, and finally just return me triples that needed. First, you need to identify the entities in the text. Entities can be classified into 2 categories–molecular, and trigger word. ’Molecular’ includes disease, gene, protein, and enzyme. ’Trigger word’ includes 1)Variation(Var), which means DNA, RNA, and mutations in proteins and changes in molecular structure, e.g. ’mutations on the Arg248 and Arg282’, ’mutant R282W’, ’missense mutations’; 2)Molecular Physiological Activity (MPA), including molecular activity, gene expression and molecular physiological activity, e.g. ’phosphorylation’, ’transcription’, ’histone methylation’, ’bioactivation of cyclophosphamide’; 3)Interaction, molecule-to-molecule or molecule-to-cell connections, e.g. ’bind’, ’interaction’; 4)Pathway, e.g. ’Bmp pathway’,’PI3K pathway’; 5)Cell Physiological Activity (CPA), Activities at or above the cellular level, including cellular reactivity and cell or organ development and growth, e.g. ’T helper cell responses’, ’renal development’; 6)Regulation (Reg), a neutral cue word or phrase meaning no loss or gain, e.g. ’resolved in’, ’regulated’; 7)Positive Regulation (PosReg), a cue word or phrase that indicates the acquisition of a function, e.g. ’facilitates’, ’enhanced’, ’increased’; 8)Negative Regulation (NegReg), a clue word or phrase that indicates a loss of function, e.g. ’suppressed’, ’decreased’, ’inhibited’. Second, you need to identify the semantic role labeling objects, including ’ThemeOf’(from the main thing entity to the current entity) and ’CauseOf’(From the current entity to the Cause entity). Third, please give me tripples that contain entities and semantic role labeling objects(ThemeOf or Causeof). Note: You should only give me the triples like this form: (…, …, …),(…, …, …),…." User Message: [processed text] Examples of process text:

Example Paragraph A novel frameshift mutation (+G) at codons 15/16 in a beta0 thalassaemia gene results in a significant reduction of beta globin mRNA values. AIMS: To identify a novel beta globin gene mutation found in a Chinese family, and also to assess its functional consequences. METHODS: Haematological analysis was performed on all family members. The 23 common mutations of beta thalassaemia found in Chinese populations were detected by means of a reverse dot blot method. Direct DNA sequencing of polymerase chain reaction (PCR) amplified complete beta globin gene was carried out to identify the novel mutation. A real time, one step reverse transcription PCR assay was used to measure beta globin mRNA in the reticulocytes of heterozygous patients. RESULTS: A novel frameshift mutation-an insertion of G between codons 15 and 16 in a homonucleotide run of four guanines-was determined, which generates a new premature chain terminator at the 22nd codon. Relative quantitative analysis of the beta globin mRNA in heterozygous subjects demonstrated a 39.83% reduction compared normal controls. CONCLUSIONS: The significantly lower amounts of beta globin mRNA found in mutation carriers is probably caused by the rapid nonsense mediated degradation of the mutant mRNA. These data, combined with haematological analysis, suggest that this novel mutation of CDs 15/16 (+G) results in a beta(0) thalassaemia phenotype. The ideal answer are "(frameshift,CauseOf,reduction),(caused by,CauseOf,lower),(mutation,CauseOf,results in),(beta(0) thalassaemia,ThemeOf,results in),(beta globin mRNA,ThemeOf,reduction),(beta0 thalassaemia gene,ThemeOf,frameshift),(insertion,CauseOf,generates),(premature chain terminator,ThemeOf,generates),(amounts of beta globin mRNA,ThemeOf,lower),(mutation,CauseOf,caused by),(degradation,ThemeOf,caused by)".

For extracting triplets (entities, regulatory types, entities), we prompt the model with:

Prompt System Message: In this Gene-Disease relation extraction task, you need to follow 3 steps. You need to extract the (gene, function change, disease) triplet from the text, such as: (SHROOM3, LOF, Neural tube defects). The second element in the triple means the regulation that the gene produces to the disease. Types of regulations are: LOF and GOF, which indicate loss or gain of function; REG, which indicates a general regulatory relationship; COM, which indicates that the functional change between genes and diseases is more complex, and it is difficult to determine whether the functional change is LOF or GOF. Please return all the relations extracted from the text in ternary format (GENE, FUNCTION, DISEASE). If there are more than one triple, please write in this form: ’(GENE, FUNCTION, DISEASE),(GENE, FUNCTION, DISEASE),……’. User Message: [processed text] Examples of process text:

Example Paragraph Mutation spectra in autosomal dominant and recessive retinitis pigmentosa in northern Sweden. Retinal degenerations represent a heterogeneous group of disorders affecting the function of the retina. The frequency of retinitis pigmentosa (RP) is 1/3500 worldwide, however, in northern Sweden it is 1/2000 due to limited migration and a ’founder’ effect. In this study we identified genetic mechanisms underlying autosomal dominant and recessive RP present in northern Sweden. Several novel mutations unique for this region were found. In an autosomal recessive form of RP, Bothnia dystrophy caused by mutations in the RLBP1 gene, bi-allelic mutations R234W, M226K and compound heterozygosity, M226K+R234W was detected.In dominant form of RP mapped to 19q13.42 a 59 kb genomic deletion including the PRPF31 and three other genes was found.These data provide additional information on the molecular mechanisms of RP evolvement and in the future might be useful in development of therapeutic strategies. Identification of the disease-causing mutations allowed introducing molecular genetic testing of the patients and their families into the clinical practice. The ideal answer is "(RLBP1, REG, Bothnia dystrophy)".

3 Data Quality, Privacy, and Copyright Compliance

The analysis and understanding of biological properties, compositions, and processing techniques are critical for the discovery and development of the field of life sciences. Often, this information is presented in the form of charts, making it essential to extract and integrate such information with textual data. To further assess the retrieval capabilities of models in the context of biological chart information, we have designed multiple-choice questions. The example of biological ChartQA can be found in a similar format to the Alloy ChartQA task in Section 2.2.2.

To safeguard the quality and ethical standards of our dataset, meticulous steps were undertaken in its preparation and validation:

Expert Validation: To guarantee the accuracy and reliability of SciAssess, all tasks have been subjected to multiple rounds of cross-validation by domain experts to ensure the correctness of dataset labels and maintain high-quality standards.

Screening and Anonymization: SciAssess underwent thorough screening for sensitive information. Any potential sensitive data identified was either removed or anonymized to protect privacy and ensure data security.

Copyright Compliance: We actively enforced rigorous copyright review procedures across all documents and data used, ensuring SciAssess adheres to legal standards and ethical norms without infringing on intellectual property rights.

Models: We have evaluated leading LLMs’ ability on scientific literature analysis:

GPT-4 : OpenAI’s GPT-4 excels in text generation and comprehension, augmented with capabilities for image processing, code interpretation, and information retrieval. These features make it adept at handling the complexities of scientific texts, positioning it as a versatile tool for scientific research. Since the newest version of GPT-4 can write the answer with the code interpreter, we use a chain-of-thought trick for GPT-4 to extract its final result. The CoT prompt is listed in Appendix B.

GPT-3.5: Preceding GPT-4, GPT-3.5 by OpenAI distinguishes itself with adept language processing skills, enabling effective engagement with complex texts.

Gemini : Google DeepMind’s Gemini model family excels in multimodal comprehension, integrating text, code, image, and audio analysis. Particularly notable is its performance on the MMLU test, where Gemini-1.0-Ultra outperforms human benchmarks. But till now, we have not received its API, so Gemini-1.0-Pro has been evaluated. Its adeptness at understanding and synthesizing complex scientific content makes it an advanced tool for academic research, offering insights and enhancing productivity in scientific literature analysis.

Baseline Workflow: SciAssess is predicated on a refined adaptation of the framework provided by openai/evals (https://github.com/openai/evals). We have incorporated additional functionalities such as model invocation (e.g., Gemini), bespoke tasks and evaluation metrics, datasets, and a PDF processing module, with the intent to release the detailed code shortly. The majority of SciAssess centers around scholarly literature and we have employed distinct methodologies for processing literature PDFs:

For GPT-4, we utilize the web-based ChatGPT4 interface. We directly upload the original PDF files to the chat interface and pose queries, thereby leveraging OpenAI’s built-in PDF processing capabilities.

For GPT-3.5, we process PDFs using PyPDF2 to convert them into text, subsequently feeding the plain text into the models.

For Gemini, given its proficiency in concurrently processing text and images, we initially employed PyPDF2 to extract text from the PDFs. Subsequently, we use PyMuPDF to retrieve images within the documents, then sort them to the reading order, and both text and images are then relayed to the model.

2 Results and Analysis

In this section, we analyze the performance of LLMs across various scientific domains, focusing on their capabilities in memorization, comprehension, and analysis, and comparing their performance on tasks both involving and excluding multimodal content.

The overall performance comparison of LLMs across various scientific domains, as summarized in Table 5, reveals the distinct strengths and weaknesses of each model. GPT-4 consistently outperforms the other models in almost all domains, securing the highest overall average rank. This demonstrates its superior adaptability and understanding of complex scientific literature. GPT-3.5, though lagging behind GPT-4, demonstrates competence across a broad spectrum of tasks, indicating its robustness. Gemini, despite its third-place finish in the overall rankings, showcases its strength in some particular tasks.

2.2 Domain-Specific Performances

Across a range of scientific domains, GPT-4 consistently demonstrates outstanding performance, excelling in nearly all fields except biology, where it ranks on par with Gemini. This highlights GPT-4’s exceptional ability to comprehend scientific literature and its robust adaptability. Despite ranking third overall, Gemini exhibits potential advantages in the biology domain, matching GPT-4’s performance. This outcome suggests Gemini’s possible domain-specific strengths. In the drug discovery domain, all models show weaker performance in the "Tag2Molecule" task, where all models scored near zero. This indicates a limitation in current models’ ability to handle highly specialized chemical content and complex molecular structure transformations. These findings reveal the strengths and limitations of different models within specific scientific domains, providing valuable insights for future model improvements.

2.3 Abilities Comparison

Table 6 shows the average rankings of different models in memory ability, comprehension ability, and analysis ability.

Memorization (L1) reflects a model’s capacity to recall information it has previously learned. In this capacity, GPT-4 stands out with the highest average ranking, demonstrating its superiority. For instance, in the "MMLU High-School Chemistry" task, GPT-4 showcased its accurate recall of foundational chemistry knowledge, leading other models with an accuracy rate of 0.591. This advantage of GPT-4 can likely be attributed to its vast training dataset, enabling it to cover a broader range of scientific knowledge domains.

Comprehension (L2) measures a model’s understanding of complex texts and its capability to extract key information. GPT-4 continues to lead in comprehension ability, exhibiting exceptional performance across multiple tasks. For example, in the "Abstract2Title" task, GPT-4 scored a model-graded score of 0.99, ranking at the top and showing its profound understanding of the text content and accurate generation of relevant titles.

Analysis and Reasoning (L3) refer to a model’s capacity to handle complex problems, reason, and generate solutions. GPT-4 marginally leads in this capability with an average rank of 1.75, indicating its proficiency in applying knowledge to analyze situations and deduce conclusions. For instance, in the "Sample Differentiation" task, GPT-4 achieved an accuracy of 0.528, significantly outperforming GPT-3.5 and Gemini, which scored 0.177 and 0.059, respectively. This demonstrates GPT-4’s superior ability to distinguish between complex samples, further highlighting its strength in tasks requiring in-depth analysis and reasoning.

2.4 Influence of Multimodal Content

Table 7 demonstrates how multimodal content influences the rankings of models. The term "multimodal content" refers to elements beyond text, such as tables, charts, and molecules. GPT-4 consistently maintains its leading position in tasks both with and without multimodal content, showcasing its exceptional adaptability and multimodal understanding capabilities.

However, the data reveal significant potential for improvement across all models when dealing with multimodal content. This is particularly evident in the drug discovery domain, where tasks frequently involve a substantial amount of multimodal content, thereby challenging the models’ processing abilities. For example, GPT-4’s accuracy in "Tag2Molecule," "Reaction QA," and "Molecule in Document" tasks was below 0.2. This highlights the substantial challenges that even the most advanced LLM face in processing highly complex multimodal scientific content. These observations emphasize the need for continuous improvements in multimodal understanding in the context of scientific literature.

The SciAssess project aims to rigorously assess the capabilities of LLMs in the field of scientific literature analysis. This benchmark covers several specific scientific domains: General Chemistry, Alloy Materials, Organic Materials, Drug Discovery, and Biology, focusing on evaluating LLMs’ core competencies in memorization, comprehension, and analysis within these contexts. Through detailed evaluations of leading models, including GPT-4, GPT-3.5, and Gemini, our research has illuminated their strengths and pinpointed aspects requiring enhancement, thus providing robust support for the continued development of LLMs in the realm of scientific research. Moving forward, we intend to broaden the scientific domains included in our benchmark tests and incorporate more complex multimodal datasets, significantly improving the benchmark’s utility and efficacy. This effort seeks to provide clearer guidance and enhance the use of LLMs in advancing scientific research and innovation.

Appendix

Appendix A Question Type

Five types of questions, as illustrated in Figure 1 are devised to evaluate the models: True/False questions, Multiple Choice questions, Open-ended Generation, Constrained Generation, and Table Extraction. Each question type is accompanied by a detailed description and representative examples, along with the corresponding metrics used for assessment. For convenience, the input in each example is simplified, and its instruction is omitted.

Appendix B Chain-of-Thought Implementation

Chain-of-thought is implemented in a two-step manner. For any chat prompt, the first step is COT reasoning, for example:

COT reasoning System Message: You are ChemistGPT, which can help the user insert substituents into the Markush formula to get SMILES formula (removing Hs). For example, if the user’s input is "*CCCC* | $R1_{p};;;;;R2_{p}$ |, R1 = OMe, R2 = NH2", you will reply with completed SMILES: "COCCCCN", without explanation. If you can’t answer the completion, just reply "Unknown". User Message: *C(*)CC(*)CC* | $A;;Pol_{p};;;Q_{e};;;M_{p}$ |, A = H, Pol = NH2, Q = OH, M = OLi Before answering, reason in a step-by-step manner to get the right answer, then conclude with the answer. The second step is COT extraction: