Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, Noa Garcia

Introduction

Over the past few years, large language models (LLMs) have exhibited remarkable performance on a wide range of question-answering (QA) tasks (Brown et al., 2020; Kadavath et al., 2022; Robinson and Wingate, 2023). The evaluation of LLMs’ strengths and limitations often relies on diverse benchmarks presented in different formats (Singhal et al., 2023; Liu et al., 2023b), domains (Jin et al., 2019; Zhong et al., 2020), and languages (Petroni et al., 2019; Bang et al., 2023). As previous research has shown (Liang et al., 2022; Chang et al., 2023; Li et al., 2023a; Chia et al., 2023), evaluation using benchmarks is essential for the detection and mitigation of various issues such as misinformation (Zheng et al., 2021; Gao et al., 2022), hate speech (ElSherief et al., 2021; Lu et al., 2023), and malicious uses (Xu et al., 2021; Ganguli et al., 2022; Shaikh et al., 2023; Zou et al., 2023). Such mechanisms are critical for safeguarding against harmful content and promoting responsible usage of LLMs in various contexts.

QA benchmarks come in a variety of formats, including True/False questions (TFQs) in which models predict whether a statement in the question is correct or not, multiple-choice questions (MCQs), in which multiple candidate answers accompany the input question, and long-form generation questions (LFGQs), in which a generated answer could span multiple sentences. Among these, multiple-choice is the most popular format as it allows a simple and quick assessment of model performance (Bhakthavatsalam et al., 2021; Ramamurthy and Aakur, 2022; Liu et al., 2023a; Huang et al., 2023). However, MCQs also present several limitations, such as potential misalignment with real-world use cases where LLMs are often required to answer questions in long-from generation format (Nuance, 2023; Bommasani et al., 2021). In addition, LLMs have been shown to be affected by changes in the position of the candidate answers (Zheng et al., 2023; Wang et al., 2023) and their contents (Pezeshkpour and Hruschka, 2023) when answering MCQs. The aforementioned problems highlight the limitations of MCQs benchmarks in evaluating LLMs, which could potentially lead to overestimation of LLMs capabilities.

With the above issues in mind, our motivation is to explore the limitations and characteristics of both MCQs and LFGQs as main evaluation formats in QA tasks. We aim to answer the following research questions:

How does the arrangement of options in MCQs influence LLMs’ selection of responses?

What methodologies can be employed to conduct comprehensive comparative experiments between MCQs and LFGQs? Additionally, what specific aspects should be considered when conducting comparative tests?

The answers to these questions contribute to the understanding and comparison between MCQs and LFGQs as evaluation formats in QA tasks. Given the prevalence of MCQs as the dominant evaluation format, our aim is to thoroughly examine their efficacy. This begins with a detailed exploration of MCQs’ capabilities and subsequently extends to a comparative analysis with LFGQs, providing a comprehensive assessment of both formats.

We address the first question by conducting a series of experiments (§3) to reveal the sensitivity of LLMs to answering MCQs by applying slight perturbations to the positional order of the options. We find significant differences between the answers in multiple LLMs (§3.1). We also identify specific patterns of the selected answer according to its position that varies among different LLMs (§3.2). For the second question, we conduct comparative experiments (§4) to quantify the misalignment between MCQs and LFGQs on three different spaces: direct output space (§4.1), token logits space (§4.2), and hidden embedding space (§4.3). By doing this, we aim to gain a deeper understanding of the unique characteristics between the two types of questions.

LLMs exhibit order sensitivity in bilingual MCQs, favoring answers at the first position.

Answers obtained from MCQs and LFGQs for identical questions have a low correlation.

Higher consistency does not indicate better model performance.

The misalignment between MCQs and LFGQs is evident in the evaluation performance as well as in the embedding space.

Overall, our study aims to provide a better understanding of the difference in QA formats in LLM evaluation, uncover underlying patterns, and shed light on the improvement of current methods.

Experimental Details

We use different models on different experiments, tailoring our choices based on the specific goals of each experiment, as summarized in Table 1. To check whether LLMs are sensitive to the order of the candidate answers (§3), we evaluate three models: ChatGLM-6B (Zeng et al., 2023; Du et al., 2022) and two models from the GPT family, namely GPT-3.5-turbo (OpenAI, 2023b) and GPT-4 (OpenAI, 2023a). In comparing MCQs and LFGQs (§4), we again use different models on the three different spaces. For the direct output space (§4.1), we use GPT-3.5-turbo, GPT-4, and ChatGLM-6B, considering both their diversity and performances. For the token logits space (§4.2), we only test GPT-3.5-turbo, as it is the only model that can output token probabilities (Manakul et al., 2023) within three models. Finally, in the embedding space (§4.3), we conduct experiments with models from multiple popular LLM families across various sizes, including StableLM-Tuned-Alpha-3/7B (Stability-AI, 2023), RedPajama-INCITE-Instruct-3B-v1 (Computer, 2023), Llama-2-7b-chat-hf (Touvron et al., 2023b), Dolly-v2-2/7/12B (Conover et al., 2023), Vicuna-7b-v1.3 (Chiang et al., 2023), and Open-llama-3/7B (Touvron et al., 2023a).

We conduct experiments on four evaluation benchmarks:

CARE-MI (Xiang et al., 2023): A Chinese benchmark for evaluating LLM misinformation in the maternity and infant care domain. It includes $1,612$ LFGQs. The questions can also be obtained as MCQs and TFQs from the original MLEC-QA(Li et al., 2021) and MEDQA (Jin et al., 2020) datasets, according to the question generation process of CARE-MI, resulting in each question being formulated in the three formats: MCQ, LFGQ, and TFQ.

M3KE (Liu et al., 2023a): A dataset with $20,477$ standard Chinese questions for $71$ tasks, encompassing all major levels of Chinese education system, including humanities, history, politics, law, education, psychology, science, technology, art and religion in MCQ format. Each question presents four candidate answers.

ARC (Clark et al., 2018): A dataset with natural, grade-school science questions (authored for human tests) in English. It is the largest public-domain set of this kind with $7,787$ questions. Each question contains four candidate answers.

MATH: A synthetic dataset randomly generated by a script with simple mathematical questions in English. Each question has four candidate answers.

As shown in Table 2, we ensure data balance by using a similar number of samples from each dataset.

For the first research question (§3), we use all the datasets: CARE-MI, M3KE, ARC, and MATH. In the second research question (§4), we use the CARE-MI dataset for the direct output (§4.1) and token logits analysis (§4.2) as it is the only dataset offering the three different QA formats. The ARC dataset is used on the embedding space analysis (§4.3), wherein we extend its MCQs to LFGQs by not presenting the candidate answers to the LLMs.

The selection of benchmarks is guided by three specific criteria: (1) Source diversity: we aim to conduct our analyses across different domains. (2) Language: We presume language is a potential factor influencing LLMs evaluation performance, so we conduct experiments on two high-resource languages, i.e., Chinese and English. (3) Performance-level: By incorporating benchmarks with varying levels of LLMs demonstrated performance, we aim to better understand how model proficiency influences results in the different QA formats. Additionally, in the token logits space (§4.2), we investigate changing the number of candidate answers to explore their impact on the expected calibration error.

To encourage the generation of concise responses, we provide LLMs with prompts both prior to (pre-prompt) and following (post-prompt) each question in any dataset format. Additionally, for MCQs, each question is accompanied by four candidate options, with only one being correct. The pre-prompt for MCQs is “Please select a correct option", while the post-prompt “Only one option can be selected. No explanation is allowed". For LFGQs, the post-prompt is “Just answer in one sentence", and for TFQs, it is “Just answer ‘yes’ or ‘no’". We find that this prompt design facilitates the generation of brief content, aiding subsequent accuracy evaluation (§4.1) and automatic confidence calculation (§4.2).

Are LLMs sensitive to the order of candidate answers?

We first investigate how the arrangement of candidate answers in MCQs datasets affects the evaluation of LLMs. We find that LLMs consistently exhibit a strong preference for specific positions when presented with options in different orders, as illustrated in Figure 1.

To check whether there are significant differences in LLMs’ answers when the candidate options are arranged in a different order, we employ the chi-squared test (McHugh, 2013). To isolate the influence of the correct answer, we designate option D as the only correct option for all the questions. Then, we establish two scenarios: in CASE1, the option order is ’ABCD’, and in CASE2, it is ’BACD’. Importantly, when arranging the option order, we also rearrange the contents and positions of each candidate option accordingly, rather than simply altering the numbering, as shown in Figure 1.

In the chi-squared test, we set the null hypothesis, $H_{0}$ , stating that the responses in CASE1 and CASE2 originate from the same distribution. The chi-squared statistic is calculated as

where $\sum\limits_{i=0}^{N}$ is the sum of $N$ candidate options , $O_{i}$ the frequency of each option in CASE1, and $R_{i}$ the frequency of each option in CASE2. With the significance test, we can get the p-value $P$ from the chi-squared probabilities based on the chi-squared statistic and the degrees of freedom, $N-1=3$ . Additionally, we can calculate the accuracy gap, which represents the difference between the original accuracy and the accuracy after reordering.

Results are shown in Table 3, from which we note the following observations:

There is a considerable disparity in LLMs’ outputs across the two scenarios. Except for two instances,GPT4 model on the ARC ( $X^{2}=2.681$ , $P=0.443$ ) and the MATH ( $X^{2}=4.513$ , $P=0.211$ ) datasets. all the results have a p-value $P<0.05$ , rejecting the null hypothesis and implying that the distribution of answers predicted by the model varies significantly when options A and B are interchanged. This indicates that the order of options significantly influences LLMs’ predictions in MCQs datasets.

Among the GPT family, the rearrangement of options has a more pronounced effect on GPT-3.5-turbo, with bigger accuracy gaps, than on GPT-4.

Higher accuracy can mitigate significant differences in the order arrangement to some extent. Results from GPT-4 on the ARC and the MATH datasets indicate that high accuracies ( $\geq 0.780$ ) can lead to not rejecting the null hypothesis.

There is no evident correlation between the accuracy gap and the original accuracy. A higher accuracy does not necessarily imply a lower gap between the two scenarios.

2. Pattern Decomposition

Next, we further explore the pattern decomposition of LLMs to investigate potential patterns underlying their sensitivity to order. We propose the following two hypotheses for exploration: (1) LLMs may have different positional preferences due to their different model bases; (2) LLMs may have different positional preferences depending on whether they have previously memorized the contents of the datasets.

We use the same LLMs as in §3.1, as they can provide concise answers to the questions and come from different model bases. Regarding the datasets, CARE-MI, M3KE, and ARC are derived from website documents, while the MATH dataset, synthetically generated by us, ensures that the LLMs have not been exposed to identical questions during training. Results are presented in Table 4, from which we can extract the following conclusions:

Within the GPT family, GPT-3.5-turbo and GPT4 exhibit different behavior. When predicting incorrect options (A, B, or C), GPT4 shows a stronger inclination towards the option positioned first compared to GPT-3.5-turbo. Specifically, when option B is presented first, GPT-3.5-turbo tends to lean towards selecting option B. Furthermore, ChatGLM-6B showcases a certain preference for the first two options.

The behavior of the LLMs remains consistent across datasets originating from different languages and sources. Hence, we can conclude that the dataset’s language or source, regardless of whether the models were previously exposed to them or not, is not the underlying cause of the models’ positional preferences.

3. Yes, LLMs are sensitive to ordering

Our experiments showed that the order of candidate answers in MCQs significantly impacts LLMs outputs. GPT-3.5-turbo and GPT4 exhibited different preferences, while ChatGLM-6B showed a certain preference for the first two positions. In addition, the positional preferences in each LLM seemed to remain consistent across datasets originating from different languages and sources.

These findings are problematic because they reveal potential biases and inconsistencies in LLM outputs, which can affect the reliability and accuracy of their responses. Failure to understand and address these preferences may lead to biased recommendations, inaccurate information retrieval, and flawed decision-making. It is important to develop methods to mitigate these effects, as well as to investigate evaluation protocols that are less impacted by positional preferences. In light of these observations, in the next section, we compare MCQs and LFGQs evaluation methods.

Multiple Choice vs Long Form Generation

To compare QA evaluation formats and gain a deeper understanding of LLMs evaluation protocols, we expose several LLMs to the same questions presented in different formats. Then, we analyze and compare the results in three spaces: the direct output space (§4.1), the token logits space (§4.2), and the embedding space (§4.3).

In the direct output space, which refers to the responses generated by the LLMs, accuracy is one of the most common evaluation metrics used for benchmarking purposes and performance assessment. The difference in accuracy between the MCQs and LFGQs formats is the first aspect we consider (§4.1.1). Additionally, we study the relationship between consistency and accuracy (§4.1.2) by exploring whether LLMs, if familiar with a particular concept, tend to generate responses that are similar and encompass consistent factual information (Manakul et al., 2023).

We randomly select $100$ samples from the CARE-MI dataset and evaluate GPT4, GPT-3.5-turbo, and ChatGLM-6B on them. For MCQs, accuracy is computed as usual, i.e. if the predicted answer matches the ground truth, it is considered correct. For LFGQs, accuracy is determined through human evaluation, with denoting an incorrect response and $1$ a correct one. Figure 2 (top) compares the accuracy between MCQs and LFGQs across the three LLMs. Notably, the accuracies of MCQs are consistently higher than those of LFGQs. This difference can be attributed to the fact that MCQs offer candidate options, facilitating the prediction task. To delve deeper into the analysis, in Figure 2 (bottom), we visualize a matrix in which, for the same question, there are four scenarios: 1) the response is correct in both formats, 2) the response is incorrect in both formats, 3) the response is correct in MCQs but incorrect in LFGQs, and 4) the response is correct in LFGQs but incorrect in MCQs. Results show that there are a relatively large number of questions where the LLMs can respond correctly in MCQs, but fail in the LFGQs format. Furthermore, we quantify the differences in accuracy produced by the two formats with Pearson correlation coefficients. The obtained values are remarkably low: $0.39$ for GPT4, $0.7$ for GPT-3.5-turbo, and $0.33$ ChatGLM-6B, clearly indicating that different versions of the same question yield different answers from the LLM.

1.2. Consistency

Next, we explore the relationship between consistency and accuracy. Consistency stands for the degree to which the LLMs provide the same answer when asked the same question multiple times. For example, an answer of ‘AAAAB’ is considered more consistent than ‘BCDAB’ when presented with the same question five times. To conduct this evaluation, we first define the quantitative measures for consistency and accuracy in a sequence of responses to a repeated question.

Formally, let $\mathcal{A}=\{A_{1},A_{2},...,A_{D}\}$ be a sequence of answers, where a model is queried $D$ times. Each answer $A_{i}\in\mathcal{A}$ is selected from a set of $N$ unique options $\mathcal{O}=\{Opt_{1},Opt_{2},...,Opt_{N}\}$ . From $\mathcal{A}$ , we derive a count sequence $\mathcal{C}=\{\textsc{count}(Opt_{1}),\textsc{count}(Opt_{2}),...,\textsc{count}(Opt_{N})\}$ , where $\textsc{count}(Opt_{i})$ represents the number of occurrences of $Opt_{i}$ in $\mathcal{A}$ , marking $Opt_{\max}$ as the option with the largest $\textsc{count}(Opt_{i})$ . We define sequence consistency $K$ as

As for accuracy, if $A_{ref}\in\mathcal{O}$ is the correct answer, the accuracy for sequence $\mathcal{A}$ can be defined as

In the experiments, we use the same samples as in §4.1.1 and repeat each question five times for each LLM. To compute consistency and accuracy on LFQs, we manually group the long-text generated answers into options so that answers with similar meanings are grouped together under the same option. We also explore the impact of different temperatures, which is the parameter that controls the degree of randomness of the generated text, by using values , $0.5$ , and $1$ .

Figure 3 shows GPT-3.5-turbo’s consistency for MCQs and LFGQs across the three temperature values. Both formats tend to be consistent in their answers. Even when the temperature is increased, consistency does not decrease notably. Between the two formats, LFGQs tend to be more consistent than MCQs. We also calculate the Pearson correlation coefficient between consistency and accuracy. In the case of MCQs, the Pearson correlation coefficient is $0.32$ , while for LFGQs, the coefficient reaches $0.416$ , implying that higher consistency does not necessarily mean more correct. Our findings suggest that a higher level of consistency indicates a sharper probability distribution of specific knowledge, but it does not guarantee the correctness of the knowledge. Unlike SelfCheckGPT (Manakul et al., 2023), which leverages the idea that the higher the consistency, the higher the correctness, we do not find a direct relationship between consistency and accuracy. We believe this is due to the knowledge required to answer the evaluation dataset. While SelfCheckGPT is evaluated on information from famous individuals Lebret et al. (2016), we use specialized professional medical datasets.

2. Token Logits

To compare MCQs and LFGQs in the token logits space, which is the space of predicted probabilities, we rely on two techniques: unified confidence calculation and expected calibration error.

One of the mainstream approaches used to analyze why LLMs select specific options when answering MCQs is through token logits (Manakul et al., 2023). GPT-3.5-turbo, for instance, can generate log probabilities for the most probable tokens associated with each output token.https://platform.openai.com/docs/guides/gpt However, while there are formulas to calculate confidence for multiple options (Jiang et al., 2021; Holtzman et al., 2021; Lin et al., 2022a), direct utilization of token probability calculations for comparing MCQs with LFGQs is not straightforward.

Since our goal is to compare MCQs and LFGQs, we follow Jiang et al. (2021) and propose a unified confidence calculation applicable to the three QA formats: MCQs, LFGQs, and TFQs. Let us assume an input question $q$ that makes a LLM generate the set of answers $\mathcal{A}$ .

Each answer $A_{i}\in\mathcal{A}$ contains $|A_{i}|$ tokens. Each token, denoted as $t_{k}$ , with $1\leq k\leq|A_{i}|$ , has a corresponding autoregressive token log probability $P_{log}(t_{k}|q,t_{<k})$ . We first compute the average token log probability of each answer $A_{i}$ as

From the initial set of answers $\mathcal{A}$ , which may contain duplicates, we consolidate them into $z$ unique answers, denoted as $\mathcal{A}^{uni}=\{A_{1}^{uni},A_{2}^{uni},...,A_{z}^{uni}\}$ , where $z\leq D$ . For each unique answer, we select the highest log probability observed for any instance of that answer in $\mathcal{A}$ , denoted as $P_{log}^{highest}(A_{i}^{uni}|q)$ . Subsequently, we rank the first $W$ unique answers, where $W\leq z$ , $W=4$ in our experiments. from $\mathcal{A}^{uni}$ in descending order by their frequency and the corresponding $P_{log}^{highest}(A_{i}^{uni}|q)$ , to filter out excessively similar responses and maintain the diversity in the unique answers. We calculate the standardized confidence for the first $W$ answers as

Finally, For MCQs, we use regularization matching to combine first $W$ answers with four candidate options, and get the final label ( or $1$ ) with the sum of corresponding standardized confidence. For LFGQs, we directly get the final label ( or $1$ ) according to standardized confidence for the first $W$ answers and human labeling for each answer.

2.2. Expected Calibration Error

After obtaining the unified confidence, we compute model calibration (Gupta et al., 2006; Ahmed et al., 2020) to test whether a LLM exhibits good calibration across different dataset evaluation formats. A well-calibrated model should provide confidence (i.e., logit) estimates that closely match the actual probability of the correctness of the answer. Inaccurate predictions should correspond to low confidence (i.e, logit) values, whereas accurate predictions should yield high confidence (i.e., logit) values.

In practice, we employ a commonly used metric known as expected calibration error (ECE) (Niculescu-Mizil and Caruana, 2005) to assess the alignment of confidence and accuracy. ECE is computed as the weighted average of the difference between the accuracy and confidence. To measure confidence quantitatively, we divide the $$ interval into multiple bins. Each sample falls into one of these bins based on the model’s predicted results. The average model confidence is calculated in each bin, and then compared with the average accuracy of the sample real label in the bin. The absolute value of these two differences can measure the model’s confidence. A larger difference indicates lower model confidence. Formally,

where $b$ represents the $b$ -th bin, $\mathcal{B}$ represents the total number of bins, $n_{b}$ represents the number of samples in the $b$ -th bin, $\text{acc}(b)$ represents the average value of the true label of the sample in the $b$ -th bin, $\text{conf}(b)$ represents the average value of the model prediction probability in the $b$ -th bin. In our experiments, we set $\mathcal{B}=100$ .

2.3. Results

Within the CARE-MI dataset, we use the three QA formats, MCQs, LFGQs, and TFQs, to compute ECE and reliability. We use confidence scores and true labels to draw reliability diagrams as in (Kängsepp et al., 2022). A reliability diagram closely aligning with the identity line suggests good model calibration, while a significant deviation indicates poor calibration. Results are shown in Figure 4 and in Table 5. LLMs operating on MCQs exhibit the poorest calibration and highest ECE compared to the other two formats. This suggests that the LLMs’ predictions in MCQs are not accurately aligned with the true probability of correct answers, indicating overconfidence in their responses. Additionally, we observe that the ECE in TFQs ( $0.276$ ), which contain only two candidate options, is lower than in MCQs ( $0.426$ ), which have four candidate options. To investigate the impact of the number of candidate answers, we conduct experiments by varying the number of options in MCQs across the four datasets. We analyze whether the number of options and the domain of each dataset affect ECE. Throughout these experiments, we maintain the correct answer consistently positioned as the last option. As depicted in Table 5, we do not find a clear correlation between these factors and ECE, meaning that the number of options and the domain do not seem to influence LLM’s performance.

3. Embeddings

Up to this point, our analysis reveals that the misalignment between MCQs and LFGQs answers is evident in both the direct output (§4.1) and the token logits (§4.2). Next, we investigate whether this difference is also manifested in the embedding space derived from the hidden states of the models (Burns et al., 2023). We also explore how the embeddings behave under different question formats and models. The technique proposed by Li et al. (2023b) enables the extraction of hidden outputs from the model by collecting the heads of the attention blocks, and use these heads as index to obtain the hidden outputs of each layer from the model.https://github.com/davidbau/baukit We utilize the hidden outputs of the last token in the input. To thoroughly investigate the distinctions in the embedding space between MCQs and LFGQs themselves as much as possible, unlike the prompt design in the previous experiments, we only set a post-prompt for MCQs, and LFGQs do not contain any prompts. Finally, the hidden outputs have information on the number of input samples, the number of hidden layers, the number of attentions, and the dimensions of heads. Refer to Table 6 in the Appendix (§8) for more details. We randomly select $40$ samples from the ARC dataset, each with MCQs and LFGQs formats, and plot t-SNE (Van der Maaten and Hinton, 2008) representations of the hidden embeddings in each layer. Figure 5 shows the visualizations for Llama-2-7b-chat-hf. Other model visualizations are provided in the Appendix (§8). The results show that the embeddings from MCQs and LFGQs display clear separations in some layers of the hidden states. We observe a consistent trend across the various LLMs: in the initial layers, embeddings of the two formats show clear separations. However, as we progress towards the final layers, the embeddings corresponding to MCQs and LFGQs tend to become closer. Additionally, in certain models, the embeddings are distinctly separated in specific middle layers. For instance, in the open-llama-7b model, the embeddings exhibit clear differentiation in the 14th layer. Finally, the representation of embeddings from the same model but different sizes can vary, as shown in the embeddings of Dolly-v2-3b and Dolly-v2-7b in Figures 9 and 10 in the Appendix.

4. Different QA formats produce different answers

Our experiments showed that different formats of a single question may result in different performances. Moreover, our results challenged the notion that greater consistency leads to higher accuracy by closely examining the relationship between them. When comparing MCQs and LFGQs in expected calibration error, prompts from MCQs were the most overconfident in their predictions. Finally, in the embedding space, MCQs and LFGQs representations were clearly separated in some layers of the hidden states.

Related Work

QA is a prevalent evaluation method in natural language processing tasks. With the surge of LLMs, several QA evaluation benchmarks have emerged to asses models’ reasoning and fact-retrieval skills (Liang et al., 2022; Chang et al., 2023; Li et al., 2023a; Chia et al., 2023). These QA benchmarks encompass divers dataset formats, including multiple-choice questions (MCQs) (Bhakthavatsalam et al., 2021; Ramamurthy and Aakur, 2022; Liu et al., 2023a; Huang et al., 2023), long-form generation questions (LFGQs) (Zhang et al., 2018; Lin et al., 2022b; Xiang et al., 2023) and True/False questions (TFQs) (Singhal et al., 2023). Many existing QA evaluation benchmarks use relatively simple MCQs formats, in which models can strongly rely to formulate their answers. In addition, previous work has primarily focused on evaluations of MCQs, not considering comparisons between the different formats (Jiang et al., 2021; Lin et al., 2022a; Robinson and Wingate, 2023). In this paper, we focused on conducting a comprehensive comparative analysis between MCQs and LFGQs, thereby enhancing the understanding of the drawbacks and limitations of the different evaluation methods.

2. LLMs and Multiple-Choice Questions

Previous work has underscored the sensitivity of LLMs to prompting strategies (Zhao et al., 2021; Singhal et al., 2023) and positional bias (Wang et al., 2023), which pose challenges to model assessment. For instance, Zheng et al. (2023) showed that GPT-4 tends to favor the candidate answer presented in the first position, leading to unfair evaluation results. Additionally, Pezeshkpour and Hruschka (2023) observed that GPT-4 and InstructGPTOuyang et al. (2022) perform differently when answer options are rearranged on various benchmarks. We expand upon prior work, which focused on a limited number of models and scenarios, to study and identify general patterns and analyze their underlying causes across diverse datasets and models.

Discussion and Conclusion

This paper focused on testing the effectiveness of MCQs evaluating LLMs. Motivated by the observation of consistent preference biases across different datasets with several LLMs, we first conducted a significance test to determine the position of the candidate answers affect LLMs’ predictions, resulting in accuracy instability. More specifically, we analyzed how different LLMs have different positional preference patterns on the same dataset, while the preference positional patterns of a particular LLM remained constant across datasets from different sources. In addition, we conducted comparative experiments between MCQs and LFGs in three different spaces to ascertain the advantages and disadvantages of each as evaluation benchmarks.

Based on our experiments, we offer a few suggestions for utilizing MCQs and LFGQs formats in LLM evaluation benchmarks:

The choice of QA format should be aligned with the type of knowledge being evaluated. Whereas it may be fine to use MCQs for testing general knowledge, in some professional domains—particularly those carrying legal responsibilities, such as the medical field, it is advisable to use LFGQs under human supervision to ensure a more rigorous evaluation.

When using MCQs for evaluating LLM, adjusting the number of options, whether decreasing or increasing them, does not necessarily enhance accuracy and confidence. However, regarding order sensitivity, reordering candidate answers for each question and repeating questions can enhance the robustness of the assessment process.

Our findings do not indicate a strong correlation between consistency and accuracy in LLMs responses. Therefore, we do not recommend relying on consistency as a tool to enhance performance in LLMs.

Given the discrepancy we found between MCQs and LFGQs results, we believe that LFGQs is the best format for evaluating LLM, as it aligns well with real-world use cases. We recommend prioritizing LFGQs format and evaluating LLM from various perspectives, including correctness, completeness, relevance, and interpretability.

We hope that the results presented in the paper and the investigation about order sensitivity and comparative analyses between MCQs and LFQs can inspire future research to improve evaluation benchmarks for LLMs.

Acknowledgment

This work was partly supported by JSPS KAKENHI No.JP22K12091 and the State Key Program of National Nature Science Foundation of China No.61936001.

Reference

Appendix

In this appendix, we report the visualization of the t-SNE projected embeddings in the other seven models (Figures 6-13) and the hidden embedding space details for each LLMs (Table 6).