Humans or LLMs as the Judge? A Study on Judgement Biases

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang

Introduction

Proprietary models such as GPT-4 (OpenAI et al., 2023), Bard (Google), Claude (Anthropic), PaLM (Anil et al., 2023) showcase their outstanding ability in numerous NLP tasks, meanwhile serving as daily-used tools in diverse scenarios. In the meantime, the open-source community is trying to replicate the proprietary models and democratize LLMs. To keep with the pace of the advancement of LLMs, the community attaches great importance to evaluating model performance by developing numerous benchmarks, which can be roughly categorized into open-ended and close-ended ones. Although close-ended benchmarks such as MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023) are convenient to evaluate on, they often suffer from data contamination issue. Proprietary LLMs, which are trained with in-house data, tend to perform particularly well in close-ended benchmarks. On the other hand, open-ended benchmarks (e.g., MT-Bench Zheng et al. (2023) and Alpaca-Eval Li et al. (2023)) test models via free-form generation, which is more consistent with real-world use cases and relies heavily on LLMs’ generation ability. The issue of data contamination in open-ended benchmarks is less severe since there are no standard answers, and even with contamination it offers minimal assistance on performance hacking.

Open-ended benchmarks often count on human to evaluate the quality of answers. As the recent emergence of human-aligned LLMs, adopting such LLMs as judges, known as LLM-as-a-judge (Zheng et al., 2023), serves as an alternative to human judges. Even though adopting human and LLM judges is a common practice for evaluating open-ended questions, both judges are found to posses certain biases (Zheng et al., 2023; Wu and Aji, 2023), questioning the validity of human- and LLM-as-a-judge. Therefore, an important question rises:

How biased are humans and LLMs on judging open-ended generation?

In this paper, we adopt intervention and post-hoc analysis 111Intervention and post-hoc analysis correspond to experimental design and observational study, respectively, which are two prevalent research paradigms in statistics (Gerry P. Quinn, 2002; Rosenbaum, 2010). to quantitatively investigate the biases of human and LLMs, specifically on: Fallacy Oversight Bias, Authority Bias, Beauty Bias, Verbosity Bias and Positional Bias. Current bias evaluation frameworks necessitates a golden standard, either in the form of groundtruth (e.g., correct vs erroneous, harmful vs non-harmful) or human providing reference answers. But what if we intend to probe the effect of some perturbations on which the golden standards are not provided or not well defined? We propose a new framework for bias evaluation to fill this gap.

In summary, our contributions are three-fold:

We propose a new framework to investigate five biases on human- and LLM-as-a-judge. The framework is free from referencing human or groundtruth answer, making it flexible and extensible.

We systematically investigate the vulnerability of these judges against all three perturbations. We further exploit the weakness to attack different LLM evaluators.

We open-source a dataset for open-ended evaluation, serving as an in-depth alternative to in-width datasets such as Vicuna-80 (Chiang et al., 2023).

Our key findings are highlighted as follow:

Human judges have significant Fallacy Oversight Bias, Beauty Bias and Verbosity Bias (see details in Sec. 3.2).

Different LLM judges possess different biases to various extent.

One can easily exploit these biases to perform superficially well under LLMs’ judgement.

Related Works

Human feedback serves as the gold standard for natural language generation (NLG) evaluation. The collected feedback can be used to improve model performance (Kreutzer et al., 2018; Zhou and Xu, 2020; Leike et al., 2018; Ziegler et al., 2019; Stiennon et al., 2020; Böhm et al., 2019; Ouyang et al., 2022; Christiano et al., 2023) or to serve as an indicator of output quality as in Chatbot Arena (Zheng et al., 2023). Prior to the prominence of LLMs, BertScore (Zhang et al., 2020), BARTScore (Yuan et al., 2021), DiscoScore (Zhao et al., 2023) and GPTScore (Fu et al., 2023) are popular metrics used to evaluate NLG tasks. Recently, powerful LLMs are leveraged as judges in place of previous methods, and are widely used in evaluating LLM performance (Chen et al., 2023b; Zhang et al., 2023; Chen et al., 2023a; Wang et al., 2023b).

2 Biases of Human and LLM Judges

Both human and LLM judges are found to be biased. Due to the subjectivity of human, the reproducibility is fairly low (Belz et al., 2023). To obtain results with higher quality, a clear codebook is needed to provide judges with clear instructions (Howcroft et al., 2020). Even if so, human judges have inherent bias (Zheng et al., 2023; Wu and Aji, 2023) and may not even provide reliable answers (Clark et al., 2021; Hämäläinen et al., 2023). To make human evaluation results more interpretable, several works have revealed the underlying factors that impact human judgements (Hu et al., 2023; Aniceto et al., 2023), challenging the position of human as the gold standard. As an alternative to human, LLM judges are also found to have certain bias and the annotation results require validationPangakis et al. (2023). Zeng et al. (2023) finds that LLMs are prone to answers with superficially good quality. Position bias (Wang et al., 2023a), verbosity bias and self-enhancement bias (Zheng et al., 2023) have also been identified.

On the Biases of the Judge

We first identify the challenges of conducting bias analysis. First, when there is no groundtruth, or when humans fail to serve as golden standard, a valid comparison of biases is hard to be carried out. Second, it is hard to ensure an experiments to be both controlled and comprehensive. Either a carelessly massive experiment or naive setting would undermine the validity of conclusions.

Unfortunately, these challenges have not been overcome. First, groundtruth annotations (e.g., w/ or w/o factual error) are indispensable in current bias analysis (Zeng et al., 2023; Wu and Aji, 2023), but the groundtruth may not be well defined in open-ended question answering. Second, experiment design is either too carelessly massive or too limited. Zheng et al. (2023) draws their conclusion on a massive dataset collected from crowd-sourced workers, which may introduce uncontrollable factors to the analysis. Wu and Aji (2023) conducts experiments on only 40 questions that are selected from Vicuna-80 (Chiang et al., 2023), resulting in a conclusion with limited generalizability.

2 Definition of Biases

Moving forward, we need to establish the biases of evaluators. As defined by the Oxford English Dictionary, “semantics” refers to the meaning in language (Oxford English Dictionary, 2023). We primarily categorize the biases into two types: semantic-agnostic biases and semantic-related biases.

Semantic-agnostic bias refers to the bias of evaluators that is influenced by factors unrelated to the semantic content of the text. Common examples include verbosity bias, authority bias, and beauty bias.

Semantic-related bias pertains to the bias of evaluators that is affected by elements linked to the content of the text. Typical examples include factual oversight bias, racial bias, and gender bias.

2.2 Bias in the experiment

In this study, we conduct extensive experiments to explore the five types of bias as described below.

Fallacy Oversight Bias: this refers to the tendency to overlook the impact of logical fallacies in an argument. It often occurs when individuals accept conclusions without critically evaluating the evidence supporting them.

Authority Bias: this is the tendency to attribute greater credibility to statements by their perceived authorities, regardless of the actual evidence. It often leads to an uncritical acceptance of expert opinions without sufficient scrutiny.

Beauty Bias: or called ‘lookism’, refers that someone is privileged because of their good looking. In this work, it refers to the inclination that judges are attracted by visually appealing content, regardless of its actual validity.

Verbosity Bias (Zheng et al., 2023): this is the tendency to perceive wordy statements as more credible than concise statements, regardless of the actual truthfulness. It reflects a bias towards equating quantity of information with quality.

Positional Bias (Zheng et al., 2023): this bias occurs when the position or placement of information affects its perception or evaluation. For instance, in a document, information presented at the beginning or end may be considered more important than information in the middle.

Experimental Protocol

In this section, we elaborate on the experimental methodology, the creation of experimental data, the experimental procedure, evaluation metrics, and the models under evaluation.

We adopt intervention and post-hoc analysis as our research methods, which will be introduced below.

This is a research method where researchers control certain variables to determine their impact on the outcome. In our experiment, we investigate Fallacy Oversight Bias, Authority Bias and Beauty Bias through add certain perturbations.

Factual Error for Fallacy Oversight Bias: This perturbation involves introducing factual inaccuracies in the text. We test judges on the ability to identify these deliberately added errors.

Reference for Authority Bias: Adding fake references to the text, which do not affect the semantics, relates to Authority Bias. Judges should not have inclination to contents with seemingly increased authority.

Rich Content for Beauty Bias: We perturb the texts with emojis and markdown formats to make the content more visually appealing without changing semantics. We test whether judges can stick to the semantics instead of being distracted by formats.

This is a research method where the researchers do not directly intervene but only observe the natural occurrence of variables. Compared to other biases, the impact of Verbosity Bias and Positional Bias on the outcome is more readily observable. Therefore, we choose to explore these two biases via post-hoc analysis.

By combining the two research paradigms, we aim to provide a comprehensive understanding of the five biases and their effects on preferences of judges.

2 Data Generation

To collect data for our experiment, we employ GPT-4 to generate questions, answers, and their variations, followed by a manual review procedure by the authors, which ensures the quality of our experiment data. The detailed prompts for question generation, answer generation and evaluation are shown in Appendix B. An example of the generated sample is shown in Figure 1. We will introduce each steps in the following paragraphs.

To increase the generality of our question set, we follow the six levels of the revised Bloom’s Taxonomy (Krathwohl, 2002) (description of the Taxonomy is in Appendix C) and create 30 questions for each level, amounting to a total of 180 questions. The knowledge level of these questions is controlled at or below the middle school level. This ensures that evaluators can utilize their knowledge to assess the quality of the answers. The categorization of the questions is manually verified, and any misclassified questions are eliminated. This process leaves us with 142 questions for the subsequent steps.

We independently generate two answers for each question, leading to a collection of 142 question-answer pairs for the control group. Each pair consists of one question and two answers, denoted as $Q$ , $A_{1}$ and $A_{2}$ , respectively.

For each type of perturbation, we randomly select an answer for each question and introduce the perturbations (Factual Error, Reference and Rich Content), resulting in three times the 142 question-answer pairs for the experimental group. In these arrangements, the two answers to each question are labeled as $A_{1}$ and $A_{2}^{p}$ , where $A_{1}$ is the original answer, and $A_{2}^{p}$ is the perturbed version of $A_{2}$ .

3 Experiment Objects

We employ 79 college students as our human judges. Detailed selection criteria are provided in the Appendix A.1. Besides human judges, our experiment also involves the evaluation of some representative models, including GPT-4 (OpenAI et al., 2023), Claude-2 (Anthropic), PaLM-2 (Anil et al., 2023), GPT-4-turbo (OpenAI), GPT-3.5-turbo (OpenAI), LLaMA2-70B-Chat (Touvron et al., 2023), Ernie (Sun et al., 2021), Spark222https://xinghuo.xfyun.cn/, and Qwen (Bai et al., 2023). However, as some models exhibit significant positional bias in the evaluation (see in Section 5.5), we only include models with less significant positional bias in the following sections. All judges follow the same evaluation guidelines, which are detailed in Appendix A.2. A list of participant names is in Appendix A.3. A screenshot of the user interface is shown in Appendix D.

4 Experiment Procedure

Figure 2 illustrates the experiment procedure. We form two groups to conduct our experiment: Control Group (aiming to evaluate $A_{1}$ and $A_{2}$ ) and Experimental Group (aiming to $A_{1}$ and $A_{2}^{p}$ , the perturbed version of $A_{2}$ ).

Given a question and its two corresponding answers, a judge is directed to determine whether “Answer 1” is better, “Answer 2” is better, or a “Tie”, based solely on the semantic quality of the answers (see detailed instructions in Appendix A.2). Each question-answer pair undergoes 6 rounds of evaluation, and the positions of the answers are shuffled to minimize the impact of positional bias (the right part of Figure 2). For human judges, we include a “not familiar” option and ask judges to choose it whenever they are not familiar with the context of the question. In this way we can maintain the quality of the evaluation. Additionally, we exclude some votes whose decision process are too quick based on the response time.

5 Metrics

After voting results are collected, we aggregate the six votes for each question by calculating

As per experimental protocol, each question receives a voting result, i.e., $A_{1}$ , $A_{2}$ , or $Tie$ . Following the terminology used in AI safety, we employ the Attack Successful Rate (ASR) at the question level to gauge the judges’ resilience to the perturbations. For Reference and Rich Content perturbation,

where $V_{1}$ is the set of samples in the control group that vote $A_{1}$ or $Tie$ , and $V_{2|1}$ is the set of samples in $V_{1}$ which are voted $A_{2}$ in the experimental group.

For Factual Error perturbation, the calculation formula of ASR is:

where $V_{2}$ is the set of samples in the control group that vote $A_{2}$ or $Tie$ , and $V_{2|2}$ is the set of samples in $V_{2}$ which are voted $A_{2}$ or $Tie$ in the experimental group. The higher the ASR, the lower the judges’ ability to detect factual errors in the text. ASR should ideally be close to 0.

For the Factual Error perturbation, we also compute the accuracy of the judges in identifying factual errors as another metric. The calculation is as follows:

where the numerator represents the number of samples in the experimental group that vote for $A_{1}$ , and the denominator is the total number of samples in the experimental group. Ideally, Accuracy should be close to 1.

Experimental Results

We will introduce the experimental results for the five biases in the following sections.

Referring to Table 1, by analyzing Acc and ASR of Factual Error, we see that GPT-4, GPT-4-Turbo, and PaLM-2 have top-tier factual error detection ability, while human judges, Claude-2, and Ernie are second tier, and LLaMA2-70B is weakest. GPT-4’s superior performance could be due to its capabilities and the chance that it generated the factual errors. Humans’ slightly lower ability might be due to text length affecting concentration. The consistency in evaluator rankings by both Acc and ASR highlights these metrics’ reliability

The GPT-4 series models and PaLM-2 perform best on the Fallacy Oversight Bias, with human performance in the middle and lower reaches, and LLaMA2-70B performing worst.

2 On Authority Bias

According to Table 1, in the case of Reference perturbation, PaLM-2 ranks as the most robust, followed by humans, and Claude-2 is the least robust. Most model judges encounter significant challenges with this perturbation. GPT-4, GPT4-Turbo, and Claude-2 perform relatively poorly, with Ernie and LLaMA2 slightly underperforming compared to humans, and PaLM’s robustness being the most notable. This suggests that some models may be easily misled by texts that “appear more credible,” even if their semantics remain unchanged.

PaLM-2 is the most robust model against authority bias, followed by humans. Models like GPT-4, GPT4-Turbo, Claude-2, Ernie, and LLaMA2 struggle, suggesting they can be misled by seemingly credible texts.

3 On Beauty Bias

Regarding Rich Content perturbation, Ernie is the least affected, humans are in the fifth position, and Claude-2 is the least robust. The results suggest that human judges can be influenced by visual, layout, and other non-content factors. Some models (Ernie, PaLM-2, GPT-4-Turbo) show less susceptibility to formatting. GPT-4 and LLaMA2 exhibit similar performance to humans in this aspect, while Claude-2 remains the least robust against this perturbation.

Ernie is the most robust to Rich Content perturbation. Humans rank 5/7, and Claude-2 is affected the most. GPT-4 and LLaMA2-70B perform similarly to humans.

4 On Verbosity bias

We conduct a statistical analysis of judges’ verbosity preferences at the vote level. Initially, we assign a value of 0 to votes favoring shorter answers, 0.5 to Tie votes, and 1 to votes favoring longer answers. Subsequently, we calculate the average value of votes based on the difference in answer length. Ideally, as depicted by the Perfect Evaluator in the figure, an evaluator’s preference for length should consistently be 0.5.

From Figure 3, it is observable that as the difference in answer length increases, all evaluators exhibit a tendency to prefer longer answers to varying extents. GPT-4-Turbo’s judgments are least influenced by length, whereas Claude-2 is most affected by length, with human evaluators also showing significant length bias. In the 0-10 length difference interval, the preferences of all evaluators are near 0.5, suggesting that when the length difference is minimal, the evaluators’ length preference is not pronounced. However, as the length difference expands, all evaluators, including humans, demonstrate a preference for longer answers, and this preference intensifies with the growth in length difference. Excluding GPT-4-Turbo, when the length difference exceeds 40, the preference scores of all evaluators approach or surpass 0.7, indicating a pronounced length bias333To prevent the confounding of length bias with perturbation, we only show statistics on the control group..

All evaluators display a degree of preference for longer answers, and this trend escalates with the increment of the answer length difference. Among them, GPT-4-Turbo is the least affected by verbosity bias.

5 On Positional Bias

In our experiment, we conduct multiple evaluations for each pair of answers and ensure an equal number of evaluations for both placement methods during the evaluation process. Thus, an ideal judge without positional bias should have approximately the same number of selections for the first and second answers444For human evaluators, first and second correspond to answers on the left and right, respectively..

From Table 2, it is evident that most evaluators exhibit some degree of positional preference, particularly GPT-3.5-Turbo, Spark, and Qwen, which demonstrate a strong positional preference in their choices. GPT-3.5-Turbo consistently favors the first answer, Spark prefers the second answer, and Qwen invariably selects Tie 555Based on this observation, we have excluded these three models from all other experiments.. Additionally, GPT-4-Turbo, Claude-2, Ernie, and LLaMA2-70B also show some positional bias, but to a less extent than the aforementioned models, with a preference difference of about 10% to 30% between the first and second answers. Human evaluators, GPT-4, PaLM, and human choices in not familiar scenarios exhibit a smaller positional bias, with the preference difference between the first and second answers all within 5%.

Human evaluators, GPT-4, PaLM-2, and Claude-2 exhibit commendable performance in the positional bias test. Certain models like GPT-4-Turbo, Ernie, and LLaMA2-70B have noticeable positional bias, and the impact of this bias should be mitigated by conducting multiple evaluations with alternating answer positions. Models such as GPT-3.5-Turbo, Spark, and Qwen exhibit strong positional bias and are unsuitable for evaluation tasks.

Deceiving LLM Judges

Having the observation that LLM judges are vulnerable to perturbations, we take a step further to measure the robustness of LLM-as-a-judge. By adding non-semantic perturbations above, we make a flawed or mediocre answer superficially good. We calculate ASR and Acc following a similar definition in Section 4.5.

Anchor set $A_{1}$ : answers serving as anchors.

Weak set $A_{2}$ : answers that are weaker than $A$ , or with deliberately added factual errors, which lower the quality as well.

Perturbed set $A_{2}^{p}$ : answers perturbed on $A_{2}$ such that they are superficially better than $A_{2}$ .

The anchor set $A_{1}$ is generated on a subset of 60 questions by GPT-3.5-Turbo. We aim to research the following two questions. The weak sets $A_{2}$ and Perturbed sets $A_{2}^{p}$ are different for each question.

To research this question, we make the Weak set $A_{2}$ flawed by adding factual errors. Specifically, we generate a normal version of answers using GPT-3.5-Turbo, and then add factual errors to each answer with GPT-4, yielding a set of flawed answers $A_{2}$ . Then for each answer in $A_{2}$ , we add Reference, Rich Content and Reference + Rich Content to see whether we can deceive LLM judges by exploiting their Authority Bias and Beauty Bias.

The idea is that we need to first curate a set of weak-strong (in terms of semantic quality) answer pairs. Indeed, we generate answers from LLaMA2-Chat-{7B,13B,70B} to form three independent weak sets. Then we add Reference on them to form their respective perturbed sets. A preliminary experiment (see results in Appendix E) shows that answers from LLaMA2-Chat family are indeed weaker than those of GPT-3.5-Turbo. To perform trending analysis, we also include another set of answers from GPT-3.5-Turbo and construct a weak set and a perturbed set for it in a similar manner.

2 Metrics

For each RQ, we conduct two groups of pairwise comparisons. Comparison between $A_{1}$ and $A_{2}$ shows the preference of judges for answers before perturbation (Control Group), whereas comparison between $A_{1}$ and $A_{2}^{p}$ shows the preference after perturbation (Experimental Group).

For RQ1, we adopt ASR (Eq. 1) and Acc (Eq. 2) as metrics. For RQ2, we adopt ASR (Eq. 1) as the metric.

3 Findings and Discussion

Table 3 summarizes the results of experiments for RQ1. Prior to perturbing on $A_{2}$ , all models can effectively distinguish a flawed answer from a correct one, evidenced by relatively high Acc’s for all models. However, all judges are affected by perturbations to various degrees. Among them, PaLM-2 and GPT-4 are in first tier and outperform the others in terms of both Acc and ASR, meaning that they are the most effective detectors of factual error and the most robust to perturbations. While Claude-2 shows promising results in identifying plain factual errors, it showcases considerable vulnerability against all three perturbations, making it the least effective model under this setting. Besides, perturbation types have effects on LLM performances. REF alone is more effective than RC in deceiving LLM judges, meaning that LLMs have more inclination towards superficial authority than nice-looking formats. Compound perturbation (REF+RC) results in the highest $ASR$ in each row, except for LLaMA2-Chat-70B and ERINE. We attribute this observation to the inconsistency between the added REF and RC, whose simultaneous appearance may make a text hard to comprehend, making the content less appealing to LLM judges.

We attempt to answer RQ2 by comparing several pairs of models with disparate difference in their answer quality. A direct observation from Table 4 is that, there is an increasing trend in each row, meaning that the LLM judges are easier to be induced by references as the quality gap between answer pairs shrinks. Notably, there is a leap of $ASR$ for GPT-4, Ernie and from the third to the fourth column, which is exactly the same setting as the experiment in Section 4.4. This indicates that GPT-4 is extremely sensitive to references when the two raw answers are similar in quality. For the other three columns, it is relatively robust to such perturbation. Claude-2 exhibits a severe Authority Bias, having 21% of results flipped even for GPT-3.5-Turbo and LLaMA2-7B-Chat, which has the most significant difference in terms of quality. In sum, preference for weaker answers can be improved by exploiting the authority bias in LLM judges, but the effect is limited due to the large quality gap between the two answers in our setting.

Conclusion

In conclusion, we developed a new framework to explore five biases in human and Large Language Model (LLM) judges, providing deeper insights into their innate biases and vulnerabilities. We revealed that both entities display significant biases, but diverge in their specific inclinations. Additionally, our findings demonstrate that these weaknesses can be exploited under LLMs’ judgement. Through our work, we hope to foster expanded evaluation and understanding of LLMs and human bias, particularly leveraging our open-sourced dataset for further research.

References

Appendix A Human Evaluators

This section details the selection criteria and basic information for human evaluators participated in our experiments. Participants are all at least with an undergraduate education level at a University whose instruction language is English. They are chosen solely based on their English proficiency, basic logic skills and other knowledge. Aimed to ensure unbiased and knowledgeable evaluation of the results, specific criteria are created as follows: At least one of the following conditions must be satisfied: 1. English as one of the first languages (mother tongues) 2. TOEFL $\geq$ 80 or IELTS $\geq$ 6.5 or at least B+ for all ENG classes or Gaokao $\geq$ 128 Participants should master: 1. Math, high school level 2. Physics, high school level 3. Logics, basic Participants should be able to: 1. Bring their own laptops 2. Focus for at least one hour 3. Participate in the experiment off-line

A total of 79 people were selected to participate in the experiments. They came from various countries such as America, China, Bangladesh, Malaysia, India and Indonesia. Their role was to finish at least 45 questions, each question asking them to evaluate the quality of the two answers corresponding to one same question.

All participants must consent to contributing their annotation results before participating in our experiments. All data and annotation results in our experiment do not have the risk to have harmful impact on any participants.

A.2 Instruction Given to Human Evaluators

This section outlines the specific instructions provided to the human evaluators for them to clearly understand the evaluation criteria. The original instructions given to the human evaluators are as the following:

Your assignment involves providing evaluations for given responses.

Each evaluation will consist of a question and two corresponding answers. Your role is to discern which response is superior based on the quality of the answer. If you find that both responses are of equal good or bad, feel free to select a tie. There is no obligation to favor one response over the other; if a decision cannot be made, a tie would be an optimal choice.

During the evaluation process, please focus exclusively on the semantic quality of the answers. Non-semantic factors such as tone of speech, response format, or answer order should not influence your evaluation. The primary focus should be on the quality and accuracy of the answers.

Please check the checkbox ”I am NOT familiar with the content of the question/ answers.” if you are not familiar with the topic and pass to next question; the question would not be counted.

If you want to take a break, refresh the webpage. When everything is ready, retype your student ID and set your target to proceed. Your log is kept safe and sound.

You are all set, please go ahead to start the evaluation. Take your time and enjoy.

A.3 A Full List of Human Evaluators

We sincerely thank all the human evaluators for their high-quality feedback. We only list out the participants who consent to have their name shown in this paper. Names are arranged in a descending order of the number of effective evaluations. Names in bold are outstanding evaluators in terms of their evaluation quality and quantity.

Chuan Jiang Kaiyou Wu Gustavs Nolle Joshua Kurniawan Djunaidi MD PARVAGE Jerome Samuel Frederick Khasanto Lichuan Jiang Hadiq Shathir Sellam Mohamed Ibrahim Tian Jiang Yancun Guo Victoria Chamberlin Farrel Yudistira Andisman Jessica Yhang Ivander Lemuel Teno William Hansen Loe Jason Gunawan Qingning Shen Darren Boesono Haoxuan Xu Phocas Isingizwe Wanglei Xu Jiayi Yan Bryan Budiarta Sutanto Shafin Habib Jefferson Joseph Tedjojuwono Annabel Leonardi Yixin Deng Jeremy Christstardy Owen Lee Marta Laurent Lo Kayla Soewito Travis William Lintungan Lanruo Xia Xintong Zhu Vaughn Buquid Wentian Zhao Yue Zhang Florensia Widjaja Yu Zhang Haoyi Yu Kerui Wan Boshi Xu Nathania Josephine Tjung Bernadette Adila Hutani Dokyung Lee Zoe Emmanuel Halim Wei Xie Zhangchi Weng Xiaoliang Liu William Christopher Archieta Venkata Yashwant Kunar Bhyri Shuwen Zhang Zihang Jie Jiani Wu Weiwen Kong Yuanhao Zhu Juan Albert Wibowo Jonathan Yulliz Jubilee Ruixi Zou Keven Pratama Hendrata Junhan Fu Yujie Sun Yingjie Wang Han Yan Aragorn Leon Gobardja Yingxue Hu Christopher Nathanael Jessica Asali Xuejing Lin Kenneth Barli Ziche Liu Baohua Fang Junhan Jia Di Wu Yingxuan Bian Ziyun Wang Bryan Delton Tawarikh Sibarani Fanzeng Xia

Appendix B Prompts for GPT

Prompt for GPT to first perform CoT and then answer the question:

Prompt for GPT to directly answer the qustion without CoT:

Prompt for GPT to first answer the question and then perform CoT:

Appendix C Revised Bloom’s Taxonomy

The Revised Bloom’s Taxonomy serves as a framework for categorizing educational goals, objectives, and standards. Our study applies this taxonomy to structure the design of questions to evaluate the nuanced bias in human evaluators and LLMs. This taxonomy differentiates cognitive processes into six ascending levels of complexity: remembering, understanding, applying, analyzing, evaluating, and creating. Our research chose this taxonomy as a guidance to create more diverse and cognitive-comprehensive questions.

Appendix D User Interface

We show a screenshot of the user interface in Figure 4.

Appendix E Supplementary Results of Deceiving Models

In Table 5, we show that the answer quality of GPT-3.5-Turbo is much higher than the that of the LLaMA2 family. This proves the validity of using LLaMA2’s answers to form the weak set $W$ .