Are LLM-based Evaluators Confusing NLG Quality Criteria?

Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan

Introduction

With the emergence of powerful large language models (LLMs) such as ChatGPT, LLM-based evaluators have been widely used for various natural language generation (NLG) tasks (Chiang and Lee, 2023a; Kocmi and Federmann, 2023; Luo et al., 2023; Gao et al., 2023). In evaluation for common NLG tasks such as summarization (Fabbri et al., 2021), dialogue (Mehri and Eskénazi, 2020), and story generation (Xie et al., 2023a), different aspects of quality (such as fluency and faithfulness) should be considered individually. Traditional evaluation metrics are either incapable of evaluating specific aspects, like BLEU Papineni et al. (2002) and BERTScore (Zhang et al., 2020), or they can only roughly assess a single aspect, like FactCC Kryscinski et al. (2020). In contrast, LLMs can be treated as akin to human annotators, with various definitions of aspects contained in the prompt for flexible evaluation. Some studies (Wang et al., 2023a; Mendonça et al., 2023; Liu et al., 2023b) have shown LLM-based evaluators have achieved comparable performance with humans in many NLG tasks, suggesting LLMs to become promising candidates for automatic evaluation.

However, during the explorations of LLM-based NLG evaluation, we observed two noteworthy phenomena that have not been revealed in previous work. First, the evaluation results from LLMs for a given aspect can achieve a higher correlation with human judgments on another clearly different aspect. Second, the correlations between LLM-generated scores across different aspects are significantly higher than those between human judgments accordingly. These lead us to question the reliability of LLM evaluations on required aspects, since LLMs seem to confuse different aspects.

Understanding these issues is inseparable from aspects themselves at first, which stem from human evaluation for NLG tasks and are typically described by terms and definitions, forming corresponding specific criteria. Through semi-structured interviews, Zhou et al. (2022) revealed that if aspects for evaluation lacked clear conceptualization, human annotators might conflate different aspects, such as fluency and grammaticality. Howcroft et al. (2020) pointed out the long-standing confusion of terms and definitions in human annotations, resulting in incomparable evaluations. Combining our investigation of previous work involving evaluation criteria, we believe that there are two distinct issues. The first is inconsistent conceptualization, where the definition is inconsistent with others for the same aspect but is clearly articulated. The second is ambiguous expression, where the definition is so vague that human annotators aren’t sure what it really means. In Figure 1, we present an example of evaluation for fluency, where the criteria enclosed by the dashed box are selected from existing work and correspond to these two issues.

Therefore, we should reduce the influence of the issues within the evaluation criteria as much as possible, so as to reveal the actual performance of LLMs on NLG evaluation across aspects. We collect many existing criteria from previous papers involved, and summarize a clear hierarchical classification system for aspects that are most commonly used. For each aspect, we construct five criteria with descriptions of different levels of detail, including default, detailed, and simplified ones, to explore the corresponding effects. Then, inspired by behavioral testing in NLP (Ribeiro et al., 2020), we elaborately design a series of perturbation attacks based on the classification system to conduct targeted analyses on both proprietary LLMs (GPT-3.5 and GPT-4) and specifically fine-tuned LLMs (Prometheus). Different from previous related work, each of our perturbations is designed for a specific aspect to better verify the variances in evaluation for aspects that are related or not. We also engage human annotators to check our perturbations and expected impacts to enhance the reliability of our attack tests. To sum up, our contributions and findings are as follows:

To the best of our knowledge, we are the first to explore the capabilities of LLMs in distinguishing aspects during NLG evaluation and the impacts of different criteria descriptions, bridging human and LLM-based evaluation.

We summarize a classification system containing 11 common aspects and propose 18 aspect-targeted perturbation attacks, which have been verified by human annotators, to test the fine-grained evaluation behaviors of LLMs.

Our experimental results reveal the confusion across different aspects in LLM-based evaluation, even for the powerful GPT-4, which necessitate attention and in-depth research. Our resources and data will be released, serving as a benchmark to facilitate future relevant work.

Preliminary Study

To explore the NLG evaluation capabilities of LLMs and potential issues, we conduct experiments with GPT-3.5 on the commonly-used summarization evaluation dataset Summeval (Fabbri et al., 2021), attempting the evaluation forms introduced by Chiang and Lee (2023b). Their work, as well as other studies (Wang et al., 2023a; Chiang and Lee, 2023a; Liu et al., 2023b), has explored directly prompting LLMs for NLG evaluation. Furthermore, their evaluations are all zero-shot, so we additionally employ few-shot methods. The main experimental results are presented in Table 1. Consistent with the findings of Chiang and Lee (2023b), requiring the model to analyze before rating (analyze-rate) along with multiple samplings and the temperature set to 1 achieves the best performance. These settings, therefore, are also used in our following experiments. However, it appears that the few-shot method has no effect as expected; instead, it leads to worse performance.

We also present the Pearson correlation coefficients between the evaluation scores of the model and human experts across different aspects in Figure 2. Interestingly, we notice some issues of confusion inherent in LLMs. First, the evaluation from the model is likely to achieve a higher correlation with human judgments on another aspect than the current aspect (such as fluency and relevance). On the other hand, most correlations between scores from the model across four aspects are significantly higher than corresponding ones between human judgments. It seems that GPT-3.5 confuses different aspects during evaluation to a certain extent, leading to a convergence in their results. We therefore study some cases of outputs and discover that GPT-3.5 indeed incorporates assessments regarding other aspects, illustrated as the red part in Figure 1. We speculate that this may lead to the poor performance of few-shot evaluations. And similar problems can also be found in GPT-4 and Prometheus (Kim et al., 2023a), with more discussions and details described in Appendix A. These results suggest the unreliabilities hidden in the LLM-based NLG evaluation, and more targeted research and experiments are required.

Methodology

Our findings in the preliminary study lead us to question whether LLMs can understand and execute the evaluation requirements represented by different criteria well. To conduct more in-depth explorations, we propose the fine-grained perturbation test inspired by behavioral testing (Ribeiro et al., 2020), hoping to reveal their more actual capacities for NLG evaluation. In particular, instead of relatively coarse-grained perturbations in previous work, our perturbation attacks have been crafted to specifically target certain evaluation aspects without affecting evaluations for other unrelated ones. We formulate our approach as follows:

We first collect a set of different common criteria denoted as $C=\{c_{i},i=1,2,\ldots m\}$ , conceptually involving inclusive and non-inclusive relationships. And each of our perturbation attacks $p_{j},j=1,2,\ldots n$ is applied to the original text $x$ to generate the corresponding perturbed text $p_{j}(x)$ . Meanwhile, each perturbation is designed and expected to only reduce the text quality regarding the criteria for the originally targeted aspect and others whose scopes cover it. We define the set of criteria affected by the perturbation $p_{j}$ as $C^{j}_{T}$ , and the rest in $C$ are defined as $C^{j}_{F}$ . And the distinction between these two groups is made more reliable based on our classification system and human annotations. Then, to conduct the test, we prompt the model to evaluate all the perturbed texts, as well as the original text, in the form of scoring to check the expected two different evaluation behaviors:

where $N$ denotes the number of original texts pending perturbations in our test, and $M_{s}$ serves as the model’s scoring based on provided information, which includes additional necessary content $v$ aside from texts and criteria to evaluate, such as task instructions. $S_{T}$ represents the set of those that should be affected, where each item $s^{j}_{i}$ represents the change in evaluation scores after the perturbation $p_{j}$ regarding the criterion $c_{i}$ , which should be significant. On the other hand, $S_{F}$ is defined in a similar manner, but each item of it is expected to be zero, showing no impacts of perturbations. We will describe important components of our approach in more detail in the following sections.

As mentioned in Howcroft et al. (2020); Zhou et al. (2022), there is inconsistent and unclear conceptualization in existing evaluation aspects, which makes it difficult to understand the requirements and relationships among them. In light of this, we carefully collect and read about 300 papers that involve various aspects for NLG evaluation, and select those most commonly used. Then, we integrate their definitions used in the corresponding work and construct our default criteria as unambiguously as possible. Furthermore, they can be organized as a tree-like classification system, as shown in Figure 3, thanks to the relatively clear relationships within our definitions.

2 Perturbation Attacks

For each fundamental aspect in Figure 3, we design several targeted perturbation attacks, with part of them selected to display in Table 2. Since fluency involves more considerations other than grammaticality, we also propose additional perturbations for it, like adding repetitive content. The perturbations are crafted and expected to affect only the current aspect and those located at its ancestor nodes as much as possible. Compared to the prior perturbation research, where the texts for the attack are constructed for universal checking using simple templates or rules, our perturbations are more fine-grained and require better controls during generation, like adding complements that should be related and not contradictory for non-hallucination. Therefore, we manually generate some high-quality examples as demonstrations, along with corresponding instructions, and then prompt the powerful GPT-4 to construct the perturbed texts in 10-shot settings. We have conducted a sampling and manual inspection of them to ensure their quality and reliability, and more details, including the full 18 types of perturbations, are described in Appendix B.

3 Different Descriptions of Criteria

Since different definitions are often used in practice for the same aspect, forming different corresponding criteria, we also intend to study the impact of different levels of detail in definitions, as shown in Table 3. We take fluency as an example, and there are four different types beside our default definitions in Figure 3: simplified, detailed, term, and list. Moreover, to better analyze the existing issues of quality criteria, we select several typical criteria that have been used for NLG evaluation from the existing literature for each aspect. And we will release all the resources and data mentioned above, including the collections of criteria, prompts for data construction, perturbed texts, and relevant experimental results, to facilitate the development of future research on NLG evaluation.

Data and Test Settings

We select three common NLG tasks: summarization (including news and dialogue), paraphrase, and table-to-text generation, for our experiments and tests. The construction of perturbation attacks requires high-quality original texts to ensure significant declines in the qualities of different aspects. However, previous studies typically employed references directly from the corresponding datasets, which are always generated by some rules instead of being written by humans, leading to unsatisfactory quality (Kryscinski et al., 2019; Pu et al., 2023; Sottana et al., 2023). Therefore, we carefully prompt the powerful GPT-4 to obtain better references based on the original data. We finally sample 1000 pieces of data, each of which is subjected to 18 different perturbations we propose.

Our tests cover both proprietary LLMs (GPT-3.5 and GPT-4) and open-source LLMs (Prometheus). GPT-3.5 and GPT-4 have been mentioned in many existing studies as performing well in flexible NLG evaluation. On the other hand, some research has recently shifted toward fine-tuning specialized open-source LLMs for evaluation, aiming to avoid the deficiencies of prompting LLMs—such as high costs and unstable reproducibility. However, most of them do not support evaluation with specified criteria, and among those remaining, only Prometheus (Kim et al., 2023a) fine-tuned on Llama-2-Chat-13B (Touvron et al., 2023) has released their model.

To more reliably distinguish whether different criteria would be affected by specific perturbation attacks, we conduct human annotations and judgments beyond the guidance of the classification system. Due to the high cost and time-consuming nature of human annotations, it is not feasible to manually judge all the data. So we sample a portion of the data and recruit 40 annotators (each of whom is proficient in English and possesses certifications) to ensure that each piece of data is annotated four times. Overall, the more detailed the aspect definitions are, the more the corresponding human judgments match our expectations based on the classification system, as well as higher annotation consistency. In particular, definitions of detailed type achieve the highest match rate of 94.4%, with full results shown in Table 21.

More details in this section including related discussions and prompts used are described in Appendix C due to the space limitation.

Experiments

We primarily display the experiments and analyze the results with GPT-3.5, and the performance of other LLMs is described in Section 5.4. And to minimize the interference from criteria themselves, we first analyze the results with the detailed type of aspect definitions, which have also been confirmed through human judgments, to align most closely with our expectations. The main results are shown in Table 4, each item of which represents the average variations between the evaluation scores of pre- and post-perturbation attacks, respectively. Moreover, those items with the consistent judgments as shown in Table 21 can be categorized into two groups: $S_{T}$ (with wavy lines) and $S_{F}$ (with underlines), as defined in Section 3, which correspond to the directional expectation test for impactful attacks and the invariance test for non-impactful attacks, respectively. Furthermore, we explore the effects of different levels of detail in aspect definitions. The complete experimental results for three LLMs can be found in Appendix D.

The results show that the perturbations for Coherence and Informativeness almost did not lead to any degradation, with the changes in evaluation scores of pre- and post-perturbation all less than 0.2. However, definite but different human judgments that they should affect respective aspects indicate that the model lacks understanding of these two aspects. As for Fluency, the impact of perturbations intensified progressively from repetition to passive voice and then to inversion, consistent with intuition since the degree of sentence alteration increases. Specifically, despite our explicit mention that redundant information should be considered in evaluations regarding Fluency, both GPT-3.5 and human annotators fail to adhere to the instruction. Through discussions with human annotators, we find that repetition issues are common and easy to ignore, which may lead to such verbosity bias in LLMs (also observed by Zheng et al. (2023)) through these issues within training data. On the other hand, all perturbations for Grammaticality and Non-contradiction except for negation, as well as complement for Non-hallucination successfully show noticeable and expected decreases (greater than 2). And the remaining ones, like those for Simplicity, are not pronounced, with score variations ranging between 0.5 and 1.

2 Invariance Test

Conversely, in situations where the evaluation should not be affected, the primary deviations from expectations and human judgments exist in Grammaticality and Non-contradiction, particularly the latter. Grammatical issues influence all criteria, yet there is a clear hierarchy. The undeserved impacts on Coherence and Simplicity—aspects also included in Readability—are greater than those under Adequacy which are more unrelated. And they also seem lesser compared to criteria that are indeed expected to be affected, such as Fluency. It indicates that while GPT-3.5 struggles to disregard grammatical errors when assessing irrelevant criteria, it can still differentiate to some extent. However, two perturbations for Non-contradiction cause almost indistinguishable degradations in all criteria, even those under Readability that do not require the source content. In comparison, Non-hallucination, also part of Faithfulness, does not result in similar behaviors. This suggests that GPT-3.5 may be overly sensitive to conflicting points between the target and source content, while being more restrained in judging unverifiable information.

3 Different Definition Types

Considering the strong instruction-following capabilities of current LLMs, it is intuitive that the more detailed the description of criteria, the more accurate the evaluation from the model should be. However, the correlations between the results of five different types of criterion descriptions are quite high, as shown in Figure 8. The almost same evaluation behaviors suggest that GPT-3.5 may rely primarily on terminology to understand and assess each criterion. And we speculate that our written definitions are close to the inherent understanding of the model for corresponding terms, which consequently leads to such a phenomenon. The related knowledge is likely derived from a wide range of pre-training corpora. In contrast, human annotators, who lack extensive NLG evaluation experience, indeed exhibit different performance when given these different types of descriptions, as shown in Appendix C.

Furthermore, for deeper comparison, we display the score distributions of $S_{T}$ and $S_{F}$ with different description types in Figure 4(a). It seems that exhaustive descriptions can still help the model make more clear judgments to some extent, since the variations in the evaluation of $S_{T}$ are more significant. However, the changes in scores in the invariant situation are somewhat erratic, proving that the confusion issues are unrelated to whether descriptions are detailed or not. In addition, we also calculate the correlation of evaluation scores for different aspects for each description type. We present the results of the detailed and term in Figure 4(b), with the complete results displayed in Appendix D. It is evident that the less detailed the description is, the more similar the evaluations for different aspects are, indicating more severe confusion.

4 Different LLMs

We have also conducted the same experiments on GPT-4 and Prometheus for comparative analysis of different types of LLMs. Due to the large scale of our perturbation attacks and the high cost of prompting GPT-4, we sampled one-fifth of the data for the test for GPT-4. All of the results are shown in Appendix D and corresponding figures. We find that both the more powerful GPT-4 and the specially fine-tuned Prometheus also have the issues present in GPT3.5 described before. GPT-4 performs better in the directional expectation test compared to GPT-3.5, but surprisingly, it exhibits worse performance in the invariance test, especially showing severe sensitivity to grammatical and conflict-related perturbations about Grammaticality and Non-contradiction. On the other hand, Prometheus performs the worst in both tests, basically failing to differentiate between various aspects, which may be due to its small model size and the training data constructed by GPT-4.

Discussions

To investigate the failures of LLMs in our attack tests, we conduct some extended experiments with detailed aspect definitions. We retain only the definition or term, or even use the empty criterion, with the results presented in Figure 30,31, which still exhibit convergence. It indicates that the improper sensitivity of LLMs to grammaticality and contradiction is likely derived from the default evaluation behaviors inherent in LLMs. They will be cumulative, regardless of whether the current criteria are unclear or unrelated to those two aspects. Moreover, the detailed aspect definitions indeed have effects for aspects whose terms are not commonly-used in NLG evaluation, like non-hallucination.

Furthermore, we have attempted different methods to intervene in LLM-based evaluation to mitigate these issues. When given the instructions to not consider unrelated aspects and problems or even explicitly mention grammaticality and contradiction, there are only slight improvements to different degrees. Therefore, we consider the ideas of Chain of Thought (CoT) and Multidimensional Quality Metrics (MQM) to decompose the evaluation. However, both with GPT-3.5 and GPT-4, this method did not alleviate the aforementioned problems but further increased their confusion. Overall, such issues are quite stubborn and pose a challenge for reliable LLM-based evaluation.

To cover a more diverse range of descriptions of quality criteria, we also conduct human evaluation and LLM-based evaluation with descriptions selected from existing papers. We find that ambiguous expressions play a similar role in both human evaluations and LLM-based evaluation as less informative descriptions designed by us. Inconsistent conceptualizations (e.g. a description mixing Fluency and Grammaticality) can alter human judgments on related aspect-targeted perturbations, and a similar but weaker effect exists in LLM-based evaluation.

Related Works

Recent studies have highlighted issues with NLG evaluation metrics through synthetic perturbations, showing their scores often diverge from human judgments (Sai et al., 2021) and some of their blind spots (He et al., 2023). Moreover, some studies focused on diagnostic tests for single tasks or specific aspects, such as translation (Karpinska et al., 2022), summarization (Ernst et al., 2023), story generation (Xie et al., 2023b), and factuality (Chen et al., 2021). Notably, Zhang et al. (2023) explored the robustness of LLM-based dialogue evaluators using perturbation strategies, while Liu et al. (2023d) highlighted their inability to judge closed-ended responses without references under adversarial conditions. Neither study addressed varying expressions or distinctions among evaluation aspects.

Wang et al. (2023b) pointed out the order of the two texts affects evaluation results when ChatGPT and GPT-4 are used as comparison-based evaluators. LLM-based evaluators also prefer longer responses (Zheng et al., 2023) and responses generated by themselves (Liu et al., 2023b). Wang et al. (2023b) discovered that the performance of ChatGPT on summarization evaluation varies on different systems and aspects. Hada et al. (2023) stated that LLM-based evaluators may have more biases in non-Latin languages.

In human evaluation, Belz et al. (2020) proposed a classification system based on the property of quality criteria to support comparability. Howcroft et al. (2020) demonstrate that different descriptions of quality criteria can be mapped to normalized criteria. In LLM-based NLG evaluation, researchers have attempted to automatically generate quality standards more suitable for LLMs. Liu et al. (2023e) let LLMs draft expressions of quality criteria based on examples with human ratings. Kim et al. (2023b) utilized LLMs to review user-defined criteria and offered suggestions for disambiguation, merging, and splitting. Furthermore, some studies aim to improve LLMs’ ability to evaluate specific aspects through chain-of-thoughts (Gong and Mao, 2023) and instruction tuning (Liu et al., 2023a).

Conclusions

In this work, we conduct fine-grained pertubation attack tests guided by the classification system and human judgments on LLMs to reveal their actual performance in NLG evaluation. Our findings can be concluded as follows: 1) The performance of LLMs in our perturbation tests deviates significantly from expectations, with both unawareness and oversensitivity in some aspects. 2) The different levels of detail in criteria almost do not change the evaluation behaviors of LLMs, except for criteria with uncommon terms like non-hallucination. 3) The oversensitivity may be inherent in LLMs and not caused by the problems within criterion descriptions, due to its still existing in evaluations with empty criteria. 4) The confusion issues are so stubborn that even the explicit instructions to hint LLMs to consider or not consider the specific problems cannot have obvious effects. These results show that LLM-based evaluation is not that reliable across different evaluation aspects. Therefore, in-depth analysis of the aforementioned problems in LLMs and effective methods for improving the evaluation capabilities of LLMs are necessary and worth exploring in future research.

Limitations

Our summarized classification system and designed perturbation attacks are mainly applicable to the commonly used aspects in closed-end text generation tasks. So our work does not include aspects with strong subjectivity, such as interestingness, which can be further explored in future work.

Due to limited resources, the domains and task types covered in our experiments are limited. The lengths of source documents and texts to be evaluated are generally a few hundred words, and the data we use is in English. Therefore, we cannot guarantee the same conclusions for long texts, other languages, or data from special domains.

We make extensive use of APIs from GPT-3.5 and GPT-4 for constructing data and testing, which incurs significant costs. This may discourage others from replicating these experiments, but we will release all the resources and data to facilitate related research.

References

Appendix A Details for Preliminary Study

We follow the evaluation forms proposed by Chiang and Lee (2023b), including scoring modes, temperatures, and sampling settings. For more information, please refer to their paper and repository (Liu et al., 2023c). As for the prompts and instructions used for evaluation, we employ those from Chiang and Lee (2023b) for GPT-3.5 and GPT-4, while those provided by Kim et al. (2023a) for Prometheus. The complete results are included in Table 5 with the default settings where the sampling number is 20, and the temperature is set to 1 with zero-shot evaluations. Multiple results are post-processed and averaged to be the final scores. During few-shot evaluations, the selected demonstrations possess human labels of a uniform distribution, and analyses are correspondingly generated using GPT-3.5 if required. Moreover, the correlation matrices for GPT-4 and Prometheus are shown in Figure 5 and Figure 6, respectively. Although the performance of GPT-4 is significantly better than that of GPT-3.5, its confusion issues seem to be more severe than GPT-3.5; meanwhile, Prometheus not only performs the worst, but its confusion is also quite serious.

Appendix B Details for Perturbation Attacks

We construct four relatively simple types of perturbations—sentence exchange, word exchange, spelling mistake, and sentence deletion—based on the corresponding rules, while the remaining 14 types are generated by GPT-4 in 10-shot settings. We manually write these 140 demonstrations and carefully check them to ensure they meet the requirements of the corresponding perturbation attacks. Then, we prompt GPT-4 with these demonstrations as well as the detailed instructions to enable GPT-4 to generate the desired perturbed texts as closely as possible. All 18 types of perturbations and corresponding examples are shown in Table 6, with the aspects to which they each belong. And all the demonstrations, instructions for GPT-4 and codes for rule-based constructions are included in the supplemental materials. In addition, we show different criteria for each aspect as described in Section 3.3 in Table 7-17.

Appendix C Details for Data and Test Settings

We select 200, 200, 300, and 300 pieces of data from CNN/Dailymail (Hermann et al., 2015), SAMSum (Gliwa et al., 2019), News Commentary http://data.statmt.org/news-commentary/v18.1, and WebNLG (Gardent et al., 2017) respectively for tasks of news summarization, dialogue summarization, paraphrase generation, and table-to-text generation. However, many times the original references in common datasets for these tasks are not written by humans or are even missing. For instance, references for news summarization often employ the assemblage of highlights to build large-scale datasets but tend to be incoherent or contain information not present in the source news. To ensure the quality of references to better serve as the original texts in perturbation attack tests, we take advantage of the powerful GPT-4 to improve them, avoiding expert annotations that are hard to obtain. Specifically, depending on the condition of the original references in different tasks, we prompt GPT-4 to generate new references for news summarization and paraphrase generation, while the original references are modified and improved by GPT-4 in table-to-text generation. And we directly use the original references from SAMSum in dialogue summarization since they are human-written. As shown in the evaluation results of GPT-3.5, the original texts for perturbations, namely the references, are generally scored around 5 in all aspects, showing their high qualities. The prompts we use here are shown in Table 19. For each reference, we construct 18 different perturbed texts in various directions, leading to 19000 samples to be evaluated. Moreover, taking into account eleven different aspects and the different types of definitions involved with each, there are a total of 80 distinct evaluation criteria. Combined together, they constitute our data for experiments with the scale of 80*19000 = 1.52M.

C.2 LLMs

We test GPT-3.5 and GPT-4 with the API provided by OpenAI, whose versions are GPT-3.5 Turbo (1106) and GPT-4 Turbo (1106), respectively. On the other hand, Prometheus (Kim et al., 2023a) has been proposed aiming to achieve performance close to that of proprietary LLMs like GPT-4 in NLG evaluation. They elaborately constructed 100K evaluations and feedbacks through GPT-4 and fine-tuned Llama-2-Chat-13B (Touvron et al., 2023) on them, endowing the model with the capacity of evaluation across diverse and customized criteria. And we directly use the prompts provided by themselves (Kim et al., 2023a) for the evaluation of Prometheus. For all three LLMs in our test, we follow the setting of Chiang and Lee (2023b) to conduct analysis before rating scores of 1–5 and set temperature and sampling number to 1.0 and 10, respectively in zeroshot, with prompts shown in Table 20.

C.3 Human annotation

To facilitate human annotators in comparing texts before and after perturbations, we use a comparative form in human evaluations. This involves displaying two texts simultaneously on the annotation interface, allowing them to judge their quality relationship based on the given description of the quality criterion, as shown in Figure 7. Specifically, considering we design different types of descriptions and select some quality criteria with ambiguous expressions from existing papers, to better record the uncertainty of human annotators facing quality criteria of varying detail, the available quality relationships they can choose include "better than" (A), "worse than" (B), "as well as" (C), and "uncertain" (D). All 40 annotators come from the company’s professional data annotation department, have certificates of English proficiency, and are paid more than the local minimum wage. Due to limited resources, we sample one example from each of the four datasets (CNN/Dailymail, SAMSum, News Commentary, and WebNLG) for human annotation. Each sample was subjected to 18 types of perturbation attacks, resulting in 18 pairs of test samples with and without perturbations. We had 11 quality criteria in total, and for each criterion, besides 5 descriptions of varying detail we design, we also select 1-3 descriptions from existing papers, making up a total of 80 descriptions. To prevent interference from other descriptions, for a quality criterion, an annotator is exposed to at most one description. Specifically, we divide the 40 annotators into four groups of ten, with each group annotating all the data once, meaning each test sample is annotated by four annotators. For each group of annotators, an annotator needed to annotate all test samples under the 8 descriptions of different quality criteria, with the types of descriptions distributed as evenly as possible (e.g. an annotator would not annotate all descriptions of the "Term" type). The total volume of annotations was $4\times 18\times 80\times 4=23040$ . The entire annotation process takes about 20 days.

C.3.2 Results

We define the annotation consistency per sample as the proportion of options with the most annotations except for the "uncertain" (D) option. For example, if the options given by four annotators on a sample are $\{A,A,C,D\}$ , and the annotation consistency is 0.5. The final annotation consistency is the average across all samples. We calculate the match rate (i.e. the proportion of human judgments about perturbations that match our expectations) in two ways. The result is shown in Table 18.

Appendix D Details for Experiments

We display all the results of perturbation attacks on GPT-3.5, GPT-4 and Prometheus in Figure 8-31.