You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments
Bangzhao Shu, Lechen Zhang, Minje Choi, Lavinia Dunagan, Lajanugen Logeswaran, Moontae Lee, Dallas Card, David Jurgens
Introduction
Large Language Models (LLMs), trained on a massive and diverse volume of human-generated text corpora, show remarkable capabilities in understanding instruction-based tasks and achieving high performance in several NLP benchmarks Brown et al. (2020); Touvron et al. (2023); Taori et al. (2023). Prior work has shown that the generated responses of LLMs often contain human-like properties that were acquired during the model’s training phase Sap et al. (2022), such as being able to explain a theory of mind about others in situations. These human-like behaviors have led to several studies that focus on identifying the psychological properties or the persona of LLMs using specifically designed prompts. Our study critically examines direction to test whether the current strategies for assessing human-like psychological states in models are sufficient to ensure reliable and consistent measurements of an LLM’s persona.
In humans, a persona encompasses a broad class of attributes that make up a person’s identity, such as personality, demographics, or values, which all influence how a person portrays themself Cheng et al. (2023). This terminology has been adopted in several NLP studies which range from identifying personas from text corpora Bamman et al. (2013); Chu et al. (2018); Ghosh et al. (2022); Zhu et al. (2023) to injecting personas into language generation tasks Ahn et al. (2023); Xu et al. (2022); Lee et al. (2022); Li et al. (2023). As LLMs are increasingly used in interpersonal settings, it is beneficial to have accurate measurements of latent properties in the model that can influence what text they generate in order to mitigate any potential harm that may arise from undesired innate model biases Lucy and Bamman (2021); Feng et al. (2023). As a result, several recent studies have investigated the opinions that LLMs such as ChatGPT represent from angles such as political perspectives Liu et al. (2022), personality tests Pan and Zeng (2023); V Ganesan et al. (2023); Miotto et al. (2022); Jiang et al. (2023), and moral choices Santurkar et al. (2023); Cheng et al. (2023).
Current approaches to measuring dimensions of LLMs’ personas typically assess them like humans by turning the questions in psychological instruments into prompts and recording answers. While models are trained to answer questions in general, multiple works have raised concerns about the brittleness of the model’s answering abilities Sclar et al. (2023), e.g., their sensitivity to prompt formats. Further, recent studies have shown that LLMs are prone to cues such as negation and thus generate inconsistent results rather than fully comprehending the question García-Ferrero et al. (2023). Thus, here, we investigate the capability of LLMs to generate responses to persona-related questionnaires from three angles: (1) Comprehensibility: are LLMs capable of understanding instructions and generating answers given a specific prompt? (2) Sensitivity: are model answers the same to different spurious changes to the question format? (3) Consistency: are model answers the same to different content-level changes to the question?
This study has the following three contributions. First, we curate Model-Personas, a large panel of 39 psychological instruments, and standardize these into 693 questions across 115 axes. Second, we introduce a systematic evaluation framework for testing the sensitivity and consistency of LLMs’ answers to persona questions through controlled variations of the prompts. Third, we evaluate multiple open-source LLMs using Model-Personas, showing that models vary widely the sensitivity and consistency levels with most models having no consistent persona. Our results reveal that BLOOMZ models are most robust to sensitivity perturbations, while FLAN-T5 models were most consistent. In general, however, most LLMs failed to deliver robust predictions, raising concerns about claims of models’ “personalities” or “values.”
Model-Personas: A Comprehensive Benchmark for Measuring Personas
Studies for creating and identifying personas largely involve qualitative methods such as interviews, surveys, field studies, and surveys Brickey et al. (2012); Salminen et al. (2020). In particular, surveys in the form of questionnaires have widely been adopted widely in psychology and behavioral studies to measure personality traits and opinions of individuals in a categorizable manner at a large scale Spence et al. (1974); Dalbert (1999); Patrick et al. (2002); Van Der Zee and Van Oudenhoven (2000). These questionnaires, known as psychological instruments, are frequently calibrated through experiments to capture to core axes of variation in people. Questionnaires are easily compatible with LLMs pretrained with instruction-based prompts, as these models can provide a wide range of answers ranging from open-ended responses to yes/no answers; further, prompting is widely accepted as the default input form for eliciting responses from LLMs. Following this trend, our benchmark also takes the form of a questionnaire designed to answer in a yes/no format.
In constructing a benchmark for assessing model persona, we performed a comprehensive survey of possible instruments. The selection criteria were focused on persona attributes that are mostly stable and excluded instruments focusing on mental health. Our persona instruments can be categorized into five groups: Belief statements, Normative statements, Values, Descriptors, and Situations. Belief statements include instruments that reflect an individual’s conviction about the truth of a particular idea, such as Unjust World Scale (UWS) Dalbert et al. (2001) and Money Attitudes Measure (MAM) Furnham and Grover (2020); Normative statements include instruments that express value judgments, opinions, or prescriptions about how things ought to be, such as Holistic Cognition Scale (HCS) Lux et al. (2021) and Ambivalent Classism Inventory (ACI) Jessica A. Jordan and Bosson (2021); Values include instruments that examine an individual’s deeply held beliefs about what is important or desirable and serve as guiding principles for behavior and judgment, such as Strength Based Inventory (SBI) Nahathai Wongpakaran and Kuntawong (2020); Descriptors include instruments that are used to detail personality traits, such as Big 5 Personality Traits (OCEAN) Poropat (2009); Situation includes instruments that measure individuals’ responses and behaviors in various social contexts and scenarios, such as Emotional Response to Unfairness Scale (ERUS) Bizer (2020).
Each instrument contains one or more axes that evaluate a specific dimension. For example, the Ethics Position Questionnaire (EPQ) Forsyth (1980) contains two axes: Idealism and Relativism, which evaluate an individual’s ethics position from two different aspects. Furthermore, each axis contains one or more statements, and individuals can get a score on the axes based on how they agree or disagree with the statements. Overall, our dataset consists of 693 questions under 39 instrument categories and 115 axes, encapsulating a broad spectrum of psychological and sociological constructs. The sample instruments are shown in Appendix Table 3.
Instruments each have their own question format or phrasing, which introduces unnecessary variability when evaluating LLMs. Therefore, across all instruments, we introduce a standard question format to prompt models with. Following best practice on prompt design (e.g.,Aher et al. (2023))), we use a structured prompt of "Statement:\n
Design of Prompt Variants
Given that LLMs can be sensitive to the format Sclar et al. (2023) and content Min et al. (2023) of prompts, here, we introduce the design choices for perturbing the prompts. These changes are intended to affect the comprehensibility, sensitivity, and consistency of an LLM’s answer for a given instrument question.
Our first analysis centers on whether spurious changes to the input prompt can affect model predictions when inferring persona. The perturbations made in this aspect, in theory, should not change the model’s confidence in generating an answer. Four types of prompt variations are used:
Sentence Ending: We compare two types of sentence ending: "?" and ":". An example would be "Do you agree?" with "Do you agree? Your Answer:".
Colon+>: We test whether varying the number of spaces after the colon by adding zero spaces, one space, double space, or a line-break can affect performance. For example, whether "Answer: " would produce different results from "Answer:\n".
Answer/Response: We compare the use of the word used at the end of the prompt - "Answer:" or "Response:".
Section Separation Format: We compare different formats to separate sections (Statement/Question/Answer) in our prompt. The separators include Line-break, Single Space, Double-Bar (//) and Triple-Sharp (###).
Detailed examples can be found in Table 5.
2 Prompts for Content-level Variation
Even if LLMs are able to understand the instructions and generate an answer with high probability, it is possible that they are merely generating based on the question structure rather than on their understanding of the question. To evaluate this aspect, we construct a set of perturbations on the original input prompts to examine whether LLMs can generate consistent responses. Four types of prompts are used:
Option Consistency: We test the consistency of response when asked to return different types of labels. For example, the responses of an LLM when asked to answer "Yes/No" should be consistent when asked to answer "True/False".
Negation Consistency: LLM predictions are known to be affected by the inclusion of negation words Jang et al. (2023); García-Ferrero et al. (2023). We test this by manually rewriting each question into reversed meaning and looking at the changes in response. We test two types of negation: (1) Direct Negation: We insert a negation word such as “not”, “no”, and “don’t” in syntactically coherent position to reverse the answer’s polarity. (2) Paraphrastic Negation: We reverse the meaning of the sentence by rephrasing it without including a negation word. Examples of this are in Appendix Table 4.
Order Consistency: We test the consistency of a response when the given answering options are in reversed order. For example, if we ask LLMs to answer using "Yes/No", the answer should be consistent when also asked to answer using "No/Yes".
Experimental Setup
Here, we describe the experimental setup and define the metrics for evaluating model performance.
We define a model’s comprehensibility as the ability to generate an answer corresponding to one of the available options, e.g., “True” or “False”. Let be the probability of the next token after the prompt.
The comprehensibility of a question is defined as where and are the set of possible valid Positive and Negative answers to the prompt’s question obtained via regular expressions. Given that the prompts all contain explicit instructions for valid answers, this sum indicates the degree to which the model can generate one of the desired tokens as an answer. The model’s answer to the prompt is calculated by normalizing the probabilities of the positive and negative words: , which captures the leaning towards the affirmative option.
2 Measuring Sensitivity and Consistency
If a model is able to answer the prompt, to what degree do its answers vary when the format and content of the question are varied? We define a model’s sensitivity as the degree to which its answers change when prompted with spurious variations, and a model’s consistency as the degree to which a model agrees across different paraphrases of the same question.
For each question in the instrument dataset , we first measure the model’s response as the valid response option with the highest probability. We then modify into a different prompt . This modification can either occur as a spurious change (§3.1) or at content-level (§3.2). We then obtain as well.
Since our model should generate answers that are robust to perturbations, we measure both sensitivity and consistency as the fraction of samples from which the answers did not change after perturbation. However, for negation consistency, we expect the model to answer with the reverse option in order to be consistent with the non-negated original prompt; negation consistency is measured as the number of opposite-answers for relative to the answer for .
3 Model Details
Using our consistency metrics, we perform evaluations on several variants of open-source LLMs which are widely used in current research. The models included in our experiments are GPT2 Radford et al. (2019), Falcon-7B Penedo et al. (2023), BLOOMZ (560M, 1B1, 3B, 7B1) Muennighoff et al. (2022), Llama2 (7B, 7B-Chat, 13B, 13B-Chat) Touvron et al. (2023), RedPajama-7B Computer (2023), and FLAN-T5 (Small, Base, Large, XL) Chung et al. (2022). Note that we exclude closed-source LLMs such as GPT-3.5 as our analysis requires obtaining the exact probability scores for the next predicted token. All experiments are conducted on NVIDIA RTX A6000 GPUs using Hugging Face Transformers 4.22.1 and Pytorch 2.0.1 on a CUDA 11.7 environment.
Results
Here, we present our results on the robustness of LLM predictions on Model-Personas.
Models varied widely in their ability to generate a valid answer to the instruments’ questions, as shown in Table 1. Models from the BLOOMZ and FLAN-T5 families demonstrate a uniformly higher likelihood of responding correctly to all variations of the prompts. In evaluations using nine varied prompt formats, the BLOOMZ family models consistently return correct responses with probabilities exceeding 0.95. The FLAN-T5 models have probabilities ranging from 0.5 to 0.7, which may be related to differences in model size. Falcon-7B, RedPajama-7B, Llama 2-7B, and Llama 2-13B show varied performance when faced with different prompt formats. For instance, in Falcon-7B, adding a single space after a colon can drastically cut the comprehensibility score from 0.934 to 0.006, indicating that its ability to respond as “True” or “False” to a given question is harshly impeded.
Our results suggest that psychological instruments cannot be blindly given to models without first testing whether the question format is understandable. Subtle changes in prompt syntax can significantly influence the performance of some models in validly answering questions, which, depending on how an experimenter handles non-answers, may significantly influence the model’s scores on the instrument.
2 LLMs can be Sensitive even to Spurious Prompt Variation
Even when models can validly answer questions, our experiments show that their answers may change due to small, spurious differences in the format of the prompt itself. Table 2 shows the sensitivity of each LLM in comparison with the baseline prompt setting which is marked in an asterisk. Ideally, an LLM should not change the leaning of its answer when asked the same question with slightly different prompt formats, especially under trivial changes such as changing a single space to a double space or line break. Nevertheless, we observe that several LLMs change their responses when prompted using such variations. In several cases, we observe that the sensitivity score of a model in a particular setting is similar to random (0.5), though it is hard to find a consistent pattern among the cases where models express sensitivity. Notably, all of the LLMs in the FLAN-T5 family exhibit perfect robustness to these perturbations. Despite LLMs of the BLOOMZ family constantly being the most comprehensible of the instructions across prompt variations (Table 1), the models’ answers change frequently, nearing the consistency of random behavior. This experiment suggests that while possibly correlated, being able to return answers of high confidence does not necessitate robustness to sensitivity and vice versa.
3 Staying Consistent is Challenging for LLMs
We now turn to see whether LLMs are capable of understanding the persona questions and providing consistent answers that suggest a latent persona property. Figure 1 shows the different consistencies of models, and following, we summarize the trends.
All models show Order Consistency performance above random choice. Most models scored over 0.7, excluding GPT-2 and RedPajama. No clear relationship emerges between model size and order consistency, and performance disparities are also present across different model families, with FLAN-T5 models leading. It is important to note that a high order consistency score does not unequivocally indicate model superiority, as models that consistently respond positively—regardless of prompt—will naturally score higher. The variance in option consistency is also noteworthy. Within model families, larger models do not always outperform their smaller counterparts, though Llama2-13B does outperform Llama2-7B. When models of the same size from different families are compared, the performance varies. Similar to order consistency, a higher score in option consistency does not necessarily mean the model is performing well; it could indicate a tendency to respond positively regardless of the prompt. GPT-2, for instance, shows high option consistency but low negation consistency, suggesting it provides uniform answers independent of prompt variations. BLOOMZ-560M exhibits lower option consistency, indicating a potential disparity in its performance on True/False versus Yes/No questions. The FLAN-T5 model family stands out for its stability and superior performance in option consistency.
Negation Consistency is hard to achieve with LLMs
While most LLMs maintain consistency levels over a random-answering baseline for order consistency and option consistency, all models struggle to generate consistent answers when the meaning of the question is reversed using negation, either with the direct inclusion of negative words or through semantic changes. These results align with recent work on negation prompts García-Ferrero et al. (2023) which showed that understanding negation is challenging for various LLMs. In fact, the majority of models (10 of 15) achieve a score close to random (0.5) regarding Negation Consistency. Only five models exhibit higher consistency—primarily among larger models such as FLAN-T5-Large, FLAN-T5-XL, BLOOMZ-7B, and Llama2-13B-Chat, and, interestingly, BLOOMZ-560M, the smallest of the BLOOMZ family. It can also be seen that models tend to achieve higher consistency when direct negation words are used rather than the sentence being semantically negated. We can observe from Figure 1 that all models perform worse on paraphrastic negations than on direct negations. A potential reason is that paraphrastic negation introduces subtle shifts and requires a deeper understanding of the context to be able to provide a flipped answer, whereas for direct negation the negation word itself can lead to a flipped answer.
Summary
In summary, there is a significant consistency variation in the performance of the tested models, with larger models generally exhibiting a greater likelihood of consistent responses across the four metrics examined. Nevertheless, the majority fail to outperform a simple random choice. Notably, the BLOOMZ-560m model displays exceptional consistency with True/False questions but significantly less so with Yes/No questions. The FLAN-T5 family consistently performs well across all metrics of persona consistency.
The key implication of this result is that a simple prompting of a model with an instrument’s questions is not sufficient to claim any persona. Instead, models must be prompted with at least negated forms of the questions to verify the model’s answers indicate a deeper understanding of the prompt and not just an artifact.
For our purposes, we consider a model to exhibit consistent personas if it achieves a threshold score of 0.7. Of the 15 models evaluated, only two—Flan-T5-Large and Flan-T5-XL —all from the Flan-T5 family, met this criterion, suggesting the potential for these models to possess consistent personas.
4 Persona Leanings can Differ even within the Same Model Group
Using the two models with semantic consistency above 0.7, Flan-T5-Large and Flan-T5-XL, we generate answers for every question in Model-Personas and aggregate them by each instrument axis. Figure 2 displays a comparison of the leanings on dimensions they most differed (a full display of all axes can be found in Appendix Figures 3-5). Even though both models share the same architecture structure and dataset, we observe significant differences among the leanings for the selected dimensions. In general, for most of the instruments Flan-T5-XL can be seen as aligning more towards the extreme values compared to Flan-T5-Large. This can be seen as Flan-T5-XL being more certain of its answers and assigning greater probabilities to either “True” or “False” than its counterpart, which may relate to its higher comprehensibility scores as seen in Table 1.
Discussion
In this section, we discuss the implications of our study as well as future steps for addressing and mitigating inconsistency issues.
Despite the rapidly increasing view of LLMs as a means of understanding and emulating human responses, our results show that most of our models fail to generate coherent responses even when tested on simple variants of input prompts.
Interestingly, our results indicate that the reliability of LLMs is not necessarily correlated with a model’s number of parameters, which is consistent with recent findings indicating that larger model sizes do not always lead to higher task performance or task understandability Choi et al. (2023). Rather we observe that comprehensibility, sensitivity, and consistency scores can be better grouped at the architecture family level. This was particularly notable for the models belonging to the FLAN-T5 and BLOOMZ families, showing that design details in the pertaining phase might have a profound effect on an LLM’s zero-shot capabilities when prompted to provide answers in downstream tasks.
2 Mitigating the Unreliability of Prompts
What measures can we adopt to mitigate the current unreliability of LLMs? We offer two suggestions based on prior studies. One approach would be to perform a preliminary test on the confidence scores of the answers for a given set of prompts before running the prompts to obtain persona leanings. For instance, Aher et al. (2023) adopt an approach of selecting out of -choice prompts the prompt that would maximize the validity rate, then conducting subsequent experiments on that prompt.
Another promising approach is to perform additional fine-tuning steps to improve model robustness. However, performing additional fine-tuning steps might not always be beneficial. For instance, one recent study has shown that even after additionally fine-tuning LLMs on text corpora that include negation samples does not significantly improve its capability to understand negation (García-Ferrero et al., 2023). Besides, there is the risk that additional fine-tuning might alter the LLM’s innate persona that was present before fine-tuning, which raises an issue in reproducibility and generalizability.
Conclusion
Human-like interactions with large language models can inspire a desire to assess models like humans. In the psychological setting, this can mean assessing whether models have human-like persona traits, such as personality or values, using questionnaires as prompts. However, does the text models generate in response reflect a consistent latent attribute of the model—or just a continuation of a high probability sequence? Here, we systematically assess whether LLMs are capable of generating robust responses for assessing personas by evaluating the extent to which LLMs can understand questions and provide answers in a consistent manner under various prompt variations. Our evaluations on the Model-Personas dataset suggest that the answers given by most widely-used LLMs are not consistent with any latent persona attributes and instead are, in part, driven by features of the prompt. Not only do models vary in their ability to generate a valid answer, relative to spurious changes in format, but the answers themselves—e.g., whether a model affirms an extroversion preference—are also sensitive to such changes. Furthermore, most LLMs fail to deliver consistent leanings when the question meaning is reversed using negation. In fact, only two of the fifteen models, both from the Flan-T5 family, were able to achieve a reasonable average consistency score over 0.7. Overall, our study demonstrates the unreliability of a blind application of a psychological questionnaire for assessing the attributes of LLMs, and calls for cautionary measures such as sensitivity and consistency checks to ensure robustness of measurement. The code and dataset are available at https://github.com/orange0629/llm-personas.
Acknowledgments
This work was funded in part by LG AI Research and by the National Science Foundation under Grant No. IIS-2143529. The authors thank Moontae Lee, Lajanugen Logeswaran, and Honglak Lee for feedback on early versions of this work.
References
Appendix
Table 3 contains examples of personas, their corresponding instrument set, and an example question that is used as a prompt for the LLM.
Negation Examples
Table 4 contains examples of reversing the meaning of a sentence via both direct and paraphrastic negation.
Detailed Prompt Variants
Table 5 shows the format of every prompt variant that was used to evaluate an LLM’s comprehensibility and sensitivity.
Persona Leanings across All Instruments
Figures 3, 4, 5 show a comparison of persona leaning scores across every single instrument for the FLAN-T5-Large and FLAN-T5-XL models.