Towards Measuring the Representation of Subjective Global Opinions in Language Models

Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, Deep Ganguli

cs.CL cs.AI

Introduction

Large Language models (LLMs) exhibit remarkable performance on a wide variety of tasks , some of which involve subjective decision-making processes such as detecting bias and toxicity , steering model outputs to abide by ethical principles , generating model evaluations , and summarizing the most important information from articles . However, such applications may require language models to make subjective judgments that could vary significantly across different social groups . If a language model disproportionately represents certain opinions, it risks imposing potentially undesirable effects such as promoting hegemonic worldviews and homogenizing people’s perspectives and beliefs . To assess this risk, we develop a framework to quantitatively measure the opinions of LLMs (§2). Inspired by (see §6 for related work) we first compile a set of questions and responses from two established cross-national surveys designed to capture values and beliefs from thousands of participants across many countries: the Pew Global Attitudes Survey (PEW)https://www.pewresearch.org/Pew Research Center bears no responsibility for the analyses or interpretations of the data presented here. The opinions expressed herein, including any implications for policy, are those of the author and not of Pew Research Center. and the World Values Survey (WVS) (§2.1, see Table 1 for example questions).Assessing people’s opinions is challenging. We rely on the Pew Global Attitudes Survey and the World Values survey, which means we inherit all the pros, cons, assumptions, and caveats of the Social Science research that attempts to measure such values. We then administer the survey questions to an LLM trained to be helpful, honest, and harmless with reinforcement learning from human feedback and Constitutional AI (§2.2).While we evaluate our framework using a single language model, the methodology can be applied to assess other models as well. Here, we scope our work to focus more on the evaluation framework and results, rather than an effort to systematically benchmark the values of multiple models as in . Finally, we compute the similarity between model responses and human responses, where the human responses are averaged within a country (Fig. 1, §2.3).We fully recognize that computing an average of human survey responses across countries elides the fact that there is significant variability in opinions within a country. Nevertheless, to compute the similarity between LLM responses and peoples’ responses, we must make a simplifying assumption such as this one.

With our framework, we run three experiments described in §2.4. In our first experiment, we simply administer the survey questions as they are and analyze the resulting model outputs. We find that the model we analyze generates survey responses that quantitatively are more similar to the opinions of participants from the USA, Canada, Australia, and several European and South American countries more closely than those of the participants from other countries (Fig. 2, §3). This is consistent with qualitative findings from . This suggests there may be biases inherent in the models that can lead to certain groups’ opinions being underrepresented, compared to the opinions from participants in Western countries .Following the definition in , the West refers to the regions and nations of Europe, the United States, Canada, and Australasia, and their common norms, values, customs, beliefs, and political systems. We also find that for some questions, the model assigns high probability to a single response, whereas human responses across countries to the same question reveal a greater diversity of responses (§4).

In our second experiment, we find that prompting the models to consider the opinions of certain groups, e.g., ones from China and Russia, can lead the models to modify their responses (Fig. 3). However, this does not necessarily mean the models have a meaningful, nuanced understanding of those perspectives and values (§4). Some of these changes could reflect over-generalizations around complex cultural values (see Tab. 5).

Finally, we find that prompting models in different languages does not necessarily translate to responses that are most similar to the opinions of populations that predominantly speak those languages. Despite promising adaptability, language models require deeper understanding of social contexts in order to produce responses that reflect people’s diverse opinions and experiences (Fig. 4, §4).

We believe transparency into the opinions encoded and reflected by current language models is critical for building AI systems that represent and serve all people equitably. Although our framework is a step in this direction, it suffers from several limitations and caveats that we highlight throughout the text in footnotes and in §5. Despite these limitations, we hope our framework can help guide the development of language models that embody a diversity of cultural viewpoints and life experiences, not just those of privileged or dominant groups.We recognize that LLMs were initially (primarily) developed in the West, and specifically in Silicon Valley. These regions have their own cultures and values which are imbued into the technology .

Methods

We compile 2,556 multiple-choice questions and responses from two large cross-national surveys: Pew Research Center’s Global Attitudes surveys (GAS, 2,203 questions) and the World Values Survey (WVS Wave 7, 353 questions). Pew Research Center is a nonpartisan organization that provides data and research on public opinion, social issues, and demographic trends in the U.S. and worldwide. Global Attitudes surveys cover topics such as politics, media, technology, religion, race, and ethnicity. The World Values Survey is a global research project that investigates people’s beliefs and values across the world, how these beliefs change over time, and the social and political impact of these beliefs. Some example questions are in Table 1, along with a more detailed analysis of these questions in Appendix A.

We choose these datasets for three main reasons. First, both the GAS and WVS surveys provide a starting point, backed by rigorous social science research, that we can easily adapt to assess how language models respond when posed with subjective questions regarding global issues. Second, the surveys include responses from people across the world, which allows us to directly compare human responses with model responses (described in §2.3). Finally, the surveys use a multiple-choice format, which is readily suitable for LLMs since responses can be scored objectively compared to open-ended questions.We recognize the limitations in using these surveys to evaluate language models, as they were not specifically designed for this purpose. As such, the construct validity of these measures when applied to LLMs is limited . While these surveys can provide some insights into LLMs’ capabilities, the results should be interpreted cautiously given the possibility of biases encoded in measurement artifacts. More tailored evaluations may be needed to gain a comprehensive understanding of language models’ strengths and weaknesses.

2 Models

We study a decoder-only transformer model fine-tuned with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI (CAI) to function as helpful, honest, and harmless dialogue model. Details about model architectures, training data, training procedures, and evaluations are described in .

For the model we study here, the majority of the pre-training data are in English. The human feedback data for RLHF (used to train the model to be helpful) are primarily provided by North Americans (primarily in English) whose demographics roughly match the U.S. Census . A small set of principles for CAI training (used to train the model to be honest and harmless) encourage the model to consider non-US-centric perspectives, as well as principles based on the Universal Declaration of Human Rights.(https://www.anthropic.com/index/claudes-constitutionAdditionally, we examined the influence of the amount of RLHF training on our results because previous work shows that amount of RLHF training can significantly change metrics on a wide range of personality, political preference, and social bias evaluations ; however we surprisingly found no strong effects (in terms of whose opinions the model’s generations are more similar to). As such, we only report on on a model after a fixed amount of RLHF and CAI training in the main text. A-priori, it was unclear how this combination of pre-training data, RLHF fine-tuning data, and CAI principles might influence the models to consider non-US-centric perspectives. We leave a detailed analysis of this for future work that we discuss in §5.

3 Metric

Given a set of survey questions $Q=\{q_{1},q_{2},...,q_{n}\}$ extracted from GAS and WVS, we compute the similarity of the responses from set of models $M=\{m_{1},m_{2},...,m_{k}\}$ , with the responses from set of countries $C=\{c_{1},c_{2},...,c_{l}\}$ as follows (illustrated in Figure 1):

For each model $m\in M$ , record predicted probabilities over options $O_{q}$ for each question $q\in Q$ :

For each country $c\in C$ , compute average probabilities over options $O_{q}$ for each question $q\in Q$ based on responses, if $n_{c|q}>0$ :

where $n_{c|q}$ denotes the number of respondents from country $c$ who answered question $q$ $\in$ $Q$ and $n_{o_{i},c|q}$ denotes the number of respondents from country c who chose option $o_{i}$ $\in O_{q}$ for question $q$ $\in Q$ .

Compute the similarity ( $S_{mc}$ ) between a model $m\in M$ and a country $c\in C$ across the questions $q\in Q$ , given a similarity metric $Sim$ :

Here, we use 1 - Jensen-Shannon Distance as our similarity metric. However, our overall method is agnostic to the specific metric used.

4 Experimental Setup

We conduct three experiments to analyze the similarity between survey responses from LLMs and responses from the participants of the surveys. Further details on the experimental methodology are provided in Appendix B.

Default Prompting (DP) – We ask the model WVS and GAS multiple-choice survey questions as they were originally written. The goal of the default prompt is to measure the opinions stated by the model, relative to people’s aggregate opinions from a country. We hypothesize that responses to the default prompt may reveal biases and challenges models may have at representing diverse views.

Cross-national Prompting (CP) – We prepend to the DP: "How would someone from [country X] respond to this question?" The goal of CP is to measure (relative to the DP condition) potential stereotypes or other assumptions the model may make about people from different countries. We aim to elicit the model’s high-level associations with the named country, while acknowledging that diverse viewpoints may exist within each country. Furthermore, this condition measures how malleable or ‘steerable’ model’s opinions may be with respect to a minor perturbation in the default prompt. We examine 6 different countries, enumerated in Appendix B.

Linguistic Prompting (LP) – We change the language of the DP. Language variation may reveal information related to individuals’ social identity and background . As large language models are trained on vast amounts of human text data, they may implicitly encode information about the social identities of the original speakers and writers. The goal of LP is to measure how model responses change (relative to the DP condition) based on linguistic cues. Since human translations are not available for all questions, we rely on the language model for translation into 3 target languages: Russian, Chinese, and Turkish. We acknowledge that relying on language models for translation risks errors, ambiguous translation, and a loss of cultural nuances. As such, we verified that the translations are accurate with native speakers (authors of this paper, details in Appendix D).

Main Experimental Results

With default prompting (DP), model responses are most similar to the opinion distributions of countries like the USA, Canada, Australia, and some of European and South American countries. (Figure 2). Model responses highlight the potential for embedded biases in the models that systematically favor Western, Educated, Industrialized, Rich, and Democratic (WEIRD) populations . As mentioned in §2.2, this likely due to the fact that the model we test is predominantly trained on English data, and English human feedback . Prior work also points out that development of AI systems is predominantly centered around Western contexts . As such, models may learn latent patterns that primarily reflect these populations .

With Cross-national Prompting (CP), model responses appear to become most similar to the opinion distributions of the prompted countries (Figure 3). When prompted to specify responses tailored to the opinions of people from those countries like China or Russia, the model’s stated opinions shift to be most similar to the opinions of those populations. However, this does not necessarily suggest that models are capable of nuanced, culturally-situated representation of diverse beliefs. As we show in Section 4, we find evidence that the model generations exhibit (possibly harmful) cultural assumptions and stereotypes as opposed to a deeper understanding of different cultures. Ultimately, we find that our evaluation framework in this experimental condition reveals new forms of potentially harmful outputs that need to be addressed.

With Linguistic Prompting (LP), model responses do not become more similar to the opinions of the populations that predominantly speak the target languages. (Figure 4). For example, we observe that, even when we ask questions in Russian, the model’s responses remain more similar to responses from the USA, Canada, and some European countries (as in the DP condition) than to responses from Russia.

While translating the prompts into different languages provides more linguistic context, this alone may not sufficiently address other factors that contribute to the model’s biases in representing some countries’ opinions more predominantly than others. The primarily English training data, RLHF annotation, and non-US-centric CAI principles (see §2.2 for details) appear insufficient for the model to steer its responses to represent the opinions of the target countries based on linguistic cues. Further analysis and examples illustrating this finding are provided in Section 4.

Question Level Analysis

For some questions, the model assigns a high confidence in a single response, whereas human responses across countries reveal a greater diversity of viewpoints. For example, Fig. 1 shows that in response to the question: “If you had to choose between a good democracy or a strong economy, which would you say is more important”, the model assigns a 1.35% probability to the option “A strong economy”. In contrast, people from the USA reply "A strong economy" 41.2% of the time, people from Russia 83.08% of the time, and people from Turkey 48.67% of the time. We observe that human respondents worldwide show a greater range of perspectives on this issue than the model does. We observe that human respondents worldwide show a greater range of perspectives on this issue than the model does.Models trained with RLHF (like the one we analyze here) tend to be less well-calibrated than pre-trained models. As such, they tend to assign probability mass less evenly across across all choices to multiple-choice questions than pre-trained models do. We leave further examples of high model confidence and distributional differences between the model and human responses in Appendix C (e.g., Figures 8 and 9).

Analysis of Cross National Prompting

Although we find that CP can steer models to be more similar to the opinions of the prompted countries (§3, Fig. 3), it is not perfect. For example, Fig. 5 shows the distribution of model and people responses to the question: “Do you personally believe that sex between unmarried adults is morally acceptable?”. In the DP setting, the model responds “Morally unacceptable” 0.8% of time (it responds “Depends on the situation” 67.3%), whereas Americans and Russians both respond “Morally unacceptable” ${\sim}$ 31% of the time. When we prompt the model to respond to the question as though it were from Russia, it responds “Morally unacceptable” 73.9% of the time and “Morally acceptable” 0.5% of the time (42.1% of Russians respond this way). CP inadequately reflects the diversity of responses to these questions amongst Russians. One potential reason for this discrepancy, is that the model may have limited country-specific training data, such that it learns superficial associations between certain countries and value preferences, as opposed to learning a deeper representation of values across different countries and contexts . We leave further examples in Appendix C (Figures 7 and 8)

Examining Model Generations

Table 2 shows the model generations to the question about sex between unmarried results (Fig. 5) in the DP and CP settings. With Default Prompting, the model output declines to make a moral judgement. However, with Cross-national Prompting to respond as though Russian, the model output conveys a strong (but not representative) judgement that justifies itself by claiming that Russians hold conservative views on sexuality, traditional family values and Orthodox Christian morality. In this case, the model may over-generalize—the justification may be too simplistic and lack nuance. We leave further examples and analysis in Appendix C (Tables 6-13).

Analysis of Linguistic Prompting

In order to understand the discrepancies between LP and CP, we examine model generations. Table 3 shows an example where the Linguistic Prompting (asking the question in Turkish) results in a response that does not match with the response distribution of the participants from Turkey (%57 of the participants select Option B). Furthermore, we observe that the model generates different responses with CP and LP for this example. With CP, the model generated a response indicating that Turkish people would believe the government should be able to prevent statements calling for violent protests. However, with LP the model selected a response emphasizing the right to free speech. Additional examples where the model generates differing responses with CP versus LP are provided in Appendix C (Tables 12 and 13)

Limitations and Discussion

Our study relies on two established global surveys and social science literature to analyze broad societal values. However, we acknowledge several limitations of this approach. Opinions and values continuously evolve, and surveys may not fully capture cultural diversity or represent all individuals within a society . Furthermore, human values are complex and subjective — we choose to average survey responses across humans within a country, which a simplifying assumption, but it is unclear what to do when people within a country have dissenting opinions . The main focus of our work is to measure whether language models under- or over-represent certain perspectives, rather than to prescribe exactly how models should reflect human values. While we believe that it is important to consider social contexts when developing AI systems , we do not make definitive claims about ideal levels of cultural representation.

Although we build a framework and dataset to measure the subjective representation of global values in LLMs, we have not attempted to articulate a road map for building models that are inclusive, equitable, and benefit all groups. We hypothesize that some simple interventions may help, such as increasing more multi-lingual pre-training data, having people from diverse backgrounds provide labels and feedback for instruction-tuning methods such as RLHF, and incorporating more inclusive principles into the constitution for models based on Constitutional AI. We believe our framework and dataset can be used to quantify the impact of these interventions; however we leave a systematic analysis for future work.

Related Work

While a large amount of technical work has focused on mitigating known issues or aligning with clearly defined values, understanding how models function in settings involving ambiguity, nuance or diverse human experiences has been less explored . However, understanding the model behaviour in settings that involve ambiguity is crucial to identifying and mitigating potential biases in order to build models that respect human diversity . Furthermore, there is evidence that LLMs exhibit biases in these settings. For example, they propagate ideological assumptions, values and biases that align with particular political viewpoints . ChatGPT has been found to express pro-environmental, left-libertarian views . Furthermore, analyses of the values and opinions reflected in LLMs have shown greater alignment with those of left-leaning US demographic groups . These findings highlight how LLMs have the potential to reflect and spread biases, assumptions and values aligned with certain demographic identities or political ideologies over others.

LLMs have been shown to reflect and amplify the biases present in their training data . Several studies have found harmful biases related to gender, race, religion and other attributes in these models . There have been various attempts to address these issues. One approach is red teaming and adversarial testing to systematically identify potential harms, shortcomings and edge cases in these models . Another focus has been developing methods to align models’ values and behaviors with human preferences and priorities . However, efforts to remedy the challenge of value imposition, by relying on prompts or other linguistic cues, may not be sufficient. Therefore, we may need to explore methods that embed ethical reasoning, social awareness, and diverse viewpoints during model development and deployment.

Conclusion

We develop a dataset and evaluation framework to help analyze which global values and opinions LLMs align with by default, as well as when prompted with different contexts. With additional transparency into the values reflected by AI systems, researchers can help address social biases and potentially develop models that are more inclusive of diverse global viewpoints. Although our work is a start, we believe we must continue to research how to develop models with broad, structured understanding of social contexts that can serve and respect all people.

Author Contributions

Esin Durmus mainly designed the study, led the project, conducted most of the experiments, and wrote significant portions of the paper. Karina Nguyen developed the interactive data visualization tool and contributed the map visualizations in the paper. Nicholas Schiefer helped Esin Durmus with writing the initial inference and data analysis code. Thomas I. Liao ran the experiment to compute BLEU scores for model translations and wrote Appendix A. Amanda Askell, Alex Tamkin and Carol Chen provided feedback on drafts of the paper. Jared Kaplan, Jack Clark, and Deep Ganguli supervised the project. Deep Ganguli also helped develop core ideas, and helped frame and write the paper. All other listed authors contributed to the development of otherwise-unpublished models, infrastructure, or contributions that made our experiments possible.

Acknowledgements

We thank Samuel R. Bowman, Iason Gabriel, Tatsunori Hashimoto, Atoosa Kasirzadeh, Seth Lazar, Giada Pistilli, Michael Sellitto and Irene Solaiman for their detailed feedback on the paper.

References

Appendix A Survey Details

Pew Research Center staff design and execute all aspects of the cross-national surveys, from determining the topics and questions to the countries and samples included. However, they hire local research organizations in each country to implement the surveys on the ground. Pew Research Center consults with subject matter experts and experienced researchers on the survey design and content. Pew aims to synchronize fieldwork across countries as much as possible to minimize external events impacting the results. These cross-national studies present special challenges to ensuring comparable data across countries, languages and cultures. Pew Research Center has identified best practices and strategies for overcoming these challenges to conduct high-quality research across countries (https://www.pewresearch.org/our-methods/international-surveys/). The surveys aim to be nationally representative using probability-based sampling. Rigorous quality control measures are implemented, including supervising interviewers, back-checking interviews, monitoring interviewer metrics, and checking on progress and metrics during the field period. Pew Research Center is actively involved in all stages of the research process, from survey design through data collection and analysis.

For each WVS wave, an international team of social scientists develops a master questionnaire in English covering a wide range of topics. The questionnaire is then translated into various languages for use in each country. The latest WVS-7 questionnaire includes 290 questions on topics such as cultural values, gender and family attitudes, poverty and health, tolerance and trust, global governance, etc. It is also used to monitor UN Sustainable Development Goals. To ensure high quality, comparable data across countries, the World Values Survey implements strict standards around sampling, questionnaire translation, fieldwork procedures, and data cleaning. Each country must follow probability sampling to survey a nationally representative sample of at least 1200 people aged 18 and over. The master questionnaire is carefully translated into local languages and pre-tested. Survey agencies report on and address any issues arising during fieldwork. The WVS examines each country’s data for logical consistency, missing information, and unreliable respondents. They check that sample characteristics match expectations. Full documentation from each country allows proper understanding of the context (https://www.worldvaluessurvey.org/WVSContents.jsp).

The survey data did not have predefined topic labels for each question. We use the language model to classify each question into one of the following broader topics based on the question content and responses. The topics are drawn from PEW and WVS survey websites and match the themes covered in the questions. This allows us to understand the key themes covered in the survey. We use the following prompt, and get the probability assigned to each letter appearing before the topic categories:

Figure 6 shows the distribution of topics in the data. Majority of the questions are classified into “Politics and policy” and “Regions and countries”.

Appendix B Experimental Details

B.2 Prompt Sensitivity Analysis

Prior research has demonstrated that results from multiple-choice studies can be sensitive to seemingly arbitrary design choices such as the ordering of options . To ensure our findings are not confounded by such effects, we conduct a sensitivity analysis. Specifically, we test whether our results are robust to changes in the ordering of choices. We randomly shuffle the order of options presented to the model, while keeping consistent the prefix labels (e.g., A, B, C, D) attached to each choice. We find that our primary conclusions remained largely the same.

Appendix C Additional Analysis

Additional examples are provided to demonstrate model generations as well as how the model’s responses and generations can change with cross-national and linguistic prompts.

Table 6 shows example model generations for questions about economic problems of countries like Greece and Italy, as well as policies restricting head scarves in public places. We observe that the model takes stances on both of these issues and provides further justification to support its positions. For example, for the headscarf policies, the model argues that bans should not be imposed in order to uphold principles of freedom of religion.

Cross-national prompting affects the model’s responses for some questions (Figures (7, 8, 9)). In certain cases, the model adapts its responses to be more similar to those from participants in the target countries. However, for other questions, cross-national prompting does not bring the model’s responses closer to the human responses. We analyze in greater depth how the model’s generations change with cross-national prompting. For example, Table 7 shows the model’s responses for the question in Figure 7. We observe that the model justifies its response by referring to surveys and opinions of Turkish citizens. It further posits that Turkish people believe a free market economy has stimulated economic growth in Turkey. However, for this question, we see that a majority of participants from Turkey agree that people are better off in a free market. Similarly, for the question in Figure 8, cross-national prompting alters the model’s response; however, it does not make the response more like that of participants from China. The model generates explanations to justify its response (Table 8). It also generates that "not every Chinese citizen would answer this way," pointing to the diversity of views among individuals. However, with the cross-national prompt, the model’s responses can reflect overgeneralizations regarding a country’s perceptions (e.g., Tables 9 and 10). We further observe that in some cases, the model generates responses stating that it does not hold any opinions or evaluations on a topic because it is just an AI system (Table 11).

Appendix D Translation Ability of the Model into Target Languages

As part of our methodology we use the model to translate questions from English into Russian, Turkish, and Chinese. Since the pre-training data is comprised primarily of English text, we validate the translation ability of the model into the three respective languages by measuring its performance on a translation benchmark, FLORES-200 . The model’s BLEU score when translating from English text ranges from 31.68 to 36.78, suggesting that the translations are generally understandable . We also manually validate the quality of the model translations by using native human speakers to inspect a small sample of outputs. We ask raters to evaluate $100$ model-translated questions on a scale of 1 to 5, where 1 represents a very poor translation and 5 represents an excellent translation. Table 5 shows that the model translations are of relatively high quality, according to human ratings.