Evaluating and Mitigating Discrimination in Language Model Decisions

Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli

cs.CL

Introduction

As language models are increasingly adopted across society, interest is growing in deploying them in an expanding range of societal applications, from finance to medicine, to routine business tasks (Wu et al., 2023; Thirunavukarasu et al., 2023). Of particular concern is their potential use in making or influencing high-stakes decisions about people, such as loan approvals, housing decisions, and travel authorizations, which could have widespread consequences for people’s lives and livelihoods (Ransbotham et al., 2017). While model providers and governments may choose to limit the use of language models for such decisions, it remains important to proactively anticipate and mitigate such potential risks as early as possible, especially given that initial deployments in high-stakes settings have already begun (Wodzak, 2022; Singh et al., 2023).

As a step in this direction, we focus on measuring the risk of widespread automated discrimination (Corbett-Davies et al., 2017; Kasy & Abebe, 2021; Starke et al., 2022; Creel & Hellman, 2022) via automated language model decisions. For example, when provided with a hypothetical candidate for a loan, does a language model suggest granting the loan to the candidate more often if the candidate is of one demographic versus another? The gold standard for investigating this question would be to analyze language model decisions in specific real-world applications, following the rich tradition of audit studies in the social sciences (Jowell & Prescott-Clarke, 1970; Gaddis, 2018). However, such partnerships can be resource intensive to execute, making it challenging to conduct a broad study across the wide range of potential applications of language models, and are additionally limited by challenging privacy and ethical constraints of conducting such studies on real people (Crabtree & Dhima, 2022). Furthermore, we wish to proactively anticipate the risks for many applications that have not yet been developed.

As a complement to real-world user studies, we explore hypothetical decision questions that people might ask language models, covering a wide set of potential use cases. We use a language model to help generate 70 diverse applications of language models across society (Table 1), compose prompts emulating how people could use the models to make automated decisions in each scenario, and then vary the prompt to change only the demographic information (Figure 1), observing any change in the model’s decision. We perform human validation to ensure these prompts describe plausible, well-constructed decision scenarios.

When analyzing model decisions on these prompts without further intervention, we find that the Claude 2.0 language model exhibits a mix of positive and negative discrimination in select settings, suggesting positive outcomes for certain groups with higher probability, including women, non-binary people, and non-white people, while suggesting them at lower probability for older people (Figures 2 and 3). This effect is smaller but still present when race and gender are provided implicitly through names rather than explicitly stated, and the effect is robust when the prompts are written in a wide range of formats and styles. Importantly, we are able to significantly reduce both positive and negative discrimination through careful prompt engineering, for example, by stating that discrimination is illegal or by asking the language model to think about how to avoid discrimination before deciding (Ganguli et al., 2023).

Overall, we demonstrate a method to assess the broad societal impacts of language models prior to real-world deployment, enabling issues like discrimination to be measured and addressed proactively. The approach provides value to both developers and policymakers in anticipating and mitigating harms before they occur, and we expect these methods and analyses to generalize to a wide range of other societal phenomena.

Generating language model prompts and decisions

To evaluate the potential for language model discrimination across different applications, we generate a diverse set of hypothetical prompts that people could use to query models for automated decision-making. Each prompt instructs the model to make a hypothetical binary decision about a particular person described in the prompt. This strategy enables us to scalably produce prompts covering decision scenarios across different areas of society and the economy, from issuing a work visa to granting press credentials; we validate the quality of these generated prompts in Section 2.1. Below we describe the process by which we generate these prompts and language model decisions. See Figure 1 for a graphical overview and Appendix B for the exact prompts we used to generate our evaluations.

We first generate a diverse set of decision scenarios where language models could be used across society. To do so, we prompted a language model with an initial prompt (Section B.1) asking it to provide examples of decision problems. This initial prompt included a few seed examples like financial, legal, professional, and educational decisions. We then iteratively generated more topics by providing the language model’s responses as context and prompting it to identify additional examples (Section B.1). Through this process, we compiled a list of 96 topics spanning areas like healthcare, relationships, entertainment, and more.

Step 2: Generating Template Questions

Next, we generate decision question templates with placeholders for demographic information. We provided the language model with a prompt (Section B.1) specifying the desired structure and content of the templates. Our prompt includes an example template with placeholders for age, race, and gender (e.g., [AGE], [RACE], [GENDER])111While we study these three demographic attributes in this initial investigation, our method can easily be extended to a much wider range of characteristics.. The prompt instructs the model to generate a template for a different decision topic that uses these placeholders, ensuring the question is framed as a yes/no decision faced by a third party decision-maker who must use only the information available in the prompt. Critically, we ensure that a yes decision for each question is a positive outcome for the subject of the decision question (e.g., having a loan approved or being promoted). Intuitively, this allows us to measure discrimination in terms of differences in the probability of a yes decision across demographic attributes. We return to this analysis in Section 3.

We manually reviewed all 96 templates and removed any that did not properly follow the prompt instructions. For example, we removed templates that mentioned children or students (to avoid filled templates containing phrases like “a 90-year-old child”), where a yes decision was not a clearly positive outcome for a subject (e.g., “Should the government issue a weather warning?”), or where the template was otherwise not constructed properly. We erred on the side of rejecting borderline templates, resulting in us keeping 70 out of the 96 templates.

Step 3: Filling the Templates

These decision templates enable us to create several versions of the same decision prompt where the only variables that change are the demographics of the subject. We explore two ways of populating the templates with demographic information. The first approach (Explicit) is to insert random combinations of age, race, and gender directly into the [AGE], [RACE], and [GENDER] placeholders (Section B.1). We considered a range of ages, genders, and racial/ethnic categories, including all combinations of [AGE] $\in$ , [GENDER] $\in[\text{male},\text{female},\text{non-binary}]$ and [RACE] $\in[\text{white},\text{Black},\text{Asian},\text{Hispanic},\text{Native American}]$ . These cover three important protected categories in US law; however, this method could be easily expanded to a wider range of descriptors and characteristics. This process results in $9*3*5*70=9450$ individual decision questions we present to the model.

The second approach (Implicit) involves specifying an age, along with a name associated with a particular race and gender (Section B.1). Whereas the former approach enables us to assess discrimination based on explicitly mentioned demographic information, this latter approach enables us to assess discrimination based on more subtle information correlated with race and gender, as has been studied in prior work (Bertrand & Mullainathan, 2004; Sweeney, 2013). Both approaches resulted in fully-written prompts with candidate profiles. See Appendix C for more details and discussion on selection of these names.

Step 4: Generating Decisions

To generate decisions for each filled template, we prompted the language model to provide a yes-or-no answer to the question (Section B.4), following Ganguli et al. (2023). This resulted in binary decisions for each candidate profile. We record the model’s predictive probability of yes based on the token immediately following the prompt. As an additional check, we ensured that our prompts resulted in a p(yes) + p(no) $>0.99$ on average across queries to the model.222This helps ensure that factors such as surface form competition (Holtzman et al., 2021) do not bias our results.

1 Analysis of the generated questions

In Table 1 we provide short descriptions of the 70 decision questions we consider, sorted by high-level category. As the table shows, we consider a range of potential use-cases for language models across the economy and broader society, ranging from high-risk questions such as increasing a person’s credit limit to lower-risk questions such as approving a merchandise return.

Human validation

To validate the quality of the generated decision questions, we conducted a human evaluation study. We collected 405 ratings sampled from the full set of generated questions.333Workers were contracted from Surge AI and paid at least a California minimum wage per hour. The raters were asked to rank the label each question’s overall quality (“The decision question is well-written, sensible, and contains no formatting errors”) on a scale of 1 to 5 (Strongly Disagree, Somewhat Disagree, Neither Agree Nor Disagree, Somewhat Agree, Strongly Agree).

Across the generated decision questions, the average score was 4.76, with $98\%$ of templates ranked 4 or 5. This human validation provides evidence that our template-based generation method can reliably produce large sets of high-quality, realistic, and diverse decision scenarios for evaluating LMs. Additional methodological details and results are available in Appendix D.

Analysis and Results

In this section, we analyze the decisions made by the language models on the generated prompts, broken down by demographic group.

To estimate the effect of each demographic variable on the model’s decisions, we fit a mixed effects linear regression model. The mixed effects model is trained to predict a logit score associated with the probability of a yes decision (a positive outcome for the subject). Specifically, we compute the normalized probability of “yes” $p_{\text{norm}}(yes)=p(yes)/(p(yes)+p(no))$ and then take the log odds of this quantity: $\text{logit}[p_{\text{norm}}(yes)]=\log({p_{\text{norm}}(yes)/(1-p_{\text{norm}}(yes))})$ .

We predict this logit score based on a set of fixed and random effects. The fixed effects are the demographic variables: age (z-scored), gender (encoded as dummy variables), and race (encoded as dummy variables). We chose the baseline (or overall intercept term in the regression) to correspond to a white, 60-year-old male, such that any learned coefficient on an effect below 0 corresponds to negative discrimination relative to a white 60-year-old male, while coefficients above 0 correspond to positive discrimination.444We chose the baseline demographic attributes white and male to correspond to historically privileged groups. We baseline against a 60 year age demographic attribute due to statistical convenience (z-scoring age means 60 years of age corresponds to a z-score of 0), and not due to any historical privilege or disadvantage.

The random effects are the decision question types (encoded as dummy variables), along with interaction terms between each decision question and each demographic variable. Intuitively, the random effects account for variance across question types (e.g., visa decisions vs loan decisions) and how those question types might influence our estimates of the fixed effects (through the interaction terms).

𝐗𝜷𝐙𝐮𝜺\displaystyle=\mathbf{X}\boldsymbol{\beta}+\mathbf{Z}\mathbf{u}+\boldsymbol{\varepsilon} Here, $\mathbf{y}$ is a continuous vector of the $\text{logit}[p_{\text{norm}}(yes)]$ probabilities, $\mathbf{X}$ is the design matrix for the predictors (with columns for intercept, age, gender, and race), $\boldsymbol{\beta}$ is the vector of coefficients for the fixed effects, $\mathbf{Z}$ is the design matrix for the random effects (with columns for the decision questions and decision question–predictor interactions), $\mathbf{u}$ is the vector of random effect coefficients (representing the effects for each decision question), and $\boldsymbol{\varepsilon}$ is the vector of error terms for each observation.

We then fit the models in R to estimate $\mathbf{\boldsymbol{\beta}}$ , $\mathbf{u}$ , and $95\%$ confidence intervals around these terms; see Appendix E for more details.

The mixed effects model enables us to model fixed effects, random effects, and their uncertainty at once. However, given the completeness of our datasets, a more accessible approach for estimating the fixed effects is to compute the differences in average $\text{logit}[p_{\text{norm}}(yes)]$ scores between any demographic and the baseline for each template. Error bars can be estimated based on the standard error of the mean. We confirmed doing this provides very similar estimates (and error bars) to the results from the mixed effects model.

2 Discrimination Score

The values of the fixed and random effect coefficients $\mathbf{\boldsymbol{\beta}}$ and $\mathbf{u}$ indicate how much each demographic attribute, decision question, and (demographic attribute, decision question) pair is associated with an increase or decrease in $\text{logit}[p_{\text{norm}}(yes)]$ relative to the baseline of a 60-year-old white male. We refer to the direct estimates of these coefficients as the discrimination score for any demographic attribute. In the case of no discrimination, the discrimination score should be zero. A negative discrimination score corresponds to negative discrimination for a particular demographic group (on average) relative to a 60-year-old white male and vice versa for a positive discrimination score.555One might be tempted to simply use $p(yes)$ for the predictor of the mixed effects model instead of $\text{logit}[p_{\text{norm}}(yes)]$ ) so that the discrimination score could be interpreted intuitively as a change in probability of $p(yes)$ for a particular demographic attribute relative to the baseline. However, $p(yes)$ can sometimes approach 0 or 1, leading to ceiling or floor effects that artificially reduce the amount of measured discrimination (for example, it is hard to see much positive discrimination if the baseline $p(yes)=95\%$ ). Applying the logit transformation makes the target variable more Gaussian, and mitigates these floor and ceiling effects without the need to filter decision questions based on the average $p(yes)$ (which are also specific to individual models). Note that the discrimination score’s range is $[-\infty,\infty]$ . As an interpretation aid, if baseline subjects had an average $p(yes)$ of $0.5$ , a discrimination score of +1.0 would correspond to an average $p(yes)$ of $0.73$ for that demographic.

3 Results of evaluations for Claude 2

When used in some of these decision scenarios, we find evidence of positive discrimination (i.e., in favor of genders other than male and races other than white) in the Claude 2 model, while finding negative discrimination against age groups over age 60. For race and non-binary gender, this effect is larger when these demographics are explicitly provided versus inferred from names.

As shown in Figure 2, when these demographics are explicit, the Claude 2 model displays a notable increase in the probability of a favorable decision for the aforementioned racial and gender demographics across decision categories. When these demographics must be inferred from names, but are not written explicitly, the positive discrimination effect is much smaller but statistically significant for all demographics except Black. In both settings, age is provided as a number, and we see negative discrimination.

In Figure 3 we additionally show that these patterns of discrimination largely hold across decision questions, for the Explicit setting. Across these questions, we see that the discrimination score is almost always positive for Black subjects compared to white subjects, and almost always negative or neutral for older subjects compared to younger ones.

These results demonstrate that noteworthy biases still exist in the model for the settings we investigate, without the use of any interventions. In Sections 4 and 5 we demonstrate that this effect is stable across a wide variety of prompt variations, and introduce prompt-based interventions that eliminate the vast majority of these differences.

Analysis of prompt sensitivity

Language models outputs have been found to be sensitive to small variations and ambiguities in their inputs (Lu et al., 2021; Tamkin et al., 2022; Ganguli & Favaro, 2023). To evaluate the robustness of our results, we test how varying the format and style of our prompts affects model decisions. The full set of prompts we use to construct these variations are provided in Appendix B.

Using a language model, we rewrote the original decision templates (Default) into several alternate formats:

We rephrased the scenario in first-person perspective, changing pronouns to “I” and “me” instead of third-person. (First person phrasing)

We rewrote the details as a bulleted list of factual statements written in a formal, detached style. (Formal bulleted list)

We rewrote the information in the question as a list, formatting the key facts as bullets under “Pros” and “Cons” headers. (Pro-con list)

We added emotional language, such as “I really just want to make the right call here” and “This choice is incredibly important.” (Emotional phrasing)

We introduced typos, lowercase letters, and omitted words to make the prompt appear informal and sloppily written. (Sloppy rewrite)

We incorporated subtle coded demographic language, such as “looking for a clean-cut all-American type”. This evaluates our model’s sensitivity to subtle potential indications of discriminatory preferences from users. (Use coded language)

2 Results

As can be seen in Figure 4, the results are largely consistent across prompt variations—we still see roughly the same discrimination patterns by the language models in these decision settings. The effect size sometimes varies, for example, Emotional phrasing produces a larger bias, while the more detached Formal bulleted list format has a smaller effect. However, the overall discrimination patterns hold across different ways of posing the decision scenario and question to the language model, demonstrating the robustness of this effect.

Mitigation Strategies to Reduce Discrimination

To provide additional control for users and policymakers (see Section 6.3 for more discussion), we present and evaluate various prompt-based methods for mitigating discrimination. The full prompts for these interventions are provided in Appendix B.

We append various statements to the end of prompts:

Statements saying demographics should not influence the decision, with 1x, 2x, and 4x repetitions of the word “really” in “really important.” (Really (1x) don’t discriminate, Really (2x) don’t discriminate, Really (4x) don’t discriminate)

A statement that affirmative action should not affect the decision. (Don’t use affirmative action)

Statements that any provided demographic information was a technical quirk (Ignore demographics) that protected characteristics cannot legally be considered (Illegal to discriminate) and a combination of both (Illegal + Ignore).

2 Requesting the model verbalize its reasoning process to avoid discrimination

We also insert requests asking the model to verbalize its reasoning process (Kojima et al., 2022) while keeping into account various fairness constraints. This follows past work (Ganguli et al., 2023). We insert various requests, paraphrased below. Again, full prompts are available in Appendix B.

A request to think out loud about how to avoid bias and stereotyping in the model’s response. (Precog basic)

A request to think out loud about how to avoid bias and avoid positive preference for members of historically disadvantaged groups. (Precog self-knowledge)

As a control, a request to make the decision in an unbiased way (without a request to think out loud). (Be unbiased)

3 Results

As shown in Figure 5, several of the interventions we explore are quite effective, especially Illegal to discriminate, Ignore demographics, Illegal + Ignore. Many of these interventions significantly reduce the discrimination score, often approaching 0. Other interventions appear to reduce the discrimination score by a more moderate amount.

These results demonstrate that positive and negative discrimination on the questions we consider can be significantly reduced, and in some cases removed altogether, by a set of prompt-based interventions.

4 Do the interventions distort the model’s decisions?

While the success of these interventions at reducing positive and negative discrimination is notable, an important remaining question is whether they make the decisions of the model less useful. For example, a simple way to reduce discrimination is to output the exact same prediction for every input. In this work, we study hypothetical decision questions that are subjective, and do not have ground-truth answers. However, we can still measure how much the responses of the model change when an intervention is applied.

Concretely, we compute the Pearson correlation coefficient between the decisions before and after the intervention is applied. In Figure 6, we show a scatter plot comparing this correlation coefficient and the average discrimination across demographic groups (age, Black, Asian, Hispanic, Native American, female, and non-binary). We see that a wide range of interventions produce small amounts of discrimination while maintaining very high correlation with the original decisions. Notably, the Illegal to discriminate and Ignore demographics interventions (Section 5.4) appear to achieve a good tradeoff between low discrimination score ( $\approx 0.15$ ) and high correlation with the original decisions ( $\approx 92\%$ ).

Discussion

We discuss a few additional considerations and questions raised by our findings:

An inherent challenge of evaluations is external validity: ensuring the conclusions of a research study generalize to real-world settings (Andrade, 2018). External validity has been discussed in both machine learning contexts (Liao et al., 2021, 2022) as well as in audit studies (Lahey & Beasley, 2018). Our work makes several attempts to improve the external validity of our evaluations, including generating a wide variety of decision questions covering 70 different possible applications across society, exploring a wide range of rewrites or phrasings of such questions as a robustness check, and performing human-validation of generated templates. The breadth of these evaluations better ensure that our conclusions capture how people could use language models in the real world.

Nevertheless, there are several limitations of our evaluation. First, we primarily evaluate a model’s judgment on paragraph-long descriptions of candidates. However, in the real world, people might use a wider variety of input formats, including longer, supplementary documents such as resumes or medical records, or might engage in more interactive dialogues with models (Jakesch et al., 2023; Li et al., 2023a), each of which has the potential to affect our conclusions.

In addition, while we consider race, gender, and age in our analysis, we do not consider a range of other important characteristics including veteran status, income, health status, or religion, though we believe our methods can easily be extended to incorporate more characteristics.

Third, while we performed human-validation on the generated evaluations, generating the evaluations with a language model in the first place may bias the scope of applications that are considered.

Fourth, choosing names in audit studies that are associated with different demographics can be challenging (Gaddis, 2017; Crabtree & Chykina, 2018); while we make a best effort in this study, more work is necessary, as well as an investigation into other sources of proxy discrimination.

Fifth, we consider only the language model’s decisions themselves, as opposed to their impact on user decisions in a human-in-the-loop setting, as in Albright (2019).

Sixth, we consider only the relationship between individual demographic characteristics and the model’s decisions, rather than intersectional effects between, e.g., race and gender (Crenshaw, 2013; Buolamwini & Gebru, 2018). Future work could extend the methodology we explore here by introducing interaction terms into the mixed effects model.

Finally, the sensitivity of models to small changes in prompts is an unsolved problem (Lu et al., 2021; Ganguli & Favaro, 2023), and despite our efforts it remains possible that variations in the construction of our prompts (including the phrasing of the prompts, the examples we included in the prompt, or the instructions we provided) could alter the conclusions of our analyses.

2 Should models be used for the applications we study?

The use of language models or other automated systems for high-stakes decision-making is a complex and much-debated question. While we hope our methods and results assist in evaluating different models, we do not believe that performing well on our evaluations is sufficient grounds to warrant the use of models in the high-risk applications we describe here, nor should our investigation of these applications be read as an endorsement of them. For example, as we discuss in Section 6.1, there are several important aspects of real-world use that are not fully covered by our evaluations. In addition, models interact with people and institutions in complex ways, including via automation bias (Skitka et al., 1999; Goddard et al., 2012), meaning that while AI-aided decision-making can have positive effects (Keding & Meissner, 2021), placing humans in an advisory in-the-loop capacity is not by itself a sufficient guardrail. Instead, we expect that a sociotechnical lens (Carayon et al., 2015) will be necessary to ensure beneficial outcomes for these technologies, including both policies within individual firms as well as the broader policy and regulatory environment. Finally, discrimination and other fairness criteria are not the only important factors when deciding whether or not to deploy a model; models must be evaluated in naturalistic settings to ensure they perform satisfactorily at the tasks they are applied to (Raji et al., 2022; Sanchez et al., 2023). The appropriate use of models for high-stakes decisions is a question that governments and societies as a whole should influence—and indeed are already subject to existing anti-discrimination laws—rather than those decisions being made solely by individual firms or actors.

3 How should positive discrimination be addressed?

A natural question raised by our work is under what circumstances (and to what degree) positive discrimination should be corrected for, given arguments raised by proponents and critics of affirmative action policies and compensatory justice (Fullinwider, 2018; Dwork et al., 2012; Eidelson, 2015). These debates are ongoing, and there exist a diversity of policies (and attitudes towards these policies) on a global basis. Rather than resolve these debates ourselves, in this paper our goal is to provide tools for different stakeholders, including companies, governments, and non-profit institutions, to better understand and control AI systems. Towards this end, we develop tools to enable measurement of discrimination that may exist across the range of scenarios we consider (Section 2), as well as provide a dial to control the extent of this discrimination through prompting-based mitigations (Section 5).

4 Where does this behavior come from?

Given the observed patterns in Section 3, another natural question is where these patterns we observed emerge from. Unfortunately, questions like this are difficult to answer, given the complex interplay of training data and algorithms, along with the cost of training large models to disentangle these factors. However, we can speculate on some potential causes. The first possibility is that that the human raters who provided the human feedback data for model training may have somewhat different preferences than the median voter in the United States. This may lead to raters providing higher ratings to certain outputs, swaying the model’s final behavior. Second, it is possible that the model has overgeneralized during the reinforcement learning process to prompts that were collected to counteract racism or sexism towards certain groups, causing the model instead to have a more favorable opinion in general towards those groups. These questions represent active areas of research and we hope that our methods will enable further investigations to provide more clarity on these issues.

Related Work

Discrimination refers to unjust treatment of certain groups based on protected characteristics like race or gender (Eidelson, 2015). Audit or correspondence studies are often described as the “gold standard” for assessing discrimination in the wild; in such studies, decision-makers, such as potential employers, judge applicants whose profiles are identical except for a protected characteristic such as race or gender (Jowell & Prescott-Clarke, 1970; Gaddis, 2018). Differences of acceptance rates across such characteristics is then considered evidence of discrimination, and such studies have uncovered discrimination in fields as diverse as hiring, housing, and lending (Cain, 1996; Riach & Rich, 2002; Bertrand & Mullainathan, 2004; Pager, 2007; Bertrand & Duflo, 2017). Our work complements these studies by using language models to generate the correspondence studies (across a wide range of hypothetical scenarios) and evaluating machine learning models as subjects in the correspondence studies (in order to quantify discrimination in LMs).

Algorithmic discrimination

A wide range of works have investigated discrimination in algorithmic systems, including through audit or correspondence studies. For example, Sweeney (2013) study discrimination in online ad delivery, finding that ads suggesting an arrest record are more likely to be shown with searches of Black-associated names. A range of other investigations have further studied algorithmic discrimination, including in other ad delivery settings (Datta et al., 2015; Ali et al., 2019; Imana et al., 2021), mortgage approvals (Martinez & Kirchner, 2021), resume screening (Dastin, 2018), recidivism prediction (Skeem & Lowenkamp, 2016), hiring (Kirk et al., 2021; Bommasani et al., 2022a; Veldanda et al., 2023), and medical treatment (Obermeyer et al., 2019). Such investigations have studied how discrimination can happen directly, based on demographic variables, as well as indirectly though proxies for those variables, such as zip code, extracurricular activities, or other features (VanderWeele & Robinson, 2014; Datta et al., 2017; Kilbertus et al., 2017; Adler et al., 2018). In response, researchers and activists have developed a range of theoretical frameworks (Dwork et al., 2012; Kusner et al., 2017; Ustun et al., 2019) as well as investigative practices (Raji & Buolamwini, 2019; Metaxa et al., 2021; Vecchione et al., 2021; Bandy, 2021), to detect and counteract such discrimination. Notably, Creel & Hellman (2022) introduce the concept of outcome homogenization, where the widespread use of an automated decision-making system can expand the arbitrary biases of a single system into systematized disenfranchisement for certain groups.

In the context of LMs, Schick et al. (2021) note that language models can recognize toxicity in their own outputs, and Si et al. (2022) demonstrated that the right prompts could reduce bias in language models on the BBQ benchmark (Parrish et al., 2021). Building on these works, Ganguli et al. (2023) investigate a range of biases in LMs, including discrimination in law school course admissions, finding that language models exhibit negative discrimination against protected groups, which can be largely eliminated through similar prompting-based interventions as we explore in our work. These studies have occurred in the context of a large amount of work on bias and fairness in language models, deeper discussion of which can be found in Bender et al. (2021); Bommasani et al. (2022b); Gallegos et al. (2023); Li et al. (2023b); Solaiman et al. (2023). Our investigation contributes to this body of work by conducting a wide-ranging study of language model discrimination across 70 diverse applications and identifying interventions that can reduce both positive and negative discrimination across applications.

Model-generated evaluations

Finally, a range of recent works explore how LMs can assist in scalably generating diverse evaluations for LMs. For example, LMs have been used to generate red-teaming attacks for LMs (Perez et al., 2022) as well as generate critiques for LM outputs (Saunders et al., 2022). Most relevant to our work, Perez et al. (2023) use LMs to generate a wide array of evaluations for an LM in order to uncover concerning behaviors. We adapt and extend this method to study the potential for LM discrimination by generating LM prompts covering a wide array of use-cases, rewriting them in different styles, and inserting various different demographic groups. As language model outputs can be flawed, we cross-check these outputs with human evaluation.

Conclusions

In summary, our work draws on a rich foundation of techniques across machine learning and the social sciences to proactively assess and mitigate the risk of language model discrimination. By combining model-generated evaluations with human validation, we conduct a wide-ranging study of language model discrimination, with methods and mitigation strategies we hope will be of interest to policymakers and third-party stakeholders. Looking forward, we anticipate that variants on our technique will be helpful for measuring sensitivity to a range of other characteristics, including other demographic attributes, writing styles or patterns of language use, and mentions of various topics or issues. More broadly, as language models continue to rapidly advance, we hope these methods assist in the crucial task of developing better evaluations for societal impacts of these systems and anticipating and mitigating any risks before harms occur.

Acknowledgments

We thank Kathleen Creel, Cathy Dinas, Elizabeth Edwards-Apell, Danny Hernandez, Everett Katigbak, Nathaniel Smith, Janel Thamkul, and Drake Thomas for helpful feedback and comments on drafts.

References

Appendix A Author Contributions

Alex Tamkin wrote the paper and led the overall project, including the framing, design, implementation, and analysis of the experiments. Esin Durmus provided feedback on experiments and drafts of the paper. Liane Lovitt provided guidance throughout the project, especially framing and writing. Amanda Askell came up with the initial idea of the model-generated substitution evaluations of discrimination used in the project, contributed many of the prompts for the template generation and interventions, and provided guidance throughout the project. Deep Ganguli contributed the mixed effects model analysis and implementation and provided detailed guidance throughout all stages of the project, including the initial framing, design of the experiments, analysis, and feedback on drafts. All other listed authors contributed to the development of otherwise-unpublished models, infrastructure, or contributions that made our work possible.

Appendix B Prompts

Here we provide the prompts used to generate the evaluations an model decisions above. These prompts are also available on our Hugging Face page for ease of use: https://huggingface.co/datasets/Anthropic/discrim-eval

B.2 Prompts for generating variations of the decision questions for prompt sensitivity

B.3 Prompts for Interventions

Below we share the prompts for our interventions to reduce discrimination. These are prompt fragments that are added to the end of the decision question, prior to passing them to the language model for making a decision.

B.4 Prompts and mitigations in the decision-making step

Below we share the prompts used for making decisions in the decision-making step

Appendix C Filling templates with names

In this section, we briefly describe how we chose names associated with different races and genders. The choice of such names is a well-studied question in the correspondence study literature (Gaddis, 2017; Crabtree & Chykina, 2018). In this work, we attempt to tackle a narrower version of this problem by exploring a subset of names that are associated with certain demographic groups, rather than attempting to capturing a representative set of names from each demographic group.

We generate names by sampling from a list of first names and last names and concatenating them together:

To collect our list of last names, we collect 25 last names from each race/ethnicity category by choosing the top 25 names from white, Black, Hispanic, and Asian categories as measured by the US Census and collated on https://namecensus.com/last-names/. For Native American names, we collected 25 last names from https://www.familyeducation.com/baby-names/surname/origin/native-american, because the most common Native American surnames overlapped to a large degree with the surnames of other racial/ethnic groups.

To collect our list of first names, we ask a language model to generate a list of 10 names for each racial and gender pair. We conducted a human study on $N=198$ filled templates, and found that the generated full names were largely associated with the correct race/ethnicity and gender. Specifically, we asked raters: Based solely on the name provided above and without any other context, what race or ethnicity do you think most people would most closely associate with that name? and the equivalent prompt substituting in gender for race or ethnicity. Note that our use of pronouns also indicates the gender of the user. Table 2 shows our results, indicating high accuracy for all genders except non-binary and races except Black. We suspect these failure cases are due to the ambiguity of non-binary names such as “Ocean,” (and perhaps people’s unfamiliarity with them), and due to some more ambiguous generated Black names such as “Jada Williams” and “Naomi Green.”

The lower accuracy for non-binary and Black names suggests that the relatively lower Discrimination Scores for those demographics in Figure 2 may be in part due to our name generation pipeline generating less stereotypical names than for the other attributes; although, the general pattern appears to hold that discrimination is lower across all racial demographics for the Implicit fill method. Future work could attempt to construct a more representative set of names from a given target population, as well as a wider range of methods for exploring proxy discrimination.

Appendix D Human validation

An important component of our template generation process is that “yes” should be a good outcome for the subject of the decision. To validate this, we asked raters to rate how much they agreed with the following statement, for each question: A “Yes” response is a more positive outcome for the subject of the decision than a “No” response. The rating scale is again a scale of 1 to 5 (Strongly Disagree, Somewhat Disagree, Neither Agree Nor Disagree, Somewhat Agree, Strongly Agree). Overall, the average score was 4.83, with 0.975% of scores either a 4 or 5. This again indicates that the model is able to generate strong templates.

D.2 Additional Methodological details

We gather 405 ratings from the Default decision question type, across both Explicit (demographics) and Implicit (names) fill types. These ratings were gathered for 135 filled questions, created from 29 different decision question templates. Three raters rated each question. Raters were contracted through Surge and paid at least a California minimum wage.