Are Emily and Greg Still More Employable than Lakisha and Jamal? Investigating Algorithmic Hiring Bias in the Era of ChatGPT

Akshaj Kumar Veldanda, Fabian Grob, Shailja Thakur, Hammond Pearce, Benjamin Tan, Ramesh Karri, Siddharth Garg

Introduction

Large Language Models (LLMs) trained on vast datasets have shown promise in generalizing to a wide range of tasks and have been deployed in applications such as automated content creation Liu et al. (2021), text translation Brown et al. (2020), and software programming Sobania et al. (2022). Future applications extend to finance, e-commerce, healthcare, human resources (HR), and beyond.

This study focuses on LLMs in algorithmic hiring, i.e., automated tools that assist HR professionals in hiring decisions. Over 98% of leading companies use some automation in their hiring processes Hu (2019). There is growing interest in the use of LLMs to assist in a range of algorithmic hiring tasks.

While automated systems offer efficiency gains, they raise bias and discrimination concerns.

A 2018 report suggested that an AI-based hiring tool biased against women by identifying gendered keywords (e.g., "executed" or "women’s") in resumes Dastin (2022). Recognizing such risks, governments are beginning to address bias and discrimination in hiring practices through legislation. For example, the European Parliament has approved the EU AI Act, which identifies AI-based hiring tools as high-risk Hupont et al. (2023), and New York City passed a law to regulate AI systems used in hiring decisions Lohr (2023). That law, effective July 2023, requires companies to notify candidates when an automated system is used and to independently audit AI systems for bias.

This raises the question of how such audits can be conducted? There is a long body of work on investigating bias in conventional hiring practices, starting with the work of Bertrand & Mullainathan (2003) via randomized field experiments. In their study, Bertrand & Mullainathan (2003) submitted edited resumes in response to job descriptions that differed only in the gender and race of the applicant, using stereo-typically White and African-American male and female names as proxies. Responses were analyzed to infer bias on sensitive attributes. In this work, our key contributions are as follows.

A method for evaluating bias in LLM-enabled algorithmic hiring on legally prohibited or normatively unacceptable demographics, i.e., gender, race, maternity/paternity leave, pregnancy status, and political affiliation. The method extends to other attributes.

The first comprehensive evaluation of three state-of-the-art LLMs in two algorithmic hiring tasks, to classify full-text resumes into job categories, and to summarize resumes and then classify summaries into job categories.

Key results, including instances of a statistically significant Equal Opportunity Gap when using LLMs to classify resumes into job categories, particularly when pregnancy status or political affiliation is mentioned. We also show that sensitive attribute flags are retained in up to 94%94\% of LLM-generated resume summaries, but that LLM-based classification of resume summaries exhibits less bias compared to full-text classification.

Method and Experimental Design

Here we describe our method as shown in Figure 1 beginning with the resume corpus we built for our study.

Prior work that conducted field experiments on hiring bias has typically not released their resume datasets. Hence, we began with a recently released public dataset of 2484 resumes spanning 24 job categories scraped from livecareer.com Bhawal (2021) anonymized by removing all personally identifying information such as names, addresses, and e-mails. However, due to rate limits for state-of-the-art LLM APIs, it was infeasible to exhaustively evaluate resumes from all 24 categories, especially because adding demographic information results in more than a ten-fold increase in the total number of resumes that need to be evaluated.

Therefore, we restrict ourselves to a subset of the raw dataset to focus on three of the 24 categories: Information-Technology (IT), Teacher, and Construction. These categories were selected because of their distinct gender characteristics based on labor force statistics in the 2022 Population Survey U.S. Bureau of Labor Statistics (2022). Women accounted for only 4.2% of workers in construction and extraction occupations, and conversely accounted for 73.3% of the Education, training, and library occupations workforce. Computer and mathematical occupations fell in between, with approximately 26.7% female workers. This yielded a “raw" resume corpus containing 334 resumes ((1) in Figure 1). We manually inspected a sample of the resumes to ensure they matched their ground-truth job categories and had relevant information, such as experience and educational qualifications.

2 Adding Sensitive Attributes

The subset we chose of the raw resume dataset does not have demographic information. We use Mullainathan’s approach Bertrand & Mullainathan (2003) to intervene on race and gender, yielding "Baseline" resumes labeled (2) in Figure 1. We intervene on three other factors: (i) maternity or paternity-based employment gaps, (ii) pregnancy status, and (iii) political affiliation. Adding these attributes yields "Flagged" resumes (3) in Figure 1. Next, we describe how we incorporate this information in raw resumes and the basis for each choice.

Since job applicants often prefer not to reveal race, we use Bertrand & Mullainathan (2003)’s approach of adding stereotypically ’White’ (W) or ’African American’ (AA) names to each resume, using the same names identified in their work (See Appendix A for the actual names used). For each racial group, we create a version each with a stereotypically male and female name, yielding four versions for each resume with White female (WF), African American female (AAF), White male (WM), and African American male (AAM) names. Finally, we add appropriate pronouns (she/her or he/his) since this is common practice today. Finally, we embed email addresses into each resume to emulate genuine resumes. This augmentation step culminates in 1336 "Baseline" resumes labeled (2) in Figure 1.

Adding employment gap flag.

Prior work has suggested that employers discriminate based on maternity (or paternity) gaps Waldfogel (1998); Hideg et al. (2018), or infer family status from this information. Anecdotally, women have been advised to include this information on resumes Jurcisinova (2022a). We include maternity/paternity leave for female/male applicants by adding to the resume: "For the past two years, I have been on an extended period of maternity/paternity leave to care for my two children until they are old enough to begin attending nursery school." This text is consistent with the advice available on internet job advice forums Jurcisinova (2022b).

Adding pregnancy status flag.

Hiring discrimination on the basis of pregnancy status is forbidden by law in several jurisdictions, for example, under the Pregnancy Discrimination Act in the United States Commission (1978). Although it is atypical for women to report pregnancy status on resumes, this intervention “stress-tests" the fairness of LLMs on the basis of legally or morally protected categories. Additionally, in practice, algorithmic hiring might include information gleaned from sources other than applicant resumes, which could be included in the prompt. To denote the pregnancy status of the applicant, we include the phrase "Please note that I am currently pregnant" at the end of the resume for female candidates.

Adding political affiliation flag.

Bias on the basis of political affiliation is legally protected in some jurisdictions Mateo-Harris (2016). Although this information is atypical in resumes, it could be gleaned in algorithmic hiring from the applicants’ social media and can be a second stress-test to interrogate bias in LLMs. To indicate the political affiliation, we include a statement such as "I am proud to actively support the Democratic/Republican Party through my volunteer work."

3 Algorithmic Hiring Tasks

We evaluate two algorithmic hiring tasks in literature: resume (i) classification Javed et al. (2015) and (ii) summarization Bondielli & Marcelloni (2021) (followed by classification).

For each job category, we pose a binary classification problem to the LLM to identify whether a resume belongs to that job category or not. We then evaluate the accuracy, true positive and true negative rates using ground-truth labels from our dataset.

For consistency, we employ a standardized prompt for all LLMs throughout the study. We set the temperature of all LLMs to 0 to remove variability in LLM outputs. This yielded high baseline accuracy on the three LLMs we tested (see Section 3.1), establishing the soundness and practicality of the evaluation method.

Summarizing resumes

In addition to direct classification, prior work has proposed resume summarization to reduce the burden on HR professionals Bondielli & Marcelloni (2021). As before, we keep the prompt consistent across all LLMs and evaluate with zero temperature. Our prompt is:

4 LLMs Evaluated

We evaluate bias in three state-of-art black-box LLMs: (1) GPT-3.5 Turbo from OpenAI Brown et al. (2020) (gpt-3.5-turbo); (2) Bard (PaLM-2) by Google Anil et al. (2023) (chat-bison-001); (3) Claude by Anthropic (Claude-v1, Claude-v2 in Appendix). These LLMs are API accessible and are similar to the LLMs used in their respective chat interfaces. We evaluate two white-box LLMs: (1) Alpaca-7B LLM which is a fine-tuned 7B LlaMa LLM Touvron et al. (2023a); (2) Llama-2 chat models (7B and 13B versions) from Meta Touvron et al. (2023b) (data in Appendix). These LLMs all have more than a 4096 token limit, except Alpaca which has a smaller 512 token limit. White-box LLMs enable further interrogation of the cause of bias.

5 Evaluating Bias

In this paper, we evaluate fairness via the Equal Opportunity Gap (EOG), a commonly used mathematical notion of fairness Hardt et al. (2016), that measures the difference in True Positive Rates (TPR) between two groups. We analyze five pairwise differences on the basis of (1) race (White vs. African-American), (2) gender (men vs. women), (3) maternity leave gap (with flag vs. without), (4) pregnancy status (pregnant vs. not), (5) political affiliation (Democrat vs. Republican). For each comparison, we identify if the TPR gap is greater than 15%15\%, and perform hypothesis tests to determine if the differences between the pairs are statistically significant. Since we are analyzing categorical data, we conduct Fisher exact tests Fisher (1992) and use p0.05p\leq 0.05 for statistical significance.

Experimental Results

Next, we describe our findings on the three black-box LLMs, and on white-box Alpaca model.

We begin by demonstrating that all models exhibit acceptable overall performance. Bard demonstrates the highest accuracy (F1-score) of 94.39% (0.9145), surpassing other models. GPT-3.5 closely follows with an accuracy of 93.55% (0.9059). In contrast, Claude exhibits marginally lower but still usable performance, with an accuracy of 68.16% and an F1-score of 0.6599. The TPRs for resume classification across sensitive attributes and job categories are plotted for GPT-3.5 (Figure 2(a)), Bard (Figure 2(b)), and Claude ( Figure 2(c)). We make several observations from the data.

Perhaps surprisingly, we find insignificant TPR Gaps between White and African American resumes and male and female resumes. From public statements, it is known that these LLMs have been sanitized to mitigate bias, and it would stand to reason that this has been performed at least on the most ‘obvious’ sensitive attributes like race and gender.

Large bias on flag attributes

We find a large bias on the three sensitive attributes, especially on Claude. Claude has a statistically significant bias against women with maternity-based employment gaps, and pregnant women. Further, Claude is biased on political affiliation, with bias in favor of Democrats. In most instances, TPR Gap exceeds the 15%15\% threshold and frequently exceeds 30%30\%.

GPT-3.5 demonstrates bias only on political affiliation (favoring Democrats) for teaching roles with a TPR Gap of 30%. Bard is the fairest LLM with remarkably consistent performance across all sensitive attributes. This shows that bias is not a fait accompli; LLMs can be trained to withstand bias on attributes that are infrequently tested against. Bard could be biased along sensitive attributes that were not in this study. We refer to Figure 7, Figure 8 in Appendix for data on paternity leave.

Similar results on TNR Gaps

For completeness, in Appendix Figure 6, we evaluate bias on true negative rates (TNR) that are used (with TPR Gaps) to evaluate equal odds. As before, we observe that all three LLMs are fair on race and gender, and Claude is biased on maternity, pregnancy, and political affiliation. Additionally, GPT-3.5 is also biased on the same three attributes. Qualitatively these results are similar to the TPR Gap results. We do not report TNR Gaps and refer to the Appendix.

2 LLM-Generated Summaries

Table 1 reports the percentage of times LLM-generated summaries contain sensitive attributes. We find something interesting: in many instances, Bard does not provide a summary and outputs an error message: “Content was blocked, but the reason is uncategorized." Similarly, in some instances, Claude does not provide an output at all. Table 1 therefore reports the percentage of instances that the output was generated. We report key takeaways.

Over all job categories, GPT-3.5 summaries have pregnancy status and political affiliation less than 12.75%12.75\% of the time. Employment gaps are reported between 22.5%22.5\%-64.71%64.71\%.

Bard frequently refuses to summarize.

Unlike GPT-3.5, which summarized (almost) every resume, Bard provides a summary for about 54%54\% to 90%90\% of resumes. When Bard provides a summary, it is more likely to mention political affiliation and pregnancy status compared to GPT-3.5 but less likely to mention employment gaps. However, a fairer comparison between the two should also account for the instances when Bard blocks information. This data (the product of the two numbers in Table 1) is shown in the Appendix Table 9. Although Bard is more likely to mention sensitive information, the difference between Bard and GPT-3.5 is less stark when normalized over all requests.

Claude is most likely to include sensitive information across the board.

Claude mentions sensitive information more frequently overall than the other two models. The starkest difference is for pregnancy status, as it is mentioned in 80%80\% to 94.12%94.12\% of the summaries generated. Claude does block some responses, although infrequently enough that it does not change our key conclusions.

3 Classifying LLM-generated Summaries

Figure 3 plots TPR rates for classification on resume summaries. In each instance, we used the same LLM for classification as the one used to generate summaries. TPRs are computed only over the subset of summaries that were actually generated.

Note that Figure 3 has only one instance of a statistically significant TPR Gap: Claude, for Teacher roles based on pregnancy status. Interestingly, this is contrary to our prior observations: Claude summaries frequently mention sensitive attributes, and Claude is highly biased when classifying entire resumes. Therefore, the reduced bias on summaries is surprising given that sensitive attributes are making their way to summaries. The same is true of GPT-3.5, which is also less biased on summaries than on entire resumes.

We hypothesize that this is perhaps because summaries make it easier for a model to attend to relevant information. We confirm this by evaluating classification bias only on the subset of summaries that actually contain sensitive attribute flags (see Appendix Figure 9), and find little evidence of bias. Unfortunately, further investigation is hindered by the black-box nature of these LLMs.

4 Analysis Using Alpaca

The black-box nature of the state-of-the-art LLMs we evaluated hinders a deeper examination of the causes of bias in the models. We performed additional experiments on Alpaca, a smaller but white-box LLM. Because of a smaller token limit, we could not run experiments with entire resumes; instead, we evaluate Alpaca with GPT-3.5 generated summaries. However, because GPT-3.5 summaries removes sensitive attribute flags, we used GPT-3.5 to first summarize baseline resumes and add sensitive attribute flags to the generated summaries.

In Figure 4, we present TPRs obtained by classifying summaries using Alpaca. Reflecting the results from larger models, we note that Alpaca shows statistically significant differences for maternity leave gaps, pregnancy status, and political affiliation.

Explaining bias using contrastive input decoding.

Contrastive input decoding (CID) is a recent method to interrogate bias in LLMs Yona et al. (2023) that replaces decoding strategies like beam search with a strategy that seeks to explain the difference between a pair of prompts. Given two prompts, CID picks the next token whose probability is maximally different across the prompts. In other words, CID generates sequences that are likely given one input but unlikely given another.

We perform a qualitative analysis using CID to explain biases in Alpaca using two prompts:

Input prompt: You are evaluating the following resume: for job position. The resume is not suitable because Contrastive prompt: You are evaluating the following resume: for job position. The resume is not suitable because Using CID for maternity leave, some responses offered the following reason for rejection: "Including personal information about maternity leave is not relevant to the job and could be seen as a liability." For pregnancy status, CID rejected candidates because "She is pregnant" or "Because of her pregnancy." Finally, CID analysis indicated that certain candidates were unsuitable because, ‘The candidate is a member of the Republican party, which may be a conflict of interest for some employers.’ It is important to note that CID does only sometimes offer these reasons, potentially because CID picks one of the potentially many reasons for rejection. Nonetheless, these results suggest that CID could be an effective tool to analyze bias even on larger models, given white-box access.

Discussion and Limitations

Our study only examined cis-gendered individuals for two racial groups. Although we did not find evidence of bias on these attributes, there could be bias for other racial groups and for transgender, non-binary, and other individuals. Further studies are necessary to investigate these biases, especially as these groups are also historically marginalized.

Results on other sensitive attributes.

Besides employment (maternity) gaps, pregnancy status, and political affiliation, there are other attributes, such as disability status, sexual orientation, and age that may have some legal protection against hiring discrimination. Some of these may be more easily discernable on resumes and merit further study.

Culturally- and geographically-aware categories.

Our study is largely in the American context in terms of the names we use, racial groups, and legal protections. These can vary by culture and geography. In India, for example, caste discrimination is a serious concern and protected by law. Thus, we acknowledge that our results are valid within a limited context.

Statistical significance of results.

We used statistical testing to more concretely support our observations of bias (or lack thereof). Although we note that prior work, including the pioneering work of Buolamwini & Gebru (2018), does not always use statistical significance to ascertain bias, one might observe significant differences by chance over a large number of experiments. To mitigate this concern, we picked experimental settings in advance, i.e., job categories, LLMs, fairness metrics, and sensitive attribute flags. Further, prompt engineering was performed only to maximize overall accuracy and not based on pre-evaluations of bias. All our code and data are publicly released.

Implications for AI-based hiring.

Mindful of these limitations, our study suggests limited bias on the basis of race and gender across state-of-the-art LLMs in this context. This is despite previous demonstrations of biased LLM outputs on toy tasks in social media; e.g., writing an algorithm to identify a "good" programmer based on race and gender. This suggests that bias on toy tasks may not translate to real-world tasks like resume evaluations. Further, the unexplained unwillingness of Bard to generate summaries when sensitive attribute flags are in resumes suggests that models might have been heavily sanitized to the point of being sometimes unusable. Finally, the observation of reduced bias on resume summaries might have practical consequences for real-world algorithmic hiring.

Related Work

A body of work on AI-assisted hiring exists. Sayfullina et al. (2017) and Javed et al. (2015) have explored the use of conventional ML methods to classify and profile resumes. Others focused on matching job descriptions with resumes Zaroor et al. (2017); Bian et al. (2020), but not job categories. Some studies investigated the use of LLMs, either to infer job titles through skills Decorte et al. (2021) or to evaluate job candidates during a virtual interview Car\textcommabelowti\textcommabelows & Suciu (2020). However, none of them investigate bias as we do.

A body of work starting with the seminal work of Buolamwini & Gebru (2018) has exposed gender and racial discrimination in commercial face recognitions systems and in image search results Metaxa et al. (2021). Prior studies in natural language processing identified gender biases Bolukbasi et al. (2016); Nangia et al. (2020); Vig et al. (2020), religious bias Abid et al. (2021) and ethnic bias Ahn & Oh (2021). However, these studies hav not been performed in the LLM context and do not look at algorithmic hiring.

Shifting the focus to bias in hiring systems, notable research by Bertrand & Mullainathan (2003) provides valuable insights into biases in traditional hiring. However, there is limited work on bias in AI-assisted hiring, especially using LLMs. Raghavan et al. (2020) did a qualitative survey of algorithmic hiring practices in industry, but do not perform a quantitative or statistical analysis with specific AI tools as we do. A recent paper uses LLMs to generate resumes given names and gender and perform simple context association tasks using LLMs; however, these tasks are only peripherally (if at all) related to real-world tasks in algorithmic hiring.

Conclusion

We proposed a method to study the biases of state-of-the-art commercial LLMs for two key tasks in algorithmic hiring: matching resumes to job categories Javed et al. (2015) and summarizing employment-relevant information from resumes Bondielli & Marcelloni (2021). Building on gold-standard methodology for identifying hiring bias in manual hiring processes, we evaluated GPT-3.5, Bard, and Claude for bias on the basis of race, gender, maternity-related employment gaps, pregnancy status, and political affiliation. We did not find evidence of bias on race and gender but found that Claude in particular (and GPT-3.5 to a lesser extent) were biased on the other sensitive attributes. We find similar results on the resume summarization task; surprisingly, we find greater bias on full resume classification versus classification on summaries. Future work involves a more inclusive set of sensitive attributes.

References

Appendix A Name Pool

The list of White last names used to create baseline resumes are ‘Baker’, ‘Kelly’, ‘McCarthy’, ‘Murphy’, ‘Murray’, ‘O’Brien’, ‘Ryan’, ‘Sullivan’, ‘Walsh’.

The list of African American last names used to create baseline resumes are ‘Jackson’, ‘Jones’, ‘Robinson’, ‘Washington’, ‘Williams’

Appendix B Alpaca Additional Results