Sociodemographic Bias in Language Models: A Survey and Forward Path

Vipul Gupta, Pranav Narayanan Venkit, Shomir Wilson, Rebecca J. Passonneau

Introduction

Recent years have seen rapid progress in natural language processing capabilities due to advances in language modeling. Large language models have demonstrated impressive performance on many tasks Raffel et al. (2020); Zhong et al. (2020); Yang et al. (2019). However, alongside these successes, recent work indicates that Deep Neural Networks (DNNs) often learn unintended shortcuts and biases Mehrabi et al. (2021); Wen et al. (2022); España-Bonet and Barrón-Cedeño (2022); Gupta et al. (2022b); Hutchinson and Mitchell (2019). Additional studies discuss ways that learned models can potentially have harmful social impacts when deployed in real-world settings Rudin (2019); Schwartz et al. (2021); Bender et al. (2021). In this paper, we survey recent work on bias in NLP to identify distinct research strands, and to develop a deeper understanding of bias versus harm.

We present a survey of 214 works on sociodemographic bias in NLP. We categorize this literature into three major strands of investigation: types of bias, quantifying bias, and debiasing techniques. This categorization emerged from a comprehensive consideration of a large body of work on different aspects of bias, and helps organize the papers included in our survey, facilitating a clearer understanding of similarities and differences among approaches. We highlight current trends in quantifying bias and in debiasing strategies, while also placing these efforts in the context of previous work.

In machine learning, bias and fairness are closely related concepts, and are often used interchangeably. In this work, we consider investigations of bias and fairness in NLP to fall under the same general umbrella. Recent research has proposed a variety of metrics to quantify bias in NLP models Dixon et al. (2018); Garg et al. (2019); Gaut et al. (2020); Dev et al. (2021). However, many bias metrics do not relate to bias that occurs in real-world applications Goldfarb-Tarrant et al. (2021). Moreover, current approaches to quantify bias are not robust and have reliability issues Seshadri et al. (2022); Du et al. (2021). In this work, we propose several criteria for developing reliable bias metrics.

Additional works have explored techniques for mitigating bias in language models Dixon et al. (2018); Ahn and Oh (2021); Lauscher et al. (2021); Bartl et al. (2020). Although bias is learned from training data, different DNNs can amplify training data bias in different ways. We concur with the findings of Gonen and Goldberg (2019) that current debiasing techniques are relatively superficial and often hide rather than remove underlying bias. Additionally, recent debiasing methods rely primarily on finetuning approaches, which may not be effective with large language models (LLMs). To enhance the effectiveness of bias mitigation, we recommend that future work should focus more on debiasing methods applied during training.

One of the confounding factors regarding investigation of bias in NLP is that bias is not always harmful. After a brief section on other surveys of bias, we present a perspective on bias grounded in ideas from psychology and behavioral economics, and point out that humans have evolved to rely on bias as an efficient way of navigating a complex world. However, unintended bias that can be discovered to produce different model behavior for different social groups can have harmful effects when deployed in real-world settings Blodgett et al. (2020).

In sum, our survey makes three main contributions: a way to understand research on sociodemographic bias in NLP in terms of three categories, an overview of datasets used in the quantification of bias, and insights into the strengths of current trends alongside the identification of the main issues with bias metrics and mitigation strategies. We conclude with recommendations for future work.

Related Work

There have been several recent surveys of bias in NLP, each with different emphases. One recent survey focuses mainly on metrics to quantify social bias Czarnowska et al. (2021). The authors unify the reviewed metrics under three generalized fairness metrics and provide recommendations for which metric to choose. Stanczak and Augenstein (2021) provide an extensive survey on gender bias; however, they their survey omits discussion of the various techniques employed for quantifying bias, and for debiasing language models (LMs). Bansal (2022) describes how varies bias metrics differ, but draws no strong conclusions about criteria that bias metrics should meet. Another bias survey concludes that the cause of all bias in LMs lies in the biased training datasets Garrido-Muñoz et al. (2021). However, this work does not discuss flaws in prevailing bias quantification approaches. Notably, Devinney et al. (2022) survey 176 works regarding gender bias in NLP, and find that conceptualizations of gender bias lack specificity in most existing literature. An earlier survey highlights significant variation in motivations across bias studies Blodgett et al. (2020). The diversity of definitions of bias and of goals of bias research reinforces the need for a unified categorization and review.

Our survey aims for a systematic organization and critique of the NLP literature on bias. Building on this analysis, we summarize the current issues in bias research and offer recommendations to inspire future work toward more rigorous bias measurement and mitigation.

Understanding and Defining Bias

The Nobel Prize-winning psychologist and behavioral economist, Daniel Kahneman, argues that human understanding depends on implicit biases about common category features Kahneman (2011). For example, the sentence “a large mouse climbed over a small elephant” immediately calls to mind a mouse, that while large relative to other mice, is tiny relative to the elephant, which we know to be one of the largest mammals on earth. Our implicit bias about about the relative sizes of mice and elephants is immediate. Extrapolating Kahneman’s argument to NLP, for a proper understanding of the real world, language models could potentially benefit from biases about certain categories, such as mice versus elephants. Thus, not all biases learned during training by language models will necessarily introduce harm.

Furthermore, Kahneman (2011) defines bias as “the tendency to make systematic errors in judgments or decisions based on factors that are irrelevant or immaterial to the task at hand” and cautions that human judgment is susceptible to bias from irrelevant factors. Applying this insight to sociodemographic bias in NLP, we need to understand the impact it might have in real-world settings. Crawford (2017) and Barocas et al. (2017) examine representational harm and alloted harm in NLP. Representational harm is defined as the harm that arises when a system represents some social groups in a less favorable light than others. Allotted harm is defined as the harm that arises when a system allocates resources or opportunities unfairly to a social group Shahbazi et al. (2023). Most works on bias in NLP look at bias from the lens of allotted harm. Blodgett et al. (2020) break down harm into allocational, stereotypical and other representational harm. Blodgett (2021) breaks down harm into quality of service, stereotyping, denigration and stigmatization, alienation, and public participation. Dev et al. (2022) created a framework that can be used to capture aspects of a bias measure (dataset, metric(s), motivations) that align with various forms of harm.

In the past decade, much work on LMs has focused on bigger models for better performance. A.M Turing award winner Yoshua Bengio states that the “bigger is better” mentality needs to change, because current LMs “make stupid mistakes” Bengio (2019). This is consistent with findings from recent work on sociodemographic bias in LMs Luo et al. (2023); Zhuo et al. (2023); Li et al. (2020). In summary, bias is neither good nor bad in itself. Some kinds of bias might be useful for building high-performance LMs, while other types of bias can potentially be harmful, through a negative impact on society.

Most definitions of bias in NLP are framed in terms of potential for harm. Two highly cited definitions of NLP model bias are from Dixon et al. (2018) and Hardt et al. (2016). Dixon et al. (2018) argue that “models are not intended to discriminate between genders. If a model does so, we call that unintended bias.” Hardt et al. (2016) argue that “a model contains unintended bias if it performs better for some demographic groups than others.”

We propose the following definition of sociodemographic bias - “A model is biased if it does not perform consistently across all demographic groups.” Such bias has the potential for harm in a real-world setting. Our definition is applicable to prominent demographic distinctions such as gender identity (male, female, non-binary), or income-based groupings (e.g., low, middle, and high income), or other broad-coverage distinctions that are learnable by LMs. For example, associating “Caucasian man” with “handsome”, and "African-American man" with “angry” is a clear indication of bias in models Garimella et al. (2021). In occupation-related tasks, associating “receptionist” with “she”, and “philosopher” with “he” can have harmful effects in real-world settings Bolukbasi et al. (2016). Henceforth for brevity, we use the term bias to mean sociodemographic bias.

Categories of Works on Bias in NLP

As discussed above, the main focus of our survey is on sociodemographic bias, although a few of the papers included here fall outside that scope. Two strategies were used to identify candidate papers: 1) using the keywords "bias" and "fairness," we searched for recent papers in the ACL Anthology, NeurIPS proceedings, and the ACM’s Conference on Fairness, Accountability and Transparency (FAccT); 2) we included papers from citation graphs for retrieved papers. We included papers only if they addressed language modeling, thus omitting papers on speech, where different issues arise. These criteria reduced an initial large set of 250 papers down to the 214 in Appendix A.

To achieve a broad-brush organization, we identified three major branches of investigation: (1) types of bias, (2) quantifying bias, and (3) debiasing techniques. Figure 1 illustrates the upper levels of our hierarchical classification of papers, where the three types of leaves T, Q or D, are used to tag individual papers. Types of bias can be further categorized into gender, race, ethnicity, and other types of bias. Work on quantifying bias can be further classified into four approaches, based on distance in vector space (Q1), performance on test data (Q2), model prompting (Q3) or probes (Q4). Our review of the literature on quantifying bias reveals significant reliability issues with current bias measurement techniques. Based on this analysis, we propose several criteria necessary for developing reliable bias metrics. In the absence of metrics satisfying these criteria, it is difficult to make strong claims about how well bias metrics and debiasing methods work. Works on debiasing differ regarding application during training (D1), fine-tuning (D2) or inference (D3).

In the following subsections, we discuss many works from our hierarchical classification; a complete list of works can be found in Appendix A.

The NLP bias that has the greatest potential for harm in real-world settings is sociodemographic bias, where model performance differs by sociodemographic group Smith et al. (2022). Sociodemographic bias includes gender bias, when models are biased against a particular gender De-Arteaga et al. (2019); Park et al. (2018); Du et al. (2021); Bartl et al. (2020); Webster et al. (2021); Tan and Celis (2019); racial bias, when models are biased against certain races Nadeem et al. (2021); Garimella et al. (2021); Nangia et al. (2020); Tan and Celis (2019); ethnic bias, when models are partial towards certain ethnicity Ahn and Oh (2021); Garg et al. (2018); Li et al. (2020); Abid et al. (2021); Manzini et al. (2019); Venkit et al. (2023a); age bias Nangia et al. (2020); Diaz et al. (2018) and sexual-orientation bias Nangia et al. (2020); Cao and Daumé III (2020).

Gender bias is the most widely studied bias type in NLP. It can be measured directly in source data, such as co-occurrence of gender and occupational or other identity categories Liu et al. (2021); De-Arteaga et al. (2019). If a model is trained on a dataset that contains gender-specific pronouns, it may learn to associate certain professions with a particular gender, even if such associations are not accurate or fair. Bias can be identified and mitigated by carefully examining the model’s output and selecting or generating a more balanced train data.

In some cases, bias arises from language usage that can trigger inferences about gender or other demographic categories Lauscher et al. (2020). This type of bias results from subtle cultural or societal bias reflected in training data. For example, certain words or phrases may be more commonly associated with a particular race or ethnicity, and a model trained on such data may inadvertently perpetuate these biases. Liu et al. (2021) show that model performance sometimes reflects the demographic attributes of the authors of training documents. Venkit et al. (2022) measures implicit disability bias in word embedding and LMs, showing how it causes a model to associate negative and toxic words with a specific sociodemographic group. Karimi Mahabadi et al. (2020) discuss how models learn to associate certain language styles with particular sociodemographic groups, resulting in unfair and biased decisions. De-Arteaga et al. (2019) show that text classifiers can learn to associate demographic information with certain content, leading to unfair decisions toward certain groups.

Our analysis revealed that previous studies primarily concentrated on gender bias, whereas recent works have started examining more bias categories. We observed that certain bias evaluation and mitigation are tailored to address specific types of bias and may not readily apply to other bias types. We encourage future works to specifically outline how their methodologies can be extended and adapted to accommodate different bias categories.

2 Quantifying Bias

Measurement of bias is challenging because it arises from aggregate behavior and is therefore sensitive to the samples that model behavior is measured on. However, quantifying bias is a precondition to addressing or mitigating bias that might be harmful. One of the key questions that emerge from existing methods to measure bias is whether these measurements are reliable. Here we review different methods of measuring bias in NLP and how they differ from each other.

Initial work in quantifying bias focused mainly on analogical distance in embedding space. These approaches define a set of target words, such as engineer and nurse, along with a set of attributes such as male versus female, to quantify bias. One of the first is the Word Embedding Association Test (WEAT) score Caliskan et al. (2017). They sum the cosine similarity between the target word and the attribute word for each class. WEAT is the difference between the sum of two classes. Dev and Phillips (2019) proposed the Embedding Coherence Test (ECT), where an attribute class, e.g., female, is represented as the average of embeddings of all attribute words such as she, women, and girl. Bias is quantified by the difference in cosine similarity between this attribute representation and the target word embedding. Intuitively, larger divergence in these similarities indicates greater incongruence between the target word and attribute class prototypes in the embedded space. Bolukbasi et al. (2016) computed the relative difference between the cosine similarity of gendered words (e.g. man/woman) and gender-neutral words (e.g. doctor) within the embedding space. A greater relative difference in these similarities indicates greater bias in the associations between the gendered and neutral words. Ethayarajh et al. (2019) introduced RIPA, using the inner product instead of cosine similarity to give importance both to differences in magnitude and to the angle of vectors to quantify bias.

Guo and Caliskan (2021); Tan and Celis (2019) extended WEAT to contextualized word embeddings, which are dynamic word representations generated by LMs based on surrounding words in a sentence. May et al. (2019) proposed SEAT by extending WEAT to the sentence level. Other metrics based on location in n-dimensional embedding space use clustering or other neighborhood methods. Chaloner and Maldonado (2019) proposed a method to discover new bias categories by clustering word embeddings. Bordia and Bowman (2019) quantified bias based on co-occurrence of neighborhoods of words. They hypothesized that words occurring in close proximity to a particular gender in the training data are prone to be more biased towards that gender during testing.

Approaches based on distance in vector space have certain limitations. Goldfarb-Tarrant et al. (2021) argue that intrinsic measures based on measurements in embedding space do not necessarily relate to real-world bias. If intrinsic metrics show similar bias between models, but testing the models in real-world situations reveals unequal performance, the real-world results expose the most worrying biases. This is because the paramount priority is finding and reducing biases that could reach real users when models are deployed in products. Another limitation of works in Q1 is that the values produced by these methods are not reliable, as they vary significantly if the sets of target words and attribute words are modified Du et al. (2021); Antoniak and Mimno (2021).

The approaches in Q1 requires access to hidden layers of the models to quantify bias. However, the growing trend of larger model sizes poses challenges in determining the appropriate layer for quantifying bias. Moreover, limited access to these layers may arise due lack of open-source availability of LMs, further complicating the process.

2.2 Performance metrics - Q2

A second class of metrics relies on differences in model performance on test data. They generally split the testing dataset into two parts based on sociodemographic groups. De-Arteaga et al. (2019) quantified gender bias as the difference between the true positive rates when male names and pronouns are swapped for female names and pronouns. Dixon et al. (2018) and Zhao et al. (2018a) took similar approaches, using area under the curve (AUC) and false positive rate to quantify bias Dixon et al. (2018), or relative accuracy Zhao et al. (2018a). Zhang et al. (2022) and Huang et al. (2020) generated augmented datasets to measure bias as the difference in accuracy between the original and augmented datasets, irrespective of type of bias. Stanovsky et al. (2019) proposed a metric based on differences in accuracy across genders for machine translation. Approaches in Q2 utilize final predictions made by the models and thus are not restricted to open-source models unlike Q1.

2.3 Prompt-based metrics - Q3

Here we review methods that prompt models using a range of prompt-generation methods, with equally varied metrics based on different performance measures.

In these approaches, models are prompted for bias through a set of pre-defined templates, or patterns, that capture specific types of bias or stereotypes. The templates contain slots that are filled through selection from a set of pre-defined demographic target terms during evaluation. For instance, a template could be "A is walking" where is systematically substituted with names associated with different demographic groups. By contrasting model outputs across varied choices of , the presence and degree of bias can be measured.

Prabhakaran et al. (2019) generated templates for toxicity detection, and proposed metrics based on average difference, standard deviation and range of model performance for different target groups. Kiritchenko and Mohammad (2018) uses 11 templates to create a dataset of 8,640 english sentences to measure gender and race bias. Smith et al. (2022) proposed a metric based on 450,000 unique sentence prompts. Webster et al. (2021) defined fourteen templates to determine gender identity bias. Ribeiro et al. (2020) created diverse templates to represent general linguistic capabilities combined with a tool to generate test cases at scale. Parrish et al. (2022) measure nine types of demographic bias on question answering dataset. They generate more than 25 different templates for each bias category. In contrast to performance-based metrics which divide the dataset into two parts as discussed in Q2, these approaches increase the size of the bias-testing dataset significantly and therefore perform a more exhaustive examination of model behavior. Presumably, by relying on large-scale testing the results achieved are more robust than for the papers in Q2.

Several works aim to make template-based approaches more rigorous. They test the models using, attributes of individuals that should have no effect on the probability of model outputs. These attributes are normally referred to as protected attributes. Specifically, "a decision is fair towards an individual if it is the same in (a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group."

These approaches generate counterfactuals by changing protected attributes in the testing examples. They give a better understanding of which attributes influence model behavior Garg et al. (2019); Kusner et al. (2017). Huang et al. (2020) created counterfactuals on testing dataset and quantified sentiment bias as the Wasserstein-1 distance between the original and counterfactual dataset. The Wasserstein-1 distance measures the minimum cost of transforming one probability distribution into another, which captures differences in both probabilities. They showed that generative LLMs like GPT-2 Radford et al. (2019) tend to generate continuations with more positive sentiment for “baker”, and more negative sentiment with “accountant” as the occupation. Gardner et al. (2020) created contrast sets by generating counterfactuals for ten NLP datasets and showed that model performance drops significantly on counterfactuals. Liang et al. (2022) perturb existing test examples by substituting terms related to a particular demographic group by other groups.

Another approach to bias measurement is to mask words in sentences, and use the probabilities of models’ predictions. Kurita et al. (2019) applied this approach to occupation terms, retrieving model predictions for masked nouns in sentences such as “[MASK] is a programmer.” Differences in normalized probabilities assigned to gendered words are used to measure occupational gender biases. These kinds of prompts are often referred to as cloze prompts. Similarly, Ahn and Oh (2021) quantified bias as the variance of normalized probabilities across various demographic groups. Bartl et al. (2020) used models’ prediction of masked tokens to measure bias.

Recently, there has been an increasing emphasis on template-based approaches for quantifying bias Smith et al. (2022); Parrish et al. (2021); Li et al. (2020). The advantage of using these approaches is that by examining model predictions, they provide greater relevance to real-world harms compared to solely analyzing internal parameters as in Q1. Additionally, they share same strengths as Q2 as they are not limited to open-source models or restricted by model sizes. However, a drawback of these approaches lies in their author bias, as templates are manually designed by the authors. This author bias makes this methodology heavily dependent on template selection and thus very susceptible to slight modifications to templates Seshadri et al. (2022); Alnegheimish et al. (2022); Selvam et al. (2023).

2.4 Probing metrics - Q4

These approaches are based on probing model performance either add a classification layer to a pre-trained LM or use a set of probes to test the inner workings of LMs. Mendelson and Belinkov (2021) trained a probing classifier on the latent representation space of LMs, and used it to predict specific linguistic properties to quantify bias. They focused mainly on measuring negative word bias and lexical overlap bias in pre-trained LMs. Dev et al. (2020) probed model bias using natural language inference datasets by measuring whether swapping lexical items for different sociodemographic groups changes entailment relations between sentence pairs. Li et al. (2020) used differences in the probability of answers from question-answering models by measuring the impact of changes in questions’ subject words.

These approaches face similar limitations as the ones discussed in Q1, as they are constrained to models with access to hidden layers. Furthermore, applying these approaches to massive language models with billions of parameters poses significant computational and methodological challenges due to the expansive scale of the underlying layers..

2.5 Criteria for reliable metrics

As our foregoing analysis indicates, prevailing approaches for quantifying bias in NLP models face reliability issues Seshadri et al. (2022) . We propose following criteria for developing reliable bias measurement techniques: (1) insensitive towards minor perturbations in evaluation templates and target sets; (2) low variance across repeated measurements of the same model; and (3) ability to generalize to different kinds of bias. We encourage that future efforts in bias quantification prioritize advancing reliable metrics based on these criteria rather than solely introducing new metrics.

3 Debiasing

Debiasing aims to reduce the impact of certain bias present in the data used to train LMs and to ensure that they are more fair and accurate in their predictions and recommendations Subramanian et al. (2021). Turning to Daniel Kahneman again, he argues that reducing social stereotypes and biases has costs, but that the costs are worth paying to achieve a better society Kahneman (2011). Extending the same principle to NLP systems, the debiasing techniques demand computation time and cost, however these costs are necessary investments for creating more equitable and ethical AI systems.

A category of debiasing methods occurs during the finetuning phase of pre-trained language models.

Zmigrod et al. (2019) and Lu et al. (2020) introduced Counterfactual Data Augmentation (CDA), which involves generating counterfactual instances to address data distribution imbalances between male and female demographics, thereby mitigating gender bias. This method entails substituting gender-specific words, such as he and she to construct novel sentences. Maudslay et al. (2019) proposed Counterfactual Data Substitution (CDS) to improve the quality of CDA counterfactuals by associating a probability with each substitution to replace random substitution, resulting in more realistic distributions. Building upon these insights, Park et al. (2018), Liang et al. (2020), and Lauscher et al. (2021) proposed swapping to balance data distribution of sociodemographic groups. Evaluation of data augmentation based approaches typically involves assessing the extent to which they achieve improved bias reduction in LMs through quantitative metrics. Some of these data augmentation approaches can also be applied during training time to reduce bias.

Dev et al. (2020, 2021) proposed a subspace correction and rectification method for modifying embedding space to mitigate bias. They aimed to disentangle associations between concepts deemed problematic for the models. Ravfogel et al. (2020) learned a linear projection over representations after training a DNN, to remove the bias components in embeddings. Manzini et al. (2019) used principal component analysis to identify the bias subspace.

Park et al. (2018) showed that fine-tuning with larger corpora helps to debias a model. This method could prevent potential over-fitting to a small, biased dataset. Ahn and Oh (2021) proposed that training BERT Devlin et al. (2019) using multiple languages helps to reduce the ethnic bias in each language.

Here, humans first identify bias in trained models, then the models are finetuned to reduce these bias. Chopra et al. (2020) used human-in-the-loop methods to find word pairs linking a sociodemographic group to a positive or negative trait. Yao et al. (2021) used human-provided explanations regarding observed bias to find spurious patterns in model output, then used these spurious patterns to reduce bias in models.

Works based on debiasing during finetuning offer enhanced opportunities for analyzing bias present in the models. These methods offer greater ease of implementation and can be tailored for each specific model. However, as the prevalence of large language models grows, the models are getting trained on enormous amounts of data. In such cases, bias become ingrained within the models and it gets really difficult to debias these model using post-training techniques.

3.2 Debiasing during Training - D2

Several works have applied debiasing at training time, or even before, to word embeddings used at initialization. Bolukbasi et al. (2016) proposed a hard debiasing technique aimed at equalizing gender associations in embeddings. Their method compensates for differences in average vector deviations between female and male gender terms relative to gender-neutral vocabulary. They published pre-trained embeddings using hard-debiasing, for use in place of Word2Vec embeddings. Park et al. (2018) and Zhao et al. (2018b) also presented results on use of debiased word embeddings to reduce gender bias in language models.

Garimella et al. (2021) used declustering loss to reduce bias, and in a similar vein Bordia and Bowman (2019) proposed a loss regularization method. Huang et al. (2020) proposed a three-step curriculum training using distance between the embeddings as a fairness loss to reduce sentiment bias. Liu et al. (2021) aimed to reduce the significance of sociodemographic attributes in the input using adversarial training.

With growing prominence of LLMs, works on debiasing during training are highly impactful as they result in LMs with significantly reduced bias, thus can be used more safely for real-world applications. However, there is need for more works to address debiasing during training to effectively encompass a broader spectrum of bias. By addressing wider range of bias, these approaches can further enhance the trustworthiness of LMs in real-world scenarios.

3.3 Debiasing at Inference Time- D3

The work in this section applies debiasing methods at inference time. In general, these methods are quite diverse. Abid et al. (2021) and Venkit et al. (2023b) applied adversarial machine learning to trigger positive associations in text generative models to reduce anti-Muslim bias and nationality bias through prompt modifications. Qian et al. (2021) performed keyword-based distillation to remove bias during inference, and to block bias acquired during training. Zhao et al. (2019) addressed gender bias at inference time through averaging of representations for different gender vocabulary, but with little reduction in bias.

There are some issues with most of the debiasing techniques as the term “debiasing” is often used in a loose sense. It acts as a preventive alternative to hide the harmful bias by forcing the model to demonstrate less biased results. Gonen and Goldberg (2019) showed that the current debiasing methods are superficial and hide bias in place of removing it. This suggests that existing debiasing techniques are insufficient.

Evaluation Datasets

Bias benchmark datasets provide valuable resources for NLP fairness research. These datasets commonly contain illustrative examples of biased language, often templated sentences filled with contrastive social group terms. Datasets allow standardized bias evaluation on diverse tasks using controlled examples. Many of them focus on a particular type of language context, such as co-reference, sentiment, or question answering, while others probe for stereotype bias through word associations. Table present in the Appendix summarizes these datasets.

In the case of coreference resolution, Zhao et al. (2018a) proposed a method for identifying gender bias using Winograd-schema sentences for occupation terms. Webster et al. (2018) introduced GAP, a gender-balanced, labeled corpus of 8,908 ambiguous pronoun–name pairs designed to detect gender bias in coreference resolution. In the word association domain, Nangia et al. (2020) presented CrowS-Pairs, a sentence pair corpus that measures a model’s bias by assessing if it favors sentences with stereotypes. Nadeem et al. (2021) released StereoSet, a large-scale natural dataset in English designed to measure stereotypical bias using inter- and intra-sentence association of words to stereotypical contexts. Li et al. (2020) proposed UNQOVER, a general framework for probing bias in question answering models using questions to probe whether a model associates a sociodemographic group to a stereotype. Smith et al. (2022) published HolisticBias, consisting of 450,000 unique sentence prompts for measuring 13 types of sociodemographic bias in generative LMs.

In the domain of sentiment evaluation, Kiritchenko and Mohammad (2018) released EEC, an 8,640 English sentence collection curated to test bias toward certain races and genders in sentiment analysis models. BITS Venkit and Wilson (2021); Venkit et al. (2023c) is a similar corpus of 1,126 sentences curated to measure disability, race, and gender bias in sentiment and toxicity analysis models.

Issues and Recommendations

The most critical issue and perhaps the most challenging to address is whether there are metrics and measurements in the literature that are robust to perturbations in the datasets and templates, or other changes in the evaluation setting, such as type of bias or training corpora. Serious reliability issues have been identified for both distance-based and performance-based approaches Webster et al. (2018); Du et al. (2021). It has been argued, for example, that distance-based metrics for bias do not correlate well with real-world phenomena Goldfarb-Tarrant et al. (2021). Additionally, it has been shown that distance-based metrics can change considerably with different initialization Antoniak and Mimno (2021). The reliability issues also persist with template-based approaches as the current approaches are very sensitive to small modifications to templates Selvam et al. (2023); Seshadri et al. (2022); Alnegheimish et al. (2022). Without having reliable metrics, making strong claims about debiasing methods is difficult. One step towards tackling this challenge would be to increase the scale of bias-testing templates and datasets. It could also be helpful to investigate whether different types of bias require different metrics.

The second problematic issue we identify is in regard to current debiasing methods. We find that often the term “debiasing” is used in a loose sense and forces the models to demonstrate less biased results without actually removing the bias in models. We find the arguments of Gonen and Goldberg (2019) convincing that these methods are largely superficial. Even worse, other work suggests that these methods can potentially increase bias Mendelson and Belinkov (2021). We reiterate, however, that robust measurements of bias are a precondition for research on debiasing methods, along with investigations into whether different metrics might be required for different types of bias. Until better metrics have been proposed, we recommend that investigators report results using multiple metrics, which could lead to insights into both measurements of bias and debiasing techniques.

Moreover, the majority of recent debiasing works are focused on finetuning. These approaches offer the benefit of easier implementation and tailored customization for each model by addressing bias after the training process. However, the increasing prevalence of large language models, trained on vast amounts of data, has led to the solidification of bias within these models. This makes it really challenging to eliminate bias through fine-tuning-based approaches. We encourage future works to focus on debiasing techniques during training as they are more impactful on the current trend of large language models.

The third problematic issue we identify is with regard to template-based methods for quantifying bias. Recent studies have witnessed the emergence of various datasets based on template-based approaches. However, a critical limitation observed in these works is the utilization of a limited number of templates, often generated by the authors themselves Seshadri et al. (2022); Selvam et al. (2023). Consequently, the diversity of these templates is compromised, and the bias metrics derived from them become highly sensitive to the authors’ template creation process. Furthermore, these works use a restricted set of target words, such as top-20 or top-50 names from US Census data to quantify bias. These commonly used names fail to accurately represent the overall population and thus does not provide a comprehensive understanding of how models will perform in real-world scenarios. We recommend future works on template-based approaches to incorporate a larger number of diverse templates from various data sources. This will help in improving the reliability of template-based approaches. Additionally, we advocate for the utilization of an expanded set of target words to comprehensively measure bias, thus enabling more accurate and holistic evaluations of NLP models.

Our survey focuses on sociodemographic bias, as is reflected in the high proportion of bias research that focuses on sociodemographic bias. Most of the research on sociodemographic bias investigates gender bias. This is unsurprising given that gender is one of the most essential components of social identity across cultures, and that most languages have grammatical gender, which partly aligns with gender identity. However, it would likely deepen our understanding of bias in NLP to broaden the types of sociodemographic bias that are investigated, and to compare them to one another.

We recommend also a rather different direction to pursue that could build on the kinds of research surveyed here, namely to investigate real-world impacts of sociodemographic bias in NLP through multidisciplinary collaborations with investigators in social science and behavioral economics. This could lead to alternative ways to address bias other than through changes in methodologies used in NLP.

Conclusion

We have presented a comprehensive literature survey encompassing 214 relevant works on sociodemographic bias in NLP. Our proposed categorization of this literature provides enhanced clarity regarding the current research landscape. Our survey reveals several critical challenges facing the field : [A] limitations in the reliability of existing bias quantification methods, [B] a lack of alignment between bias quantification metrics and real-world bias, [C] scarcity of recent work on training-based debiasing methods and [D] issue of author bias in template-based approaches for bias quantification. In light of these challenges, we offer a set of actionable recommendations to guide future work toward a more impactful and responsible approach to addressing bias in NLP.

Limitations

In our survey, we have included works from ACL Anthology, NeurIPS proceedings, and FAccT. We might have missed some relevant works in our survey, as they didn’t appear in any of the above conferences.

References

Appendix A Appendix

We tried to categorise works in one of the following categories based on their main contribution. Sometimes, works have major contribution in multiple categories and maybe present multiple times below. Thus total number of works mentioned below are more than 214, but the total number of unique works is 214.111A github link will be provided upon acceptance.

Liu et al. (2021); De-Arteaga et al. (2019); Silva et al. (2021); Park et al. (2018); Sap et al. (2020); B et al. (2021); Lauscher and Glavaš (2019); Rozado (2020); Rudinger et al. (2017); Shah et al. (2020); Du et al. (2022); Nozza et al. (2022); Honnavalli et al. (2022); Lucy and Bamman (2021); Mendelson and Belinkov (2021); Matthews et al. (2021); Cao et al. (2022); Papakyriakopoulos et al. (2020); Kementchedjhieva et al. (2021); Garrido-Muñoz et al. (2021); Strengers et al. (2020); Delobelle et al. (2022); Fisher et al. (2020); Sheng et al. (2020); Zhang et al. (2020a); Hendricks et al. (2018); Mehrabi et al. (2021); Mayfield et al. (2019); Schwartz et al. (2021); Nozza et al. (2019); Vaidya et al. (2020); He et al. (2019); Hovy and Søgaard (2015); Wolfe and Caliskan (2021); Sakaguchi et al. (2021); Agarwal et al. (2019); White and Cotterell (2021)

Gender Bias : Gaut et al. (2020); Sun et al. (2019); Hamidi et al. (2018); Zhou et al. (2019); Savoldi et al. (2021); Sahlgren and Olsson (2019); Ahn et al. (2022); Tal et al. (2022); Kaneko et al. (2022); Field and Tsvetkov (2020); Garimella et al. (2019); Escudé Font and Costa-jussà (2019); Bhaskaran and Bhallamudi (2019); McCurdy and Serbetci (2020); Kaneko and Bollegala (2019); Larson (2017); Du et al. (2021); Bartl et al. (2020); Webster et al. (2021); Tan and Celis (2019); Bolukbasi et al. (2016); Maudslay et al. (2019); Zhao et al. (2019); Rudinger et al. (2018); Lu et al. (2020)

Racial Bias : Sap et al. (2019); Hanna et al. (2020); Blodgett et al. (2016); Davidson et al. (2019); Friedman et al. (2019); Shen et al. (2018); Karve et al. (2019); Nadeem et al. (2021); Garimella et al. (2021); Nangia et al. (2020); Tan and Celis (2019); Guo and Caliskan (2021); Brown et al. (2020)

Disability bias : Venkit and Wilson (2021); Hutchinson et al. (2020); Bennett and Keyes (2020); Mills and Whittaker (2019); Hassan et al. (2021)

Ethnic bias : Malik et al. (2022); Li et al. (2022); Ahn and Oh (2021); Garg et al. (2018); Li et al. (2020); Abid et al. (2021); Manzini et al. (2019); Venkit et al. (2023b); Bhatt et al. (2022), age bias Nangia et al. (2020); Diaz et al. (2018) and sexual-orientation bias Nangia et al. (2020); Cao and Daumé III (2020)

: Caliskan et al. (2017); Dev and Phillips (2019); Zhao et al. (2017); Basta et al. (2019); Shen et al. (2018); Brunet et al. (2019); May et al. (2019); Dev et al. (2021); Zhou et al. (2019); Pujari et al. (2020); Sutton et al. (2018); Lauscher et al. (2020); Guo and Caliskan (2021); Bolukbasi et al. (2016); Ross et al. (2021); Tan and Celis (2019); Ethayarajh et al. (2019); Chaloner and Maldonado (2019); Bordia and Bowman (2019)

: De-Arteaga et al. (2019); Kwon and Mihindukulasooriya (2022); Zhang et al. (2022); Huang et al. (2020); Dixon et al. (2018); Zhao et al. (2018a); Cho et al. (2019); Stanovsky et al. (2019); Gonen and Webster (2020); Borkan et al. (2019); Dev et al. (2020)

: Webster et al. (2021); Smith et al. (2022); Kurita et al. (2019); Krishna et al. (2022); Bhaskaran and Bhallamudi (2019); Gupta et al. (2022b); Prabhakaran et al. (2019); Ahn and Oh (2021); Bartl et al. (2020); Li et al. (2020); Venkit and Wilson (2021); Salazar et al. (2020); Dev et al. (2020); Diaz et al. (2018); Zhang et al. (2020b); Garg et al. (2019); Liang et al. (2022); Kusner et al. (2017); Huang et al. (2020); Akyürek et al. (2022); Gardner et al. (2020); Ousidhoum et al. (2021); Parrish et al. (2022); Kiritchenko and Mohammad (2018)

: Ousidhoum et al. (2021); Dev et al. (2020); de Vassimon Manela et al. (2021); Immer et al. (2022); Kennedy et al. (2020); Sweeney and Najafian (2019); Tan et al. (2020); Li et al. (2020); Mendelson and Belinkov (2021)

: Jin et al. (2021); He et al. (2022); Zmigrod et al. (2019); Jin et al. (2021); Gupta et al. (2022a); Ghaddar et al. (2021); Kumar et al. (2020); Han et al. (2021); Attanasio et al. (2022); Joniak and Aizawa (2022); Chopra et al. (2020); Maudslay et al. (2019); Park et al. (2018); Yao et al. (2021); Liang et al. (2020); Sen et al. (2022); Ma et al. (2020); Limisiewicz and Mareček (2022); Yang et al. (2021); Wang et al. (2021); Pujari et al. (2020); Sedoc and Ungar (2019); Tan et al. (2020); Sutton et al. (2018); Ravfogel et al. (2020); Kaneko and Bollegala (2019); Karve et al. (2019); Gyamfi et al. (2020); Shin et al. (2020); Zhang et al. (2020a); Wen et al. (2022); Chopra et al. (2020); Yang and Feng (2020); Lu et al. (2020); Lauscher et al. (2021); Garg et al. (2019); Dev et al. (2020, 2021); Manzini et al. (2019); Bolukbasi et al. (2016); Ahn and Oh (2021); Orgad et al. (2022).

: An et al. (2022); Bolukbasi et al. (2016); He et al. (2019); Han et al. (2022); Liu et al. (2020b); Escudé Font and Costa-jussà (2019); Prost et al. (2019); James and Alvarez-Melis (2019); Park et al. (2018); Zhao et al. (2018b); Sweeney and Najafian (2020); Hube et al. (2020); Sen and Ganguly (2020); Saunders and Byrne (2020); Dixon et al. (2018); Karimi Mahabadi et al. (2020) Loss functions for bias mitigation : Hashimoto et al. (2018); Qian et al. (2019); Berg et al. (2022); Romanov et al. (2019); Garimella et al. (2021); Bordia and Bowman (2019); Huang et al. (2020); Provilkov and Malinin (2021); Liu et al. (2021); Orgad and Belinkov (2023)

: Qian et al. (2021); Zhao et al. (2019); Abid et al. (2021); Guo et al. (2022); Schick et al. (2021); Venkit et al. (2023b)

: These are works that are difficult to categorize in one of the above categories. Chouldechova and Roth (2020); Green (2019); Zhang and Bareinboim (2018); Mayfield et al. (2019); Katell et al. (2020); Dwork et al. (2012); Jacobs et al. (2020); Anoop et al. (2022); Czarnowska et al. (2021); Blodgett et al. (2021); Zhuo et al. (2023); Mulligan et al. (2019); Jacobs and Wallach (2021); Schoch et al. (2020); Franklin et al. (2022); Bender (2019); España-Bonet and Barrón-Cedeño (2022); Hutchinson and Mitchell (2019); Bender et al. (2021); Goldfarb-Tarrant et al. (2021); Brown et al. (2020); Li et al. (2020); Bagdasaryan et al. (2019); Liu et al. (2020a); Zhiltsova et al. (2019); Chopra et al. (2020); Luo et al. (2023); Shah et al. (2020); Garrido-Muñoz et al. (2021); Delobelle et al. (2022); Czarnowska et al. (2021)

Appendix B Datasets

Table 1 provides list of datasets for quantifying bias in NLP models.