A Survey on Fairness in Large Language Models

Yingji Li, Mengnan Du, Rui Song, Xin Wang, Ying Wang

Introduction

Large language models (LLMs), such as BERT (Devlin et al. 2019), GPT-3 (Brown et al. 2020), and LLaMA (Touvron et al. 2023a), have shown powerful performance and development prospect in various tasks of natural language processing (NLP), and have an increasingly wide impact in the real-world. Their pre-training relies on large corpora from various sources. Numerous studies have verified that LLMs capture human social biases in unprocessed training data, and biases emerge in encoded embeddings that carry over into downstream tasks (Garg et al. 2018; Sun et al. 2019). Unfair LLM systems make discriminatory, stereotypic, and biased decisions against vulnerable or marginalized demographics, causing undesirable social impacts and potential harms (Blodgett et al. 2020; Kumar et al. 2023).

Social biases in language models are derived primarily from training data collected from human societies. On the one hand, these uncensored corpora contain a lot of harmful information reflecting bias, leading language models to learn stereotyped behaviors (Mehrabi et al. 2022). On the other hand, the labels of different demographic groups in the training data are imbalanced, and the distributional difference can lead to unfair predictions when the model trained under the homogeneity assumption is applied to the heterogeneous real data (Shah, Schwartz, and Hovy 2020). In addition, human factors during language model learning or unanticipated biases in embeddings can cause or even amplify downstream biases (Bansal 2022).

According to the training paradigm, LLMs can be divided into pre-training and fine-tuning paradigm as well as prompting paradigm. In the pre-training and fine-tuning paradigm, LLMs have less than a billion parameters and are easy to tune, such as BERT and RoBERTA (Liu et al. 2019), which we call medium-scale LLMs. Biases in medium-scale LLMs can be roughly understood as two types: intrinsic bias and extrinsic bias (Goldfarb-Tarrant et al. 2021), as shown in Figure 1. Intrinsic bias corresponds to the bias in the embeddings encoded by the LLM and reflects the fairness of model’s output representation. Extrinsic bias corresponds to the decision bias of the downstream task and reflects the fairness of model’s prediction. In the prompting paradigm, LLMs have more than a billion parameters and are tuned or not tuned based on the prompts, such as GPT-4 (OpenAI 2023) and LLaMA-2 (Touvron et al. 2023b), which we call large-scale LLMs. Biases in large-scale LLMs are generally reflected in the model output when given specific prompts.

In this paper, we provide a comprehensive review of related research on fairness in LLMs, where the overall architecuture is shown in Figure 2. Focusing on medium-scale LLMs under the pre-training and fine-tuning paradigm, we introduce the evaluation metrics in Section 2, and the intrinsic debiasing methods and extrinsic debiasing methods in Section 3 and Section 4, respectively. In Section 5, the fairness of large-scale LLMs under the prompting paradigm is provided, including fairness evaluation, reasons for bias, and debiasing methods. We also provide a discussion of current challenges and future directions in Section 6.

Evaluation Metrics

In this section, we summarize the fairness evaluation metrics for medium-scale LLMs, which are divided into intrinsic metrics and extrinsic metrics. Intrinsic metrics are applied to embeddings, formalizing intrinsic bias by statistically quantifying the associations between targets and concepts. Extrinsic metrics are applied to the output of downstream tasks to characterize extrinsic bias by the performance gap.

Distance-based. Sentence embedding association test (SEAT) (May et al. 2019) adapts the word embeddings association test (WEAT) (Caliskan, Bryson, and Narayanan 2017) to contextual embeddings, which measures the association between two sets of targets (e.g., male/female) and two sets of attributes (e.g., family/career) via semantically bleared templates such as “He/She is a [MASK]”. The cosine distance between the two sets of embeddings is then calculated as the effect size score. The contextualized embedding association test (CEAT) extends WEAT to a dynamic setting by quantifying the distribution of effect sizes for social and cross-bias in contextualized word embeddings (Guo and Caliskan 2021). Given a set of target groups and two polarity attribute sets, CEAT measures the effect size of the difference in distance between the target group and the two attribute sets, with lower effect size scores indicating that the target group is closer to the negative polarity of the attribute.

Probability-based. Discovery of correlations (DisCo) takes the average score of a model’s predictions as the measurement (Webster et al. 2020). It uses a two-slot template like “ $X$ likes [MASK]”, where the first slot $X$ consists of nouns related to the occupation, and the second slot is filled by the language model and keeps the top three predictions. Log probability bias score (LPBS) takes a similar template and measurement (Kurita et al. 2019). It corrects for inconsistencies in the prior probability of the target attribute, such as the model having a higher prior probability for males than females. StereoSet is a crowd-sourced dataset that measures four stereotype biases, where each sample consists of a context sentence and a set of candidate associations (Nadeem, Bethke, and Reddy 2021). The model chooses among three candidate associations: stereotyped, anti-stereotyped, and irrelevant, and obtains a bias score for each protected group. Similarly, CrowS-Pairs is a dataset containing pairs of stereotyped and anti-stereotyped sentences, which utilizes the pseudo-log-likelihood to compute the perplexity of all tokens conditioned on typical tokens (Nangia et al. 2020). AUL modifies CrowS-Pairs by combining multiple correct predictions instead of testing whether the target token is predicted (Kaneko and Bollegala 2022).

2 Extrinsic Metrics

Coreference Resolution. One of the most classical tasks for measuring gender bias is coreference resolution on datasets developed based on the Winograd (Levesque, Davis, and Morgenstern 2012) format. WinoBias is a dataset for the intra-clause coreference resolution task (Zhao et al. 2018), which evaluates the model’s ability to associate gender pronouns and occupations in contexts of stereotype and anti-stereotype. The bias score is defined as the difference between the model’s assessment of “stereotype” and “anti-stereotype”. Similarly, Winogender is also an English coreference resolution dataset based on the Winograd format (Rudinger et al. 2018). The difference is that Winogender includes neutral gender and takes one occupation in each instance, while WinoBias defines binary gender and tests two occupations in each instance. In addition, GAP proposes a gender-balanced tagged corpus of 8,908 ambiguous pron-name pairs, which can cover more diverse discriminatory pronouns and a more balanced dataset to measure the actual bias of the model more accurately (Webster et al. 2018).

Semantic Similarity. Considering the semantic similarity between sentence pairs allows assessing the associations between gender and occupation, such as STS-B (Cer et al. 2017) and Bias-NLI (Dev et al. 2020). They form a series of templates of neutral sentence pairs, where one sentence contains gender terms and the other contains occupation with gender connotations (e.g., “A [woman] is walking.” and “A [nurse] is walking.”). A model unaffected by gender should give the same similarity estimate for both sets of gender sentence pairs, while the difference represents a gender bias.

Group Fairness. Some representative metrics measure the performance gap of the model for different groups. BOLD is a large-scale fairness benchmark dataset containing natural prompts to evaluate bias across five domains in open-ended English language generation (Dhamala et al. 2021). Given prompts that describe the target population, BOLD measures bias by evaluating the quality of language model generation. Bias-in-Bios is a dataset of third-person biographies that measures the association between gender and occupation, where each biography contains explicit gender indicators (names and pronouns) and occupation annotations (De-Arteaga et al. 2019). The model is fine-tuned on samples without occupation information, and then binary gender bias is measured based on the difference between the classification results for gender groups.

Intrinsic Debiasing

Intrinsic debiasing, which aims to mitigate the intrinsic bias in the representations before they are applied to downstream tasks, is task-agnostic. Considering the application stage of debiasing techniques, intrinsic debiasing methods can be divided into three categories: pre-processing, in-processing, and post-processing (Du et al. 2020).

Pre-processing methods take various remedies for deficiencies in training data before training the model.

CDA-based. Since label imbalance across different demographic groups in the training data is an important factor in inducing bias, a widespread data processing method is to balance labels via counterfactual data augmentation (CDA) (Lu et al. 2020; Zmigrod et al. 2019). CDA augments the original corpus with causal intervention, which replaces the sensitive attributes in the original sample with the sensitive attributes of the opposite demographic based on a prior list of sensitive word pairs. For example, in binary gender debiasing, “[He] is a doctor” is replaced with “[She] is a doctor” based on the sensitive word pair (he, she). However, it is difficult to completely eliminate bias by simply augmenting the dataset. Therefore, most debiasing methods combine CDA and other debiasing strategies (Stahl, Spliethöver, and Wachsmuth 2022; Xie and Lukasiewicz 2023). They make various improvements based on CDA, but the fundamental idea is to balance the training samples.

Data Calibration. Other pre-processing methods create fairer training corpora by calibrating harmful information in the data. One approach is to remove potentially biased texts, identify harmful text subsets by differential (Brunet et al. 2019) or programmatically (Ngo et al. 2021), and then delete these subsets to retrain unbiased models. For languages with more complex morphology than English, it is more practical to create training data in the opposite direction, which creates biased text from real fair text using a machine translation model round-trip translation (Amrhein et al. 2023).

2 In-processing

In-processing methods incorporate fairness into LLMs’ design, and obtain a fairer model by tuning the parameters.

Retraining Optimization. Retraining models is a direct way to reduce bias, although it can be resource-intensive and difficult to scale. Dropout regularization interrupts the attention mechanism association between words, and can be used to retrain LLMs to reduce gendered correlations (Webster et al. 2020). Bias in the distilled language model can be mitigated using a fair knowledge distillation approach based on counterfactual role reversal, which improves the fairness of the output probabilities of the teacher model to guide a fair student model (Gupta et al. 2022).

Disentanglement. Disentanglement methods remove biases while preserving useful information. They disentangle potentially correlated concepts by projecting representations into orthogonal subspaces, thus removing discriminatory correlation bias (Dev et al. 2021; Kaneko and Bollegala 2021). Group-specific subspace projection requires prior group knowledge, some work (Ungless et al. 2022; Omrani et al. 2023) projects representations to stereotype content models (SCM) (Fiske et al. 2002) that rely on theoretical understanding of social stereotypes to define bias subspaces, thus breaking the limitations of prior knowledge.

Alignment Constraint. This mitigation strategy is to constrain models to learn more similar representations by aligning the distributions between different sensitive attributes. Auto-Debias proposes the max-min debiasing strategy, which maximizes the dissimilarity between different demographic groups through automatically searched biased prompts, and then minimizes the dissimilarity between the two distributions using alignment constraints (Guo, Yang, and Abbasi 2022). To mitigate bias in low-resource multilingual models, (Ahn and Oh 2021) proposes to leverage the contextual embeddings of two monolingual BERT and align the less biased one.

Contrastive Learning. The training objective is to narrow the distance between positive samples specific to different populations and push the distance between negative sample pairs far away. MABEL counterfactual augments premises and hypotheses from the natural language inference (NLI) dataset, and then uses a contrastive learning objective on gender-balanced entailment pairs (He et al. 2022). CCPA learns a continuous biased prompt to push the representation distance between different populations and utilizes contrastive learning to pull the distance between the concatenated biased prompt representations (Li et al. 2023).

3 Post-processing

Post-processing methods freeze the parameters of the pre-trained LLMs and debias the output representations.

Projection-based. One traditional approach is to remove bias information from representations by linearly separating sensitive and neutral attributes (Dev and Phillips 2019). The strategy is to linearly project the representation into a bias subspace, isolate potentially harmful embeddings associated with the biased concept according to the orientation of the embeddings, and then remove the biased attributes (Dev et al. 2020; Liang et al. 2020). However, removing only useless information is difficult, and it carries the risk of compromising the original semantics (Garimella et al. 2021).

Parameter-efficient. Parameter-efficient methods are used to address the potentially catastrophic forgetting (Kirkpatrick et al. 2016) that can occur with in-processing methods, that is information of the original training data retained in the pre-trained parameters is erased during tuning. The sustainable debiasing method adds a popular adapter module after the encoding layer and only updates the adapter’s parameters during training while freezing the LLM’s parameters, achieving debiasing by parameter-efficient and knowledge-preserving (Lauscher, Lüken, and Glavas 2021). GEEP injects LLMs with gender equality prompts that are trainable embedding of occupation names (Fatemi et al. 2023). Similarly, LLMs’ parameters are fixed while prompts are updated, thus preserving the original useful information.

Contrastive Learning. Contrastive learning can also be used to deal with intrinsic bias in the post-processing. For example, FairFil proposes a neural debiasing method based on the contrastive learning framework (Cheng et al. 2021), which trains a fair filter after LLM’s encoder. Under the constraint of contrastive loss, the fair filter makes the embeddings of positive pairs similar, thus alleviating the bias in the representations of different genders.

Extrinsic Debiasing

Extrinsic debiasing aims to improve fairness in downstream tasks, such as sentiment analysis and machine translation, by making models provide consistent outputs across different demographic groups. Extrinsic debiasing strategies work by debiasing LLMs in a task-specific way. These strategies can be grouped into two type: data-centric and model-centric.

Data-centric debiasing focuses on correcting the defects of training data such as label imbalance, potentially harmful information, and distributional difference.

Data Augmentation. In the case of text classification, the text classifiers trained on imbalanced corpus show problematic trends for some identity terms, such as “gay” being frequently used in toxic reviews causing the model to associate it with toxic labels (Dixon et al. 2018). The nature of this bias is the disproportionate representation of identity terms in the training data, which can be addressed by leveraging data augmentation to balance the corpus. Some work bridges robustness and fairness by augmenting a robust training set with robust word substitution (Pruksachatkun et al. 2021) and counterfactual logit pairing (Garg et al. 2019).

Data Calibration. In order to improve data quality, some work has developed data calibration schemes for specific tasks. In machine translation, data calibration methods include labeling the gender of samples (Vanmassenhove, Hardmeier, and Way 2019) and creating a credible gender-balanced adaptation dataset (Saunders and Byrne 2020). In toxic language detection, methods include using transfer learning to reduce bias from a less biased corpus (Park, Shin, and Fung 2018), relabeling samples by dialect and race priming (Sap et al. 2019) or automatically sensing dialects (Zhou et al. 2021), and identifying and removing proxy words associated with identity terms (Panda et al. 2022). These debiasing methods leverage various data calibration schemes to create training datasets with fewer harmful texts and more balanced labels, and then they improve prediction fairness by training models in unbiased datasets.

Instance Weighting. The main idea is to manipulate the weight of each instance to balance the training data during training for downstream tasks, e.g., reducing the weight of biased instances to reduce model attention (Han, Baldwin, and Cohn 2022). Social bias in text classification is formalized as a selection bias from a non-discriminatory distribution to a discriminatory distribution (Zhang et al. 2020). It is assumed that each instance of the discrimination distribution is drawn according to the social bias independently from the samples of the non-discrimination distribution. Calculating instance weights based on this formalization, mitigating bias then amounts to recovering a non-discriminatory distribution from selection bias. BLIND treats social bias as a special case of the robustness problem caused by shortcut learning (Orgad and Belinkov 2023). It trains an auxiliary model that predicts the success of the main model to detect instances of demographic characteristics that may be used, and then reduces the weights of these instances to train the main model to improve prediction fairness.

2 Model-centric Debiasing

Model-centric debiasing methods focus on designing more effective frameworks to mitigate bias, which mainly consider the fairness objective in the learning process or introduce various advanced techniques to assist debiasing.

Regularization Constraint. The regularization constraint incorporates the fairness objective into the training process of downstream tasks, and adds a regularization term beyond the task objective to encourage debiasing. One approach leverages causal knowledge from model training, which applies regularization to separately penalize causal features and spurious features that are manually identified by a counterfactual framework (Wang, Shu, and Culotta 2021). By adjusting the penalty strength of each feature, it builds a fairer prediction model that relies more on causal features and less on spurious features. Another plug-and-play debiasing method integrates the training objective of masked language models into downstream classification tasks (Ghanbarzadeh et al. 2023). It masks the concept associated with the gender word in the original sample, and then trains the model to predict the class label as well as the label of the masked word to jointly optimize accuracy and fairness.

Adversarial Learning. The main idea of adversarial learning is to hide sensitive information from the decision function (Ravfogel et al. 2022). In general, adversarial networks consist of an attacker who detects protected attributes in the encoder’s representation and an encoder who tries to prevent the discriminator from identifying protected attributes in a given task (Lahoti et al. 2020). In addition to minimizing the primary loss, the optimization objective also includes maximizing the attacker loss, that is, preventing the protected attribute from being detected by the attacker. The protected attributes in the input are more likely to be independent rather than confounding variables, making the model prediction results more fair and uncorrelated with sensitive information (Han, Baldwin, and Cohn 2021a). Although adversarial debiasing alleviates the bias to a large extent, it still retains important sensitive information in the model encoding and prediction output (Elazar and Goldberg 2018). To this end, the orthogonality constraint is used to enhance the adversarial component, which uses multiple different discriminators to learn hidden orthogonal representations from each other (Han, Baldwin, and Cohn 2021b).

Auxiliary Classifier. Auxiliary classifiers are added to the main model to assist debiasing by predicting the expected target. INLP trains multiple linear classifiers to predict the target attributes of different dimensions respectively, and then projects representations into their null-space (Ravfogel et al. 2020). Based on this, the model ignores the target attribute and it is difficult to linearly separate the data according to the target attribute, so as to make a fairer prediction. Another representative work is equipped with a classifier as a correction layer after the input layer of the main model, which learns the feature selection of the main model (Liu et al. 2021). The correction layer maps the input text to a saliency distribution by assigning high attention to important features and low attention to irrelevant features. The re-selected representations are fed into the original classifier so that the predictions are less disturbed by irrelevant features.

Contrastive Learning. It is cheaper and easier to optimize by combining contrastive learning to mitigate the bias in classifier training (Shen et al. 2021). The intuition is that fair representations of classification tasks should cluster instances with the same class label rather than instances with sensitive attributes. The training objective is the combination of the two contrastive loss components and the cross-entropy loss, which maximizes the similarity of instance pairs sharing the main task label while minimizing the similarity of instance pairs with the same sensitive attribute. In the framework of contrastive learning, sensitive attributes can be diversified and less affect the prediction results of the model.

Fairness of Large-scale LLMs

Large-scale LLMs with billion-level parameters based on the prompt training paradigm are under rapid development. As more large-scale LLMs are deployed in various real-world scenarios, concerns about their fairness are growing simultaneously. In this section, we summarize the existing fairness researches on large-scale LLMs in terms of fairness evaluation, reasons for bias, and debiasing methods.

For assessing social bias in large-scale LLMs, the basic strategy is to analyze bias associations in the content generated by the model in response to the input prompts (Cheng, Durmus, and Jurafsky 2023; Ramezani and Xu 2023). This can be performed from different perspectives using a variety of tasks, including prompts completion, dialogue generation, and analogical reasoning, while some work has also developed benchmark datasets to test for social bias.

GPT-3 is declared socially biased and it is validated by prompt completion and co-occurrence tests (Brown et al. 2020). The authors test the association between gender and occupation, and in 83% of 388 occupations prompts are generated with text related to male identifiers. They feed in 800 prompts about gender, race, and religion in a co-occurrence test, and GPT-3’s output reflects the presence of social bias in the training data. Other work has shown that GPT-3 has a higher violent bias against Muslims than other religious groups, by leveraging tasks such as prompt completion, story generation, and analogical reasoning to quantify the probability of GPT-3 outputting violent content against Muslim groups (Abid, Farooqi, and Zou 2021).

BBQ is a question answering bias benchmark with nine social bias categories, consisting of 58,492 hand-constructed context examples of ambiguity and disambiguation (Parrish et al. 2022). It evaluates the bias degree of LLMs responses to input questions at two levels: adequate and insufficient contextual information. The test results on UnifiedQA (11B) (Khashabi et al. 2020) show that the model relies on social bias to varying degrees to make predictions when context information is insufficient, and the bias degree is reduced when context is disambiguated. Then, BBQ is used to evaluate biases and stereotypes contained in 30 well-known LLMs (Liang et al. 2022). It finds a strong correlation between bias and accuracy in ambiguous contexts for InstructGPT davinci v2 (175B) (Ouyang et al. 2022), T0++ (11B) (Sanh et al. 2022), and TNLG v2 (530B) (Smith et al. 2022), which exhibit the strongest bias while also demonstrating striking accuracy. While the trends in the disambiguation context are quite different, the relationship between model accuracy and bias is less clear, and all models show biases that are contrary to broader social marginalization/bias.

Recent researches have focused on the fairness evaluation of ChatGPT/GPT-4 (OpenAI 2023). BiasAsker proposes an automated framework for identifying and measuring social biases in conversational AI systems, which identifies absolute and correlated biases in dialogue (Wan et al. 2023). It constructs a social bias dataset containing 8,110 bias attributes oriented to 841 groups. Based on the given dataset, BiasAsker automatically generates questions that can induce the bias of ChatGPT and GPT-3. A literature evaluates ChatGPT’s fairness performance in high-stakes domains such as education, criminology, finance, and healthcare (Li and Zhang 2023). Considering group fairness and individual fairness, the authors observe the difference in ChatGPT’s output given a set of biased or unbiased prompts. They adopt datasets from different domains to construct prompts consisting of four parts: task instructions, context samples, feature descriptions in the dataset, and questions. Experiments show that although ChatGPT is better than small models, it still has the unfairness problem.

For a more adequate study, DecodingTrust provides a comprehensive fairness evaluation for ChatGPT and GPT-4, where stereotype bias and fairness are evaluated separately (Wang et al. 2023). For stereotype bias, it creates a dataset of stereotype statements with 16 stereotype topics that affect 24 demographic groups. Evaluation bias is achieved by querying whether the model agrees with a given stereotype statement in the three constructed evaluation scenarios. It is found that ChatGPT and GPT-4 are not strongly biased for most stereotyped topics considered in benign scenarios, while they can be tricked into agreeing with stereotyped statements in misleading scenarios, with GPT-4 in particular being more misleading. Moreover, for different populations and topics, the GPT models exhibit different levels of bias, such as showing higher bias on less sensitive topics such as leadership and greed than on more sensitive topics such as drug dealing and terrorism. For fairness, it constructs 3 evaluation scenarios: a zero-shot scenario, a scenario with unbalanced samples, and a scenario with different numbers of balanced samples. It is found that while GPT-4 is more accurate in population-balanced test environments, it is less fair in imbalanced test environments. In the zero-shot and few-shot scenarios, ChatGPT and GPT-4 have very different performance on different groups, and a small number of balanced few-shot can effectively guide the model to be fairer.

2 What are the Reasons for Model Bias?

Recent large-scale LLMs such as GPT-4 and LLaMA-2 are found to undergo a “phase transition” of capabilities compared to earlier LLMs, and exploration of the reasons for the bias in earlier models does not necessarily translate. Therefore, there are some experimental studies to understand the reasons for the bias in large-scale LLMs (Santy et al. 2023; Bubeck et al. 2023).

LLaMA-2 (Touvron et al. 2023b) is verified that the bias in its generation is correlated with the frequency of gender pronouns and identity terms in the training data (Touvron et al. 2023b). The authors perform pronoun analysis in an English pre-training corpus by counting the most common English pronouns and grammatical persons. They find that the frequency of male pronouns is much higher than that of female pronouns, and similar regularities are found in other models of similar size (Chowdhery et al. 2022). However, in the statistics on identity terms, female terms appear in a larger proportion of documents, reflecting the difference between terms and linguistic tags. In addition, the identity term has a larger proportion of terms about LGBTQ+ sexual orientation and Western groups.

An investigation of an earlier version of GPT-4 examines the stereotype bias between occupation and gender that is proportional to the gender proportion of that occupation in the world (Bubeck et al. 2023). It prompts GPT-4 to generate recommendation letters for a given occupation and counts the model’s gender selection for the occupation, and the results reflect the skewness of the world representation of the occupation. NLPositionality is a framework for characterizing design biases and quantifying the positionality of datasets and models, which collects annotations from volunteers and aligns dataset labels and model predictions (Santy et al. 2023). By applying social acceptability and hate speech detection tasks to existing models, it observes that datasets and models favor advantaged groups such as Western, white, young, and highly educated, while some marginal groups such as non-binary people and non-native English speakers may be further marginalized.

3 Debiasing Large-scale LLMs

Compared with the flexibility of medium-scale LLMs, large-scale LLMs are more difficult in debiasing. Under the prompt training paradigm, large-scale LLMs can be debiased by instruction fine-tuning and prompt engineering.

Instruction Fine-tuning. Fine-tuning large-scale LLMs on a set of datasets expressed as instructions has been shown to mitigate model bias and is applied by some work in debiasing zero-shot and few-shot tasks (Wei et al. 2022; Chung et al. 2022). Using reinforcement learning from human feedback (RLHF) (Christiano et al. 2017) to instruct fine-tuning is a means of strengthening, the representative work include InstructGPT (Ouyang et al. 2022) and LLaMA-2-chat (Touvron et al. 2023b). InstructGPT fine-tunes GPT-3 to follow human instructions with RLHF. Three steps are followed: 1) collect human-written demonstration data to supervise GPT-3’s learning, 2) collect comparison data of model outputs provided by annotators and train a reward model to predict human-preferred outputs, and 3) optimize policies against the reward model using the PPO algorithm (Schulman et al. 2017). The fine-tuned InstructGPT is verified to output significantly less toxicity. However, the results of evaluating bias on modified versions of Winogender (Rudinger et al. 2018) and CrowS-Pairs (Nangia et al. 2020) datasets show that the bias generated by InstructGPT is not significantly improved compared to GPT-3. To mitigate the security risks of LLaMA-2, LLaMA-2-Chat employs three security fine-tuning techniques: 1) collect adversarial prompts and security demonstrations to initialize and include them in a general supervised fine-tuning process, 2) train a security-specific reward model to integrate security into the RLHF pipeline, and 3) security context distillation to refine the RLHF pipeline. Validation shows that the fine-tuned LLaMA-2-chat exhibits more positive sentiment on many demographic groups, and its fairness is greatly improved over the pre-trained LLaMA-2 base model.

Prompt Engineering. Prompt engineering has also been used to mitigate the bias of large-scale LLMs in language generation, by designing additional prompts to guide the model to a fairer output without fine-tuning. For example, in the occupation recommendation task, the authors change GPT-4’s gender choice from a third-person pronoun to “they/their” by adding the phrase “in an inclusive way” to the prompts (Bubeck et al. 2023).

Discussions

Although the fairness of medium-scale LLMs is relatively widely studied and has been discussed in some previous work, we find that these studies are still limited and should be explored more. In parallel, large-scale LLMs are still in the stage of developing a more comprehensive and socially harmless system, whose fairness is a societal focus. In this section, we discuss the shortcomings, challenges, and future research directions of the current development of LLM fairness and give our insight.

Intrinsic metrics probe the underlying LLMs, while extrinsic metrics evaluate the model for downstream tasks. In the pre-training and fine-tuning paradigm, while the pre-trained model is the foundation, fine-tuning may override the knowledge learned in pre-training. Some work verifies that intrinsic debiasing benefits the fairness of downstream tasks (Jin et al. 2021). But others point out that intrinsic bias and extrinsic bias are not necessarily correlated (Goldfarb-Tarrant et al. 2021; Delobelle et al. 2022), not only in the original setting but even when correcting for metric bias, noise in the dataset, and confounding factors (Cao et al. 2022). Moreover, different metrics are not compatible with each other, making it difficult to guarantee the reliability of the benchmark (Qian et al. 2022). Therefore, we urge practitioners working on debiasing research not to rely only on certain metrics, especially intrinsic metrics, but to focus more on extrinsic metrics and consider fairness on downstream tasks. Moreover, new challenge sets and annotated test data should be created to make these metrics more feasible.

2 Accurately Evaluating Fairness of Large-scale LLMs

Expand methods for quantifying bias. For evaluating the fairness of medium-scale LLMs, bias can be measured from both intrinsic and extrinsic perspectives based on model embeddings and output predictions. Compared to this, the evaluation of the fairness of large-scale LLMs is relatively inadequate. In particular, for many large-scale LLMs that are not open source, we can only quantify bias based on the response results of the model. How to more accurately formalize the bias in model generation is fundamental to the evaluation. In addition, most methods rely on human judgment of the bias in the model response, which consumes a lot of resources and cannot guarantee whether it will introduce personal bias of annotators. Therefore, we propose to apply statistical principles and automated measurement techniques from more perspectives to enrich methods for quantifying bias in large-scale LLMs.

Develop more diverse datasets. The premise of the evaluation is a comprehensive benchmark dataset and task. Some work uses existing datasets such as BLOD, Bias-in-Bios to evaluate the fairness of models. However, these datasets are not specific to large-scale LLMs development, and they have not been proven to accurately reflect the performance of the model. Although large-scale LLMs specific benchmark datasets have been developed, such as BBQ for question answering tasks and BiasAsker for dialogue tasks, the range of tasks and biases they cover is limited. We believe that it is necessary to develop diverse and comprehensive benchmark datasets specific to large-scale LLMs.

3 Further Explore the Reasons for Bias

As we conclude in Section 5.2, some literatures analyze the reasons for the bias in large-scale LLMS through experimental validation, which focus on comparing the associations of pre-training corpora and real-world stereotypes from a data statistical perspective. There are studies that explore the reasons for bias in medium-scale LLMs from other perspectives, such as (Watson, Beekhuizen, and Stevenson 2023) understands how BERT’s predicted preferences reflect social attitudes toward gender from the psychological perspective, (Walter et al. 2021) analyzes bias in historical corpora from the political perspective, and (Baldini et al. 2022) explores the model size, random seed size, training, and other external factors can affect performance and the relationship between fairness. Inspired by these researches, we suggest that large-scale LLMs should also develop more inquiry work to deepen the investigation of reasons for bias from a broader perspective to develop more fair systems.

4 Efficiently Debiasing Large-scale LLMs

Improve current debiasing strategies. RLHF-based fine-tuning methods are difficult to generalize in implementation due to their high labor costs and resources. We expect to apply low-cost methods to debias large-scale LLMs. Although the debiasing strategy based on prompt engineering has been initially confirmed to be effective, the current exploration is still in its infancy. We can go further in the direction of designing more targeted and controllable prompt templates that can be generalized to more models and combining more techniques in prompt tuning such as interpretability methods, to develop more efficient debiasing strategies. Furthermore, the early version of GPT-4 is seen to be capable of self-reflection and explanation combined with the ability to reason about people’s beliefs (Bubeck et al. 2023), creating new opportunities for guiding model’s behaviors.

Consider fairness during development. As LLMs grow in size, social impact, and commercial use, mitigating bias from a training strategy perspective alone cannot fundamentally eliminate model bias. Another debiasing way is to consider fairness in terms of data processing and model architecture during the model development phase. Especially for training data that is a major source of bias, we encourage developers to invest resources in data processing instead of ingesting everything on the network, thereby fundamentally eliminating social bias.

Conclusions

We present a comprehensive survey of the fairness problem in LLMs. The social biases mainly come from training data containing harmful information and imbalanced data, and can be divided into intrinsic bias and extrinsic bias. We summarize the fairness researches of LLMs including intrinsic and extrinsic evaluation metrics and debiasing strategies for medium-scale LLMs, as well as fairness evaluation, reasons for bias, and debiasing methods for large-scale LLMs. Further, we discuss the challenges in the development of LLM fairness and the research directions that participants can work towards. This survey concludes that the current fairness research on LLM still needs to be strengthened in terms of evaluation bias, sources of bias, and debiasing strategies. Especially for the fairness of large-scale LLMs, which are still in the early stage, practitioners should combine more techniques and build comprehensive and safe language model systems.