Fairness Testing: A Comprehensive Survey and Analysis of Trends

Zhenpeng Chen, Jie M. Zhang, Max Hort, Mark Harman, Federica Sarro

Introduction

Machine Learning (ML)-enabled software, commonly referred to as ML software, has gained widespread adoption in critical areas of society, including hiring (Chan and Wang, 2018), credit assessment (Bono et al., 2021), and criminal justice (Berk et al., 2021). However, the use of such software has also led to instances of unfair decision-making, particularly when sensitive attributes such as sex, race, age, and occupation are involved (Brun and Meliou, 2018). For instance, a ML-enabled recidivism assessment system employed by US courts was found to incorrectly label black defendants as higher-risk individuals compared to white defendants (fai, 2016).

The unfair behaviors exhibited by ML software can have profound ethical implications, resulting in unacceptable outcomes, particularly by disadvantaging minority groups and protected categories. Consequently, there is a growing concern and heightened awareness within the research community regarding the issue of unfairness and its impact.

The exploration of unfairness issues gained significant momentum in the late 1960s (Hutchinson and Mitchell, 2019), when psychometricians began investigating the fairness of educational tests (Cleary, 1966, 1968). Subsequently, the study of fairness expanded to include the domains of ML (Pedreschi et al., 2008) and Software Engineering (SE) (Finkelstein et al., 2008), beginning around 2008, in response to the rapid progress of ML and software applications in supporting decision-making processes.

From the SE perspective, fairness is a non-functional software property that should be treated as a first-class entity throughout the entire SE process (Brun and Meliou, 2018; Aydemir and Dalpiaz, 2018). Ahmad et al. (Ahmad et al., 2021) emphasized the increasing importance of fairness as a requirement that should be considered during the requirements engineering phase of ML software development. Alidoosti (Alidoosti, 2021) argued for the inclusion of fairness considerations in the design process of software architecture. Albarghouthi et al. (Albarghouthi et al., 2017) framed fairness as correctness properties for ML program verification. Zhang et al. (Zhang et al., 2022b) described fairness as a significant testing property for ML software. Additionally, Zhang et al. (Zhang et al., 2022a) viewed fairness as the objective of repairing ML software.

In this paper, we also approach fairness from the SE perspective and define imperfections in software systems that result in a misalignment between desired fairness conditions and actual outcomes as fairness bugs. Our focus is on fairness testing of ML software, which aims to uncover fairness bugs through code execution. Fairness testing represents an important aspect of software fairness research and is closely intertwined with other activities in the SE process. It verifies whether software systems meet fairness requirements, exposes fairness bugs introduced during software implementation, and guides software repair efforts to address fairness issues, among other related SE tasks.

Compared to traditional software testing, fairness testing presents distinct challenges. For example, in traditional software testing, a test oracle typically relies on the output of a single input. In fairness testing, the oracle problem is more challenging because inputs and outputs from different demographic groups need to be considered simultaneously. Additionally, there are diverse fairness definitions, some of which may even conflict with each other or be mathematically impossible to satisfy concurrently (Verma and Rubin, 2018; Castelnovo et al., 2022). Since different definitions can require different test oracles, designing fairness testing techniques to accommodate this multitude of definitions poses a significant problem as well.

The significance of fairness testing and its associated challenges has led to a notable increase in research efforts in this field. Figure 1 illustrates the cumulative number of publications on fairness testing until 2023, revealing a growing interest and emphasizing the relevance of this survey. Notably, 89% of fairness testing publications have emerged since 2019, indicating the emergence of this new domain of software testing.

This paper offers a comprehensive survey of fairness testing in ML software. The collected papers are sourced from various venues including SE, artificial intelligence, computer security, and human-computer interaction. We categorize these papers based on two key aspects: the fairness testing workflow (i.e., how to test) and fairness testing components (i.e., where to identify fairness bugs). Furthermore, we conduct an analysis of research trends and identify potential research opportunities for the fairness testing community. Additionally, we provide an overview of publicly accessible datasets and open-source tools available for fairness testing.

Previous surveys have explored various aspects of fairness in ML and related fields. Mehrabi et al. (Mehrabi et al., 2021) and Pessach and Shmueli (Pessach and Shmueli, 2022) surveyed fairness research on ML algorithms, while Hort et al. (Hort et al., 2022) focused on biasThe terms “bias” and “unfairness” are often used interchangeably in the literature, as they both denote deviations from “fairness” (Mehrabi et al., 2021). mitigation methods for ML classifiers. Sun et al. (Sun et al., 2019), Berk et al. (Berk et al., 2021), and Pitoura (Pitoura et al., 2021) surveyed techniques for improving fairness in specific ML tasks, such as natural language processing, criminal justice risk assessment, and ranking. Tushev et al. (Tushev et al., 2022) surveyed software design strategies for fairness in digital sharing economy applications. Hutchinson and Mitchell (Hutchinson and Mitchell, 2019) provided a historical perspective on fairness assessment tests across disciplines, including education and hiring. Zhang et al. (Zhang et al., 2022b) conducted a broader survey on ML testing, considering fairness as one of several testing properties. A recent systematic literature review (Soremekun et al., 2022a) focused on software fairness, covering only 20 fairness testing papers. In contrast, our survey specifically concentrates on fairness testing, which plays a vital role in assessing bias issues in ML software. To our knowledge, this is the first comprehensive survey specifically dedicated to the literature on fairness testing.

To summarize, this work makes the following contributions:

It provides the reader with a comprehensive survey of 100 fairness testing papers, encompassing diverse research communities.

It defines fairness bug and fairness testing, and provides an overview of the testing workflow and testing components related to fairness testing.

It compiles a summary of public datasets and open-source tools for fairness testing, providing a navigation for researchers and practitioners interested in the field.

It analyses research trends and identifies promising research opportunities in fairness testing, aiming to foster further advancements in this area.

The structure of the paper is illustrated in Figure 2, and the detailed survey methodology is presented in Section 3.

Preliminaries

In this section, we begin by presenting the prevailing definitions of fairness that have gained widespread adoption. Subsequently, we offer a definition of fairness bug and fairness testing from the perspective of SE. We further outline the testing workflow and testing components of fairness testing.

The definition of fairness plays a crucial role in establishing the fairness conditions that software systems are expected to meet. Over the years, researchers and practitioners have proposed and explored various fairness definitions (Hellman, 2020; Mitchell et al., 2021; Castelnovo et al., 2022). This section aims to present the fairness definitions that have received widespread adoption in the literature (Zhang et al., 2022b), as listed in Table 1.

These definitions primarily fall into two categories: individual fairness and group fairness (Mehrabi et al., 2021). Individual fairness requires that software should produce similar predictive outcomes for similar individuals, while group fairness requires software to treat different demographic groups in a similar manner. Fairness assessment in the context of ML software often relies on sensitive attributes, also known as protected attributes, which represent characteristics that require protection against unfairness, such as sex, race, age, and physical ability. By considering a sensitive attribute, the population can be categorized into two groups: a privileged group and an unprivileged group.

To facilitate the formalization of fairness, we use $A$ to denote the sensitive attribute (1 for the privileged group, 0 for the unprivileged group), and $X$ to represent the remaining features. We use $Y$ to denote the actual outcome of data, and $\hat{Y}$ to represent the predicted outcome generated by ML software. A favorable outcome is indicated by 1, whereas an unfavorable outcome is denoted by 0.

We first introduce three widely-adopted definitions of individual fairness.

Fairness through unawareness (Grgic-Hlaca et al., 2016) assumes that a software system can achieve fair outcomes by refraining from using sensitive attributes in the decision-making process. By excluding these attributes, the system cannot rely on them and is thus expected to produce fair results. Consequently, individuals $i$ and $j$ with identical non-sensitive attributes should receive the same outcome: $\hat{Y}(X^{(i)})=\hat{Y}(X^{(j)})$ .

Fairness through awareness (Dwork et al., 2012) requires s software system to produce similar predictions for individuals who are deemed similar based on a predefined distance metric. This metric defines the notion of similarity between individuals for a specific task, which serves as the basis of “awareness.” Formally, if individuals $i$ and $j$ are similar according to the metric $d(\cdot,\cdot)$ , their predictions should be similar, i.e., $\hat{Y}(X^{(i)},A^{(i)})\approx\hat{Y}(X^{(j)},A^{(j)})$ .

Counterfactual fairness (Kusner et al., 2017) states that an individual’s prediction should remain the same in the real world as well as in a counterfactual world where the individual belongs to a different demographic group (i.e., the sensitive attribute is different). In practical applications, the input $X$ may contain features that have a causal relationship with the sensitive attribute $A$ . Therefore, when the attribute $A=a$ is changed to the counterfactual value $A=a^{\prime}$ , the input $X=x$ will also be transformed to $X=x^{\prime}$ , where features causally linked to the sensitive attribute will be altered accordingly. Formally, under any context $X=x$ and $A=a$ : $P(\hat{Y}(x,a)=y|X=x,A=a)=P(\hat{Y}(x^{\prime},a^{\prime})=y|X=x,A=a)$ , for all $y$ and for any value $a^{\prime}$ attainable by $A$ .

Causal fairness (Galhotra et al., 2017) requires a software system to produces the same outcome for every two individuals who differ only in sensitive attributes. Formally, for any input $X=x$ and $A=a$ : $\hat{Y}(x,a)=\hat{Y}(x,a^{\prime})$ .

These definitions of individual fairness are interconnected. Fairness through awareness is a more general concept compared to the other three definitions. When the similarity metric is defined based on the similarities of non-sensitive attributes, fairness through awareness can be seen as fairness through unawareness or causal fairness (Verma and Rubin, 2018). On the other hand, when the similarity metric is defined by considering the similarity of samples in counterfactual worlds, fairness through awareness aligns with the concept of counterfactual fairness (Kusner et al., 2017).

Additionally, both causal fairness and counterfactual fairness approach the concept of individual fairness from a causal standpoint, where fairness is evaluated by controlling for variables other than sensitive attributes, highlighting that unfairness is caused by variations in sensitive attributes. However, causal fairness simplifies this perspective by not considering the causal relationships among sensitive attributes and other features.

1.2. Group Fairness

We introduce three widely-adopted definitions of group fairness. To better illustrate the distinctions between these definitions, we use a credit score software system as an example, with sex as the protected attribute of applicants.

Statistical parity (Barocas and Selbst, 2016), also known as demographic parity, requires the probability of a favorable outcome to be the same among different demographic groups. In other words, a software system satisfies statistical parity if $P[\hat{Y}=1|A=1]=P[\hat{Y}=1|A=0]$ . For instance, the credit score system should assign a good score to male and female applicants in equal proportions.

Equalized odds (Hardt et al., 2016) requires that the privileged and the unprivileged groups have equal true positive rates and equal false positive rates. In other words, the prediction is independent of the sensitive attribute when the target label $Y$ is fixed: $P[\hat{Y}=1|A=0,Y=y]=P[\hat{Y}=1|A=1,Y=y],y\in\{0,1\}$ . In our example, the probability of correctly assigning a good predicted credit score to an applicant with an actual good credit score, and the probability of incorrectly assigning a good predicted credit score to an applicant with an actual bad credit score, should be equal for male and female applicants.

Equal opportunity (Hardt et al., 2016) states that the privileged and the unprivileged groups have equal true positive rates. In other words, the prediction is independent of the sensitive attribute when the target label $Y$ is fixed as 1: $P[\hat{Y}=1|A=0,Y=1]=P[\hat{Y}=1|A=1,Y=1]$ . In our example, The probability of accurately assigning a good predicted credit score to an applicant with an actual good credit score should be the same for male and female applications.

More about fairness definitions can be found in recent work on surveying, analyzing, and comparing them. Mitchell et al. (Mitchell et al., 2021) have compiled a comprehensive catalog of fairness definitions found in the literature, offering a valuable resource for understanding different perspectives. Verma and Rubin (Verma and Rubin, 2018) provide insights into the rationale behind existing fairness definitions and examine their relationships through a detailed case study. Castelnovo et al. (Castelnovo et al., 2022) delve into the differences, implications, and orthogonal aspects of various fairness definitions, shedding light on their distinct characteristics. Additionally, Mehrabi et al. (Mehrabi et al., 2021) have developed a taxonomy of fairness definitions proposed specifically for ML algorithms.

2. Definition of Fairness Bug and Fairness Testing

A software bug is an imperfection in a computer program that causes a discordance between existing and required conditions (539, 2010). Based on this definition, prior research (Zhang et al., 2022b) defines “ML bug” as any imperfection in an ML artefact that causes a discordance between the existing and required conditions, and “ML testing” as any activity designed to reveal ML bugs. Aligning with the terminology used in SE, we define fairness bug and fairness testing as follows.

$\bullet$ Definition (Fairness Bug). A fairness bug refers to any imperfection in a software system that causes a discordance between the existing and required fairness conditions.

The required fairness conditions depend on the fairness definition adopted by the developers of the software under test. The imperfections can manifest as unfair predictions that violate individual fairness or discrepancies in outcomes among different demographic groups that surpass a predetermined threshold for group fairness. Previous studies have also referred to such imperfections as fairness defects (Brun and Meliou, 2018) or fairness issues (Finkelstein et al., 2008). In this paper, we adopt the term “fairness bug” as a representative of these related concepts, as “bug” carries a broader meaning (539, 2010).

The presence of fairness bugs has spurred research efforts towards developing automated testing techniques to detect these bugs, i.e., fairness testing.

$\bullet$ Definition (Fairness Testing). Fairness testing refers to any activity designed to reveal fairness bugs through code execution.

This definition categorizes fairness testing as a subset of software testing, specifically excluding manual inspection and formal verification methods. Since we focus on fairness testing of ML software, we follow the recent ML testing survey (Zhang et al., 2022b) to consider two key aspects of this emerging testing domain: testing workflow and testing components.

3. Fairness Testing Workflow

The fairness testing workflow refers to how to conduct fairness testing with different testing activities. This section delves into the fairness testing workflow and outlines its key activities.

Figure 3 presents an overview of the fairness testing workflow for ML software, which builds upon the established ML testing workflow in (Zhang et al., 2022b). In this process, software engineers establish and specify the desired fairness conditions for the software being tested. Test inputs are obtained either through sampling from collected data or automatic generation. Test oracles, which are based on the defined fairness requirements, are identified and created. The test inputs are then executed on the software under test to determine if any violations of the fairness requirements occur. Engineers assess the adequacy of the tests, evaluating their effectiveness in uncovering fairness bugs. Simultaneously, the results of the test execution provide a bug report that assists engineers in reproducing, locating, and resolving any identified fairness bugs. The fairness testing process can be iterated to ensure that the repaired software aligns with the desired fairness requirements. Upon satisfying the fairness requirements, the software can be deployed for use. Following deployment, engineers can employ runtime monitoring to continually test the software using new real-world inputs, ensuring ongoing adherence to the fairness requirements.

Section 4 organizes fairness testing research papers based on the testing workflow perspective. The existing fairness testing literature primarily addresses test input generation (Section 4.1) and test oracle identification (Section 4.2),while other activities remain open challenges and present research opportunities for the community (discussed in Section 8).

4. Fairness Testing Components

Software testing can be conducted for different testable parts within a software system (Laghari and Demeyer, 2018). In the context of fairness testing, this section focuses on introducing the testable components within ML software.

We refer to the entire ML-enabled software that an end user would use as ML software. As shown in Figure 4, ML software consists of ML components and non-ML components, which are interconnected and work together to achieve the software’s objective (Lu et al., 2022; Nahar et al., 2022). In ML components, ML models are trained using large-scale data with ML programs written based on ML frameworks (e.g., scikit-learn (Pedregosa et al., 2011), TensorFlow (Abadi et al., 2016), and Keras (Gulli and Pal, 2017)) (Zhang et al., 2022b; Chen et al., 2020a). An ML program specifies the structure of a desirable ML model and the process by which the model is obtained using the training data (Chen et al., 2020a). To turn ML components into a real-world software system, many other non-ML parts can be needed, such as a data storage, a user interface, and a monitoring infrastructure. Since ML components (including data, ML programs, ML frameworks, and ML models) and non-ML components together achieve software goals of ML software, developers can test each of the components when conducting fairness testing:

Data Testing. ML follows the data-driven programming paradigm. Training data determines the decision logic of ML models to a large extent (Zhang et al., 2022b). Therefore, training data is considered as an important component to test in the ML testing literature (Zhang et al., 2022b; Breck et al., 2019). Since bias in training data is demonstrated to be a main root cause of ML software unfairness (Chakraborty et al., 2021), training data is also an important component for fairness testing. Data testing in the fairness literature aims to detect different types of training data bias, including checking whether the labels of training data are biased (label bias) (Chakraborty et al., 2020a), whether the distribution of training data implies an unexpected correlation between sensitive attributes and outcomes (selection bias) (Wick et al., 2019), and whether the features of training data contain bias (feature bias) (Li et al., 2022).

ML Program Testing. An ML program encodes the process by which an ML model is obtained based on the training data. It has been an important fairness testing component of ML software (Zhang et al., 2022b). A fairness bug may arise as improper data processing (Biswas and Rajan, 2021), training algorithm selection (Hort et al., 2021), and hyper-parameter settings (i.e., configuration options that control the learning process) (Tizpaz-Niari et al., 2022) in ML programs.

Framework Testing. ML frameworks (also called ML libraries) implement ML algorithms internally and provide high-level Application Programming Interfaces (APIs) for developers to build ML models without knowledge of the inner working of these algorithms. The ML testing literature (Pham et al., 2019; Wang et al., 2020b; Nejadgholi and Yang, 2019; Wang et al., 2022a; Moussa and Sarro, 2022) has considered ML frameworks as an important testing component and detected bugs inside ML frameworks that lead to accuracy problems in the final ML model. Nevertheless, existing framework testing studies are primarily related to ML performance (e.g., accuracy). To the best of our knowledge, to date, there has been no framework testing work that detects fairness bugs.

Model Testing: Most fairness testing techniques are model-centric (Galhotra et al., 2017; Fan et al., 2022; Udeshi et al., 2018; Aggarwal et al., 2019; Xie and Wu, 2020; Zhang et al., 2020; Zhang et al., 2021c; Zheng et al., 2022). They consider ML models as the testing component and reveal fairness bugs based on the input-output behaviors of ML models. Fairness testing of ML models can be performed in a white, black, or gray-box manner (Tizpaz-Niari et al., 2022). Black-box testing is a technique of testing without having any knowledge of the internal working of ML software (e.g., code and data); white-box testing tests an ML software system taking into consideration its internal working; gray-box testing is to test with limited knowledge of the internal working of the software under test (Khan and Khan, 2012).

Non-ML component testing: From the SE perspective, ML software is beyond the ML models, and also includes non-ML components (Lu et al., 2021). These non-ML components may also affect the fairness of ML software. However, to the best of our knowledge, there has been no non-ML component testing work in the fairness testing literature.

Section 5 organizes the related papers from the testing component perspective. The existing literature primarily focuses on data testing (Section 5.1), ML program testing (Section 5.2), and model testing (Section 5.3).

Survey Methodology

This section introduces the scope and the paper collection process of our survey.

We aim to define, collect, and curate the disparate literature, arguing and demonstrating that there does, indeed, exist a coherent area of research in the field that can be termed “fairness testing”.

We apply the following inclusion criteria when collecting papers. The papers that satisfy any of these criteria are included in this survey.

The paper introduces the general idea or one of the related aspects of fairness testing of ML software.

The paper presents an approach, study, framework, or tool that targets at fairness testing of ML software.

We do not include papers about the issues of fairness in network systems and hardware systems. Moreover, we filter out papers that are about fairness definitions, but do not consider them in the context of testing. We also do not include papers about gender diversity/inclusion and cognitive bias in software development, because our survey focuses on SE product fairness, not SE process fairness.

2. Paper Collection

We first collected papers by keyword searching on the DBLP publication database (dbl, 2022), which covers arXiv (a widely-used open-access archive), more than 1,800 journals, and 5,800 academic conferences and workshops in computer science (dbl, 2023). DBLP has been extensively used in previous surveys in SE (Zhang et al., 2022b; Chen et al., 2020b; Zhang et al., 2018; Mathew et al., 2018; Garousi and Fernandes, 2016), and a recent ML testing survey (Zhang et al., 2022b) demonstrates that papers collected from other popular publication databases are a subset of those collected from DBLP.

We developed the search keywords through an iterative trial-and-error procedure (Lin et al., 2022a) conducted by the first two authors, with input and discussion from the other authors. Initially, we started with a general search string “fairness testing” to gather initial relevant papers. Subsequently, we carefully examined the titles, abstracts, and keywords of these papers to identify additional keywords and phrases. Through brainstorming sessions, we expanded and refined the list of search strings by incorporating related terms, synonyms, and variations. The iterative process allowed us to continuously improve the search keyword list based on the outcomes of the searches, ensuring that it accurately captured the relevant literature on fairness testing.

The final keywords used for searching included (“fair” OR “bias” OR “discrimination”) AND (“software” OR “learning” OR “bug” OR “defect” OR “fault” OR “algorithm” OR “test” OR “detect” OR “evaluat” OR “discover” OR “identify” OR “find” OR “uncover” OR “reveal” OR “recogniz” OR “unveil”). As a result, we conducted a total of $3\times 16=48$ searches on DBLP on March 12, 2023, and obtained 7,674 hits. Then, the first two authors manually inspected each hit paper to check whether it is in the scope of our survey, and selected 67 relevant papers.

The fairness testing community is diverse, with publications appearing in various venues and adopting different terminologies. To capture papers that might have been overlooked by our keywords and ensure a comprehensive coverage of the field, we further employed a snowballing approach. This process, conducted in April and May 2023, aimed to identify transitively dependent papers and expand our paper collection. Both backward and forward snowballing approaches (Jalali and Wohlin, 2012) were employed. In backward snowballing, we examined the references in each collected paper and identified those lying in our scope; in forward snowballing, we used Google Scholar to identify papers of our interest from those that cited the collected papers. We iteratively repeated the snowballing process until we reached a fixed point, where no new relevant papers were identified. Through this process, we were able to retrieve an additional 25 papers, contributing to a more comprehensive coverage of the fairness testing literature.

To ensure that our survey is comprehensive and accurate, we also contacted the authors of the papers we collected via keyword searching and snowballing. We provided them with our paper and asked them to check whether our description about their work is correct. This interaction allowed us to refine our understanding of their contributions and make necessary revisions to our descriptions. Furthermore, during these communications, authors directed our attention to 15 additional papers that we had not initially included in our collection. Among the suggested papers, 8 met our inclusion criteria and were deemed relevant to our survey. These 8 papers were subsequently added to our repository, further enhancing the coverage of the fairness testing literature in our survey.

Table 2 shows the statistics of the paper collection process. In summary, we consider 67 + 25 + 8 = 100 papers for this survey.

3. Paper Analysis

To ensure a rigorous analysis of the collected papers and to enhance the quality and accuracy of the survey, we employed a well-established method in SE literature reviews called thematic synthesis (Huang et al., 2018). This method allowed us to systematically organize and analyze the papers in a structured manner.

The analysis process was led by the first two authors, who extensively read and examined the full text of each paper. Their objective was to develop a comprehensive understanding of the testing workflow and test components described in the papers. Through meticulous manual analysis, they extracted relevant information, identified common patterns, and discerned major themes that emerged across the collection of papers.

Using these identified themes as a framework, the authors categorized and organized the related papers under different thematic headings. This approach facilitated a coherent and structured representation of the survey’s content, enabling readers to easily navigate and comprehend the key insights and findings from the analyzed papers.

During the analysis, in instances where disagreements arose, the first two authors held discussion meetings involving other co-authors. These meetings served as a platform to address conflicts and reach a consensus on the extracted data and the placement of papers within the identified themes. The co-authors, all of whom have published fairness-related papers in top-tier venues, contributed their expertise and acted as arbitrators, ensuring the resolution of any disagreements and maintaining the integrity of the analysis.

Finally, to ensure the integrity and reliability of the survey’s findings, all authors independently double-checked the content. This review process aimed to identify any potential errors, inconsistencies, or omissions.

Additionally, as mentioned before, to further enhance the quality of the survey, we shared the draft with the authors of the collected papers. This collaborative approach allowed us to gather valuable input, feedback, and validation from experts in the field. By incorporating their insights, we ensured that the survey accurately represented the original papers’ findings and perspectives, strengthening the overall credibility and robustness of our analysis.

Fairness Testing Workflow

We first introduce existing techniques that support the key activities involved in fairness testing, i.e., test input generation and test oracle identification.

In the area of fairness testing, test input generation aims to automatically produce instances that can induce discrimination and reveal fairness bugs of software systems. We organize relevant research based on the techniques adopted, including random test input generation, search-based test input generation, verification-based test input generation, and domain-specific test input generation. Table 3 summarizes these techniques and indicates the fairness definitions that they adopt.

Random testing evaluates a software system with inputs randomly chosen from its input space, and it can be used to infer the quantitative estimate of a system’s operational reliability (Duran and Ntafos, 1984).

Galhotra et al. (Galhotra et al., 2017) and Angell et al. (Angell et al., 2018) introduced Themis, a random test input generation approach for fairness. Themis randomly assigns values to non-sensitive attributes and varies the values of sensitive attributes. By observing the behavior of the system under test with these inputs, Themis quantifies the occurrence of discriminatory instances in the input space. Discrimination is measured using two frequency values. The first value represents the proportion of generated inputs where altering the sensitive attributes leads to a change in the output (causal fairness). The second value captures the disparity in receiving favorable outcomes among different demographic groups within the generated inputs (statistical parity).

1.2. Search-based Test Input Generation

Despite the effectiveness of Themis, random generation can lead to a low success rate of the discriminatory input generation (Fan et al., 2022), so the following fairness testing work generates test inputs using search-based techniques. Search-based test generation uses meta-heuristic search techniques to guide the generation process and make this process more efficient and effective (Harman and Jones, 2001; Lakhotia et al., 2007; Harman et al., 2015). It has been employed in an increasing number of fairness testing techniques to explore the input space of the software under test.

Two-phase search-based techniques. In the realm of fairness testing, most search-based test input generation techniques employ a two-phase approach for generating individual discriminatory instances. These techniques operate based on the causal fairness definition, which implies that altering sensitive attributes should not impact the outcomes.

The two phases involved are the global search and the local search:

During the global search phase, the technique conducts an exploration of the input space to identify an initial set of individual discriminatory instances. These instances consist of pairs where altering the sensitive attributes results in divergent outcomes.

During the local search phase, the technique focuses on searching for additional individual discriminatory instances in the neighborhood of those found during the global search. This phase is based on the hypothesis that if a discriminatory input exists in the input space, there exist more discriminatory inputs located closer to it (Udeshi et al., 2018). The hypothesis draws inspiration from the robustness property of ML models, where similar inputs should yield similar outputs. Hence, the discriminatory inputs and their neighbourhood are likely to be similarly discriminatory, especially for robust models (Udeshi et al., 2018).

In the following, we introduce how existing search-based fairness testing techniques materialize the two phases.

Udeshi et al. (Udeshi et al., 2018) introduced Aequitas, the first fairness testing approach based on a two-phase search framework. In the global search phase, Aequitas randomly explores the input space to uncover discriminatory instances. In the local search phase, Aequitas perturbs the non-sensitive attribute of the discriminatory instances found in the global phase to explore their neighboring inputs and identify more discriminatory instances. Furthermore, Aequitas employs the generated cases to estimate the proportion of inputs that violate causal fairness, offering statistical evidence for fairness bugs. Sano et al. (Sano et al., 2022) proposed KOSEI, which modifies the local search of Aequitas (Udeshi et al., 2018). KOSEI individually applies perturbations to all non-protected attributes, in contrast to Aequitas which probabilistically selects one attribute at a time. Additionally, KOSEI enables users to set an iteration limit for the perturbation process.

Aggarwal et al. (Aggarwal et al., 2019) proposed SG, which combines symbolic generation and local explainability for search-based discriminatory instance generation. In the global search phase, SG utilizes a local model explainer to approximate the decision-making process of ML software by constructing a decision tree. Symbolic execution is then employed to cover various paths in the decision tree, aiming to search for discriminatory inputs. In the local search phase, SG perturbs the non-sensitive attribute of these discovered inputs to explore their neighborhood within the input space, thereby generating additional discriminatory inputs.

Fan et al. (Fan et al., 2022) introduced ExpGA, an explanation-guided approach for generating discriminatory instances. Initially, ExpGA employs interpretable methods to identify seed instances that are more likely to produce discriminatory instances when their feature values are slightly modified compared to other instances. Subsequently, using these seed instances as inputs, ExpGA leverages a genetic algorithm for local search, enabling efficient generation of a substantial number of discriminatory instances.

Ma et al. (Ma et al., 2022) proposed I&D to enhance the initial seed selection for the global search of the two-phase search-based generation approach. It achieves this by generating a chiral model, which alters the protected attributes of the training data. This chiral model helps identify initial discriminatory instances by detecting differences in predictions compared to the original model. I&D employs the SHAP value (Lundberg and Lee, 2017), a game-theory-based approach, to explain the variation in prediction behavior between the chiral and original models for each initial discriminatory instance. Moreover, it clusters the initial discriminatory instances based on their SHAP values and selects diverse instances from each cluster in a round-robin fashion for future use in the global search.

In addition to these general techniques, there have been several two-phase search-based generation techniques specifically proposed for Deep Neural Networks (DNNs), including ADF (Zhang et al., 2020, 2021b), EIDIG (Zhang et al., 2021c), NeuronFair (Zheng et al., 2022), and DICE (Monjezi et al., 2023).

Zhang et al. (Zhang et al., 2020, 2021b) proposed ADF, a gradient-guided generation approach for DNNs. In the global search phase, ADF locates the discriminatory instances near the decision boundary by iteratively perturbing a seed input towards the decision boundary with the guidance of gradient. In the local search phase, ADF further uses gradients as the guidance to search the neighborhood of the found individual discriminatory instances to discover more discriminatory instances.

Zhang et al. introduced EIDIG (Zhang et al., 2021c), which inherits and improves ADF by integrating a momentum term into the iterative search for identifying discriminatory instances. The momentum term enables the memorization of the previous trend and helps to escape from local optima, which ensures a higher success rate of finding discriminatory instances. In addition, EIDIG reduces the frequency of gradient calculation by exploiting the prior knowledge of gradients to accelerate the local search phase.

Zheng et al. (Zheng et al., 2022) proposed NeuronFair, which uses the identified biased neurons to guide the generation of discriminatory instances for DNNs. In the global search phase, NeuronFair identifies the biased neurons that cause discrimination via neuron analysis and searches for discriminatory instances with the optimization object of increasing the ActDiff (activation difference) values of the biased neurons. In the local search phase, NeuronFair uses the generated discriminatory instances as seeds and perturbs non-sensitive attribute of them to generate more discriminatory instances.

Monjezi et al. (Monjezi et al., 2023) introduced DICE, an information-theoretic approach for fairness testing of DNNs. In the global search phase, DICE uses gradient-guided clustering to explore the input space and identify instances with the maximum quantitative individual discrimination. This discrimination is quantified as the dependence on sensitive attributes using information-theoretic principles. In the local phase, promising instances from the global phase are used to generate additional discrimination instances by exploring their neighborhood.

Other search-based techniques. Besides the two-phase approach, there are also other search-based generation techniques that have been proposed in the fairness testing literature.

Xie and Wu (Xie and Wu, 2020) used reinforcement learning to develop a black-box search strategy for generating instances that violate causal fairness. Their approach frames the task of generating discriminatory instances as a reinforcement learning problem, with the ML model under test treated as part of the environment. The reinforcement learning agent interacts with the environment by taking actions to search for discriminatory inputs, while receiving feedback in the form of rewards and observing the resulting state. Through iterative interactions, the agent learns an optimal policy for efficiently generating discriminatory inputs.

Perera et al. (Perera et al., 2022) presented SBFT, a search-based fairness testing approach for regression-based ML models. SBFT measures unfairness degree as the maximum difference in regression results between pairs of instances that only differ in terms of their sensitive attributes (i.e., causal fairness).SBFT begins with an initial set of randomly selected test inputs from the input space. Genetic algorithms are then applied to these inputs, aiming to generate the next generation of test inputs that exhibit the desired fairness degree.

Tao et al. (Tao et al., 2022) introduced RULER, a fairness testing technique that departs from the strict definition of causal fairness. Unlike existing approaches that require coupling samples differing only in sensitive attributes, RULER relaxes this constraint. It allows simultaneous perturbations on both sensitive and non-sensitive attributes to search for more discriminatory instances, because there may not exist discriminatory instances that strictly satisfy the causal fairness. RULER imposes perturbation constraints: sensitive attributes must remain within their valid value ranges, while non-sensitive attributes are bounded by a small range.

Xiao et al. (Xiao et al., 2023) proposed LIMI, an approach for generating natural discriminatory instances based on causal fairness. LIMI uses a generative adversarial network (GAN) to imitate the decision boundary of the model under test in the latent space. By approximating the decision boundary with a surrogate linear boundary, LIMI can search for instances that closely align with the original data distribution. LIMI performs vector manipulations on latent vectors to move them towards the surrogate boundary. Vector calculations then identify two potential discriminatory candidates in close proximity to the real decision boundary.

Patel et al. (Patel et al., 2022) applied combinatorial t-way testing (Kuhn et al., 2013) to fairness testing. Combinatorial t-way testing is a coverage-based data sampling method that can generate diverse datasets by applying logical constraints to specify the sampling space. The approach creates an input parameter model from the training data set and uses the model to generate a t-way test set. For each test, it mutates protected attributes to search for discriminatory instances.

Haffar et al. (Haffar et al., 2022) introduced two distinct generation approaches for tabular data and images. First, they employed guided adversarial generation on tabular data to search for counterfactuals. Second, they employed GANs to generate counterfactuals for image data. The generation objective is to modify the predicted label with minimal adjustments to the input. If one of the modified attributes happens to be a protected attribute, it suggests that the model is exhibiting unfairness against that particular attribute.

Zhang et al. (Zhang et al., 2023) introduced TestSGD, a group fairness testing approach for DNNs that specifically focuses on the statistical parity of subgroups defined by conjunctions of protected attributes. TestSGD first generates rule sets to capture frequent subgroups. Each rule splits the input space into two groups. To accurately measure group discrimination, TestSGD introduces small, uniform perturbations to the original training samples to search for more samples. Then, it calculates the demographic parity score based on the generated samples.

Cabrera et al. (Cabrera et al., 2019) also focused on fairness testing of subgroups and proposed a testing tool named FAIRVIS. In real-world applications, the number of subgroups to analyze can be overwhelming. To address this challenge, FAIRVIS offers methods to search this large space and finding potential issues more efficiently. It introduces a subgroup generation technique that recommends intersectional groups where a model may be underperforming. This technique involves clustering the training dataset to identify statistically similar subgroups and using an entropy-based approach to determine important features and calculate group fairness metrics for the clusters. Finally, FAIRVIS presents generated subgroups sorted by their group fairness metrics.

1.3. Verification-based Test Input Generation

Since our survey focuses on fairness testing, we do not discuss work focusing solely on formal verification of fairness properties (see e.g., (Albarghouthi et al., 2017; Bastani et al., 2019; John et al., 2020; Urban et al., 2020; Sun et al., 2021; Ghosh et al., 2021, 2022)). Nevertheless, there is a research stream focusing on generating discriminatory test inputs through fairness verification using Satisfiability Modulo Theories (SMT) solving. These techniques are called verification-based test input generation (Sharma et al., 2021).

Sharma et al. (Sharma and Wehrheim, 2020; Sharma et al., 2021) introduced fairCheck and MLCheck, two verification-based techniques for generating test inputs to assess fairness. These approaches approximate the black-box model under test with a white-box model, leveraging its predictions. The fairness property and the white-box model are translated into logical formulas using SMT solvers. Test cases are subsequently automatically generated in an attempt to violate the specified fairness property, employing the Z3 SMT solver (de Moura and Bjørner, 2008) to check for satisfiability.

Kitamura et al. (Kitamura et al., 2022) developed VBT-CT, a technique that integrates combinatorial t-way testing into verification-based fairness testing. As described before, combinatorial t-way testing can generate diverse datasets by applying logical constraints to specify the sampling space. By incorporating combinatorial t-way testing into the generation process of verification-based testing, VBT-CT enhances the detection capability of discriminatory data.

Zhao et al. (Zhao et al., 2022) introduced VBT-X, an approach that integrates hash-based sampling (Chakraborty et al., 2015) into the test generation phase of verification-based fairness testing. Hash-based sampling techniques are capable of producing diverse solutions for a given logical formula, offering the advantage of generating varied solutions with reasonable computational overhead. By leveraging the diverse sampling capability of hash-based sampling, VBT-X enhances the effectiveness of test generation in verification-based fairness testing.

1.4. Domain-specific Test Input Generation

Recently, an increasing number of approaches have been proposed for test input generation in specific application domains. These approaches aim to generate natural inputs that belong to the data distribution of a practical application scenario. This section introduces such domain-specific test input generation in typical domains: natural language processing, computer vision, and speech recognition.

Natural language processing. Test input generation for Natural Language Processing (NLP) systems is primarily based on that a fair NLP system should produce similar result for pairs of similar texts.

Researchers can collect text data from the wild and then mutate words related to sensitive attributes to generate test inputs. For example, Díaz et al. (Díaz et al., 2018) conducted a study where they collected sentences containing the word “old” from a blogger and replaced it with “young” to detect age-related fairness issues in sentiment analysis systems. They also applied mutations to common adjectives instead of age-related words. By using word embeddings, they obtained analogs for “older” and “younger” versions of these adjectives. These variants were used as test inputs to identify fairness issues related to age. Zhang et al. (Zhang et al., 2023) generated new samples by substituting sensitive terms in the collected texts with alternative terms within the same sensitive attribute category. For instance,, for the gender category, the sensitive terms can be bisexual, female, gay, etc. Then they calculate the statistical parity based on the original samples and the generated ones.

Similarly, Liu et al. (Liu et al., 2020) conducted fairness testing on generative and retrieval dialogue models. They created gender and race word lists, including male-female word pairs and African American English-Standard English word pairs. From a large dialogue corpus, they selected contexts containing words from these lists and created parallel contexts by replacing the words with their counterparts. The original and generated texts were then compared based on the response of dialogue models using four measurements: diversity, politeness, sentiment, and attribute words. The attribute word analysis involved comparing the probability of attribute words (e.g., career and family-related) appearing in the responses across different groups’ contexts. This is a mixed adoption of causal fairness and statistical parity.

Generation based on handcrafted templates: A large proportion of fairness testing studies for NLP systems involve the use of handcrafted templates to generate test inputs. These templates consist of short sentences with placeholders (e.g., “ $<$ person $>$ goes to the school in our neighborhood”) that can be filled with different words to test for violations of causal fairness. Due to the simplicity of the handcrafted templates, most of such test generation studies adopt causal fairness that detects whether NLP systems produced similar outcomes for texts that differ only in sensitive attributes.

Kiritchenko and Mohammad (Kiritchenko and Mohammad, 2018) created 11 templates focused on gender and race, using predefined values for the placeholder $<$ person $>$ representing different names and noun phrases referring to females/males or African/European American. Then they compared emotion and sentiment scores that the systems predict on pairs of sentences differ only in one word corresponding to race or gender.Sex and gender are different concepts that are often used interchangeably. We keep the original usage of the two words in each paper to reserve fidelity.

Mehrabi et al. (Mehrabi et al., 2020a) created nine templates to compare the performance of name entity recognition systems with respect to recognizing male and female names. Wang et al. (Wang et al., 2022b) focused on machine translation systems and developed 30 templates to test their ability to determine the correct gender of a name.

Sharma et al. (Sharma et al., 2020) designed three templates specifically for gender-related fairness issues in natural language inference systems, with gender-specific hypotheses using the placeholder $<$ gender $>$ .

Similarly, other researchers have designed templates for detecting fairness bugs in natural language generation and machine translation systems. Sheng et al. (Sheng et al., 2019), Huang et al. (Huang et al., 2020), Vig et al. (Vig et al., 2020), and Dhamala et al. (Dhamala et al., 2021) used templates with placeholders for sensitive attributes such gender and race.

Smith et al. (Smith et al., 2022) developed a set of manually created templates covering 13 different demographic attributes. These templates were used to identify bias in language models by examining statistical disparity, which measures the differences at the group level in the output or assigned probabilities of the model, which arise due to the presence of different identity or demographic information within the input text.

Another approach, CheckList (Ribeiro et al., 2020), utilizes predefined templates to evaluate NLP systems on various capabilities, including fairness.

Wan et al. (Wan et al., 2023) introduced BiasAsker, a test input generation approach aimed at measuring absolute bias and relative bias of conversational systems towards various demographic groups. Absolute bias refers to direct expressions of bias, such as statements like “Group A is smarter than Group B,” while relative bias involves generating different responses to questions about different groups. To obtain social groups and biased properties, they constructed a comprehensive social bias dataset, which includes a total of 841 groups and 8,110 biased properties. To detect both types of bias, they designed templates and rules for generating Yes-No Questions, Choice-Questions, and Wh-Questions. These generated inputs serve as a means to evaluate and identify bias in conversational systems.

Automated generation: While handcrafted templates have been effective in detecting fairness issues in NLP systems, researchers argue that the generated test inputs relying on them may be simplistic and limited (Poria et al., 2020). Furthermore, the generated simplistic test inputs that may overlook complex scenarios where multiple words related to sensitive attributes are present, which is often the case in practical applications. To address this limitation, automated test generation techniques should carefully consider each word that can rely on sensitive attributes and generate natural counterfactual inputs that align with the counterfactual fairness definition.

To address the problem, Ma et al. (Ma et al., 2020) introduced the automated framework MT-NLP. It employs advanced NLP techniques to identify human-related tokens in the input text and utilizes word analogy techniques to mutate these tokens, generating discriminatory test inputs. Language fluency metrics are then used to filter out unrealistic inputs.

Asyrofi et al. (Asyrofi et al., 2021) proposed BiasFinder, an automated approach for creating diverse and complex test inputs to uncover fairness bugs in NLP systems. By leveraging NLP techniques like coreference resolution and named entity recognition, BiasFinder identifies words associated with demographic characteristics and replaces them with placeholders to form templates. Concrete values are then filled into these templates, generating a large number of text mutants for testing metamorphic relationships.

Yang et al. (Yang et al., 2021) proposed BiasRV, an approach for testing bias in deployed sentiment analysis systems. By utilizing BiasFinder, they generate a template for a given text and create mutants based on that template to assess if the system exhibits biased predictions. They define distributional fairness, which examines whether mutants from different demographic groups are treated similarly. Specifically, they expect the distribution of predicted sentiments for these groups of mutants to be closely aligned if the system is fair. Unlike traditional statistical parity, which measures overall fairness, distributional fairness evaluates whether the system makes biased predictions for specific inputs.

Ezekiel et al. (Soremekun et al., 2022b) developed ASTRAEA, a grammar-based fairness testing approach for generating discriminatory inputs in NLP systems. ASTRAEA incorporates input grammars covering various NLP tasks and biases. By randomly exploring the input grammars and mutating sensitive attribute-related words using alternative tokens, ASTRAEA generates initial test inputs and checks their satisfaction of metamorphic relations.

For fairness testing of machine translation systems, Sun et al. (Sun et al., 2020; Sun et al., 2022b) proposed TransRepair and CAT. TransRepair conducts sentence mutations by replacing words with context-similar alternatives, while CAT identifies and replaces words using isotopic replacement. The resulting mutants, along with the original input sentence, serve as test inputs for evaluating the fairness of machine translation.

Computer vision. To detect fairness issues in Computer Vision (CV) systems, researchers often examine how the system’s output changes when the sensitive attribute of a person in the input image is altered while keeping other factors constant (i.e., generating counterfactual images). This idea forms the basis of test input generation for CV systems.

GANs (Goodfellow et al., 2014) are commonly used for image transformations in ML testing (Zhang et al., 2022b). However, conventional GANs face challenges in generating precise changes required for fairness testing. For example, changing hair color without affecting other facial features or hair style can be difficult. To address this, recent efforts have adapted and improved conventional GANs for generating test inputs in CV systems. Denton et al. (Denton et al., 2019b) developed a face generative model that maps latent codes to images and inferred directions in latent code space to manipulate specific sensitive attributes. Test inputs were generated by traversing these inferred directions. Joo and Kärkkäinen (Joo and Kärkkäinen, 2020) employed the FaderNetwork architecture (Lample et al., 2017), where specific known attributes of an image are input separately to the generator. Zhang et al. (Zhang et al., 2021a) utilized CycleGAN (Zhu et al., 2017), which limits changes to non-sensitive attributes, to generate discriminatory inputs.

Muthukumar (Muthukumar, 2019) criticized GAN-based approaches for their inability to effectively modify a single attribute while keeping other non-related attributes unchanged. This limitation makes it challenging to identify the exact cause of unfair outcomes. For instance, in gender classification where discrimination occurs between different skin types, it is possible that factors like hairstyle, facial structure, cosmetics, or clothing contribute to the disparity rather than skin type alone. To address this issue, Muthukumar proposed a solution where face images were represented in the YCrCb color space (Hsu et al., 2002), and techniques such as luminance mode-shift and optimal transport (Ferradans et al., 2014) were utilized to alter the skin type of a face.

However, these approaches may fail to account for causal relationships between attributes when generating discriminatory images. Although they claim to produce counterfactual images, they often rely on a causal fairness definition rather than strict counterfactual fairness. To generate test inputs that reflect real-world scenarios, it is crucial to consider the downstream effects resulting from changes in sensitive attributes. For instance, in a chest MRI classification system, a patient’s age can influence the relative size of their organs (Dash et al., 2022). Therefore, altering the age without considering the causal relationship between age and organ size would lack realism. In response to this limitation, Dash et al. (Dash et al., 2022) introduced ImageCFGen, a fairness testing method that incorporates knowledge from a causal graph and utilizes an inference mechanism within a GAN-like framework. This approach enables the generation of discriminatory images that adhere to the definition of counterfactual fairness.

Unlike the aforementioned studies that employ individual fairness definitions, Balakrishnan et al. (Balakrishnan et al., 2020) introduced a new concept of group fairness, investigating the parity in robustness against transformations across various demographic groups. Specifically, they proposed an image generation method that produces synthesized samples considering multiple sensitive attributes, and compares error rates of gender classifiers across various subgroups before and after the synthesis. For generation, the approach modifies multiple attributes simultaneously to create grid-like sets of matched images called transects. This is achieved by navigating the latent space of the generator in directions specific to each attribute.

Similarly, Pu et al. (Pu et al., 2022) investigates the parity in robustness of deepfake detectors by introducing makeup, as a feature perturbation across various demographic groups. Specifically, they employed common makeup alterations such as eyeshadows, eyeliners, lipstick, and blushes to perturb the face images before feeding these images to the deepfake detection models. Subsequently, they compared the accuracy differences of the detectors in detecting deepfakes between original and synthetic images for both male and female images, thus quantifying the extent of gender disparity in the two accuracy differences. In addition, they discovered that salient regions near the lips have the greatest impact on the fairness of the tested models.

Speech recognition. The fairness testing literature has dedicated less attention to speech recognition compared to NLP and CV. Rajan et al. (Rajan et al., 2022) introduced AequeVox, a fairness testing approach tailored for speech recognition systems. Similar to Balakrishnan et al. (Balakrishnan et al., 2020) and Pu et al. (Pu et al., 2022), AequeVox measures group fairness as the parity in robustness against perturbations across demographic groups. It generates test inputs by applying eight common real-life metamorphic perturbations to speech signals, such as noise, drop, and low/high pass filters. Then it calculates the error rate increase in speech recognition for various demographic groups after applying these perturbations. If the difference in error rate increase surpasses a predetermined threshold for different groups, the recognition system is considered to have fairness bugs.

2. Test Oracle Identification

Given an input for a software system, the challenge of distinguishing the corresponding desired behaviour from the potentially incorrect behavior is called the “test oracle problem” (Barr et al., 2015). Test oracle identification is one of the key problems in ML testing (Zhang et al., 2022b). The test oracle of fairness testing determines whether the software is behaving as fairness requirements and enables the judgement of whether a fairness bug exists. Existing work employs two types of test oracles for fairness testing: metamorphic relations and statistical measurements.

A metamorphic relation is a relationship between the software input change and the output change that we expect to hold across multiple executions (Barr et al., 2015). Suppose a system that implements the function $\sin(x)$ , then $\sin(x)=sin(\pi+x)$ is a metamorphic relation. This relation can be used as a test oracle to help detect bugs. If $\sin(x)$ differs from $sin(\pi+x)$ , we can conclude that the system under test has a bug without the need for examining the specific values output by the system.

Metamorphic relation is a type of pseudo oracle commonly adopted to automatically mitigate the oracle problem in ML testing. It has also been widely studied for fairness testing of ML software. Specifically, existing work mainly performs fairness-related metamorphic transformation on the input data or training data of ML software, and expects these transformations do not change or yield expected changes in the prediction. Next, we classify and discuss this work according to whether the metamorphic transformations operate on sensitive attributes.

Metamorphic transformations through mutating sensitive attributes. The mutation of sensitive attributes is a widely used technique for generating metamorphic relations in the context of individual fairness. Assessing violations of individual fairness, such as counterfactual fairness and causal fairness, typically requires the comparison of paired data instances that vary in their sensitive attributes.

For ML software designed for classification tasks, a widely employed metamorphic relation for fairness testing involves comparing pairs of instances with different sensitive attributes but similar non-sensitive attributes, expecting them to yield the same classification outcome. For instance, a fair loan application system should make identical decisions for two applicants who differ only in their gender.

This metamorphic relation has found extensive application in testing the fairness of software systems and guiding the generation of fairness tests across various domains, including tabular data classification (Galhotra et al., 2017; Zhang et al., 2021c; Zhang et al., 2021b, 2020; Xie and Wu, 2020; Aggarwal et al., 2019; Udeshi et al., 2018; Angell et al., 2018), text classification (Soremekun et al., 2022b; Asyrofi et al., 2021; Kiritchenko and Mohammad, 2018; Ribeiro et al., 2020; Wang et al., 2022b; Sharma et al., 2020; Huang et al., 2020; Ma et al., 2020), and image classification (Zheng et al., 2022; Zhang et al., 2021a; Dash et al., 2022; Denton et al., 2019a; Joo and Kärkkäinen, 2020; Denton et al., 2019b), among others.

In the case of tabular data, researchers typically select the sensitive attributes of interest in the dataset (e.g., gender and race) and modify the values of those attributes (e.g., from “male” to “female”) to assess if the software violates the fairness metamorphic relation for those instances (i.e., adopting the causal fairness definition). Tao et al. (Tao et al., 2022) proposed that altering only the sensitive attributes can lead to unnatural test inputs and therefore relaxed this constraint by allowing small permutations to be applied to non-sensitive attributes simultaneously. Similarly, counterfactual fairness (Kusner et al., 2017) necessitates modifying the non-sensitive attributes that are causally influenced by the sensitive attributes whenever the sensitive attributes are modified.

For text data, researchers generate inputs by filling sensitive attribute-related placeholders in predefined templates and mutating the sensitive attributes within these templates, or identify the entities related to the sensitive attributes and simultaneously transform all the identified entities to create valid test inputs that reflect real-world scenarios. For example, when dealing with gender, researchers identify and modify person names, gender pronouns, gender nouns, and so on.

For image data, researchers predominantly rely on advanced deep learning techniques such as GANs (Goodfellow et al., 2014) to transform images across different sensitive attributes. The methodologies for altering sensitive attributes in different types of data are extensively discussed in Section 4.1.

For ML software designed for regression tasks,the prediction outcomes are continuous values rather than discrete labels, which poses a challenge in determining metamorphic relations for assessing fairness. Specifically, it is difficult to ascertain whether two predicted continuous outcomes are different enough to indicate fairness issues in the software under test. To address this challenge, Udeshi et al. (Udeshi et al., 2018) introduced the use of a threshold to define metamorphic relations. In this approach, the difference between the outcomes of two similar instances that vary only in their sensitive attribute must be smaller than a manually-specified threshold. Similarly, Perera et al. (Perera et al., 2022) proposed the concept of fairness degree, which quantifies the maximum difference in predicted values across all pairs of instances that are similar except for their sensitive attribute. The fairness degree can be utilized and defined to construct metamorphic relations and guide the generation of test inputs.

For ML software designed for generation tasks, it is also challenging to determine the metamorphic relations. In the case of natural language generation systems, it is particularly difficult to evaluate the identity or similarity of generated text. To address this, researchers (Huang et al., 2020; Sheng et al., 2019; Liu et al., 2020; Dhamala et al., 2021) have employed various existing natural language processing techniques to measure text similarity. These techniques include sentiment classification, perplexity and semantic similarity measurement, politeness measurement, diversity measurement, toxicity classificaiton, and regard classification applied to machine-generated text. As a result, metamorphic relations for text generation systems require that pairs of inputs, which are identical except for the sensitive attribute, should yield generated text with consistent sentiment polarity, perplexity, semantics, and regard. This serves as a measure of similarity in the generated text. In the context of machine translation systems, where translations are generated based on input sentences, Sun et al. (Sun et al., 2020; Sun et al., 2022b) have proposed generating test oracles based on the metamorphic relationship between translation inputs and outputs. Specifically, they expect translation outputs for the original input sentence and its mutants, considering sensitive characteristics, to exhibit a certain level of consistency modulo the mutated words. To validate this consistency, similarity metrics are employed as test oracles to measure the degree of agreement between translated outputs.

Metamorphic transformations through mutating non-sensitive attributes. Some researchers (Pu et al., 2022; Rajan et al., 2022) adopt a different approach to generate metamorphic relations for fairness testing by focusing on the mutation of non-sensitive attributes. They employ metamorphic transformations as perturbations applied to samples and subsequently assess the ML software’s robustness to these perturbations across various demographic groups. This methodology can be viewed as an implementation of group fairness, as it involves comparing the software’s performance across different groups in response to the perturbations.

Balakrishnan et al. (Balakrishnan et al., 2020) produced synthesized samples taking multiple sensitive attributes into account, and compared error rates of gender classifiers across various subgroups before and after the synthesis.

Pu et al. (Pu et al., 2022) employed makeup as a form of perturbation applied to face images. They then compared the accuracy differences as bias factors between male and female individuals on both the original face images and synthetic images generated by introducing these perturbations.

Rajan et al. (Rajan et al., 2022) applied eight metamorphic transformations to speech signals, and measured the increase in error rates of speech recognition for different demographic groups after these transformations.

2.2. Statistical Measurements as Test Oracles

Researchers have proposed various statistical fairness measurements that align with different fairness definitions. While these measurements do not serve as direct oracles for fairness testing, they provide a quantitative way to assess the fairness of the software under test. For instance, in the case of statistical parity, researchers calculate the favorable rate among demographic groups and identify fairness violations by comparing these rates. This comparison involves measuring the difference between the rates, known as Statistical Parity Difference (SPD), or computing the ratio of the rates, known as Disparate Impact (DI) (Zhang and Harman, 2021). If the calculated SPD or DI exceeds a predefined threshold, it indicates the presence of fairness bugs in the software under test.

There is a wide array of statistical fairness measurements available, with the IBM AIF360 toolkit (AIF, 2018) alone offering over 70 such measurements. Determining the appropriate measure of fairness is a requirements engineering problem involving negotiation among various stakeholders and different interpretations(Baresi et al., 2023). For certain social-critical application scenarios, domain-knowledge is required for fairness testing and the prediction-modeler (e.g., data scientists and software engineers) needs to work together with the decision-maker (e.g., product managers and business strategists) (Sarro, 2023). How to achieve this is still an open challenge. However, once a fairness measure is determined, the actual statistical measurement used for testing is typically straightforward. Providing a comprehensive description and comparison of each measurement is beyond the scope of this survey. Verma and Rubin (Verma and Rubin, 2018) conducted a survey and categorized several widely adopted statistical fairness measurements. In this survey, we expand upon their categorization based on our collected papers and present representative measurements from each category.

Measurements based on predicted outcomes. Some measurements are calculated based on the predicted outcomes of the software for privileged and unprivileged groups. For example, the aforementioned Statistical Parity Difference (SPD) (Barocas and Selbst, 2016) measures the difference in favorable rates among different demographic groups; Disparate Impact (DI) (Feldman et al., 2015) measures the ratio of the favorable rate of the unprivileged group against that of the privileged group.

Measurements based on predicted and actual outcomes. Some measurements not only consider the predicted outcomes for different demographic groups, but also compare them with the actual outcomes recorded in the collected data. For example, the Equal Opportunity Difference (EOD) (Hardt et al., 2016) measures the difference in the true-positive rates of privileged and unprivileged groups, where the true-positive rates are calculated by comparing the predicted and actual outcomes. Another widely-adopted measurement that lies in this category is Average Odds Difference (AOD) (Hardt et al., 2016), which refers to the average of the false-positive rate difference and the true-positive rate difference between unprivileged and privileged groups.

Measurements based on predicted probabilities and actual outcomes. Some measurements take the predicted probability scores and actual outcomes into account. For example, for any given predicted score, the calibration measurement calculates the difference in the probability of having a favorable outcome for privileged and unprivileged groups (Chouldechova, 2017); the measurement of balance for positive class calculates the difference of average predicted probability scores in the favorable class between privileged and unprivileged groups (Kleinberg et al., 2017).

Measurements based on neuron activation. As DNNs are widely used in software systems to support the decision-making process, researchers have started to leverage the internal behaviors of DNNs to design statistical fairness measurements. Tian et al. (Tian et al., 2020) proposed a new statistical measurement based on neuron activation for DNNs. First, they computed a neuron activation vector for each label class based on the test inputs. Specifically, for a class $c$ , each element of its neuron activation vector represents how frequently a corresponding neuron is activated by all members in the test inputs belonging to class $c$ . Then they computed the distance between neuron activation vectors of different classes as the fairness measurement. If two classes do not show a similar distance with regards to a third class, they consider that the DNN under test contains fairness bugs.

Measurements based on situation testing. Some researchers designed statistical measurements to approximate situation testing, which is a legal experimental procedure of seeking for pairs of instances that are with similar characteristics apart from the sensitive attribute value, but obtain different prediction outputs (Thanh et al., 2011). Thanh et al. (Thanh et al., 2011) leveraged the k-nearest neighbor classification to approximate situation testing. They first divided the dataset into the privileged group and the unprivileged group, based on the sensitive attribute. Then, for each instance $r$ in the dataset, they found the k-nearest neighbors in the two groups and denoted them as sets $K_{p}$ and $K_{u}$ , respectively. Finally, they calculated the proportions of instances, for which the outcome is the same as $r$ in $K_{p}$ and $K_{u}$ , and measured the difference between the two proportions. If the difference is larger than a given threshold, the instance $r$ is considered unfairly treated. Zhang et al. (Zhang et al., 2016b) improved the measurement proposed by Thanh et al. (Thanh et al., 2011). They designed a new distance function that measures the distance between data instances, to improve the k-nearest neighbor classification. Their function considers only the set of attributes that are identified as the direct causes of the outcome by Causal Bayesian Networks (Pearl, 2009).

Measurements based on optimal transport projections. Several measurements (Black et al., 2020; Taskesen et al., 2021; Si et al., 2021) are proposed based on optimal transport projections (Villani, 2009), which seek for a transformation map between two probability measures. Black et al. (Black et al., 2020) mapped the set of women in the data to their male correspondents, with the optimal transport projection to minimize the sum of the distances between a woman and the man to which she is mapped (called her counterpart). Then they extracted the positive flipset, which contained the women with favorable outcomes whose counterparts are not. They also extracted the negative flipset, which was the set of women with unfavorable outcomes whose counterparts are favorable. Finally, they calculated the size difference of the positive and the negative flipsets to measure the unfairness of the system under test.

Measurements for ranking systems. Applying the aforementioned statistical fairness measurements directly to ranking systems, which are extensively utilized in various domains such as hiring and university admissions (Kuhlman et al., 2021; Gezici et al., 2021), poses significant challenges. To address this challenge, some researchers have tackled the ranking problem by transforming it into a classification problem and subsequently applying existing statistical fairness measurements. For example, researchers (Singh and Joachims, 2018; Yang and Stoyanovich, 2017) have used statistical parity difference as a fairness measurement, assessing whether individuals from different groups have equal representation among desirable outcomes, such as securing top positions in the ranking. Pairwise fairness is another common statistical metric employed in the context of ranking systems (Kuhlman et al., 2019; Narasimhan et al., 2020; Beutel et al., 2019). It necessitates that a ranking system ensures equal likelihood of a clicked item being ranked above another relevant unclicked item across different demographic groups. Recently, a fairness testing approach for deep recommender systems, namely FairRec (Guo et al., 2023), supports the measurement of differences in evaluation metrics between demographic groups through three types of statistical measurements: recommendation performance, alignment of recommended item popularity with user preferences, and diversity in recommendations.

It is difficult to determine the threshold for statistical measurements to detect the fairness bugs. For example, for the aforementioned SPD, it would be too strict to consider the software under test to be fair only when SPD equals to 0. In practice, practitioners can set a threshold for the measurement under consideration (DiCiccio et al., 2020). If the measurement result for the software under test is above or below the specified threshold, the software is considered to have a fairness bug. Although the threshold could be empirically specified by engineers, it is challenging to determine the appreciate threshold for each fairness measurement.

To alleviate this problem, researchers attempt to use statistical testing, based on the measurements to detect fairness bugs. Tramèr et al. (Tramèr et al., 2017) proposed FairTest to analyze the associations between software outcomes and sensitive attributes. The software under test is deemed to have a fairness bug, if the associations are statistically significant. Taskesen et al. (Taskesen et al., 2021) and Si et al. (Si et al., 2021) employed a statistical hypothesis test for the fairness measurements based on optimal transport projection. DiCiccio et al. (DiCiccio et al., 2020) presented a non-parametric permutation testing approach for assessing whether a software system is fair in terms of a fairness measurement. The permutation test is used to test the null hypothesis that a system has equitable performance for two demographic groups (e.g., male or female) with respect to the given measurement. Gursoy et al. (Gursoy and Kakadiaris, 2022) used the permutation test to detect whether prediction errors of a regression model are distributed in a statistically significant manner across demographic groups to determine whether the model under test is unfair.

In addition, researchers construct the baseline for the fairness measurement, and detect fairness bugs by comparing the measurement value with the baseline. Zhao et al. (Zhao et al., 2017) used the fairness measurements calculated based on training data as their baseline against which to evaluate. Specifically, they used the obtained ML model to annotate unlabeled data instances, and revealed situations when the ML process amplified existing bias by comparing the fairness measurements on training data and those on the annotated dataset. Wang and Russakovsky (Wang and Russakovsky, 2021) showed that the bias amplification measurement proposed by Zhao et al. (Zhao et al., 2017) conflated different types of bias amplification and failed to account for varying base rates of sensitive attributes. Then they proposed a new, decoupled metric for measuring bias amplification, which takes into account the base rate of each sensitive attribute and disentangles the directions of amplification. Wang et al. (Wang et al., 2019) presented a fairness testing approach for visual recognition systems that predicted action labels for images containing people. They trained two classifiers to predict gender from a set of ground truth labels and model predictions. The difference in the predictability of the two models indicated whether the ML process introduced fairness bugs.

Fairness Testing Components

This section introduces the fairness testing literature from the perspective of “what to test.” Just as traditional software testing can be conducted on different testable parts within a software system (Laghari and Demeyer, 2018), fairness testing can also be performed on different parts, including training data, ML programs, ML models, ML frameworks, and non-ML components. Fig. 5 shows the categorization of this section. Existing studies primarily focus on testing training data, ML programs, and ML models.

ML models are developed following the data-driven paradigm. This paradigm makes ML models vulnerable to fairness bugs present in data. Specifically, fairness bugs in training data can be learned and propagated throughout the ML model development pipeline, leading to the creation of biased and unfair ML software systems. To tackle this problem, data testing approaches, which detect bugs in ML training data (Zhang et al., 2022b), have been proposed for fairness testing. They detect the bias in data features, data labels, and data distribution.

Feature bias arises when certain features in the training data exhibit a strong correlation with sensitive attributes, causing them to become the underlying source of software unfairness (Li et al., 2022). Zhang and Harman (Zhang and Harman, 2021) investigated the impact of the feature set on the fairness of ML models. Their findings demonstrated that the selection of features has a substantial influence on fairness, thereby highlighting the importance of considering testing data features in fairness-related endeavors.

In order to identify the features responsible for fairness issues, it is natural to suspect that the software exhibits discrimination against a particular demographic group due to its consideration of sensitive attributes during both the training and prediction phases (Chakraborty et al., 2020a). To investigate whether the sensitive attribute serves as the underlying cause of fairness problems, Chakraborty et al. (Chakraborty et al., 2020a) conducted an experiment where they removed the sensitive attribute information from the data (i.e., fairness through unawareness). Surprisingly, they found that the resulting machine learning software demonstrated a similar degree of unfairness as observed previously.

A similar discovery was made in a real-world case: back in 2016, it was revealed that Amazon’s same-day delivery service exhibited discriminatory behavior towards neighborhoods with a disproportionately high population of Black residents (ama, 2016). Although the ML model behind the service did not explicitly incorporate race information, the presence of correlated attributes in the training data allowed for the possibility of bias. Specifically, it was found that the “Zipcode” information utilized during model training exhibited a strong correlation with race, causing the ML model to indirectly infer race information from it.

To identify non-sensitive features that could potentially contribute to fairness issues, Peng et al. (Peng et al., 2023) employed logistic regression and decision tree algorithms as models to infer the relationships between sensitive attributes and non-sensitive features. Similarly, Li et al. (Li et al., 2022) utilized linear regression to analyze the association between each feature and sensitive attributes, thereby identifying features that may introduce bias.

In contrast, Zhang et al. (Zhang et al., 2016a, 2017, 2019a) employed discrimination detection based on causal modeling to detect both direct and indirect discrimination within datasets. They constructed a causal graph to capture the causal relationships between attributes and outcomes. Direct discrimination was modeled as the causal effect occurring along the direct path from sensitive attributes to the outcome. On the other hand, indirect discrimination was represented by the causal effects along other paths involving non-sensitive features.

Li and Xu (Li and Xu, 2021) proposed a method to detect unknown biased attributes in a classifier that predicts a target attribute, such as gender, based on input images. The biased attribute in question is distinct from the target attribute. For instance, if a gender classifier exhibits varying predictions for female images based on their skin tones, then the skin tone attribute would be considered biased. To identify this unknown biased attribute, the authors optimized a hyperplane within the latent space of a generative model. By analyzing the transformation of synthesized counterfactual images generated by the model, human observers can interpret the semantic meaning of the biased attribute hyperplane. For example, images transitioning from “light skin” to “dark skin” indicate the presence of bias associated with skin color.

Black et al. (Black et al., 2020) utilized optimal transport projections (Villani, 2009) to map data instances from the unprivileged group to their counterparts in the privileged group. This allowed them to extract the positive flipset, which consists of unprivileged group members with favorable outcomes whose counterparts experienced unfavorable outcomes. Additionally, they computed the negative flipset, consisting of unprivileged group members with unfavorable outcomes whose counterparts experienced favorable outcomes. By analyzing the members of these flipsets, they determined which features contributed to inconsistent classifications.

1.2. Detecting Label Bias

Label bias occurs when factors unrelated to the determination of labels influence the process that generates outcome labels (Wick et al., 2019). ML models are often developed using data collected over an extended period. During the data collection process, labels are typically assigned by human annotators or algorithms, introducing the potential for human and algorithmic biases to be encoded into the labels.

To address label bias, Chakraborty et al. (Chakraborty et al., 2021, 2020a; Chakraborty et al., 2022) employed situation testing to identify biased data points and remove them from the training data. They divided the dataset into privileged and unprivileged groups based on the sensitive attribute. Two separate models were then trained on the data from each group. For each training instance, the predictions from both models were compared. If the two models produced divergent results, there was a probability that the label of that data point was biased.

Chen and Joo (Chen and Joo, 2021) utilized facial action units, objective indicators of fundamental muscle actions associated with different facial expressions, to detect label bias in widely-used datasets for facial expression recognition. They demonstrated that many expression datasets exhibited significant label bias between different gender groups, particularly concerning expressions of happiness and anger. Furthermore, they found that conventional fairness repair methods were unable to completely mitigate such biases in trained models.

1.3. Detecting Selection Bias

Selection bias occurs when the process of sampling training data introduces an unexpected correlation between sensitive attributes and the outcome (Wick et al., 2019). For example, the Compas dataset (com, 2016), widely studied in the fairness literature, has been shown to exhibit unintended correlations between race and recidivism (Wick et al., 2019). This dataset was collected during a specific time period (2013 to 2014) and from a particular county in Florida, with its inherent policing patterns, making it susceptible to the introduction of unintentional correlations.

Researchers primarily employ distribution testing to detect selection bias in the data. Chen et al. (Chen et al., 2022) tested whether the training data satisfied the “We Are All Equal” worldview, which assumes that there should be no statistical association between the outcome and the sensitive attribute. They specifically examined whether the favorable rates of privileged and unprivileged groups were equal. Chakraborty et al. (Chakraborty et al., 2021; Chakraborty et al., 2022) not only analyzed the disparity in favorable rates between privileged and unprivileged groups but also compared the numbers of data instances in the two groups.

Kärkkäinen and Joo (Kärkkäinen and Joo, 2021) detected bias in public face datasets, revealing a strong bias toward Caucasian faces while other racial groups (e.g., Latino) were significantly underrepresented. Such biases increase the risk of introducing fairness issues in facial analytic systems and limit their applicability.

Similarly, Torralba and Efros (Torralba and Efros, 2011) investigated computer vision datasets to evaluate whether existing datasets genuinely represent unbiased representations of the real world. They assessed how well an object detector trained on one dataset generalized when tested on a representative set of other datasets.

Yang et al. (Yang et al., 2022) collected perceived demographic attributes on a popular face detection dataset (wid, 2022) and observed skewed demographic distributions. Face detectors trained on this dataset exhibited demographic bias, as measured by performance disparities among different groups.

Wang et al. (Wang et al., 2020a) detected selection bias in visual datasets across three dimensions: object-based bias, gender-based bias, and geography-based bias. Object-based detection considered statistics related to object size, frequency, context, and diversity of object representation. Gender-based detection revealed the stereotypical portrayal of individuals of different genders. Geography-based detection focused on the representation of different geographic locations.

Mambreyan et al. (Mambreyan et al., 2022) analyzed datasets used for lie detection and discovered significant sex bias within them. Specifically, the percentage of instances labeled as lies for females was greater than that for males in the dataset. They further examined the effect of this bias on lie detection, training a classifier to predict the sex of the identity in a video and using sex as a proxy for lies (predicting lies for females and truth for males). This deception detector simulated a classifier that relied solely on selection bias. The results demonstrated that the performance of this biased classifier was comparable to the state-of-the-art, suggesting that recent techniques claiming near-perfect results may exploit selection bias.

2. ML Program Testing

An ML program includes various parts such as data processing, decision-making logic, and run-time configurations (e.g., ML hyper-parameters) (Chen et al., 2020a). Each of these parts can potentially introduce a discordance between the existing fairness conditions and the desired ones for the final ML software system. Testing the ML program can help identify fairness issues within its implementation.

In ML programs, it is common to include data processing scripts to manipulate and transform training data for downstream learning tasks. The processing can have a significant impact on the fairness of the software.

Biswas and Rajan (Biswas and Rajan, 2021) and Valentim et al. (Valentim et al., 2019) have investigated the introduction of fairness bugs through data processing methods using causal reasoning. They systematically intervened in the development process of ML software by applying different commonly-used data processing methods while keeping other settings unchanged. Their findings indicate that certain pre-processing methods indeed introduce fairness bugs, while other methods may improve software fairness.

Caton et al. (Caton et al., 2022) have observed that real-world datasets often contain missing values, and one common approach to address this issue is to impute the missing values using various techniques during the data processing phase. To evaluate the impact of different imputation strategies on fairness outcomes, they conducted tests and examined the resulting fairness implications.

2.2. Testing Hyper-parameters

The hyper-parameters specified in ML programs can affect fairness. To validate it, researchers explore whether different hyper-parameter settings lead to varying levels of software fairness. This testing process is treated as a search-based problem, aiming to discover optimal settings within the hyper-parameter space.

Chakraborty et al. (Chakraborty et al., 2019, 2020a) proposed Fairway, a method that combines situation testing with multi-objective optimization. Since there is often a trade-off between fairness and ML performance (such as accuracy) (Corbett-Davies et al., 2017), Fairway employs sequential model-based optimization (Nair et al., 2020) to search for hyper-parameters that maximize software fairness while minimizing any negative impact on other performance measures.

Similarly, Tizpaz-Niari et al. (Tizpaz-Niari et al., 2022) considered both fairness and accuracy in their approach. They introduced Parfait-ML, which offers three dynamic search algorithms (independently random, black-box evolutionary, and gray-box evolutionary) to approximate the Pareto front of hyper-parameters that balance fairness and accuracy. Parfait-ML not only provides a statistical method to identify hyper-parameters that systematically influence fairness but also incorporates a fairness repair method to discover improved hyper-parameter configurations that simultaneously enhance fairness and accuracy.

Gohar et al. (Gohar et al., 2023) conducted fairness testing on hyper-parameters in ensemble learning. As ensemble hyper-parameters are more intricate due to their impact on how learners are combined within different ensemble categories, the researchers investigated the effects of ensemble hyper-parameters on fairness. They also presented how to design fair ensembles using ensemble hyper-parameters.

2.3. Testing Fairness Repair Algorithms

As fairness becomes an increasingly important requirement for software systems, engineers may incorporate fairness repair algorithms (also known as bias mitigation algorithms) into their programs to ensure fairness. Researchers focus on testing whether these fairness repair algorithms effectively reduce fairness bugs without introducing side effects such as a decrease in accuracy.

Biswas and Rajan (Biswas and Rajan, 2020) applied seven fairness repair algorithms to 40 top-rated ML models collected from a crowdsourced platform. They compared individual fairness, group fairness, and ML performance before and after applying these algorithms.

Qian et al. (Qian et al., 2021) applied fairness repair techniques to five widely-adopted ML tasks and examined the variance of fairness and ML performance associated with these techniques. They investigated whether identical runs with a fixed seed produced different results. The findings indicated that most fairness repair techniques had undesirable impacts on the ML software, such as reducing accuracy, increasing fairness variance, or increasing accuracy variance.

Zhang and Sun (Zhang and Sun, 2022) evaluated existing fairness repair techniques on DNNs and discovered that while these techniques improved fairness, they often resulted in a significant drop in accuracy. In some cases, fairness and accuracy were both worsened. They proposed an adaptive approach that selects the fairness repair method for a DNN based on causality analysis (Sun et al., 2022a).

Hort et al. (Hort et al., 2021) introduced a benchmarking framework called Fairea. Prior work often measured the impacts of fairness repair algorithms on fairness and ML performance separately, making it unclear whether the improved fairness was solely due to the unavoidable loss in ML performance. Fairea addressed this issue by providing a unified baseline to evaluate and compare the fairness-performance trade-off of different repair methods.

Chen et al. (Chen et al., 2023) utilized Fairea to conduct a large-scale, comprehensive empirical evaluation of 17 representative bias mitigation methods from both the ML and SE communities. They evaluated these methods across 12 ML performance metrics, 4 fairness metrics, and 24 types of fairness-performance trade-off measurements.

Hort and Sarro (Hort and Sarro, 2021) observed another side effect of fairness repair: it could lead to the loss of discriminatory behaviors of anti-protected attributes. Anti-protected attributes refer to attributes on which one might want the ML decision to depend (e.g., students with completed homework should receive higher grades).

Orgad et al. (Orgad et al., 2022; Orgad and Belinkov, 2022) evaluated fairness repair approaches for NLP models from two aspects: extrinsic bias (performance difference across different demographic groups) and intrinsic bias (bias in models’ internal representations, e.g., sentence embeddings). They found that the two types of bias may not be correlated, and the choice of bias measurement and dataset can significantly affect the evaluation results.

2.4. Testing Compression Algorithms

A computation-intensive DL model can be efficiently executed on PC platforms with GPU support, but it cannot be directly deployed and executed on platforms with limited computing power, such as mobile devices (Chen et al., 2020a). To address this issue, model compression algorithms have been proposed to represent DL models in a smaller size with minimal impact on their performance (Chen et al., 2021). Common model compression techniques include quantization (representing weight values of DL models using smaller data types), pruning (eliminating redundant weights that contribute little to the model’s behavior), and knowledge distillation (transferring knowledge from a large model to a smaller one) (Cheng et al., 2017). The widespread adoption of model compression for DL models has motivated researchers to detect fairness bugs introduced by these algorithms.

Since model compression is often applied to large DL models, existing fairness testing of model compression algorithms typically focuses on complex NLP models (Blakeney et al., 2021) and computer vision models (Hooker et al., 2020; Stoychev and Gunes, 2022; Blakeney et al., 2021; Lin et al., 2022b). Hooker et al. (Hooker et al., 2020) demonstrated that pruning and quantization can amplify gender bias when classifying hair color in a computer vision dataset. Xu and Hu (Xu and Hu, 2022) tested the effect of distillation and pruning on bias in generative language models and provided empirical evidence that distilled models exhibited less bias. Stoychev and Gunes (Stoychev and Gunes, 2022) detected fairness bugs introduced by different model compression algorithms in various facial expression recognition systems but did not observe consistent findings across different systems.

3. Model Testing

Most existing fairness testing techniques primarily focus on the evaluation of individual ML models (Galhotra et al., 2017; Fan et al., 2022; Udeshi et al., 2018; Aggarwal et al., 2019; Xie and Wu, 2020; Fan et al., 2022; Zhang et al., 2020; Zhang et al., 2021c; Zheng et al., 2022). These techniques can be directly applied to the final ML models obtained, using either a black-box or white-box approach. The distinction between white-box testing and black-box testing lies in the level of access to training data and internal knowledge of the ML models.

Black-box model testing is a technique used to detect fairness issues in ML models without relying on access to training data or knowledge of the internal model structure. This approach primarily relies on analyzing the behavior of the model based on the input space.

Fairness testing in the field typically relies on statistical measurements to identify fairness bugs in black-box models based on their prediction behaviors. For instance, Tramèr et al. (Tramèr et al., 2017) conducted an analysis to detect fairness bugs by examining the associations between prediction outcomes and sensitive attributes. They aimed to uncover any potential biases or unfairness present in the model’s predictions. Similarly, Bae (Bae and Xu, 2022) compared the performance of pedestrian trajectory prediction models across different demographic groups, aiming to uncover variations or biases in their performance.

Additionally, fairness testing approaches often leverage metamorphic relations to detect fairness bugs by applying transformations to software inputs. These transformations aim to identify unexpected changes in the model’s predictions. Many of these techniques employ black-box testing methodologies. For example, the Themis tool (Galhotra et al., 2017; Angell et al., 2018) generates random test inputs and checks if the software system produces consistent outputs for individuals who differ only in sensitive attribute values. Similarly, Aequitas (Udeshi et al., 2018) and ExpGA (Fan et al., 2022) search the input space of the software for discriminatory instances that reveal unfair predictions.

Black-box testing is commonly used to detect fairness bugs in complex software systems, including NLP, computer vision, and ranking systems, where the internal workings of the system are not fully visible to the testers.

Researchers have developed various text templates to uncover fairness bugs in different NLP systems, such as sentiment analysis (Kiritchenko and Mohammad, 2018; Asyrofi et al., 2021), machine translation (Wang et al., 2022b), text generation (Sheng et al., 2019; Huang et al., 2020; Dhamala et al., 2021), natural language inference (Sharma et al., 2020), named entity recognition (Mehrabi et al., 2020a), and conversational systems (Wan et al., 2023). Further details regarding these studies can be found in Section 4.1.4.

For computer vision systems, state-of-the-art fairness techniques (Goodfellow et al., 2014; Denton et al., 2019a; Karras et al., 2018; Joo and Kärkkäinen, 2020; Zhang et al., 2021a; Denton et al., 2019b; Dash et al., 2022) often utilize GAN-based algorithms to generate images that differ in sensitive attributes. These techniques then check if the computer vision systems make different decisions for equivalent image mutants. More information about these techniques is available in Section 4.1.4.

Ranking systems, being predominantly black-box, are also tested in a black-box manner (Singh and Joachims, 2018; Yang and Stoyanovich, 2017; Kuhlman et al., 2019; Narasimhan et al., 2020; Beutel et al., 2019). For instance, researchers have measured whether different demographic groups have proportional representation in top ranking positions based on the system’s ranking outputs. FairRec (Guo et al., 2023) assesses recommendation differences across demographic groups using statistical measurements such as recommendation performance, alignment of recommended item popularity with user preferences, and diversity in recommendations.

Some black-box testing techniques approximate the behavior of the black-box software using a white-box model, allowing the application of white-box testing techniques. For example, Aggarwal (Aggarwal et al., 2019) approximated the decision-making process of the black-box ML software using a decision tree constructed through a local model explainer. They then employed symbolic execution-based test input generation to discover discriminatory inputs. Sharma and Wehrheim (Sharma and Wehrheim, 2020) first approximated the black-box software with a white-box model based on its behaviors. They subsequently developed a property-based testing mechanism for fairness checking, where specific fairness requirements can be specified using an assume-assert construct. Test cases were automatically generated to attempt to violate the specified fairness property.

3.2. White-box Model Testing

White-box model testing aims to identify fairness bugs by examining either the training data or the internal structure and information of the ML model that is accessible to test engineers.

Some approaches leverage training data to uncover unfair predictions, without accessing the internal of ML models. Chakraborty et al. (Chakraborty et al., 2020b) proposed an explanation method based on k-nearest neighbors to detect bias in ML software predictions. They identified instances predicted unfavorably and examined their k-nearest neighbors with favorable labels from the training data. By comparing the distribution of these neighbors with the test instance, they determined and explained the presence of bias.

Zhao et al. (Zhao et al., 2017) used fairness measurements from training data as a baseline to identify bias amplification. They annotated unlabeled data using the ML model and compared fairness measurements between the training data and annotated dataset to reveal bias amplification.

Wang and Russakovsky (Wang and Russakovsky, 2021) highlighted issues with Zhao et al.’s bias amplification measurement and proposed a new, decoupled metric that considers varying base rates of sensitive attributes.

Cabrera et al. (Cabrera et al., 2019) developed FAIRVIS, a testing tool for subgroup fairness. It efficiently searches for potential issues among numerous subgroups, clustering the training dataset to identify statistically similar subgroups and calculating group fairness metrics using an entropy-based approach. FAIRVIS presents generated subgroups sorted by their group fairness metrics.

Patel et al. (Patel et al., 2022) utilized combinatorial t-way testing (Kuhn et al., 2013) for fairness testing. This coverage-based data sampling method generates diverse datasets by applying logical constraints. They created an input parameter model from the training data and used it to generate a t-way test set. Discriminatory instances were identified by mutating protected attributes in each test.

Several techniques (Zhang et al., 2020, 2021b; Tao et al., 2022; Zheng et al., 2022; Zhang et al., 2021c, a) utilize gradient information, which represents the direction of steepest ascent in the loss function, to generate test inputs for fairness testing. ADF (Zhang et al., 2020, 2021b) focuses on discriminatory instances near the decision boundary of DNNs and employs gradients to guide the search for neighboring test inputs. EIDIG (Zhang et al., 2021c) reduces gradient calculations to accelerate the search process.

Researchers have also developed various methods to detect and analyze neurons responsible for unfair outcomes in emerging deep learning (DL) models. NeuronFair (Zheng et al., 2022) utilizes neuron analysis to identify biased neurons contributing to unfairness, and generates discriminatory instances to amplify the activation differences of these biased neurons. It demonstrates strong interpretability, generation effectiveness, and data generalization.

DeepFAIT (Zhang et al., 2021a) employs significance testing to identify fairness-related neurons by analyzing the activation differences between privileged and unprivileged groups.

Vig et al. (Vig et al., 2020) apply Causal Mediation Analysis (CMA) to identify causally implicated parts, such as neurons or attention heads, in the unfair predictions of a DNN model. CMA measures the direct and indirect effects of targeted neurons on the final unfair predictions, considering each neuron as an intermediate mediator.

Tian et al. (Tian et al., 2020) introduce a novel statistical measurement using neuron activation in DNNs. They compute neuron activation vectors for each label class based on test inputs and calculate the distance between these vectors to assess fairness. Dissimilar distances with respect to a third class indicate the presence of fairness bugs in the tested DNN.

Gao et al. (Gao et al., 2022) propose FairNeuron for DNNs, which employs neuron slicing to identify conflict paths containing neurons that rely on sensitive attributes for predictions. Biased instances triggering the selection of sensitive attributes are identified using these paths, and the model is retrained through selective training. FairNeuron ensures that the conflict paths learn all important features for prediction instead of biased ones for the identified biased instances, while retaining the original training approach for other instances.

Additionally, Zhang et al. (Zhang et al., 2022a) developed a method to identify and rectify fairness-related paths in decision tree and random forest models. They employed a MaxSMT solver to determine the paths that could be altered while satisfying fairness and semantic difference constraints. The identified paths were refined by modifying the leaf labels, resulting in a repaired fair model.

Datasets and Tools

Based on the collected papers, this section summarizes the public datasets and open source tools for fairness testing to provide a quick navigation for researchers and practitioners.

This section lists the public datasets in the literature based on our collected papers. Table 4 provides detailed information about these datasets, including their sizes, data types, sensitive attributes, usage scenarios, and access links. Many of these datasets comprise tabular data, making them suitable for traditional ML classifiers.

In recent years, there has been a surge in the availability of text and image datasets, driven by the growing popularity of natural language processing and computer vision. These datasets are often sourced from social media platforms like Twitter. It is worth noting that certain datasets come with specific usage constraints that researchers must consider when utilizing them. For instance, the well-known image dataset CelebA (Cel, 2015) is restricted to non-commercial research purposes only.

For a comprehensive overview of the existing fairness datasets, we recommend referring to the works of Le Quy et al. (Quy et al., 2021) and Fabris et al. (Fabris et al., 2022). Le Quy et al. surveyed tabular datasets specifically for fairness research, while Fabris et al. expanded the survey to include unstructured data such as text and images. Their surveys encompass datasets from diverse domains, including social sciences, computer vision, health, economics, business, and linguistics.

2. Open-source Testing Tools

There is a recent proliferation of open-source tools for supporting fairness testing. Nevertheless, Lee and Singh(Lee and Singh, 2021) demonstrated that there is a steep learning curve for practitioners to use these fairness tools. Presently, there is a lack of guidance on tool adoption (Lee and Singh, 2021).

To address this gap, we provide a summary of 42 open-source fairness testing tools in this section, aiming to assist fairness researchers and practitioners in selecting the most suitable tools. The details of these tools are presented in Table 5. The table includes fairness testing tools for various domains, including general ML software (e.g., FairTest (Tramèr et al., 2017) and Themis (Galhotra et al., 2017)), DL software (e.g., ADF (Zhang et al., 2020) and EIDIG (Zhang et al., 2021c)), natural language processing software (e.g., ASTRAEA (Soremekun et al., 2022b) and BiasFinder (Asyrofi et al., 2021)), computer vision software (e.g., REVISE (Wang et al., 2020a)), speech recognition software (e.g., AequeVox (Rajan et al., 2022)), and recommendation systems (e.g., DialogueFairness (Liu et al., 2020) and BiasAsker (Wan et al., 2023)).

Research Trends and Distributions

In Figure 1, we have shown that fairness testing is experiencing a dramatic increase in the number of publications. This section further analyzes the research trends and distributions of fairness testing.

We first describe research trends in terms of the research communities engaging in fairness testing. Since 2017, an increasing number of research communities have dedicated their efforts to studying fairness testing. Notable contributions include:

Galhotra et al. (Galhotra et al., 2017) introduced the first fairness testing approach for ML software in the SE community, receiving the Distinguished Paper Award at ESEC/FSE 2017.

Zhao et al. (Zhao et al., 2017) detected bias in datasets and ML models for visual recognition tasks, earning the Best Paper Award at EMNLP 2017.

Díaz et al. (Díaz et al., 2018) identified age-related bias in sentiment analysis systems, receiving the Best Paper Award at CHI 2018.

Ribeiro et al. (Ribeiro et al., 2020) proposed CheckList, a task-agnostic methodology for testing NLP models, including fairness testing, which received the Best Paper Award at ACL 2020.

Chakraborty et al. (Chakraborty et al., 2021) addressed selection bias and label bias in training data, and were honored with the Distinguished Paper Award at ESEC/FSE 2021.

These five best paper awards, as well as being a credit to their authors, also demonstrate both the significant level of interest and the high quality of research on fairness in three different research communities:

Natural Language Processing (EMNLP and ACL)

Figure 6 illustrates the distribution of the collected papers across various research venues. The majority of fairness testing papers (49%) are published in software engineering venues, including ICSE, ESEC/FSE, ASE, ISSTA, and IEEE TSE. Artificial intelligence venues, such as ICML, NeurIPS, IJCAI, ACL, EMNLP, CVPR, ECCV, and KDD, account for 42% of the fairness testing papers.

Furthermore, our survey reveals that fairness testing is gaining traction in other research communities, such as computer security, human-computer interaction, and mobile computing database communities. This highlights the broad audience and significance of our survey across multiple disciplines.

2. Machine Learning Categories

In this section, we explore the research trend of fairness testing across different ML categories. Following previous work (Zhang et al., 2022b), we categorize the gathered papers into two groups: those that concentrate on DL software and those that address general ML software.

Out of the total number of papers, 50 papers (50%) focus on conducting fairness testing for DL software, 41 papers (41%) specifically target general ML software, and 8 papers (8%) consider both traditional ML software and DL software. The significant volume of publications on fairness testing for DL software can be attributed to several factors. On one hand, DL has gained widespread adoption and is being utilized in a diverse range of software applications, generating significant interest from the research community. On the other hand, compared to traditional ML algorithms such as regression and decision trees, DL models are less interpretable (Baranyi et al., 2020), making it more challenging to directly reason about fairness.

To gain further insights, we analyze the publication trends for both categories over the years. Figure 7 depicts the number of papers focusing on fairness testing in general ML and DL per year. Our analysis reveals a clear shift in research focus, with a transition from testing general ML software to testing DL software. Prior to 2019, fairness testing research primarily concentrated on general ML. However, since 2019, the number of papers specifically addressing DL has experienced a notable surge, surpassing the publications on general ML.

3. Data Types

In this section, we investigate the research trends of fairness testing in applications that involve different types of data.

Out of the total number of papers (100) that we have collected, 5 of them consider more than one data type. We count each of these papers for each data type they examine, allowing us to analyze the distribution across various data types. The findings are illustrated in Figure 9.

Our analysis indicates that among the publications we have collected, a significant portion focuses on testing software applications that utilize tabular data as inputs, accounting for 55% of the total. Furthermore, approximately 24% of publications address fairness testing in applications involving text inputs, while another 24% specifically tackle fairness issues in applications utilizing image inputs.

It is important to note that fairness testing for other data types, such as speech and recommendation systems, has not yet received extensive investigation, representing only a small percentage of the publications, approximately 1% each. These data types remain relatively underexplored in the context of fairness testing, emphasizing the need for further research and attention to ensure fairness across a wider range of data-driven applications.

We also plot the data type distribution for SE publications. Figure 8 shows the results. In comparison to the broader research landscape, the SE community predominantly focuses on applications involving tabular data, representing a substantial majority of fairness testing publications in SE venues, amounting to 79.2%. In contrast, publications addressing text- or image-based problems account for only 16.7% and 8.3%, respectively. These figures are significantly lower than the average distribution across all data types.

4. Fairness Categories

Figure 10 provides insights into the distribution of different fairness categories within the fairness testing literature. Our analysis reveals that comparable research efforts have been dedicated to exploring various aspects of fairness testing.

Specifically, we observe that 46% of fairness testing papers focus on group fairness, addressing issues related to fairness among different groups. Another 46% of papers concentrate on individual fairness, investigating fairness concerns at the individual level. The remaining 8% of papers examine both group fairness and individual fairness, recognizing the importance of considering both perspectives in fairness testing research.

The finding may initially appear contradictory when comparing it with Table 3, where we observe that nearly all test generation techniques are proposed for individual fairness. However, this discrepancy can be attributed to the distinct characteristics of individual fairness and group fairness.

When examining group fairness, researchers can leverage real-world data to construct test inputs that capture fairness considerations among different groups. This availability of data facilitates the exploration of group fairness in testing scenarios. On the other hand, for individual fairness, it becomes challenging for researchers to identify pairs of instances that satisfy the specific input requirements associated with individual fairness. For instance, it is not always straightforward to find two individuals who differ solely in a sensitive protected attribute, making test input generation more focused on individual fairness.

Despite these differences in test input generation, it is important to note that fairness testing studies overall investigate group fairness and individual fairness at a similar level. The distribution of research efforts aims to address both perspectives, acknowledging the significance of both group and individual fairness in testing methodologies.

5. Testing Manners

Finally, we categorize existing fairness testing techniques based on the software testing approach employed, distinguishing between white-box and black-box methods. It is worth noting that although Tizpaz-Niari et al. (Tizpaz-Niari et al., 2022) proposed black-box and gray-box techniques for testing hyper-parameters, we consider their approach as white-box because it requires access to the training data for model training and fairness evaluation.

In our analysis, we find that 53% of the 100 papers employ black-box testing techniques, while 47% adopt white-box testing methods. This observed gap between black-box and white-box testing proportions is reasonable.

Compared to black-box testing, white-box testing necessitates access to either training data or the internal workings of software systems. However, fairness testing often applies to systems that are human-related and have social significance, making it challenging to disclose internal information to the public due to privacy concerns or legal policies. Therefore, it is reasonable that more research studies focus on black-box testing approaches, which do not require detailed knowledge of the internal system mechanisms.

Research Opportunities

Fairness testing remains in a relatively embryonic state. Research in this area is experiencing rapid growth, so there are plenty of open research opportunities. In this section, we outline the challenges for fairness testing and present promising research directions and open problems.

Fairness testing in absence of sensitive attribute information. Existing fairness testing techniques rely on the existence of sensitive attributes, but in practice, this information might be unavailable or imperfect for many reasons (Awasthi et al., 2021). On the one hand, the data may be collected in a setting where the sensitive attribute information is unnecessary, undesirable, or even illegal, considering the recent released regulations such as GDPR (General Data Protection Regulation) (Voigt and Von dem Bussche, 2017) and CCPA (California Consumer Privacy Act) (Goldman, 2020). On the other hand, users may withhold or modify sensitive attribute information, for example, due to privacy concerns or other personal preferences.

To tackle this issue, a straightforward solution is to first use existing demographic information inference techniques (e.g., gender inference, race inference, and age inference) to infer the sensitive attribute and then apply fairness testing techniques. However, existing inference techniques may not be fully satisfactory, and their application scenarios remain limited (Chen et al., 2018). Moreover, building a model to infer sensitive information leaves open the possibility that the model may ultimately be used more broadly, with possibly unintended consequences (Awasthi et al., 2021). Therefore, more research is needed to tackle the fairness testing in absence of sensitive attribute information.

Fairness testing at the intersection of multiple sensitive attributes. Software systems may have multiple sensitive attributes that need to be considered at the same time (Gohar and Cheng, 2023). Human attributes, such as sex, race, and class, intersect with one another, and unfair software systems built into society lead to systematic disadvantages along these intersecting attributes (Crenshaw, 2013). However, existing fairness testing work often tackles a single sensitive attribute at a time. To the best of our knowledge, there has been a little work that explores fairness testing for compounded or intersectional effects of multiple sensitive attributes (Cabrera et al., 2019; Tao et al., 2022; Zhang et al., 2023), leaving an interesting research opportunity for the community.

2. Test Oracle for Fairness Testing

Existing work mainly employs metamorphic relations as pseudo oracles or uses statistical measurements as indirect oracles of fairness testing, which both involve human ingenuity. It is a open challenge to design automatic techniques for constructing reliable oracles for fairness testing.

Furthermore, the emergence of manually-defined oracles for fairness testing brings a challenge for test oracle selection. For instance, the IBM AIF360 toolkit alone offers more than 70 fairness measurements (AIF, 2018; Bellamy et al., 2018), and the research community continues to introduce novel measurements. However, it is impractical to utilize all existing measurements as test oracles for fairness testing. Moreover, while each measurement may be suitable in a specific context, many of them cannot be simultaneously satisfied (Brun and Meliou, 2018). Hence, an important area for the research community is the development of automatic techniques for constructing reliable oracles in fairness testing. It is worth noting that one could argue that this challenge falls within the realm of requirements engineering rather than testing. However, in reality, fairness testing is frequently explored in natural settings, where upfront requirements engineering processes are not always assumed or followed strictly. Fairness testing involves investigating and evaluating fairness concerns in real-world systems or datasets, which may lack comprehensive and formal requirements. As a result, fairness testing needs to adapt to address the unique challenges that arise in these real-world contexts, where strict adherence to traditional requirements engineering may not be practical or feasible.

3. Test Input Generation for Fairness testing

Generation of natural inputs. Despite the existence of various techniques for test input generation in fairness testing, there is no guarantee that the generated instances are legitimate and natural. Particularly, in Table 3, it is evident that most test input generation techniques are based on the causal fairness definition, which requires generating pairs of instances that differ solely in sensitive attributes. However, it remains uncertain whether altering only the sensitive attribute is sufficient to generate inputs that are truly natural.

Furthermore, existing techniques (Zhang et al., 2021c; Zhang et al., 2021b, 2020) primarily rely on perturbing input features, without explicitly constraining the magnitude of the perturbation. As long as the generated instances can induce the intended output behavior, such as flipping the predicted outcome after modifying sensitive attribute information, they are considered effective. However, this approach may overlook real-world constraints, potentially resulting in generated instances that do not align with reality (e.g., granting a loan to a 10-year-old individual).

As a result, open problems arise regarding how to generate test inputs for fairness testing that are both legitimate and natural. Researchers need to address the challenges of ensuring the generated instances adhere to real-world constraints while still accurately assessing fairness. Additionally, automating the evaluation of the naturalness of generated test inputs is another important area of exploration, enabling more reliable and efficient fairness testing methodologies.

Exploration of more generation techniques. As mentioned earlier, most test input generation techniques in fairness testing focus on the causal fairness definition. In contrast, test input generation for counterfactual fairness is relatively unmatured and more challenging. It requires researchers to conduct causal analysis of features and consider the causal relationships among them when altering sensitive attributes during generation. Moreover, when dealing with multiple sensitive attributes simultaneously, the task becomes even more difficult as changes in features need to account for the causal impacts from multiple sensitive attributes.

Furthermore, there is a research opportunity to design test inputs specifically for group fairness testing. While group fairness can be evaluated using real-world collected data, obtaining such data is not always easy. Given the limited work on generating test inputs for group fairness, there is still much potential for exploration in this area.

4. Test Adequacy for Fairness Testing

Test adequacy is a well-explored concept in traditional software testing, focusing on evaluating the coverage provided by existing tests (Zhang et al., 2022b). Adequacy criteria not only offer confidence in testing activities but also serve as a guide for test generation. However, in the domain of fairness testing, the issue of test adequacy remains an open problem, and to the best of our knowledge, no research has specifically addressed this area.

To address this challenge, one approach could be to adapt traditional software test adequacy metrics or ML test adequacy metrics for fairness testing. For instance, traditional software testing has proposed metrics such as line coverage, branch coverage, and dataflow coverage (Zhang et al., 2019b), while ML testing has introduced metrics such as neuron coverage, layer coverage, and surprise adequacy for deep learning models (Zhang et al., 2022b). Neuron coverage assesses the extent to which neurons in a deep learning model are exercised by a test suite, while layer coverage measures the coverage of different layers. Surprise adequacy, on the other hand, evaluates the coverage of discretized input surprise range for deep learning models (Zhang et al., 2022b).

However, there is currently no empirical evidence to support the applicability and effectiveness of these metrics in assessing the ability to detect fairness bugs and the sufficiency of fairness testing. Further research is needed to investigate and validate the suitability of these metrics in the context of fairness testing. Additionally, novel metrics tailored specifically for fairness testing may need to be developed to capture the unique characteristics and requirements of assessing fairness in ML software.

5. Test Cost Reduction

Test cost poses a significant challenge in fairness testing of ML software. The process of assessing fairness often entails retraining ML models, repeating the prediction process, or generating extensive data to explore the vast behavioral space of the models. However, thus far, there has been no research dedicated to reducing the cost of fairness testing. It would be intriguing to explore specific techniques for test selection, prioritization, and minimization that can effectively reduce the cost of fairness testing without compromising test effectiveness.

Moreover, as discussed in Section 5.2, there is an escalating demand for deploying intelligent software systems on platforms with limited computing power and resources, such as mobile devices. Several studies (Blakeney et al., 2021; Hooker et al., 2020; Stoychev and Gunes, 2022; Blakeney et al., 2021) have addressed fairness testing in such scenarios. This presents a fresh challenge for the research community: how to conduct fairness testing effectively on diverse end devices, including those with restricted computing power, limited memory size, and constrained energy capacity. Addressing this challenge requires innovative approaches and techniques that can adapt fairness testing methodologies to accommodate the limitations and constraints of these resource-constrained platforms.

6. Fairness and Other Testing Properties

Testing fairness repair techniques with more properties considered. After fairness repair techniques have been applied to software systems, fairness testing is often performed again. In this process, testers may also take ML performance (e.g., accuracy) into consideration (Hort et al., 2021; Chen et al., 2022), because it is well-known that fairness improvement is often at the cost of ML performance (Hort et al., 2021). However, in addition to ML performance, there are also many other properties important for software systems, including robustness, security, efficiency, interpretability, and privacy (Zhang et al., 2022b). The relationship between fairness and these properties is not well studied in the literature, and thus these relationships remain less well understood. Future research is needed to uncover the relationships and perform the testing with these properties considered. The determination of the properties to be considered needs the assistance of requirements engineers.

Fairness and explainability. Explainability is defined as that users can understand why a prediction is made by a software system (Gilpin et al., 2018). Like fairness, it has also been an important software property required by recent regulatory provisions (Mittelstadt et al., 2019). Because application scenarios that demand fairness often also require explainability, it would be an interesting research direction to consider fairness and explainability together. Many existing fairness testing studies just generate discriminatory instances that reveal fairness bugs in the software under test, but do not explain why these instances are unfairly treated by the software.

In this case, software engineers have relatively little guidance on the production of targeted fixes to repair the software. Improving the explainability behind the unfair software outcomes can help summarize the reasons for fairness bugs, produce insights for fairness repair, and help stakeholders without technical background (e.g., product managers, compliance officers, and policy makers) understand the software bias simply and quickly.

7. Testing Fairness of More Applications

The majority of existing fairness testing work has concentrated on tabular data, natural language processing systems, and computer vision systems. However, fairness, as a critical non-functional property, should be considered across a broader spectrum of software systems, including speech recognition systems, video analytic systems, multi-modal systems, and recommendation systems. Additionally, existing fairness testing studies have predominantly centered around classification tasks. However, fairness is a crucial concern that should be examined in various machine learning tasks, including regression and clustering.

8. More Fairness Testing Activities and Components

The current body of research primarily focuses on offline fairness testing. There is a pressing need for more research in the realm of online fairness testing, as it can provide valuable insights to guide software maintenance and facilitate the evolution of software systems.

Furthermore, researchers have an opportunity to extend fairness testing investigations to include additional testing activities that have been extensively studied in traditional software testing but rarely explored in the context of fairness testing. For instance, exploring bug report analysis (Zhang et al., 2015), bug triage (Jeong et al., 2009), test evaluation (Zhang et al., 2022b) in fairness testing can contribute valuable insights to the field.

Additionally, there exists a research gap in the fairness testing components. While testing of ML frameworks has received considerable attention in traditional ML testing (Pham et al., 2019; Wang et al., 2020b; Nejadgholi and Yang, 2019; Wang et al., 2022a), its application in fairness testing remains underexplored. Furthermore, exploring non-ML component testing within the context of fairness testing presents a promising research direction that warrants further investigation.

9. More Fairness Testing Tools

Existing fairness testing tools (listed in Table 5) tend to require programming skills, and thus are unfriendly to non-technical stakeholders. However, fairness testing research includes many non-programmer stakeholders and contributors such as compliance officers, policy makers, and legal practitioners.

Conclusion

We have presented a comprehensive survey of 100 papers on fairness testing. We summarized current research status in the fairness testing workflow (including test input generation and test oracle identification) and testing components (including data testing, ML program testing, and model testing). We also listed public datasets and open source tools that can be accessed by researchers and practitioners interested in the fairness testing topic. We analyzed trends and promising research directions for fairness testing. We hope this survey will help researchers from various research communities become familiar with the current status and open opportunities of fairness testing.

Acknowledgment

We shared our work work with the authors of the papers we surveyed in order to check for accuracy and omission, we would like to thank those authors who kindly provided comments and feedback on earlier drafts of this paper. Zhenpeng Chen, Federica Sarro, and Mark Harman are supported by the ERC Advanced Grant No.741278 (EPIC: Evolutionary Program Improvement Collaborators). Jie M. Zhang is partially supported by the UKRI Trustworthy Autonomous Systems Node in Verifiability, with Grant Award Reference EP/V026801/2.