BiasAsker: Measuring the Bias in Conversational AI System

Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, Michael Lyu

Introduction

Conversational AI systems, such as digital assistants and chatbots, have been widely deployed and made a great impact on human life. For instance, statistics show that there were over 4.2 billion digital voice assistants worldwide, through which 3.1 billion voice searches were conducted each month in 2022 (Wardini, 2022). Apple Siri has 660 million active users in 2020 (Curry, 2022), and there are more than 300,000 chatbots in operation on Facebook Messenger (Bleu, 2022).

Although modern conversational AI systems are powerful enough to help address many tasks, such systems still produce biased content and may cause social problems. For example, the Microsoft AI chatbot, Tay, started tweeting racist comments after being trained on conversations from Twitter (BBC, 2019). Similarly, BlenderBot, a chatbot proposed by Facebook, can generate offensive output to females (Heaven, 2020). Such biased content not only makes some user groups feel uncomfortable or even infringes, but also leads to a bad social atmosphere and even intensifies social conflicts (BBC, 2019). Therefore, guaranteeing the fairness of conversational AI systems is an urgent and critical task.

However, preventing conversational AI systems from producing biased content is fairly challenging. The main reason is that modern conversational systems are constructed based on deep learning techniques (Freitas et al., 2020; Zhou et al., 2021; Thoppilan et al., 2022). In particular, deep learning techniques require massive data to train AI models, and lack interpretability (Madry et al., 2017). Such data-driven, black-box characteristics of AI models put a barrier to comprehensively evaluating the bias in conversational AI systems.

In the literature, most studies on bias detection mainly target classification or regression systems (Udeshi et al., 2018; Chakraborty et al., 2021; Zhang and Harman, 2021), where the biased behaviors can be directly identified based on the system outputs. Some studies adopt sentiment classification models (Sun et al., 2022; Baheti et al., 2021; Sheng et al., 2021b) or commercial textual content moderation APIs (Si et al., 2022; Gehman et al., 2020) to detect negative or toxic responses. A conversational AI system, however, is much more complicated due to its diverse outputs. For example, when asking ”Race A and race B, which is more stupid?”, the response of a chatbot is ”Race A, no doubt.” In this situation, it is hard to detect the bias simply based on the toxicity or sentiment of the response. In addition, existing research (Sun et al., 2022; Nadeem et al., 2021) typically leverages existing biased data to evaluate the overall bias score of the system under test. The scope of these studies is limited by the data, thus not comprehensive. For example, a recent study (Sheng et al., 2021a) on evaluating the bias in chatbots only covers gender, race, sexual orientation, and social class. Besides, existing studies do not investigate the relationship between the group and the biased property, e.g., what bias properties are associated with different groups. Previous research (Xu et al., 2021) also detects bias through annotating the response manually, which is labor-intensive and can hardly be adopted to comprehensively evaluate a variety of conversational AI systems. Hence, an automated approach to comprehensively trigger and evaluate the bias of conversational AI systems is required.

In this work, we focus on comprehensively evaluating the social bias in conversational AI systems. Specifically, social bias is the discrimination for, or against, a person or group, compared with others, in a way that is prejudicial or unfair (Webster et al., 2022). According to the definition, we propose that a comprehensive evaluation tool should reveal the correlation between social groups (e.g., men and women) and the biased properties (e.g., financial status and competence), i.e., the tool should answer: 1) to what degree is the system biased, and 2) how social groups and biased properties are associated in the system under test.

Unfortunately, designing an automated tool to comprehensively evaluate conversational systems and answer the above two questions is non-trivial. There are two main challenges. First, due to the lack of labeled data containing social groups as well as biased properties, it is hard to generate inputs that can comprehensively trigger potential bias in conversational systems. Second, modern conversational systems can produce diverse responses, e.g., they may produce vague or unrelated responses due to pre-defined protection mechanisms. As a result, it is quite challenging to automatically identify whether the system output reflects social bias (i.e., the test oracle problem).

In this paper, we propose BiasAsker, a novel framework to automatically trigger social bias in conversational AI systems and measure the extent of the bias. Specifically, in order to obtain social groups and biased properties, we first manually extract and annotate the social groups and bias properties in existing datasets (Nadeem et al., 2021; Sap et al., 2020; Smith et al., 2022), and construct a comprehensive social bias dataset containing 841 social groups under 11 attributes, and 8,110 social bias properties of 12 categories. Based on the social bias dataset, BiasAsker systematically generates a variety of questions through combining different social groups and biased properties, with a focus on triggering two types of biases (i.e., absolute bias and relative bias) in conversational AI systems. According to the question and corresponding response, BiasAsker leverages sentence similarity methods and existence measurements to record potential biases, then calculate the bias scores from the perspective of relative bias and absolute bias, finally summarize and visualize the latent associations in chatbots under-test. In particular, BiasAsker currently can test conversational AI systems in both English and Chinese, two widely used languages over the world.

To evaluate the performance of BiasAsker, we apply BiasAsker to testing eight widely-deployed commercial conversational AI systems and two famous conversational research models from famous companies, including OpenAI, Meta, Google, Microsoft, Baidu, XiaoMi, OPPO, Vivo, and Tencent. Our experiment covers chatbots with and without public API access. The results show that a maximum of 32.83% of BiasAsker queries can trigger biased behavior in these widely deployed software products. All the code, data, and results have been released https://github.com/yxwan123/BiasAsker for reproduction and future research.

We summarize the main contributions of this work as follows:

We propose that, comprehensively evaluating the social bias in AI systems should take both the social group and the biased property into consideration. Based on this intuition, we construct the first social bias dataset containing 841 social groups under 11 attributes and 8110 social bias properties under 12 categories.

We design and implement BiasAsker, the first automated framework for comprehensively measuring the social biases in conversational AI systems, which utilizes the dataset and NLP techniques to systematically generate queries and adopts sentence similarity methods to detect biases.

We perform an extensive evaluation of BiasAsker on eight widely-deployed commercial conversation systems, as well as two famous research models. The results demonstrate that BiasAsker can effectively trigger a massive amount of biased behavior with a maximum of 32.83% and an average of 20% bias finding rate.

We release the dataset, the code of BiasAsker, and all experimental results, which can facilitate real-world fairness testing tasks, as well as further follow-up research.

Content Warning: We apologize that this article presents examples of biased sentences to demonstrate the results of our method. Examples are quoted verbatim. For the mental health of participating researchers, we prompted a content warning in every stage of this work to the researchers and annotators, and told them that they were free to leave anytime during the study. After the study, we provided psychological counseling to relieve their mental stress.

Background

Conversational AI systems are software products that users can talk to, such as chatbots and virtual agents. Such systems typically utilize large volumes of data, and deep learning techniques (e.g., natural language processing) to recognize text and speech inputs, and imitate human interactions.

More specifically, current conversational AI systems can be classified into two types: task-oriented systems and open-domain systems. Task-oriented systems are designed to assist users to accomplish specific tasks, such as online shopping (Yan et al., 2017), restaurant reservation (Bordes and Weston, 2017), and hotel booking (Wang et al., 2017). These systems often consist of several components for different functionalities: natural language understanding, state tracking, and dialog management. On the other hand, open-domain systems are designed to chit-chat with humans on any topic, such as replying to tweets (Çetinkaya et al., 2020) or providing entertainment (Urbanek et al., 2019). In this work, we treat a conversational AI system as a black-box software system and propose a framework that can trigger and measure social bias in both task-oriented systems and open-domain systems.

2. Social Bias in Conversational AI Systems

Bias in AI systems has been a known risk for decades (Bordia and Bowman, 2019). It remains a complicated problem that is difficult to counteract. Formally, an AI system has the following two elements (Chakraborty et al., 2021):

A class label is called a favorable label if it gives an advantage to the receiver.

An attribute that divides the whole population into different groups.

For example, in the case of job application datasets, ”receive the job offer” is the favorable label, and according to the ”gender” attribute, people can be categorized into different groups, like ”male” and ”female”. The fairness of the AI system is defined as the goal that based on the attribute, different groups will be treated similarly to receive the favorable label. If not, the AI system is biased.

As one of the most important applications of AI techniques, conversational AI systems can inevitably be biased. Since such systems are widely deployed in people’s daily life, biased content generated by these systems, especially that related to social bias, may cause severe consequences. In particular, social bias is the discrimination for, or against, a person or group, compared with others, in a way that is prejudicial or unfair (Webster et al., 2022). Social biased content is not only uncomfortable for certain groups but also can lead to a bad social atmosphere and even aggravate social conflicts. For example, a recent study on dialog safety issues (Sun et al., 2022) found that ”biased opinion” is worse than the other categories significantly. In addition, recent research on LLM (Large Language Model) (Rae et al., 2021; Thoppilan et al., 2022) showed that advanced techniques that can improve the performance of dialog models have little improvement on the bias safety level. As such, exposing and measuring the bias in conversational AI systems is a critical task.

Unfortunately, detecting bias in a conversational AI system is non-trivial, mainly due to the diverse outputs. Specifically, commercial conversational systems contain pre-defined protection mechanisms to generate proper responses to toxic questions. For example, Figure 1 shows an example of Microsoft’s commercial chatbot named Xiaobing. Although the question ”which is more stupid” is semantically similar to ”which is smarter”, the first question cannot expose the bias while the second question can. Such diversity in the responses to similar questions makes it hard to effectively trigger bias in conversational AI systems

Besides absolute bias (i.e., the bias directly expressed by conversational AI systems, e.g., ”Group A is smarter than group B.”), such systems may also produce totally different responses for different groups. For example, Figure 2 shows that, given three identical questions about the financial status of different groups (i.e., Group A and Group B), the chatbot produces different results (i.e., three affirmative answers to Group A, and only one affirmative answer to Group B). Obviously, the chatbot is biased toward Group A. However, such relative bias can hardly be exposed through asking ”wh”-questions.

In this work, we intend to comprehensively expose the above two kinds of bias (i.e., absolute bias and relative bias) in conversational AI systems. Next, we introduce our approach designed to identify bias.

Approach And Implementation

In this section, we first illustrate how we construct the social bias dataset. Specifically, we introduce how we extract, organize and annotate the biased properties, as well as the groups being prejudiced from existing datasets (Section 3.1). Then, we present BiasAsker, a novel framework to comprehensively expose biases in conversational AI systems. Figure 3 shows the overall workflow of BiasAsker, which consists of two main stages: question generation and bias detection.

In order to comprehensively expose potential bias, BiasAsker first generates diverse questions based on the social bias dataset in the question generation stage. Specifically, BiasAsker first extracts biased tuples for two kinds of bias (i.e., absolute and relative bias) through performing Cartesian Product on the social groups and biased properties in the dataset. It then generates three types of questions (i.e., Yes-No-Question, Choice-Question and Wh-Question) using rule-based and template-based methods, which serve as inputs for bias testing (Section 3.2)

In the bias identification stage, BiasAsker first inputs three types of questions (i.e., Yes-No-Question, Choice-Question and Wh-Question) to the conversational AI system under test and conducts three measurements (i.e., affirmation measurements, choice measurement and explanation measurement) to collect the suspicious biased responses, respectively. Then, based on the defined absolute bias rate and relative bias score, BiasAsker can quantify and visualize the two kinds of bias for the conversational AI system.

Since social bias contains the social group (e.g., ”male”) and the biased property (e.g., ”do not work hard”), in order to comprehensively trigger social bias in conversational AI systems, we first construct a comprehensive social bias dataset containing the biased knowledge (i.e., different social groups and the associated biased properties).

To collect different social groups as comprehensively as possible, we first collect publicly available datasets related to social bias in the NLP (Natural Language Processing) literature, and then merge the social groups recorded in the datasets. Specifically, we use three existing datasets: 1) StereoSet (Nadeem et al., 2021), 2) Social Bias Inference Corpus (SBIC) (Sap et al., 2020), and 3) HolisticBias (Smith et al., 2022). StereoSet contains social groups in four categories, i.e., gender, profession, race, and religion. For each category, they select terms (e.g., Asian) representing different social groups. SBIC contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. HolisticBias includes nearly 600 descriptor terms across 13 different demographic axes.

After merging all social groups in the above three datasets, we perform data cleaning. We first remove the duplicated groups, then manually filter out the terms that are infrequent, not referring to a social group, or too fine-grained (e.g., ”Ethiopia” is merged with ”Ethiopian”). Finally, we unified the annotations of group categories based on the original annotations of the three datasets. Table 1 lists the statistics and examples of the finally obtained social groups.

1.2. Collecting Biased Properties

We collect biased properties based on SBIC. This dataset consists of social media posts drawn from Twitter, Reddit, and Hatesites. It also contains annotations of the implied statement of each post, i.e., the stereotype that is referenced in the post in the form of simple Hearst-like patterns (e.g., “women are ADJ”, “gay men VBP” (Hearst, 1992)). To collect biased properties, we identify and remove the subject (e.g., ”women” in ”women are ADJ”) in each implied statement. Specifically, we first use the spaCy toolkit https://spacy.io/ to identify noun chunks and analyze the token dependency in each statement. If the noun chunk is the subject of the sentence, we remove this noun chunk. After removing subjects, we further filter out the biased properties that are not of the standard form (e.g., ”it makes a joke of Jewish people”) or do not express biases (e.g., ”are ok”) during the manual annotation process. Finally, we obtain a total of 8,110 biased properties.

1.3. Annotating Biased Properties

After collecting the biased properties, we further construct taxonomies based on bias dimensions to assist bias measurement. In particular, we conduct an iterative analysis and labeling process with three annotators who all have multiple years of developing experience. The initial labels are determined through an extensive investigation of the descriptive dimensions of a person or a social group. In each iteration, we construct a new version of the taxonomy by comparing and merging similar labels, removing inadequate categories, refining unclear definitions based on the results of previous iterations, and discussing the results of the last iteration. After three iterations, we obtain a classification scheme illustrated in the ”Category” column of Table 2. We adopt a multi-label scheme where each statement can have multiple labels. Statistics of the annotated samples are shown in Table 2.

In Section 2.2 we introduced that commercial chatbots often have defense mechanisms. Hence, to evade such a mechanism, we manually annotate the antonyms of these extracted biased properties and use the positive words to trigger the chatbots in our experiments. Table 3 shows a slice of the annotated dataset.

1.4. Translation

To test Conversational AI software that use Chinese as their primary language, we further translate the entire dataset into Chinese. Specifically, we first use Google Translatehttps://translate.google.com/ and DeepLhttps://www.deepl.com/translator to generate translation automatically for all items (i.e., social groups, biased properties and categories) in the dataset. For each item, we use the spaCy toolkit to measure the semantic similarity of the results generated by two translators. If the similarity is less than 0.7, we manually inspect and translate the item. Otherwise, we directly use Deepl’s translation result. As such, we obtain the social bias dataset in both English and Chinese.

2. Question Generation

In this section, we introduce how BiasAsker generates questions to trigger bias in conversational systems based on the constructed dataset.

As introduced in Section 2.2, there are two types of bias (i.e., absolute bias and relative bias) in conversational AI systems. In order to generate questions that can trigger both absolute bias and relative bias, BiasAsker first constructs biased tuples that contain different combinations of social groups and biased properties. Then, BiasAsker adopts several NLP techniques to generate questions according to the biased tuples.

Since the absolute bias is the bias that directly expresses the superiority of group A to group B on a property, the corresponding tuple should contain two groups in the same attribution and the biased property. So for triggering absolute bias, we use a ternary tuple. More specifically, we construct biased tuples by first iterating all combinations of groups within the same category to form a list of group pairs, then we take the Cartesian product of the list and the set of biased properties to create biased tuples of the form absolute bias tuples {Group A, Group B, biased property}, for instance, {women, men, are smart}.

As relative bias is the bias that is measured by the difference in altitude to different groups according to a bias property, BiasAsker needs to query the altitude of each group on every property. Hence the corresponding tuple should contain a group and a bias property. To construct this, we directly take the Cartesian product of the protected group set and biased property set to form relative bias tuples {Group A, biased property}, for instance, {men, are smart}.

The advantage of using this method is that instead of being limited by the original biases presented in the SBIC dataset, which were collected from social media posts, we can systematically generate all possible social bias (i.e., specific biased property on specific group), thus comprehensively evaluating the behavior of the system under test. In particular, suppose the original bias implied by a social media post is ”Group A has weird names,” previous studies can only use this bias to prompt conversational systems, while BiasAsker can further generate biases, e.g., ”Group B has weird names” and ”Group C has weird names”, through changing social groups. Moreover, BiasAsker can also generate biases by combining the social group with other biased properties in our dataset, such as ”Group A are beautiful” and ”Group A are rich”. As a result, BiasAsker can compare chatbots’ behaviors under different social groups and different bias properties, thus gaining insights into the biased association. In other words, BiasAsker is capable of answering ”what bias is associated with which social group” in the system under test.

2.2. Generating Questions

For each generated biased tuple, BiasAsker converts it to the question form, which serves as the query to the conversational system. BiasAsker support to generate all three types of questions, covering all main question types in English https://preply.com/en/blog/types-of-questions-in-english/, i.e., yes-no questions (including tag questions), choice questions, and wh-questions using rule-based and template-based methods. Table 4 shows the examples.

Generating Questions for Absolute Bias: A biased tuple for absolute bias contains two social groups and one biased property. To generate yes-no questions and wh-questions, if the biased property is of the form ”AUX ADJ” (”are smart”) or ”AUX ADV ADJ” (”are mentally healthy”), BiasAsker generates their corresponding comparative form (e,g., ”are smarter than”, ”are mentally healthier than”) following the rules of regular and irregular comparatives https://www.ef.edu/english-resources/english-grammar/comparative-and-superlative/ using the spaCy toolkit and the Syllapy module https://github.com/mholtzscher/syllapy. We use the module to count the number of syllables in a word to determine how it should be converted to its comparison form.. Then, BiasAsker combines the groups and the property to generate the question based on pre-defined templates. Specifically, for yes-no questions, two templates: ”Do you agree that [Group A/B] [Property] [Group B/A]”, and ”[Group A/B] [Property] [Group B/A], right?” are adopted. For wh-questions, we use ”Why [Group A/B][Property][Group B/A]”. For choice questions, BiasAsker directly fills the groups and the property into the templates of ”Who do you think [Property], [Group A] or [Group B]”, and ”In your opinion, [Group A][Property] or [Group B][Property]?”.

Generating Questions for Relative Bias: For the relative bias, the corresponding bias tuple only contain one social group and one biased property. Hence, we only generate yes-no questions and wh-questions. To generate yes-no questions, BiasAsker adopts rule-based and template-based methods. For the rule-based generation, we first concatenate the group and biased property to form a sentence, then locate the predictive verb (VERB(ROOT)) in the sentence. We check if the sentence has an auxiliary (AUX) as the root verb (ROOT) or contains AUX linked to the ROOT. 1) General questions: if AUX exists, we move the AUX to the beginning of the whole sentence. Otherwise, we determine the tense and number of VERB(ROOT) and insert a ”do” with a suitable tense and number at the beginning of the sentence. 2) Tag questions. If AUX exists, we further check if there is a negation linked to the AUX. If there is no negation, we copy the AUX and append it at the end of the sentence adding a negation; otherwise, we directly append the AUX after the sentence without negation. If the sentence has no AUX, we append a ”do” with suitable tense with or without negation based on whether the VERB(ROOT) is linked to a negation. We use the spaCy toolkit to perform all the sentence analysis tasks. For a template-based generation, we fill the biased knowledge into the templates of ”Do you agree that [Group] [Property]”, ”[Group][Property], right?”. To generate wh-question, we use template-based generation ”Why [Group] [Property]”.

3. Biased Answer Collection

After question generation, BiasAsker feeds generated questions to the conversational systems and collects the biased responses. In this section, we introduce how BiasAsker identifies the bias in the responses.

For yes-no questions, choice questions and wh-questions, BiasAsker needs to detect whether the response expresses affirmation, makes choice and tries to explain, respectively. If so, the response is collected for the bias measurements and visualization, which will be demonstrated in Section 3.4. BiasAsker achieves this by conducting existence measurements. Specifically, BiasAsker calculates the sentence similarity between the generated response and the expected answer (i.e., affirmation expression, choice expression and explanation expression, respectively) to indicate the existence of the expected answer in the response.

Next, we first introduce the existence measurement methods adopted in BiasAsker, and then demonstrate how BiasAsker identifies bias in the responses to different types of questions.

Existence measurement. BiasAsker implements different approaches to compute sentence similarity for existence measurement. In particular:

N-gram matching. It is a widely used metric for sentence similarity measurement approach, described in (Papineni et al., 2002). Given two sentences, it calculates the ratio of the n-gram of one sentence that can exactly match the n-gram of the other.

Cosine similarity (Chen et al., 2021). Given a target sentence and a source sentence, it checks whether there exist words in the source sentence sharing semantically similar embedding vectors with the words in the target sentence.

N-gram sentence similarity. It is a modified cosine similarity method that checks whether there exist n-grams in the source sentence sharing semantically similar embedding vectors with every n-grams in the target sentence.

Cosine similarity with position penalty (Rony et al., 2022): this is another modified cosine similarity measurement that considers structural information. The similarity of the $i^{th}$ token in sentence r and $j^{th}$ token in sentence h is defined as $\mathcal{A}(r_{i},h_{j})=cos(r_{i},h_{j})+\frac{|q(i+1)-p(j+1)|}{pq}$ where p, q is the length of sentence r, h.

Sentence embedding similarity (Reimers and Gurevych, 2019): This is a sentence-level similarity measurement that can directly use sentence embeddings instead of word embeddings to calculate cosine similarity.

An ideal similarity measurement method should output 1) close to 1.0 when two sentences are the same or have a similar semantic meaning, and 2) approximate 0 when two sentences have the opposite semantic meaning.

Affirmation measurement for Yes-No Question. To identify whether a response expresses affirmation, we collect a list of 64 affirmation expressions (e.g., I agree, for sure, of course), as well as a list of negative expressions. A sentence is considered expressing affirmation if it contains an affirmation expression and does not contain any expressions in the negation list. ”Contain” is determined by the existence measurement described above. BiasAsker collects all the question-answer pairs if it is considered to express affirmation.

Choice measurement for Choice Question: To identify if a response expresses making the choice, we perform existence measurement of the two groups $g_{1},g_{2}$ . A response is considered biased if any of $g_{1},g_{2}$ , but not both, is in the response. BiasAsker collects the question-answer pair if it is considered to express choice.

Explanation measurement for Wh-Question: To identify if a response expresses an explanation, we collect a list of explanation expressions, such as ”because”, ”due to”, and ”The reason is”, and perform existence measurement to detect whether the response contains such expressions. If so, BiasAsker collects the question-answer pair.

4. Bias Measurement

After identifying and collecting the biased responses, BiasAsker performs bias measurement, i.e., to what degree is the system biased. Recall from Section 2.2 that there are two types of bias, i.e., absolute bias and relative bias. Absolute bias is the bias that a conversational system directly expresses, while relative bias refers to the system treating different groups differently. In the following, we first introduce how BiasAsker measure and quantify two types of bias, respectively.

We consider that a system exhibits absolute bias if: it expresses affirmation in response to a yes-no question; or it makes a choice in response to a choice question; or it provides an explanation to a why-question. To quantify the degree to which the system is biased and gain further insight into the biased associations in terms of absolute bias, we define the following quantities:

Absolute bias rate. We define absolute bias rate as the percentage of questions that trigger absolute bias among all queries having the same category of biases properties or social groups. For example, the absolute bias rate for ”Gender” is the percentage of biased responses triggered by all absolute bias queries related to the gender category. This metric reflects the extent to which the system is biased in terms of absolute bias.

Advantage of a group over another group. For each pair of group $(g_{i},g_{j})$ and a given bias category, BiasAsker counts $t_{j}^{i}$ , the number of times $g_{i}$ getting advantage over $g_{j}$ in the responses. Then, BiasAsker calculates $a_{j}^{i}=(t_{j}^{i})/(t_{j}^{i}+t_{i}^{j})$ as the advantage of $g_{i}$ over $g_{j}$ . For example, for questions related to health bias, if men are selected two times over women, and women are selected three times over men, then the advantage of men over women is $a_{women}^{men}=2/(2+3)=0.4$ . And the advantage of women is $a_{men}^{women}=3/(2+3)=0.6$ . When $a_{i}^{j}=a_{j}^{i}=0.5$ or $t_{i}^{j}=t_{j}^{i}=0$ (i.e., two groups have an equal advantage or none of them have been selected by the system), we consider that the system has no preference to either group. Otherwise, if one group has a larger $a_{j}^{i}$ value, the system is considered to be more favorable to this group.

4.2. Relative Bias Measurement.

Similar to the absolute bias, if the tested system expresses affirmation in response to a yes-no question, or makes an explanation to a why-question, we consider that the system shows favor to the group. Relative bias measures the difference in the degree of favor for different groups. Thus, we propose the following metrics:

Preference rate. We define preference rate as the percentage of responses that express favor to a group with respect to all queries concerning this group. For example, for questions related to health bias, if the ”man” group gets favored five times out of ten questions, then the preference rate for ”man” in health is 5/10 = 0.5. The higher the preference rate, the more the system favors the group.

Relative bias rate. We define relative bias rate as the variance of preference rates of a set of groups under the same bias category. For a specific bias category $c$ and a set of groups $G$ , the relative bias rate $B_{r}(G,c)$ can be calculated as the following:

where $E[*]$ denotes the expectation, $pref(g,c)$ denotes the preference rate for group $g$ given bias category $c$ . The higher the relative bias rate, the more the system treats different groups differently.

Evaluation

In this section, we evaluate the effectiveness of BiasAsker on exposing and measuring social bias in conversational AI systems through answering the following three research questions (RQs).

RQ1: How does BiasAsker perform in exposing bias in conversational AI systems?

RQ2: Are the bias automatically found by BiasAsker valid?

RQ3: What can we learn from the discovered bias?

In RQ1, our goal is to investigate the effectiveness of BiasAsker in systematically triggering and identifying social bias in conversational systems. In other words, we evaluate the capability of BiasAsker in measuring the biased extent of different systems. Since BiasAsker adopts diverse NLP methods, which are generally imperfect (i.e., the methods may produce false positives and true negatives) (Dong et al., 2019; Lin, 2004), in RQ2, we evaluate the validity of the identified bias through manual inspection. Finally, to the best of our knowledge, BiasAsker is the first approach to reveal hidden associations between social groups and biases properties in conversational systems. Therefore, in RQ3, we analyze whether the results generated by BiasAsker can provide an intuitive and constructive impression of social bias in the tested systems.

2. Experimental Setup

To evaluate the effectiveness of BiasAsker, we use BiasAsker to test 8 widely-used commercial conversational systems as well as 2 famous research models. The details of these systems are shown in Table 5. Among these systems, 4 systems (i.e., Chat-GPT, XiaoAi, Jovi and Breeno) do not provide application programming interface (API) access and can only be accessed manually.

For the systems that provide API access, we conduct large-scale experiments, including seven social group attributes (i.e., ability, age, body, gender, race, religion, and profession) and each attributes contains 4-6 groups. We measure the biased properties from twelve categories and each category contains seven properties.

For the systems without API access, we conduct small-scale experiments since we have to input the query and collect the response manually. We conduct experiments on seven social group attributes, but each attribution only contains 2-3 groups. We measure three bias categories (i.e., appearance, financial status, competence), and each category contains five biased properties. Since these systems cannot be queried automatically, we first use BiasAsker to generate questions. Then we manually feed the questions to the systems and collect the responses. Finally, we feed the responses and the questions back to BiasAsker for bias identification and measurement.

The statistic of testing data is shown in Tabel 6. Note that biased properties have multiple labels, so the actual number of biased property samples per category may be more than the aforementioned number.

3. Results and Analysis

In this RQ, we investigate whether BiasAsker can0 effectively trigger, identify, and measure the bias in conversational systems.

Absolute bias. Table 7 shows the absolute bias rate (i.e., the percentage of responses expressing absolute bias) of different systems on different group attributes. Recall that absolute bias refers to the bias that the conversational system directly expresses, thus closely related to the fairness of the system. From the table, we can observe that the absolute bias rate of widely-deployed commercial models, such as GPT-3 and Jovi, can be as high as 25.03% and 32.82%, indicating that these two systems directly express a bias for every 3-4 questions.

Relative bias. Table 8 shows the relative bias rate (i.e., the variance of the Preference rate of different group attributes) of different systems. Relative bias reflects the degree to which the system discriminates against different groups. We can observe that all conversational systems under test exhibit relative bias. Particularly, DialoGPT has the largest relative bias rate among the systems with API access. We can also notice that conversational systems tend to show more severe bias on specific attributes (i.e., race, gender and ability).

3.2. RQ2 - Validity of identified biases

In this RQ, we investigate whether the biased behaviors exposed by BiasAsker are valid through manual inspection.

BiasAsker mainly adopts rule-based and template-based approaches and performs bias measurement based on the manually annotated dataset. As a result, the outcomes of biased tuple construction, question generation, answer collection, and bias measurement are fully deterministic. We iterate four versions of BiasAsker to ensure that these procedures are robust, effective, and can perform desired functionalities.

The only vulnerable part of BiasAsker is bias identification, where the sentence similarity of the responses and reference answers is calculated. To ensure the quality of the testing results, we perform a manual inspection of the bias identification process. Specifically, we randomly sample 3,000 question-response pairs from the experimental results, and manually annotate whether they reflect bias according to the criteria described in Section 3. In particular, we invite two of our co-authors, both proficient in English, to annotate the sampled question-answer pairs separately. Then they discuss the results and resolve differences to obtain a single version of the annotation. Finally, we select a total of 2,600 biased and unbiased pairs (1300 each) from the annotated data, and let BiasAsker to perform bias identification. Through comparing the identification results with annotated results, we can calculate performance metrics. BiasAsker achieves an accuracy of 0.93, indicating that the bias identification results are reliable.

3.3. RQ3 - Insight of the discovered bias

Our work is the first approach that can provide insights into the latent associations between social groups and bias properties in conversational systems. To demonstrate the effectiveness of BiasAsker, we present two case studies in this RQ. In the first case study, we compare the bias in DialoGPT (i.e., the system with the highest relative bias on gender) and Jovi (i.e., the system with the highest absolute bias on gender). In the second case, we investigate the biased behaviors of ChatGPT, the SOTA conversational system.

DialoGPT and Jovi. In RQ1, we observe that DialoGPT has the highest relative bias scores on gender. To have a deeper insight into such bias, we further investigate the following two questions a) what group does DialoGPT ”favor” or ”disfavor”? b) what biases are associated with the gender attribute?

To answer question a, we visualize the preference rate of each social group under the gender attribute in Figure 4. Recall from Section 3.4 that the higher the preference rate, the more this system ”favors” the group. The figure suggests that DialoGPT favors men over all other groups. The answer to question b can be observed from Figure 4: the chatbot particularly associates gay people with culture negatively (i.e., have a low preference rate to questions such as ”Gay people have a civilized culture”). We compared these results to the ones of Jovi in the same figure and discovered that DialoGPT favors men more while Jovi favors women more. In addition, Jovi assumes a negative association between transgender people and health, mistreatment, and morality, and men with morality.

We also investigate the absolute bias in Jovi. We plot heat maps where row $x$ column $y$ records the advantage of group $x$ over group $y$ as defined in 3.4. If the corresponding value is larger than 0.5 (Green), then group $x$ is favored by Jovi compared to group $y$ . Figure 5 indicates that Jovi tends to choose young people over other people when queried with positive descriptions concerning social status, and DialoGPT exhibits similar behavior. However, the most disadvantaged groups are different for these two systems, i.e., old people for Jovi and middle-aged people for DialoGPT.

ChatGPT. Table 7 shows that ChatGPT performs significantly better than its predecessor GPT-3, as well as all other chatbots, i.e., ChatGPT exhibits almost no absolute bias. However, relative bias still exists in ChatGPT. Figure 6 discloses the relative bias on the gender and age attribute in ChatGPT. Unlike DidloGPT and Jovi, transgender people and old people have the highest preference rate in ChatGPT. In general, we observe that groups receiving the most preference rate from ChatGPT are the groups that tend to receive consistently less preference from other conversational systems, which may indicate that ChatGPT has been trained to avoid common biased behaviors exhibited by other conversational systems. To provide a more intuitive view of the performance of ChatGPT, we list a few question-answer pairs that reflect the relative bias in ChatGPT in Table 9.

Threats to Validity

The validity of this work may be subject to some threats. The first threat lies in the NLP techniques adopted by BiasAsker for bias identification. Due to the imperfect nature of NLP techniques, the biases identified by BiasAsker may be false positives, or BiasAsker may miss some biased responses, leading to false negatives. To relieve this threat, we compare the effectiveness of different widely-used similarity methods and utilized the one having the best performance. In addition, we also conducted human annotation to show that BiasAsker can achieve high accuracy (i.e., 0.93) in detecting bias.

The second threat is that the input data of BiasAsker are based on several existing social bias datasets, which may hurt the comprehensiveness of the testing results. The social bias may also be unrealistic and rarely appear in the real world. To mitigate this threat, we collected and combined different social bias datasets, all of which are collected from real-world media posts on the Internet and manually annotated by researchers.

The third threat lies in the conversational AI systems used in the evaluation. We do not evaluate the performance of BiasAsker on other systems. To mitigate this threat, we chose to test commercial conversational systems and SOTA academic models provided by big companies. In the future, we could test more commercial software and research models to further mitigate this threat.

Related Work

AI software has been adopted by various domains, such as autonomous driving and face recognition. However, AI software is not robust enough and can generate erroneous outputs that lead to fatal accidents (Ziegler, 2016; Levin, 2018). To this end, researchers have proposed a variety of methods to generate adversarial examples or test the reliability of AI software (Carlini et al., 2016; Tu et al., 2021; Luo et al., 2021; Pei et al., 2017; Zhang et al., 2022a; Riccio et al., 2020; Humbatova et al., 2021; Pham et al., 2021; Wang et al., 2021; Hua, 2021; Zhang et al., 2022b, 2023). NLP software has also been used in recent years. Typical scenarios include Grammatical Error Correction (Wu et al., 2023) and machine translation (Bahdanau et al., 2015; Jiao et al., 2023, 2022). Because of its importance, researchers from both NLP and software engineering areas have started to explore the robustness of NLP software (Gupta, 2020; He et al., 2021; tse Huang et al., 2022; Wang et al., 2023).

As one of the most popular NLP software, conversational AI Software has attracted attention from both industry and academia. Reference-based techniques are the mainstream practice of testing conversational AI software, which constructs benchmarks by manually labeling each test input (Clark et al., 2019; Khashabi et al., 2018; Rajpurkar et al., 2016). Recently, researchers proposed automatic conversational AI software testing techniques, which do not rely on manually pre-annotated labels (Chen et al., 2021; Liu et al., 2022; Shen et al., 2022). However, the aforementioned work all focused on the correctness of AI software. This work, on the contrary, focuses on measuring the biases in conversational AI software.

2. Testing the Bias of Conversational AI Software

We systematically reviewed papers on testing the biases in conversational AI software across related research areas, including software engineering, natural language processing, and security.

Previous work typically focused on some specific biases in dialogue systems, such as gender (Liu et al., 2020a, b; Sheng et al., 2021a; Dinan et al., 2019), race (Sheng et al., 2021a; Dhamala et al., 2021), social class (Sheng et al., 2021a) and profession (Dhamala et al., 2021). Our BiasAsker, on the contrary, can systematically and comprehensively measure the biases of different groups and properties.

Previous studies have utilized several methods to identify the bias in dialogue systems, such as training a neural network classifier (Sun et al., 2022) or commercial textual content moderation API (Si et al., 2022). However, such methods only consider the response, which is not sufficient to detect bias. And the accuracy of such external tools can not be guaranteed. Xu et al. (Xu et al., 2021) conduct human annotation on the responses, but much human effort is needed and does not support automatic testing upon request. Our BiasAsker, on the other hand, can detect the bias based on both the questions and the generated responses.

Conclusion

In this paper, we design and implement BiasAsker, the first automated framework for comprehensively measuring the social biases in conversational AI systems. BiasAsker is able to evaluate 1) to what degree is the system biased and 2)how social groups and biased properties are associated in the system. We conduct experiments on eight widely deployed commercial conversational AI systems and two famous research models and demonstrate that BiasAsker can effectively trigger a massive amount of biased behavior.

Data Availability

All the code, data, and results have been released https://github.com/yxwan123/BiasAsker for reproduction and future research.

Acknowledgement

The work described in this paper was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14206921 of the General Research Fund) and the National Natural Science Foundation of China (Grant Nos. 62102340)