Lawyers are Dishonest? Quantifying Representational Harms in Commonsense Knowledge Resources

Ninareh Mehrabi, Pei Zhou, Fred Morstatter, Jay Pujara, Xiang Ren, Aram Galstyan

Introduction

Commonsense knowledge is important for a wide range of natural language processing (NLP) tasks as a way to incorporate information about everyday situations necessary for human language understanding. Numerous models have included knowledge resources such as ConceptNet Speer et al. (2017) for question answering Lin et al. (2019), sarcasm generation Chakrabarty et al. (2020), and dialogue response generation Zhou et al. (2018, 2021), among others. However, commonsense knowledge resources are mostly human-generated, either crowdsourced from the public Speer et al. (2017); Sap et al. (2019) or crawled from massive web corpora Bhakthavatsalam et al. (2020). For example, ConceptNet originated from the Open Mind Common Sense project that collects commonsense statements online from web users Singh et al. (2002)ConceptNet also includes knowledge from expert-created sources such as WordNet Miller (1995) and GenericsKB consists of crawled text from public websites. One issue with this approach is that the crowdsourcing workers and web page writers may conflate their own prejudices with the notion of commonsense. For instance, we have found that querying for some target words such as “church” as shown in Table 1 in ConceptNet, results in biased triples.

The potentially biased nature of commonsense knowledge bases (CSKB), given their increasing popularity, raises the urgent need to quantify biases both in the knowledge resources and in the downstream models that use these resources. We present the first study on measuring bias in two large CSKBs, namely ConceptNet Speer et al. (2017), the most widely used knowledge graph in commonsense reasoning tasks, and GenericsKB Bhakthavatsalam et al. (2020), which expresses knowledge in the form of natural language sentences and has gained increasing usage. We formalize a new quantification of “representational harms,” i.e., how social groups (referred to as “targets”) are perceived Barocas et al. (2017); Blodgett et al. (2020) in the context of CSKBs.

We consider two types of such harms in the context of CSKBs. One is intra-target overgeneralization, indicating that “common sense” in these resources may unfairly attribute a polarized (negative or positive) characteristic to all members of a target class such as “lawyers are dishonest.” The other is inter-target disparity, occurring when targets have significantly different coverage in the CSKB in terms of both the number of statements about the targets (e.g., “Persian” might have much fewer CS statements than “British”) and perception toward the targets (“islam” might have more negative CS statements than “christian”).

We propose a quantification of overgeneralization and disparity in CSKBs using two proxy measures of polarized perceptions: sentiment and regard Sheng et al. (2019). Applying the proposed metrics of bias to ConceptNet and GenericsKB, we find harmful overgeneralizations of both negative and positive perceptions over many target groups, indicating that human biases have been conflated with “common sense” in these resources. We find severe disparities across the targets in demographic categories such as professions and genders, both in the number of statements and the polarized perceptions about the targets.

We then examine two generative downstream tasks and the corresponding models that use ConceptNet. Specifically, we focus on automatic knowledge graph construction and story generation and quantify biases in COMeT Bosselut et al. (2019) and CommonsenseStoryGen (CSG) Guan et al. (2020). We find that these models also contain the harmful overgeneralizations and disparities found in ConceptNet. We then design a simple mitigation method that filters unwanted triples according to our measures in ConceptNet. We retrain COMeT using filtered ConceptNet and show that our proposed mitigation approach helps in reducing both overgeneralization and disparity issues in the COMeT model but leads to a performance drop in terms of the quality of triples generated according to human evaluations. We open-source our data and prompts to evaluate biases in commonsense resources and models for future work https://github.com/Ninarehm/Commonsense_bias.

Quantifying Representational Harms

Representational harms occur “when systems reinforce the subordination of some groups along the lines of identity” and can be further categorized into stereotyping, recognition, denigration, and under-representation Barocas et al. (2017); Crawford (2017); Blodgett et al. (2020). This work aims to formalize representational harms specifically for a set of statements about some target groups Nadeem et al. (2020), e.g “lawyer is related to dishonest” is a statement about the group “lawyer.”

When measuring such harms, we consider the core concept of polarized perceptions: non-neutral views that can take the form of either prejudice that expresses negative views The word prejudice is defined as “preconceived (usually unfavorable) evaluation of another person” Lindzey and Aronson (1968). We focus on only unfavorable prejudice and use favoritism for favorable evaluation. or favoritism that expresses positive views toward a certain target perceived in the statement Mehrabi et al. (2021).

2 Two Types of Harms

To adapt the definition of representational harms to a sentence set, we define two sub-types of harms, intra-target overgeneralization and inter-target disparity, aiming to cover different categories of representational harms Barocas et al. (2017); Crawford (2017). We consider overgeneralization that directly examines whether targets such as “lawyer” or “lady” are perceived positively or negatively in the statements (examples in Table 1), covering categories including stereotyping, denigration, and favoritism. Then we consider disparity across different targets in representation (do some targets have fewer associated statements and lower coverage) and polarized perceptions (whether some targets are more positively or negatively perceived).

3 Measuring Polarized Perceptions

Prior work Sheng et al. (2019) demonstrated that sentiment and regard are effective measures of bias (polarized views toward a target group). Although this is still an active area of research, for now, these are promising proxies that many works in ethical NLP also have used to measure bias (e.g. Sheng et al. (2019); Li et al. (2020); Brown et al. (2020); Sheng et al. (2020); Dhamala et al. (2021)). However, we acknowledge that there still exist problems with these measures as proxies for measuring bias and acknowledge the existence of noisy labels using these measures as proxies. To put this into test and to show that these measures can still be reliable proxies despite the aforementioned problems, we perform studies both including human evaluators in the loop as well as comparison of these measures with a keyword-based approach in this section.

In order to determine the polarization of perception associated to a statement toward a group, we apply sentiment and regard classifiers on the statement containing the target group and obtain the corresponding labels from each of the classifiers. We then categorize the statement into favoritism, prejudice, or neutral based on the positive, negative, or neutral labels obtained from each of the classifiers.

Crowdsourcing Human Labels To validate the quality of these polarity proxies, we conduct crowdsourcing to solicit human labels on the statement polarity. We asked Amazon Mechanical Turk workers to label provided knowledge from GenericsKB Bhakthavatsalam et al. (2020) and ConceptNet Speer et al. (2017) with regards to favoritism, prejudice, and neutral toward a target group. 3,000 instances were labeled from ConceptNet and more than 1,500 from GenericsKB. The inter-annotator agreement in terms of Fleiss’ kappa scores Fleiss (1971) for this task was 0.5007 and 0.3827 for GenericsKB and ConceptNet respectively.

Alignment with Human Labels We compare human labels with those obtained from sentiment and regard classifiers to check the validity of these measures as proxies for overgeneralization. As shown in Table 2, we found reasonable agreement in terms of accuracy for sentiment and regard with human labels. This was also confirmed in previous work Sheng et al. (2019) in which sentiment and regard were shown to be good proxies to measure bias.

Comparison with Keyword-based Approach We also compare the sentiment and regard classifiers to a keyword-based baseline, in which we collect a list of biased words that could represent favoritism and prejudice from LIWC Tausczik and Pennebaker (2010) and Empath Fast et al. (2016). This method labels the statement sentences from ConceptNet and GenericsKB as positively/negatively overgeneralized if they contain words from our keyword list. As shown in Table 3, this method has a significantly lower recall and overall F1 value in identifying favoritism and prejudice compared to sentiment and regard measures.

Representational Harms in CSKBs

Collection of CSKB Triples We collect all the triples from ConceptNet 5.7https://github.com/commonsense/conceptnet5/wiki/Downloads Speer et al. (2017) which contain the target words in each category, resulting in more than 100k triples. For GenericsKB, we use the GenericsKB-Best set Bhakthavatsalam et al. (2020), which contains filtered, high-quality sentences and extract those that have one of our target words as their annotated topic of the sentence, resulting in around 30k statements (sentences).

Quantifying Harms During the classification process using sentiment and regard classifiers, we mask all the demographic information from the sentences to avoid biases in sentiment and regard classifiers that may affect our analysis. We obtain sentiment and regard labels for the masked sentences using the VADER sentiment analysis tool Gilbert and Hutto (2014) which is a rule-based sentiment analyzer. For regard, we use the fine-tuned BERT model from Sheng et al. (2019). After obtaining the labels, we use Eq. (1) and (2) to measure overgeneralization and Eq. (4) for disparity in overgeneralization.

2 Analysis of Representational Harms

Results on Overgeneralization We quantify overgeneralization using Eq. (1) and (2) in Section 2. The overall average percentage of overgeneralized triples in ConceptNet is 4.5% (4.6k triples) for sentiment and 3.4% (3.6k triple) for regard. For GenericsKB, the percentages are 36.5% for sentiment (11k triples) and 38.6% for regard (11k triples). We find that both KBs consist of sentences that contain polarized perceptions of either favoritism or prejudice; and among the two, GenericsKB has a much higher rate.

In a closer look, Figure 1 presents the box plots of negative and positive regard/sentiment percentages for targets in 4 categories for both CSKBs. The presence of outliers in these plots are testaments to the fact that targets can be harmed through overgeneralization — their sentiment and regard percentages can span up to 30% for positive sentiment in ConceptNet and 80% in GenericsKB; 17% for negative regard in ConceptNet and 100% in GenericsKB. We again find some similar trends of representational harms across the two KBs qualitatively, such as the box shapes for “Gender” and “Religion” categories, indicating common biases in knowledge resources. Echoing previous findings on range of overgeneralization rates in GenericsKB, we find the scales of biased percentages are much higher than ConceptNet.

Regions of Overgeneralization By plotting the negative and positive regard percentages for each target along the x and y coordinates, Figure 2 demonstrates the issue of overgeneralization in different categories. For example, for “Profession,” some target professions such as “CEO” are associated with a higher positive regard percentage (blue region) and thus a higher overgenaralization in terms of favoritism. In contrast, some professions, such as “politician” are associated with a higher negative regard percentage (red region) representing a higher overgenaralization in terms of prejudice. In addition, some professions, such as “psychologist” are associated with both high negative and positive regard percentages (purple region) and high positive and negative overgenaralization.

ConceptNet vs GenericsKB We compare ConceptNet and GenericsKB on the “Religion” category and see certain targets contain similar biases, such as “christian” contains both biases and “sharia” is prejudiced against in both KBs. Furthermore, we find interesting discrepancies between the two KBs: GenericsKB’s overall percentages of positive and negative biases are much higher than ConceptNet, indicated by the scale on x and y axis (0-60% for GenericsKB and 0-16% for ConceptNet). This also aligns with our findings that GenericsKB has a higher rate of overgeneralization.

Severity of Overgeneralization Figure 3 further demonstrates how severe the problem of overgeneralization can be, along with some concrete examples. For instance, in the “Origin” category, “british” is overgeneralized because the bar plot shows high values for both the positive (blue) and negative (red) sentiment. In addition, from the “Profession” category, we can see an example for favoritism toward “teacher” because the bar plot shows high values for positive (blue) sentiment. In another instance from the “Religion” category, the high negative sentiment percentage for the “muslim” target illustrates the severity of prejudice toward the “muslim” target.

Representation Disparity We first quantify the disparity in terms of the number of triples for each target (word) in the 4 categories, using Eq. (3). Table 4 shows extremely high variance in both CSKBs. Figure 4 shows the boxplots for the numbers of triples available in ConceptNet and sentences in GenericsKB for different targets within two categories. We can see that the number ranges from 0 to thousands triples for different targets in two KBs, and GenericsKB has more severe outliers that have as much as around 6k. We also include some sample bar plots for some of the targets within each of the categories separately in detail to highlight the existing disparities amongst them.

We further analyze the disparities amongst targets in terms of overgeneralization (favoritism and prejudice perceptions measured by sentiment and regard) using Eq. (4), shown in Table 4. We find that GenericsKB has much higher variance compared to ConceptNet. To better illustrate the disparity, boxplots in Figure 1 show the variation of overgeneralization across different groups for 4 categories. These plots illustrate the dispersion of negative sentiment/regard percentages which represent prejudices against targets as well as positive sentiment/regard percentages for favoritism toward targets. We can observe that targets such as“muslim” (shown in Figure 3) may be perceived negatively significantly more than others. The same trend also holds for positive sentiment and regard scores. Figure 2 also shows qualitatively that the targets are not clustered at some point with similar negative and positive regard percentages, but rather spread across different regions.

Analysis on Downstream Applications

As a popular downstream application, we first consider the task of commonsense knowledge base completion which looks to automatically augment a CSKB with generated facts Li et al. (2016). We focus our analysis on the COMeT model Bosselut et al. (2019), built by fine-tuning a pre-trained GPT model Radford et al. (2018) over ConceptNet triples. COMeT has been shown to generate unseen commonsense knowledge in ConceptNet with high quality, and much recent work has used it to provide commonsense background knowledge Shwartz et al. (2020); Chakrabarty et al. (2020).

Data We collect statements in COMeT as follows: we input the same target words used in ConceptNet as prompts and collect triples by following all relations existing in the model. Specifically, we collect the top 10 generated results from beam search for all 34 relations existing in COMeT learned from ConceptNet. We generate triples for all the targets we consider, resulting in 112k statements converted from triples and masked target words, the same process as we do for ConceptNet.

Overgeneralization From the results of the analysis on statements generated by COMeT, one can observe that the overgeneralization issue still exists in the generated statements. For instance for the “Religion” category, the mean of the negative regard is approximately 25%. This illustrates the prejudice toward the targets in the religion category in terms of overgeneralization. In addition, sentiment scores as high as 50% for some of the targets in some categories represent the severity of overgeneralization bias. Some additional qualitative examples are also included in Table 5.

Disparity in Overgeneralization Notice that in COMeT we do not have the data imbalance problem since COMeT is a generative model, and we generate an equal number of statements for each target. Disparity in number of triples is not an issue for this task. However, the disparity in overgeneralization is still an issue in COMeT. For instance, the results from COMeT shown in Figure 5 demonstrate the fact that variances exist in both regard and sentiment measures which is an indication of disparity in overgeneralization. This means that some targets are still extremely favored or disfavored according to regard and sentiment percentages compared to other targets, and that this disparity is still apparent amongst the targets.

2 Neural Story Generation

As our second downstream task, we consider Commonsense Story Generation (CSG) Guan et al. (2020): given a prompt, the model will generate 3 to 5 sentences to tell a story. The CSG model augments GPT-2 Radford et al. (2019) with external commonsense knowledge by training on the CSKB examples constructed from ConceptNet and ATOMIC Sap et al. (2019).

Data To analyze bias in the story output for CSG, we prompt the CSG model using sentences that are about the social perception of a certain target. We split our targets into: people, locations, professions, and others. Next, we manually come up with 30 templates inspired by the prefix templates for bias in NLG Sheng et al. (2019). Some examples are listed in Table 6. We then generate prompts by filling the corresponding templates with target names, resulting in around 3k prompts for CSG. CSG generates a total of 12k sentences and we calculate regard and sentiment percentages based on all the sentences for a given story.

Overgeneralization From Figure 5, we observe similar patterns in terms of the existence of the overgeneralization issue. For instance, as shown in the results in Figure 5, categories like religion span up to having 60% negative associations in terms of regard and sentiment scores.

Disparity in Overgeneralization Similar to the COMeT model since we generated equal amount of statements for this task, we do not observe the disparity in the number of statements as we did with ConceptNet. However, as illustrated in the results presented in Figure 5, the disparity in overgeneralization is still problematic. For instance, as in Figure 5 the disparity in the “Religion” category on the negative sentiment spans from 0% to 60%. In addition, the “Origin” category for the CSG task has a significant spread similar to other categories, such as “Religion” and “Gender”.

3 Bias Mitigation on CSKB Completion

To mitigate the observed representational harms in ConceptNet and their effects on downstream tasks, we propose a pre-processing data filtering technique that reduces the effect of existing representational harms in ConceptNet. We apply our mitigation technique on COMeT as a case study.

Our pre-processing technique relies on data filtering. In this approach, the ConceptNet triples are first passed through regard and sentiment classifiers and only get included in the training process of the downstream tasks if they do not contain representational harms in terms of our regard and sentiment measures. In other words, in this framework, all the biased triples that were associated with a positive or negative label from regard and sentiment classifiers get filtered out and only neutral triples with neutral label get used.

Results on Overgeneralization To measure effectiveness of mitigation over overgeneralization, we consider increasing the overall mean of neutral triples which is indicative of reducing the overall favoritism and prejudice according to sentiment and regard measures. We report the effects on overgenaralization on sentiment as Neutral Sentiment Mean (NSM) and regard measure as Neutral Regard Mean (NRM). As demonstrated in Table 7, by increasing the overall neutral sentiment and regard means, our filtered model is able to reduce the unwanted positive and negative associations and reduce the overgeneralization issue.

Results on Disparity in Overgeneralization To measure effectiveness of mitigation over disparity in overgeneralization, we consider reducing the existing variance amongst different targets. We report the disparity in overgeneralization on sentiment as Neutral Sentiment Variance (NSV) and on regard as Neutral Regard Variance (NRV). Shown in Table 7, our filtered technique reduces the variance and disparities amongst targets over the standard COMeT model in terms of regard and sentiment measures.

Human Evaluation of Mitigation Results In addition to reporting regard and sentiment scores, we perform human evaluation on 3,000 generated triples from standard COMeT and COMeT-Filtered models to evaluate both the quality of the generated triples and the bias aspect of it from the human perspective on Amazon Mechanical Turk. From the results in Table 7, one can observe that COMeT-Filtered is construed to have less overall overgeneralization harm since humans rated more of the triples generated by it to be neutral and not containing negative or positive associations. This is shown as Human Neutral Mean (HNM) in Table 7. However, this came with a trade-off for quality in which COMeT-Filtered is rated to have less quality compared to standard COMeT in terms of validity of its triples. We encourage future work to improve for higher quality. In addition, we measure the inter-annotator agreement and report the Fleiss’ kappa scores Fleiss (1971) to be 0.4788 and 0.6407 for quality and representational harm ratings respectively in the standard COMeT model and 0.4983 and 0.6498 for that of COMeT-Filtered.

Related Work

Work on fairness in NLP has expanded to different applications and domains including coreference resolution Zhao et al. (2018a), named entity recognition Mehrabi et al. (2020), machine translation Font and Costa-jussà (2019), word embedding Bolukbasi et al. (2016); Zhao et al. (2018b); Zhou et al. (2019), as well as surveys Sun et al. (2019); Blodgett et al. (2020); Mehrabi et al. (2021). Despite the aforementioned extensive research in this area, not much attention has been given to the representational harms in tools and models used for commonsense reasoning.

Injecting commonsense knowledge into NLP tasks is gaining attention Storks et al. (2019); Chang et al. (2020). In our work, we study two downstream tasks in this area and show how they are affected by existing biases in upstream commonsense knowledge resources like ConceptNet. Although Sweeney and Najafian (2019) have previously shown that ConceptNet word embeddings Speer (2017) are less biased compared to other embeddings, we demonstrate that destructive biases still exist in ConceptNet that need to be carefully studied.

Conclusion

Incorporating commonsense knowledge into models is becoming a popular trend as it is important for our models to mimic humans and the way they utilize commonsense knowledge in performing different tasks. One danger of mimicking humans is adopting their biases. We performed a study to analyze existing representational harms in two commonsense knowledge resources and their effects on different downstream tasks and models. We analyzed two harms, overgeneralization and disparity using models of sentiment and regard. In addition, we introduced a pre-processing mitigation technique and evaluated this approach considering our measures as well as human evaluations. Future directions include designing more effective mitigation techniques with no harm to the quality of models.

Acknowledgments

We thank anonymous reviewers for providing insightful feedback along with Brendan Kennedy and Lee Kezar for their comments and help. Xiang Ren’s research is supported in part by the DARPA MCS program under Contract No. N660011924033, the Defense Advanced Research Projects Agency with award W911NF-19-20271, NSF IIS 2048211, and NSF SMA 182926. This material is based upon work supported, in part, by the Defense Advanced Research Projects Agency (DARPA) and Army Research Office (ARO) under Contract No. W911NF-21-C-0002.

Ethics and Broader Impact

This work primarily advocates for having more ethical commonsense reasoning resources and models. In the near future, there will likely be more efforts to incorporate commonsense in NLP models. Conflating human biases with commonsense is harmful. Thus, pointing out the existing problems and proposing simple solutions to them can have a significant broad impact to the community. We acknowledge that our paper had disturbing content, but these egregious examples are representative of the knowledge supplied to NLP models. Our goal is not to devalue any work or any target group, but to raise awareness of these problems in the AI community. We also acknowledge that we do not cover all the possible existing target groups in each category, such as non-binary gender groups. However, we incorporated groups from Nadeem et al. (2020) and made extensions to fill gaps in these groups. Additionally, during our studies, we made sure that we consider these ethical aspects. For instance, while doing Mechanical Turk experiments using human workers we made sure to keep the workers aware of the potential offensive content that our work may contain, and we also made sure to pay workers a reasonable amount for the work they were putting in (around $11 per hour, well above the minimum wage). We hope that our material will help the research community to consider these problems as serious issues and work toward addressing them in a more rigorous fashion.

References

Appendix A Qualitative Examples

We include details in the appendix section both in terms of providing more qualitative analysis and also some detailed experimental results that we could not include in the main text due to the space limitation. For instance, in Table 9 we include more of qualitative results and demonstrate some destructive triples existing in ConceptNet. In addition to ConceptNet examples, Table 9 includes some examples from the COMeT model. Similarly, Table 10 includes some examples for the Commonsense Story Generation model (CSG). Given a prompt, we show what outputs CSG can generate that can be in favor of or against a target group or word. Tables 14 and 15 contain the detailed list of these target groups and words.

Appendix B Mitigation Framework

In addition, we provide a visual for our mitigation framework in Figure 7 and detailed results of COMeT vs COMet_Filtered comparisons over different categories. Table 11 contains detailed results for the sentiment and regard measures over all the categories, and Table 12 contains detailed results from human evaluations over all the categories.

Appendix C Human Evaluation

COMeT vs Filtered-COMeT For human evaluations, we sample the top 3 generated triples for each of the “CapableOf”, “Causes”, and “HasProperty” relations for all the groups in each category resutling in around 1,000 triples for each model and ask three mecahnical turk workers to rate each of the triples in terms of their quality (whether a triple is a valid commonsense or not) and bias (whether a triple shows favoritism or prejudice or is neutral toward the demographic groups). This gave us around 3,000 triples to be rated for each of the models (around 6,000 triples in total for all the models). Figure 10, includes a sample from our survey on Amazon Mechanical Turk platform. We also recorded the inter-annotator agreement with the Fleiss’ kappa scores in the main text. These numbers are reasonable agreements. Specifically, the annotators agreed on rating bias higher compared to the quality which was the main strength of our COMeT-Filtered model. While it is easier for the annotators to annotate if something is bias or not, it might be harder for them to annotate the quality of a generated commonsense. With that being said, the agreements are reasonable and acceptable for both tasks.

ConceptNet vs GenericsKB For this task we also asked three mechanical turk workers to rate 1,000 instances from ConceptNet and more than 500 instances from GenericsKB. The statement sentence triples were chosen randomly. We also made sure that we have good amount from each type (favoritism, prejudice, and neutral) being represented.

Appendix D Experimental Details

Sentiment Analysis For sentiment analysis, we used a threshold value of greater than or equal to $0.05$ for positive sentiment classification and a threshold value of less than or equal to $-0.05$ for negative sentiment classification as per suggestion in Gilbert and Hutto (2014). Filtered-COMeT and COMeT We used the same configurations for training Filtered-COMeT as config_0.json in the COMeT repositoryhttps://github.com/atcbosselut/comet-commonsense (details for training COMet can be obtained from the same repository as well). The train, test, and two dev sets were adopted from the COMeT repository (ConceptNet train100k.txt, test.txt, dev1.txt, and dev2.txt) and augmented according to our filtering approach. Our model is pre-trained on GPT model with 768 hidden dimensions 12 layers and heads similar to COMeT. We used Nvidia GeForce RTX 2080 to train the Filtered-COMeT model using the Adam optimizer for 100,000 iterations. Commonsense Story Generation Experimental details can be found at CommonsenseStoryGen repository https://github.com/thu-coai/CommonsenseStoryGen.