Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings

Thomas Manzini, Yao Chong Lim, Yulia Tsvetkov, Alan W Black

Introduction

In addition to possessing informative features useful for a variety of NLP tasks, word embeddings reflect and propagate social biases present in training corpora (Caliskan et al., 2017; Garg et al., 2018). Machine learning systems that use embeddings can further amplify biases (Barocas and Selbst, 2016; Zhao et al., 2017), discriminating against users, particularly those from disadvantaged social groups.

(Bolukbasi et al., 2016) introduced a method to debias embeddings by removing components that lie in stereotype-related embedding subspaces. They demonstrate the effectiveness of the approach by removing gender bias from word2vec embeddings (Mikolov et al., 2013), preserving the utility of embeddings and potentially alleviating biases in downstream tasks. However, this method was only for binary labels (e.g., male/female), whereas most real-world demographic attributes, including gender, race, religion, are not binary but continuous or categorical, with more than two categories.

In this work, we show a generalization of Bolukbasi et al.’s (2016) which enables multiclass debiasing, while preserving utility of embeddings (§3). We train word2vec embeddings using the Reddit L2 corpus (Rabinovich et al., 2018) and apply multiclass debiasing using lexicons from studies on bias in NLP and social science (§4.2). We introduce a novel metric for evaluation of bias in collections of word embeddings (§5). Finally, we validate that the utility of debiased embeddings in the tasks of part-of-speech (POS) tagging, named entity recognition (NER), and POS chunking is on par with off-the-shelf embeddings.

Background

As defined by (Bolukbasi et al., 2016), debiasing word embeddings in a binary setting requires identifying the bias subspace of the embeddings. Components lying in that subspace are then removed from each embedding.

(Bolukbasi et al., 2016) define the gender subspace using defining sets of words, where the words in each set represent different ends of the bias. For example, in the case of gender, one defining set might be the gendered pronouns {he, she} and another set might be the gendered nouns {man, woman}. The gender subspace is then computed from these defining sets by 1) computing the vector differences of the word embeddings of words in each set from the set’s mean, and 2) taking the most significant components of these vectors.

2 Removing bias components

Following the identification of the gender subspace, one can apply hard or soft debiasing (Bolukbasi et al., 2016) to completely or partially remove the subspace components from the embeddings.

Hard debiasing (also called “Neutralize and Equalize”) involves two steps. First, bias components are removed from words that are not gendered and should not contain gender bias (e.g., doctor, nurse), and second, gendered word embeddings are centered and their bias components are equalized. For example, in the binary case, man and woman should have bias components in opposite directions, but of the same magnitude. Intuitively, this then ensures that any neutral words are equidistant to any biased words with respect to the bias subspace.

More formally, to neutralize, given a bias subspace B\mathcal{B} spanned by the vectors {b1,b2,...,bk}\{\bm{b_{1}},\bm{b_{2}},...,\bm{b_{k}}\}, we compute the component of each embedding in this subspace:

We then remove this component from words that should be bias-neutral and normalize to get the debiased embedding:

To equalize the embeddings of words in an equality set EE, let μ=1EwEw\bm{\mu}=\frac{1}{|E|}\sum_{\mathbf{w}\in E}\mathbf{w} be the mean embedding of the words in the set and μB\bm{\mu}_{\mathcal{B}} be its component in the bias subspace as calculated in Equation 1. Then, for wE\mathbf{w}\in E,

Note that in both Equations 2 and 3, the new embedding has unit length.

Soft debiasing

Soft debiasing involves learning a projection of the embedding matrix that preserves the inner product between biased and debiased embeddings while minimizing the projection onto the bias subspace of embeddings that should be neutral.

Given embeddings W\mathbf{W} and N\mathbf{N} which are embeddings for the whole vocabulary and the subset of bias-neutral words respectively, and the bias subspace B\mathcal{B} obtained in Section 2.1, soft debiasing seeks for a linear transformation AA that minimizes the following objective:

Methodology

We now discuss our proposed extension of word embedding debiasing to the multiclass setting.

As in the binary setting, debiasing consists of two steps: identifying the “bias subspace” and removing this component from the set of embeddings.

The core contribution of our work is in identifying the “bias subspace” in a multiclass setting; if we can identify the bias subspace then prior work can be used for multiclass debiasing.

Past work has shown that it is possible to linearly separate multiple social classes based on components of word embeddings (Garg et al., 2018). Based on this we hypothesize that there exists some component of these embeddings which can capture multiclass bias. While a multiclass problem is inherently not a linearly separable problem, a one versus rest classifier is. Following from this, the computation of a multiclass bias subspace does not have any linear constraints, though it does come with a loss of resolution. As a result we can compute the principal components required to compute the “bias subspace” by simply adding an additional term for each additional bias class to each defining set.

Formally, given defining sets of word embeddings D1,D2,...,DnD_{1},D_{2},...,D_{n}, let the mean of the defining set ii be μi=1DiwDiw\bm{\mu}_{i}=\frac{1}{|D_{i}|}\sum_{\mathbf{w}\in D_{i}}\mathbf{w}, where w\mathbf{w} is the word embedding of ww. Then the bias subspace B\mathcal{B} is given by the first kk components of the following principal component analysis (PCA) evaluation:

The number of components kk can be empirically determined by inspecting the eigenvalues of the PCA, or using a threshold. Also, note that the defining sets do not have to be the same size. We discuss the robustness of this method later.

Removing Bias Components

Following the identification of the bias subspace, we apply the hard Neutralize and Equalize debiasing and soft debiasing method presented in (Bolukbasi et al., 2016) and discussed in Section 2.2 to completely or partially remove the subspace components from the embeddings.

For equalization, we take the defining sets to be the equality sets as well.

2 Quantifying Bias Removal

We propose a new metric for the evaluation of bias in collections of words which is simply the mean average cosine similarity (MAC). This approach is motivated by the WEAT evaluation method proposed by (Caliskan et al., 2017) but modified for a multiclass setting. To compute this metric the following data is required: a set of target word embeddings TT containing terms that inherently contain some form of social bias (e.g. {church, synagogue, mosque}), and a set AA which contains sets of attributes A1,A2,...,ANA_{1},A_{2},...,A_{N} containing word embeddings that should not be associated with any word embeddings contained in the set TT (e.g. {violent, liberal, conservative}).

We define a function SS that computes the mean cosine distance between a particular target TiT_{i} and all terms in a particular attribute set AjA_{j}:

We also perform a paired tt-test on the distribution of average cosines used to calculate the MAC. Thus we can quantify the effect of debiasing on word embeddings in TT and sets in AA.

3 Measuring Downstream Utility

To measure the utility of the debiased word embeddings, we use the tasks of NER, POS tagging, and POS chunking. This is to ensure that the debiasing procedure has not destroyed the utility of the word embeddings. We evaluate test sentences that contain at least one word affected by debiasing. Additionally, we measure the change in performance after replacing the biased embedding matrix by a debiased one, and retraining the model on debiased embeddings.

Data

In this section we discuss the different data sources we used for our initial word embeddings, the social bias data used for evaluating bias, and the linguistic data used for evaluating the debiasing process.

We used the L2-Reddit corpus (Rabinovich et al., 2018), a collection of Reddit posts and comments by both native and non-native English speakers. The native countries of post authors are determined based on their posts in country- and region-specific subreddits (such as r/Europe and r/UnitedKingdom), and other metadata such as user flairs, which serve as self-identification of the user’s country of origin.

In this work, we exclusively explore data collected from the United States. This was done to leverage extensive studies of social bias done in the United States. To obtain the initial biased word embeddings, we trained word2vec embeddings (Mikolov et al., 2013) using approximately 56 million sentences.

2 Social Bias Data

We used the following vocabularies and studies to compile lexicons for bias detection and removal.The source and lexicons can be found here: https://github.com/TManzini/DebiasMulticlassWordEmbedding/.

For gender, we used vocabularies created by (Bolukbasi et al., 2016) and (Caliskan et al., 2017).

For race we consulted a number of different sources for each race: Caucasians (Chung-Herrera and Lankau, 2005; Goad, 1998); African Americans (Punyanunt-Carter, 2008; Brown Givens and Monahan, 2005; Chung-Herrera and Lankau, 2005; Hakanen, 1995; Welch, 2007; Kawai, 2005); and Asian Americans (Leong and Hayes, 1990; Lin et al., 2005; Chung-Herrera and Lankau, 2005; Osajima, 2005; Garg et al., 2018).

Finally, for religion we used the following sources and labels: Christians (Rios et al., 2015; Zuckerman, 2009; Unnever et al., 2005); Jews (Dundes, 1971; Fetzer, 2000); and Muslims (Shryock, 2010; Alsultany, 2012; Shaheen, 1997).

3 Downstream Tasks

We evaluate biased and debiased word embeddings on several downstream tasks. Specifically, the CoNLL 2003 shared task (Tjong Kim Sang and De Meulder, 2003) which provides evaluation data for NER, POS tagging, and POS chunking.

Results and Discussion

In this section we review the results of our experiments and discuss what those results mean in the context of this work.

We use the analogy task from (Bolukbasi et al., 2016) to demonstrate that bias exists in these word embeddings. In order to construct our analogies we trained five word2vec (Mikolov et al., 2013) embedding spaces on the same data. We then constructed a set of analogies for each embedding space taking the intersection of these to form a working set of analogies. We performed this extra step in order to ensure the analogies were robust to perturbations in the embedding space. Following this analysis we observe that bias is present in generated analogies by viewing them directly. A small subset of these analogies are in Table 1 to highlight our findings.

2 Removal of Bias

We perform our debiasing in the same manner as described in Section 3.1 and calculate the MAC scores and pp-values to measure the effects of debiasing. Results are presented in Table 2.

Does multiclass debiasing decrease bias? We see that this debiasing procedure categorically moves MAC scores closer to 1.0. This indicates an increase in cosine distance. Further, the associated P-values indicate these changes are statistically significant. This demonstrates that our approach for multiclass debiasing decreases bias.

3 Downstream Effects of Bias Removal

The effects of debiasing on downstream tasks are shown in Table 3. Debiasing can either help or harm performance. For POS tagging there is almost always a decrease in performance. However, for NER and POS chunking, there is a consistent increase. We conclude that these models have learned to depend on some bias subspaces differently. Note that many performance changes are of questionable statistical significance.

Does multiclass debiasing preserve semantic utility? We argue the minor changes in Table 3 support the preservation of semantic utility in the multiclass setting, especially compared to gender debiasing which is known to preserve utility (Bolukbasi et al., 2016).

Is the calculated bias subspace robust? The bias subspace is at least robust enough to support the above debiasing operations. This is shown by statistically significant changes in MAC scores.

Limitations & Future Work

Calculating multiclass bias subspace using our proposed approach has drawbacks. For example, in the binary gender case, the extremes of bias subspace reflect extreme male and female terms. However, this is not possible when projecting multiple classes into a linear space. Thus, while we can calculate the magnitude of the bias components, we cannot measure extremes of each class.

Additionally, the methods presented here rely on words that represent biases (defining sets) and words that should or should not contain biases (equality sets). These lists are based on data collected specifically from the US. Thus, they may not translate to other countries or cultures. Further, some of these vocabulary terms, while peer reviewed, may be subjective and may not fully capture the bias subspace.

Recent work by Gonen and Goldberg (2019) suggests that debiasing methods based on bias component removal are insufficient to completely remove bias in the embeddings, since embeddings with similar biases are still clustered together after bias component removal. Following Gonen and Goldberg’s (2019) procedure, we plot the number of neighbors of a particular bias class as a function of the original bias, before and after debiasing in Figure 1 and 2 in the Appendix. In line with Gonen and Goldberg’s (2019) findings, simply removing the bias component is insufficient to remove multiclass “cluster bias”. However, increasing the size of the bias subspace reduces the correlation of the two variables (Table 4 in the Appendix).

Conclusion

We showed that word embeddings trained on www.reddit.com data contain multiclass biases. We presented a novel metric for evaluating debiasing procedures for word embeddings. We robustly removed multiclass bias using a generalization of existing techniques. Finally, we showed that this multiclass generalization preserves the utility of embeddings for different NLP tasks.

Acknowledgments

This research was supported by Grant No. IIS1812327 from the United States National Science Foundation (NSF). We also acknowledge several people who contributed to this work: Benjamin Pall for his valuable early support of this work; Elise Romberger who helped edit this work prior to its final submission. Finally, we are greatly appreciative of the anonymous reviewers for their time and constructive comments.

References

Appendix A Addressing Cluster Bias

To visualize the degree of cluster bias before and after our debiasing procedure, we follow a similar procedure to Gonen and Goldberg (2019). For a defining set DD for the target task (e.g. religion, race, gender), we compute the mean embedding μ=1DcDc\bm{\mu}=\frac{1}{|D|}\sum_{\mathbf{c}\in D}\mathbf{c}. Then, for each class c\mathbf{c} in the defining set, we define the bias direction as b=cμcμ\mathbf{b}=\frac{\mathbf{c}-\bm{\mu}}{\lVert\mathbf{c}-\bm{\mu}\rVert}. Using this, we find the 500 most biased words in each direction in the whole vocabulary based on their component in the bias direction: w,b\langle\mathbf{w},\mathbf{b}\rangle.

Then, using the list of professions from Bolukbasi et al. (2016)https://github.com/tolga-b/debiaswe/blob/master/data/professions.json, we find the 100 closest neighbors for each profession. We then plot the number of neighbors with positive bias against the original bias of the profession word, as shown in Figures 1 and 2. The plots suggest that while the correlation between the bias component and the number of positively-biased neighbors might decrease slightly as the number of bias subspace dimensions increase, the cluster bias is still not fully removed. As Table 4 shows, while the correlation between the two quantities decreases as the number of subspace dimensions increase to 2 or 3, its magnitude is still high.