Better Hit the Nail on the Head than Beat around the Bush: Removing Protected Attributes with a Single Projection

Pantea Haghighatkhah, Antske Fokkens, Pia Sommerauer, Bettina Speckmann, Kevin Verbeek

Introduction

Word embedding spaces can contain rich information that is valuable for a wide range of NLP tasks. High quality word embeddings should capture the semantic attributes associated with a word’s meaning. Though we can establish that points representing words with similar semantic attributes tend to be close to each other, it is challenging to reason which attributes are captured to what extent and, in particular, which dimensions capture these attributes Sommerauer and Fokkens (2018). But even though we do not (yet) know exactly which attributes are represented in which way, we nevertheless may want to ensure that one particular attribute is no longer present in the embedding. For example, because this attribute is a protected attribute such as gender which can lead to harmful bias, or because we want to test how the attribute was affecting a system in a probing setup.

There are various approaches to remove a particular attribute from an embedding. Clearly it is important to remove as much of the target attribute as possible while preserving other information present. Arguably linear projections are one of the least intrusive methods to transform an embedding space. Iterative nullspace projection (Ravfogel et al., 2020, INLP) is a recent popular method which follows this paradigm and uses multiple linear projections. INLP achieves linear guarding of the protected attribute, that is, a linear classifier can no longer separate instances that have the attribute from instances that do not.

However, it remains unclear to what extent INLP affects the remainder of the embedding space. In general it requires 10-15 iterations (projections) to achieve linear guarding; it is hence likely that other attributes are negatively affected as well. This is particularly problematic in settings where INLP is used to determine what impact a specific attribute has on overall systems.

In this paper, we introduce two methods that find a single targeted projection which achieves linear guarding at once. Specifically, our Mean Projection [MP] method uses the mean of the points in each class to determine the projection direction. The mean is more robust to unbalanced class distributions than INLP. The mean can, however, be overly sensitive to outliers. We prove that our second method, the Tukey Median Projection [TMP], finds nearly worst-case optimal projections for any input. Unfortunately, computing the Tukey median Herbert Edelsbrunner (1987) is computationally expensive in high dimensions. In our experiments we hence compared only MP to INLP. Specifically, we carried out the gender debiasing experiments of Ravfogel et al. (2020) with both methods. They show that

MP only needs one projection for linearly guarding gender where INLP needs 10-15.

MP has less impact on other aspects of the embedding space.

INLP projections improve simlex scores and reduce WEAT scores more than MP. A priori it is unclear why their many projections should have this effect. We investigated and show that the same improvements appear after applying random projections (either after MP or after the first INLP projections). Applying one MP projection to linearly guard an attribute is hence methodologically cleaner than applying multiple (INLP) projections.

Related Work

Multiple methods for removing bias from embeddings have been suggested. Bias can be addressed at the level of the training data (Zhao et al., 2018a), the training process (Zhang et al., 2018; Madras et al., 2018), and the resulting model itself Ravfogel et al. (2020); Bolukbasi et al. (2016); Vargas and Cotterell (2020). Debiasing an existing model has the clear advantage that it can be done with relatively little data and that the model does not need to be retrained. In this paper, we focus on removing bias through transforming the embedding space.

Transformations of embedding spaces can be performed by means of linear projections Bolukbasi et al. (2016), multiple linear projections Ravfogel et al. (2020), or via non-linear kernels Vargas and Cotterell (2020). Ravfogel et al. (2020) introduce a method called Iterative Nullspace Projection (INLP) where they attempt to remove the bias by projecting points iteratively. Our methods achieve linear guarding after a single linear projection. As such we directly improve upon the INLP method, which we describe in detail in Section 3. In the remainder of this section, we focus on more recent post-training approaches and applications of INLP.

Similar to Ravfogel et al. (2020), Ravfogel et al. (2022) aim to find a linear subspace of the data such that removing that subspace using projection removes bias in the data optimally. Doing so necessarily requires a metric to measure how well bias has been removed by a specific projection. Ravfogel et al. (2022) use the classification loss of various classifiers as metrics. That is, a higher classification loss after projection corresponds to better bias removal. This directly translates into a minimax optimization problem, which the authors refer to as a minimax game. The authors present different strategies to solve this optimization problem for their chosen classifiers. A similar strategy was previously described by Haghighatkhah et al. (2021, 2022). Here the corresponding minimax optimization problem is referred to as “maximizing inseparability”. Haghighatkhah et al. (2022) show that this problem can be solved efficiently under certain convexity assumptions on the metric.

Zhang et al. (2022) independently developed a projection method which is very similar to our MP. Their method also uses class means to find the projection direction, but proceeds via a projection to the origin, and hence needs one additional projection. The goal of the paper, however, is quite different, namely the study of human brain reactions to syntactic and semantic features of words. As such, it does not include a systematic comparison to INLP or a comprehensive analysis of the impact on the embedding space.

Shao et al. (2022) in recent and unpublished work compute linear projection vectors by minimizing the covariance between biased word vectors and the protected attribute. They generally need two projections to achieve comparable or better performance to INLP. The paper does not report on exactly the same experiments as we do, but for all experiments reported in both papers, our method MP performs similarly or better.

Dev et al. (2021a) use a different kind of linear transformation on embedding spaces, namely rotation. Their goal is to disentangle two particular attributes, such as gender and occupation. To do so, they rotate the embedding space in such as way that the subspaces corresponding to the two attributes become orthogonal. On the positive side, their approach does not remove a dimension and hence arguably retains more information in the embedding. However, it will not remove all bias with respect to gender, but only gender bias with respect to a single specified attribute, e.g., occupation. More generally, it will remove the interaction between the two specified attributes and not actually completely remove an attribute from the embedding.

INLP has recently gained importance for probing studies. Probing has been criticized because (1) results are difficult to interpret, and (2) it is usually not possible to test whether the target attribute is also relevant for downstream tasks. Elazar et al. (2021) propose amnesic probing, which employs INLP to remove the target attribute and then tests change in performance of the adapted embeddings on downstream tasks. INLP has since been used for this purpose in several other studies (Celikkanat et al., 2020; Nikoulina et al., 2021; Babazhanova et al., 2021; Dankers et al., 2022; Gonen et al., 2022; Lovering and Pavlick, 2022, e.g.). When using a projection-based removal method such as INLP or our MP in a probing setup, it is particularly important to ensure that all other information is preserved. If this is not the case, the probing study may lead to misleading conclusions. We hence recommend future such studies to consider using MP instead of INLP.

Projection Methods

In this section we introduce our two new projection methods MP and TMP. Furthermore, we theoretically analyze their relative performance when (linearly) debiasing word embeddings, also in comparison to the existing method INLP. Here we focus on binary classification for ease of explanation. Both methods can however also be used for multi-class classification (see Section 4.1) and then require $n-1$ projections for $n$ classes. We start with a formal description of the problem we study.

In the following we consider only transformations which consist of one or more projections. Let $p$ be a point in $P$ . We define its projection $p_{w}$ along a unit vector $w$ as $p_{w}=p-(p\cdot w)w$ . The projection along the vector $w$ maps points to the hyperplane $H_{w}$ which is orthogonal to $w$ ( $H_{w}\bot w$ ) and contains the origin.

To evaluate how well INLP, MP, and TMP do in terms of linear guarding, we consider the number of misclassifications with respect to $a^{*}$ by the best possible linear classifier after a single projection; the higher the number of misclassifications, the better the method performs. If a single projection is sufficient to achieve linear guarding, then arguably the semantic encoding of other attributes in our word embeddings are preserved as well as possible.

From a theoretical perspective, TMP is the most effective method; it increases the number of misclassifications the most. However, it is costly to compute exactly for data in 300+ dimensions. MP is not as effective in theory as TMP, but it is more effective than INLP. Furthermore, MP is very easy to compute and appears to be very effective in practice. Hence, we evaluate the efficacy of MP extensively in Section 4 and recommend to use MP for linear guarding in practice.

Iterative Nullspace Projection (Ravfogel et al., 2020, INLP) debiases by iteratively projecting along a vector computed by a linear classifier. Here we assume that this linear classifier is a linear Support Vector Machine (SVM), as this is the main classifier used in their paper. In that setting, at each iteration, INLP trains an SVM to classify $a^{*}$ in the point set $P$ . Since $a^{*}$ is binary, the result of the SVM is a vector $w$ (along with a single real value, which we may ignore). The input points are then projected along $w$ and the resulting point set is used as the input for the next iteration.

If $P^{-}$ can be separated from $P^{+}$ with a hyperplane, then the vector $w$ is the normal of a separating hyperplane and indicates the direction along which there is the largest gap between $P^{-}$ and $P^{+}$ : the SVM margin. If $P^{-}$ and $P^{+}$ cannot be separated by a hyperplane, then the classifier allows misclassifications for a certain penalty, but otherwise still maximizes the margin. In this setting, SVM uses a parameter that controls the trade-off between misclassifications and the margin.

The vector $w$ used in the projection essentially maximizes the gap between $P^{-}$ and $P^{+}$ , and hence it is determined only by the points on the “outside” of $P^{-}$ and $P^{+}$ . Specifically, if $P^{-}$ and $P^{+}$ can be separated, then the vector $w$ can be determined solely by the points on the convex hullsThe convex hull of a set of points is the smallest convex set that contains all of the points. of $P^{-}$ and $P^{+}$ , and is not influenced by the points that are interior to the convex hulls. If $P^{-}$ and $P^{+}$ are not separable, then $w$ will be determined by the points on the convex hulls of the correctly classified points. Thus, if the number of misclassifications is small, then many of the “more interior” points of $P^{-}$ and $P^{+}$ still do not contribute to determining the projection vector $w$ .

Haghighatkhah et al. (2022) show that the projection along $w$ does eliminate the existence of a perfect linear classifier after projection (that is, there must be at least one misclassification), but this is by far not sufficient for linear guarding. We claim that we must also consider the points interior to the convex hulls of $P^{-}$ and $P^{+}$ , if we wish to obtain an effective projection for linear guarding.

We illustrate this claim with Figure 2, where $P^{-}$ and $P^{+}$ are colored red and blue, respectively. The black line indicates the hyperplane learned by a linear SVM and $w$ is the corresponding vector. The points at the top are distributed roughly evenly in the convex hull of their class (the colored polygon). Hence the projection along $w$ is very effective for linear guarding. The points at the bottom are distributed unevenly in the same convex hulls. The same projection along $w$ is now not very effective: red and blue points can still be separated fairly well by a linear classifier with relatively high accuracy.

We observe that the distribution of all points must be taken into account to compute a projection that is effective for linear guarding. Our two methods (MP and TMP) explore different ways of incorporating the distribution of the interior points when constructing the projection vector $w$ .

2 Mean Projection (MP)

Our general goal is to make the distribution of the points in $P^{-}$ and $P^{+}$ as similar as possible after the projection; that makes it difficult to distinguish the two sets using a linear classifier. One of the most characteristic values of a distribution is its mean. Therefore, in the Mean Projection method we ensure that the means of $P^{-}$ and $P^{+}$ are identical after projection. Specifically, let $\mu^{-}$ be the mean of $P^{-}$ and let $\mu^{+}$ be the mean of $P^{+}$ . We choose the projection vector simply as $w=\mu^{+}-\mu^{-}$ . Note that $\mu^{+}$ and $\mu^{-}$ (and hence $w$ ) can easily be computed efficiently by summing up the coordinates of the points in each point set and dividing by the number of points in that set.

MP tends to be very effective, since the majority of the points are typically concentrated around their mean. Figure 2 compares INLP (top) and MP (bottom) in a scenario where MP clearly outperforms INLP with respect to linear guarding. However, we can also directly see a weakness of MP: the mean of a point set can be pulled into some direction by (few) outliers. Figure 4 (left) shows another example of this phenomenon. Hence, there exist theoretical instances where MP performs quite poorly with respect to linear guarding. However, we demonstrate in Section 4 that MP is more effective than INLP in practice.

3 Tukey Median Projection (TMP)

As illustrated above, the mean can be influenced by outliers. Ideally, we would hence like to identify a “center point” for each point set that has roughly the same number of points in every “direction”. In 1D this center point is known as the median. We therefore consider a higher-dimensional version of the median as our center point.

Now let $\tau^{-}$ be a Tukey median of $P^{-}$ and let $\tau^{+}$ be a Tukey median of $P^{+}$ . Then the projection vector of the Tukey Median Projection method is simply $w=\tau^{+}-\tau^{-}$ .

Figure 4 contrasts MP and TMP. The mean of the blue point set is clearly not in the center of the majority of the points due to the outliers. As a result, the projection of MP is not very effective for linear guarding. The Tukey median of the blue point set, however, remains centered near the majority of the points and hence TMP is effective.

Unfortunately, TMP has one major drawback: computing the Tukey median of a set of points is computationally expensive in high dimensions. Although it is possible to compute an approximate Tukey median efficiently, the Tukey depth of this approximation may be significantly worse than that of the Tukey median (namely, $O(n/d^{2})$ )We are aiming to maximize the number of misclassifications after projection, so $O(n/d)$ is better than $O(n/d^{2})$ . Hence, we generally recommend to use MP in practice.

Experiments

In this section, we present our main experimental results.The code is available at tue-alga/debias-mean-projection. We first describe the outcome of the original INLP experiments and how our results compare (Section 4.1). We then outline the additional experiments we carried out to gain more insight into surprising results of the original INLP experiments as well as those of our own method (Section 4.2).

We compare the performance of INLP and MP on the gender bias experiments reported in Sections 6.1 and 6.3 in Ravfogel et al. (2020). We carried out all experiments with the original (INLP) and our new (MP) method. We could successfully reproduce Ravfogel et al.’s 2020 results. Since the results were nearly identical, we report the INLP results from the original paper. For reasons of space, this section only contains the intuition and main results. We provide a self-contained description in Appendix B, where all experimental details can be found. We use the same settings and metrics as Ravfogel et al. (2020), unless specified otherwise. Table 1 summarises the results in this section.

We use the GloVe word embeddings Zhao et al. (2018b)glove.42B.300d and limit the dataset to the 150,000 most common words. We create the same classes of male, female and neutral embeddings as Ravfogel et al. (2020) following their approach.Gender is not binary. Our experiments are restricted to removing male and female gender bias because we follow the settings from Ravfogel et al. (2020). Our limitations and ethics sections provide more elaborate comments on this matter. A male vector $\vec{M}$ is defined as $\vec{he}-\vec{she}$ and a female vector $\vec{F}$ as $\vec{she}-\vec{he}$ . Our male dataset consists of the 7500 data points closest to $\vec{M}$ and our female dataset to the 7500 data points closest to $\vec{F}$ . We use the same random selection of 7500 neutral words with a cosine similarity of less than 0.3 to $\vec{M}$ as Ravfogel et al. (2020).

We include experiments on two classes (feminine and masculine) and on three classes (feminine, masculine and neutral). The INLP method uses the datasets to train SVM classifiers as described above. For the MP method, let $\vec{v_{F}}$ , $\vec{v_{M}}$ and $\vec{v_{N}}$ be the mean of respectively feminine, masculine and neutral labeled points of a given set $D$ . Then in experiments on two classes the mean projection method uses the $\vec{v_{F}}-\vec{v_{M}}$ vector for projection. For experiments with all three classes, we use both $\vec{v_{N}}-\vec{v_{F}}$ and $\vec{v_{N}}-\vec{v_{M}}$ vectors for projection.

Linear Guarding.

We first compare the process of obtaining linear guarding using INLP and MP. Figure 4 illustrates the decrease in accuracy of a linear classifier trained to identify gender. The Mean Projection method brings down the accuracy to 34.18%. It takes INLP 12 iterations to reach a comparable result of 34.9% or at least 6 iterations to reach 39.9% accuracy. Figure 8 in Appendix B shows a similar pattern for binary classification, where MP reaches 50.6% and INLP needs 14 iterations to drop to 50.51%.

Classification.

Both methods aim to remove linearly encoded information. Ravfogel et al. (2020) test to what extent the non-linear encoding of gender remains intact by running a 1-hidden-layered MLP with ReLU activation and report 85.0% after 35 iterations with INLP. After a single MP projection, accuracy of the MLP drops to 81.6%.

Effect on the Embedding Space.

Ravfogel et al. (2020) show changes in the three nearest neighbors of 40 randomly selected words and 22 gendered given names (see Tables 6 and 7 in Appendix D). MP keeps these environments more stable changing 8 out of 120 neighbors compared to 43 for INLP for random words. MP changes 24 and INLP 56 tokens out of 66 neighbors of the given names. We also observe that 14 of the 24 changes are other given names (5 of which of opposite gender). With INLP, 46 new tokens are not regularly spelled names.

Ravfogel et al. (2020) also investigate how INLP impacts semantic similarity scores on multiple datasets reporting an increase of results, e.g. from 0.373 to 0.489 on simlex-999. Applying MP has less impact with a result of 0.385. This confirms that MP has less impact on the overall space, but also raises the question whether the additional changes caused by INLP’s extra projections is a positive effect. We investigate this in Section 4.2.

WEAT.

Caliskan et al. (2017) evaluate bias in embeddings by comparing the distance of known stereotypical terms to terms that are either explicitly male or explicitly female. A higher WEAT score means that stereotypical terms are indeed closer to their stereotyped gender. Like Ravfogel et al. (2020) we use gendered names for representing attributes and take the targets of WEAT 6, 7, and 8. MP achieves WEAT scores of 0.173, 0.191 and 0.110 respectively. The first INLP iteration yields scores of 0.278, 0.347, 0.203 and the full 35 iterations reach 0.008, 0.084 and -0.310 respectively. We investigate this in Section 4.2.

Bias-by-neighbor.

Gonen and Goldberg (2019) propose an evaluation that checks whether the 100 nearest neighbors of a term after debiasing carried the same gender bias before debiasing. Ravfogel et al. (2020) show that INLP reduces the gender bias of Bolukbasi et al.’s 2016 set of stereotypical professions from 85.2% to 73.4%. MP reaches a comparable reduction (74.5%).

TPR-GAP: “gender in the wild”.

De-Arteaga et al. (2019) propose an approach that measures to what extent debiasing methods can remove gender bias from a system that predicts occupations based on biographies. A fair system should have equal performance for members of different classes and not amplify a bias that is present in the label distribution Hardt et al. (2016). In the case of gender, the system should perform equally well predicting occupations such as surgeon, caretaker, secretary or marine, regardless of whether the biographical texts are about men or women. De-Arteaga et al. (2019) use the $GAP^{TPR}$ score to measure this. The GAPfem, e.g., quantifies to what extent a given occupation is more probable to be predicted correctly for biographies of women compared to biographies of men. A good debiasing approach should obtain a $GAP^{TPR}$ score that is close to zero while maintaining the classification accuracy that was obtained before debiasing.

Instead of directly debiasing embeddings, we debiase representations of entire biographies within the occupation classification system. Like Ravfogel et al. (2020), we use logistic classifiers that take one of three representations of the biographies as input: (1) one-hot BOW, (2) averaged FastText embeddings (Joulin et al., 2017) and (3) the last hidden state of BERT Devlin et al. (2019). The projections are applied using Pedregosa et al. (2011).

In both approaches, biography representations are debiased on the basis of their gender nullspace for each of the 28 occupations. For the MP setup, we create a mean projection for every occupation $o$ by identifying the projection vector $P_{o}$ based on the mean of all female and the mean of all male biographies with occupation $o$ . The INLP setup uses logistic regression in 100 iterations for the BOW representations, 150 linear SVM iterations for FastText representations and 300 linear SVM iterations of for the BERT.

Overall, results of both methods are comparable. INLP maintains a higher accuracy on identifying professions for one-hot BOW (77.1% vs 76.7%), where MP has higher accuracy for FastText (75.6% vs 73.6%) and BERT (75.2% vs. 74.7%). For BOW, the GAP score drops by 99.4% when using MP compared to 35% for INLP. Differences for other representations are smaller: MP reduces GAP by 49.5% on FastText compared to INLP’s 43.4%. For the BERT model, INLP outperforms MP reducing the GAP by 64.3% compared to MP’s 52.2%. See Table 5 in Appendix B for the full results.

Summary.

In general, we observe that a single MP projection achieves linear guarding where multiple INLP projections are needed and that the rest of the space remains more stable. The improvement of similarity results, as well as INLP’s results on WEAT, raise further questions. Ravfogel et al. (2020) hypothesize that the improvied similarity results may be due to a significant gender component in embeddings that does not correlate with human similarity judgment (Ravfogel et al., 2020, p. 7253). Together with the increased drop in WEAT, this may point to 35 INLP iterations removing more gender information than the single MP projection. This is countered by the result of the non-linear classifier where MP induced a larger decrease than INLP. We investigate the results on similarity and WEAT further in the next subsection.

2 Diving Deeper

We first dive into the impact of INLP on simlex-999 Hill et al. (2015). Figure 5 illustrates the changes in similarity scores (Correlation) after every iteration of INLP compared to MP (one projection). We observe that the scores mainly increase between the 8th and 14th iteration, remaining relatively stable before and afterwards. Going back to Figure 4, we observe that this increase thus starts when the INLP projections have almost dropped to majority class accuracy, i.e. linear guarding. The main increase in similarity scores thus occurs after gender encoding has been removed, which could imply that the increased result is not related to gender. We investigate this by comparing the impact of INLP iterations to iterations that project the dataset along random vectors.Weighted by eigenvalues of all principle components i.e. we choose a random direction from the span of the data.

We compare three scenarios: directly applying 35 iterations with random projections to the original model [Random], adding 34 random projection iterations after one MP projection [MP+R] and adding 27 random projection iterations to 8 INLP projections (the point where the score starts to increase) [INLP-8+R]. We carry out 500 runs of this experiment and report the 95% confidence interval of similarity correlation scores for each setting in Table 2. Running 35 iterations of INLP increased the semantic similarity score from 0.373 to 0.489. This improvement falls within the confidence interval of MP+R and INLP-8+R. Results of random projections only [Random] do not reach this score, but still clearly improve compared to the original 0.373 with a mean score of 0.447. We thus conclude that the improvement in similarity scores are due to reducing dimensions in general rather than removing (partial) representations of gender.

WEAT.

We investigate the WEAT scores in a similar manner. We first inspect the impact on the WEAT score per INLP iteration and then compare this to the impact of projections along weighted random vectors. Figure 6 illustrates the impact of applying INLP for 35 iterations (blue line) and 34 random projection iterations applied after applying MP method (500 runs). Recall that an ideal WEAT score would be close to zero. We observe that MP+R on average ends up with an equal score for WEAT 6, a slightly higher positive (thus worse) score for WEAT 7 and close to zero for WEAT 8 where INLP results in a negative score.

We also compare the effect of applying random projections after 8 INLP iterations. INLP-8+R shows similar results as MP+R. Applying random projections from the start [Random] leads to slightly increasing WEAT scores in most cases (see Figures 10 and 10 in Appendix C). This implies necessity of the first step in removing gender to make a reduction of WEAT by random projections.

We now turn back to what this means for iterative nullspace projections. When inspecting the blue lines in Figure 6, we observe that all WEAT scores initially drop to the same level as with MP after 2-4 iterations and then increase again. Between 20 and 25 iterations we observe a sharp drop, even flipping to a bias in the opposite direction for WEAT 8. These patterns do not reveal a consistent positive impact of applying additional nullspace projections. Ethayarajh et al. (2019) report that results obtained by applying WEAT are brittle and that statistically significant results in opposite directions can be obtained depending on the attribute words selected. They show this in a highly simplified setting (using only one attribute word at the time). Nevertheless, the WEAT sets are relatively small (8 attributes and targets per set) and some of the results we see could be the effect of the specific selection of terms in the set. Overall, this makes the results even more difficult to interpret. We can say, however, that a single MP and the first INLP projections seem beneficial for decreasing WEAT and there is no evidence that continuing iterations of INLP further removes gender bias.

Summary.

Our further investigations showed that (1) increases in simlex-999 scores occur very locally after gender has been removed and that (2) both for similarity and WEAT, random projections after MP or a limited number of INLP iterations lead to similar results as 35 INLP iterations. We therefore conclude that the side effects of continuing to apply INLP are not related to the overall goal of removing encoding of gender, but rather related to the overall effect of dimension reduction.

Conclusion

This paper started from the idea that one targeted projection results in similar linear guarding and more stability of the remaining space compared to iterative nullspace projections. We proposed two methods to find such projections: Mean Projection (MP) and Tukey Median Projection (TMP). We compared performance of Mean Projection, which is still susceptible to outliers but more efficient to compute than TMP, to the gender based experiments using 35 Iterative Nullspace Projections (INLP) reported in Ravfogel et al. (2020).

Our results show that MP obtains comparable linear guarding in one projection where INLP requires multiple iterations. They also confirm that the rest of the space remains more stable through fewer changes in nearest neighbors of randomly chosen words and similarity scores that are closer to the original scores compared to INLP. Two results led to further questions: similarity scores improved after applying INLP and INLP resulted in better WEAT scores than MP. We conducted additional experiments to test whether the extra debiasing iterations performed in INLP end up removing more subtle representations of gender bias that are missed by the single MP iteration.

The results show that it is unlikely that INLP can indeed remove more subtle representations of bias for two reasons: (1) we observed that most effects occur once linear guarding has already been achieved. (2) More importantly, similar performance is obtained by running random projections after either MP or the first 8 INLP iterations. We therefore conclude that these additional effects are not related to removing representations of the targeted attribute, but rather a side effect of reducing the dimensionality of the space. This leads us to the overall conclusion that a single targeted projection is indeed preferable over iterative nullspace projections when aiming to remove an attribute. This finding is particular important for the recent line of research that uses INLP for interpretability: these studies test the effect of removing a target attribute and their conclusions should not be confounded by other changes to the space.

Our theoretical findings in Section 3 suggest that MP might produce poor results for skewed data. Hence we plan to analyze MP more rigorously on synthetic data, specifically, data where the points are not centered around the mean. In this way we hope to identify scenarios where MP produces poor projections; subsequently we can test if TMP produces better projections for the same scenarios.

Limitations

The methods we described and evaluated in this paper have the following limitations.

The results of all projection methods (INLP, MP and TMP) directly depend on the representativeness of the labelled data points that are used to identify the projections. All methods would suffer from non-representative points. In addition, outliers can hamper the effectiveness of MP.

Binary gender.

The binary perspective on gender used in our experiments does not reflect the reality of non-binary gender identities. These experiments focus on gender bias in the form of stereotypical associations with binary gender identities. The underrepresentation of non-binary gender references (e.g. singular they, them in English) is another form of bias that also affects NLP systems. Recent work has shown that current NLP technologies indeed have difficulties dealing with non-binary gender references (Dev et al., 2021b; Brandl et al., 2022, e.g.). To our knowledge, the question of whether language models also contain stereotypes around non-binary gender has not been addressed yet. If such bias is indeed present, data sparseness is likely to pose a problem for the methods described in this paper.

Suitability for language with grammatical gender is unclear

For the application of gender bias, we note that all experiments are run on English. In particular for languages with grammatical gender, it is unclear whether these projections would be able to identify semantic gender bias correctly while avoiding to remove grammatical characteristics of the language: words with the same semantic gender will almost always also have the same grammatical gender. Projections are likely to remove this highly distinctive feature which is also a grammatical property. Its removal may impact downstream applications.

Linear guarding only.

The methods described here aim only at (and achieve only) linear guarding. This means that non-linear representations of the attributes remain present in the data (e.g. the one-hidden layered perceptron still achieves an accuracy of 85.0% after 35 INLP iterations and 81.6% after applying MP on binary classification).

Ethics

We applied our method to gender debiasing. The approach aims to remove bias which can be seen as helping to address an ethical issue, rather than causing one. Nevertheless, we feel it is important to point out the following. When using the methods discussed in this paper to remove a protected attribute, with the idea to avoid harmful bias, the limitations outlined above must be taken into account. This concerns both potential limitations of the effectiveness of the approach and the treatment of gender as a binary variable. Concerning effectiveness, this means considering the fact that (1) non-linear representations are not addressed by these methods and (2) the success of the approach is directly dependent on the quality of data used to identify the projections. Concerning the treatment of non-binary gender, it should be kept in mind that (1) to our knowledge, not much is known yet about stereotypical representations of non-binary gender in language models (2) it is known that current NLP methods experience difficulties in dealing with non-binary gender. In general, debiasing methods must always be carefully evaluated to avoid that users of a system assume the problem is solved, when this is not (completely) the case. Ideally, such evaluation should look beyond the scope of the target. When dealing with gender, this should include looking at what models are doing with non-binary gender references.

In practice, the considerations outlined above mean the following: We do not claim that any of the approaches presented in this paper can fully remove bias from embedding representations. When using debiased embeddings as part of a system, we strongly advise to conduct a critical evaluation with respect to potentially remaining bias using data that are representative of the use-case. This is particularly important if the use-case encompasses the potential for discrimination (e.g. automatic CV analysis).

References

Appendix A Tukey Median Projection

Let $q^{-}$ be a point with Tukey depth $t(q^{-})$ with respect to $P^{-}$ and let $q^{+}$ be a point with Tukey depth $t(q^{+})$ with respect to $P^{+}$ . Then the projection of $P$ along vector $w=q^{+}-q^{-}$ ensures that any linear classifier must make at least $\min(t(q^{-}),t(q^{+}))$ misclassifications after projection.

Let $\tau$ be the Tukey medians of $P^{-}$ and $P^{+}$ after projection (by construction, these Tukey medians have been projected onto the same point). Now let $H$ be the hyperplane corresponding to some linear classifier after projection. Let $H_{\leftrightarrow}$ be the hyperplane obtained by shifting $H$ to contain the point $\tau$ . Let $H^{-}_{\leftrightarrow}$ be the halfspace bounded by $H_{\leftrightarrow}$ that contains the majority of $P^{-}$ after projection, and let $H^{+}_{\leftrightarrow}$ be the other halfspace bounded by $H_{\leftrightarrow}$ . By construction of $\tau$ , there must be at least $t(q^{-})$ points of $P^{-}$ in $H^{+}_{\leftrightarrow}$ and at least $t(q^{+})$ points of $P^{+}$ in $H^{-}_{\leftrightarrow}$ , which are misclassified by the linear classifier defined by $H_{\leftrightarrow}$ . By shifting the hyperplane back from $H_{\leftrightarrow}$ to $H$ , we can see that either the misclassified points of $P^{-}$ or the misclassified points of $P^{+}$ remain misclassified by the linear classifier defined by $H$ . Thus, any linear classifier must misclassify at least $\min(t(q^{-}),t(q^{+}))$ points after projection by TMP. ∎

We can show that TMP is asymptotically worst-case optimal in minimizing the number of misclassifications after projection. For that we need the following technical lemma.

On the other hand we have that $(v_{i}\cdot r)=\frac{-1}{d}$ for all $v_{i}\neq v_{j}$ , and hence the stated inequality holds for all $v_{i}\neq v_{j}$ . ∎

Let $w$ be any unit projection vector. We write $w$ as $w=\alpha e_{d+1}+\beta z$ , where $z$ is a unit vector in the subspace spanned by $e_{1},\ldots,e_{d}$ (such that we have $\alpha^{2}+\beta^{2}=1$ ). Let $q_{w}$ be the center of $Q$ after projection along $w$ . It is easy to see that $P$ and $Q$ are linearly separable after projection if $\|q_{w}\|\geq 1+\varepsilon$ (the center of $P$ remains at the origin after projection). We have that:

Since $z$ is independent from $e_{d+1}$ we get that $\|q_{w}\|\geq C(1-\alpha^{2})=C\beta^{2}$ , and hence $\|q_{w}\|\geq 1+\varepsilon$ if $\beta^{2}\geq\frac{1+\varepsilon}{C}$ . We may therefore assume that $\beta^{2}<\frac{1+\varepsilon}{C}$ .

We now show that $H$ has the desired properties. Let $u_{i}$ be one of the $d$ points of $P$ such that $(u_{i}\cdot r)\leq(x\cdot r)-\frac{1}{d}$ , and let $u^{\prime}_{i}$ be the point obtained by projecting $u_{i}$ along $w$ . We get that:

where $\beta^{\prime}_{i}\in[-\beta,\beta]$ . By construction of $x$ , we also get that $(x\cdot r)=(q_{w}\cdot R)$ . We then get the following:

Thus, if $\beta^{2}-\frac{1}{d}<-\varepsilon$ , then $(u^{\prime}_{i}\cdot R)<(q_{w}\cdot R)-\varepsilon$ . Since $\beta^{2}<\frac{1+\varepsilon}{C}$ , we can choose $C=4d$ and $\varepsilon=\frac{1}{2d}$ to ensure this property, as then $\frac{1+\varepsilon}{C}-\frac{1}{d}<\frac{1}{2d}-\frac{1}{d}=-\varepsilon$ .

Now let $v_{j}$ be one of the points of $Q$ , and let $v^{\prime}_{j}$ be the point obtained by projecting $v_{j}$ along $w$ . Since the projection along $w$ can only shrink distances, we get that $\|v^{\prime}_{j}-q_{w}\|\leq\|v_{j}-q\|<\varepsilon$ . This implies that $((v^{\prime}_{j}-q_{w})\cdot R)>-\varepsilon$ , which can be rewritten as $(v^{\prime}_{j}\cdot R)>(q_{w}\cdot R)-\varepsilon$ . As a result, $d$ points of $P$ are on one side of $H$ and $n$ points of $Q$ are on the other side of $H$ , as required.

If $P$ contains $m\neq d+1$ points, then we can simply follow the construction above and distribute the points arbitrarily close to the vertices of the regular simplex, such that there are at at most $\left\lceil\frac{m}{d+1}\right\rceil$ points around each vertex. As a result, $n$ points of $Q$ will be on one side of $H$ and at least $m-\left\lceil\frac{m}{d+1}\right\rceil=\left\lfloor\frac{md}{d+1}\right\rfloor$ points of $P$ will be on the other side of $H$ , which completes the proof. ∎

Choose $P$ and the binary attribute $a^{*}$ such that $P^{-}$ and $P^{+}$ match $P$ and $Q$ in the statement of Theorem 2 (note that the theorem has to be applied with dimension $d-1$ ). Then, for any projection of $P$ along a single vector, there exists a hyperplane $H$ such that $\left\lfloor\frac{m(d-1)}{d}\right\rfloor$ points of $P^{-}$ are on one side of $H$ and all points of $P^{+}$ are on the other side of $H$ . Thus, the linear classifier defined by $H$ misclassifies at most $\left\lceil\frac{m}{d}\right\rceil$ points, and hence the best possible linear classifier must perform at least as well. ∎

Appendix B Experiments

For convenience, we provide a detailed self-contained description of the experiments here. The first paragraph is an exact repetition (included here for convenience). All others are extended versions of the one included in the main text. This version can be read either instead of or next to the shorter version in the main text.

We use the GloVe word embeddings Zhao et al. (2018b)glove.42B.300d and limit the dataset to the 150,000 most common words. We create the same classes of male, female and neutral embeddings as Ravfogel et al. (2020) following their approach. A male vector $\vec{M}$ is defined as $\vec{he}-\vec{she}$ and a female vector $\vec{F}$ as $\vec{she}-\vec{he}$ . Our male dataset consists of the 7500 data points closest to $\vec{M}$ and our female dataset to the 7500 data points closest to $\vec{F}$ . For the neutral dataset we use Ravfogel et al.’s 2020 seed to obtain the same random selection of 7500 words with a cosine similarity of less than 0.3 to $\vec{M}$ .

Linear Guarding.

The first set of experiments illustrates the process of gender information being removed from embeddings using each method. We apply the INLP and MP algorithms to the data labeled feminine, masculine and neutral described above and investigate to what extent a linear classifier can still identify the original gender of the word. The results of this classifier should decrease as gender information is removed from the set. For INLP, we use a L2-regularized SVM with the same parameters and random seeds as Ravfogel et al. (2020). We can observe changes over the course of performing multiple projections for INLP. For MP, only a single projection is needed.

We test the classifier on the same train, test and development set as Ravfogel et al. (2020). The dataset contains three classes (feminine, maschuline and neutral). Like Ravfogel et al. (2020), we report the results for all three classes as well as two classes (male and female). Since all classes are equally distributed, we report on the classifiers accuracy. A set of completely debiased embeddings should yield a result that is equal to chance.

Figure 7 presents the results of 35 iterations of INLP and the single iteration of MP on three classes of the data. Before applying projections, a linear classifier achieves perfect classification. The mean projection method brings down the accuracy to 34.18%. It takes INLP 12 iterations to reach a comparable result of 34.9% or at least 6 iterations to reach 39.9% accuracy. We repeat the experiment on two classes (masculine and feminine) and provide the results in Figure 8. Here after applying the MP method, an accuracy of 50.6% is reached while INLP needs 14 iterations to get down to 50.51% accuracy. These results show that with MP, one projection is enough whereas multiple projections are needed to achieve a similar result with INLP. The results of INLP stabilizes at approximately the same result as MP’s single projection.

Classification.

Both methods only aim to remove linearly encoded information. We compare to what extent they leave non-linear encoding of gender in tact by running a 1-hidden-layered MLP with ReLU activation. This MLP manages to identify gender with 85.0% after 35 iterations. After a single MP projection, accuracy of the MLP drops to 81.6%. Overall these results confirm that MP needs a single projection to achieve equal linear guarding as multiple INLP projections and a comparable (slightly better) result on the remaining non-linear encoding.

Clustering.

Another way of assessing the debiasing methods is to test to what extent the embeddings can still form clusters representative of the male and female class. If the bias has been removed successfully, the clusters should no longer reflect gender bias information. We use the V-measure Rosenberg and Hirschberg (2007) to assess to what extent the two classes are intertwined. This measure is the harmonic mean of two components: homogeneity (reflecting how many points in a cluster belong to the same class) and completeness (reflecting how many points of a class ended up in the same cluster). The homogeneity component assesses to what extent samples of a single cluster have the same class i.e. are similar. Completeness on the other hand measures to what extent data points with the same labels are in the same cluster. A perfect V measure (score 1.0) indicates perfect clustering with respect to the gold labels, where equal distribution of class members from all classes over all clusters yields a score of 0. We apply $K$ -means clustering with $K=2$ to the male and female gendered words in the test split, then we calculate the V-measure.

We use the 2000 most biased female and male words as Ravfogel et al. (2020) suggests. Using the MP method reduces the V-measure to 0.53% after a single projection. The V-measure reaches 0.67% after applying the INLP method with all 35 iterations on the this set.

Exploring Neighbors.

We inspect changes in the direct environment of 40 words randomly selected by Ravfogel et al. (2020) as well as their selection of 22 gendered given names. If a debiasing method is targeted, it should not affect the neighbors of the randomly selected words and only lead to changes in the neighborhood of the words that should be affected by debiasing (in this case the 22 gendered given names, where new name of the opposite gender may appear).

When applying MP, only 8 out of 120 neighbors of random words changed, compared to 43 after changes caused by INLP. For given names, MP yielded 24 changes out of 66 compared to 56 for INLP. Overall, it seems that INLP leads to more changes in the neighborhoods of words targeted by debiasing as well as other words.

Upon manual inspection, we observed that MP mainly leads to names being replaced by other given names in the neighborhood of gendered names (14 times, 5 times a name of opposite gender). INLP’s new contexts contain more common nouns and rare non-given names (46 tokens that end up in the immediate context of a gendered name are not regularly spelled given names). Another noticeable difference is that MP appears to have moved the name Ariel to its interpretation of a location, whereas it ended up near other terms related to Disney after applying INLP. The full results can be found in Tables D and 7 in Appendix D.

Similarity Experiments

Word debiasing methods should only affect information that reflects the bias and leave the rest of the space unchanged. We investigate the effects of both debiasing methods on three word similarity datasets, namely, simlex-999, Sim353 (both relatedness and similarity dataset) and Mturk-771. The datasets can serve as a reflection of how the debiasing methods affect the semantic space. While it is possible that the similarity datasets contain gendered words, it can be assumed that the majority of word pairs does not have an explicit gendered meaning. We verify this assumption by means of manual analysis of the simlex-999 pairs and found 62 pairs where one or both terms had a meaning that explicitly contains gender. Our manual annotations can be found in our Github repository. We report the Spearman correlation between the similarity scores given by the annotators and the cosine distance before and after debiasing with INLP and MP in Table 3.

The results on these similarity tests remain relatively stable after applying MP, whereas we see a clear increase in correlation sores after INLP is applied. This result confirms that the overall embeddings change less when applying MP compared to INLP. The observed improvement caused by INLP indicate that more research is needed to determine how INLP affects the entire semantic space. If iteractive projections can improve the overall quality, the additional effect they have on the space may actually be desirable. In particular, Ravfogel et al. (2020) hypothesize that this increase may be due to words in the datasets containing a gender component which is not perceived by humans, but do not investigate the matter further. We further explore this possibility in our additional experiments described in Section 4.2 (see main content above).

WEAT.

Caliskan et al. (2017) propose the Word Embedding Association Test (WEAT) that can measure bias inferred by the semantic similarity between groups of words. We have two groups of target words for instance $T_{1}=\{$ programmer, professor, engineer, … $\}$ , $T_{2}=\{$ nurse, teacher, librarian , … $\}$ and two set of attribute words $A_{1}=\{$ man, male, masculine, … $\}$ , $A_{2}=\{$ woman, female, feminine, … $\}$ . If the word embeddings representing these target words are not biased, then the relative similarities between the target words ( $T_{1}$ , $T_{2}$ ) and attributes ( $A_{1}$ , $A_{2}$ ) should be equivalent. In other words: words such as secretary and programmer should not be more similar to either of the two groups of gendered attributes.

WEAT measures the association between the target groups and attributes: a biased representation is expected to maintain this association (reflected by a high score), whereas WEAT scores on representations with low to no bias should be close to zero. A WEAT score of 0 indicates that words from the target groups are, on average, equally strongly associated with $A_{1}$ and $A_{2}$ . Like Ravfogel et al. (2020), we follow Gonen and Goldberg (2019) and represent the attributes through gendered names rather than attributes. We then take the target words of the same the three tests used by Ravfogel et al. (2020) (WEAT 6, WEAT 7 and WEAT 8) in Table 4.Ravfogel et al. (2020) report p-values of associations rather than the values. We present the values instead, because the p-values do not distinguish between positive and negative WEAT scores. WEAT (6) includes targets words referring to career and family attributes, WEAT (7) has target words related to math and art and WEAT (8) has science and art target words.

Both MP and INLP reduce the WEAT scores. For WEAT 6 and 7, 35 INLP projections lead to the largest reduction. For WEAT 8, MP provides the best results followed by a single INLP iteration (rather than the full 35). The negative value of WEAT 8 indicates a stronger bias in opposite direction that the remaining bias present after the first INLP iteration or after the MP projection. This leads us to wonder how to interpret these results. We therefore investigate them further in Section 4.2.

Bias-by-neighbor.

Gonen and Goldberg (2019) show that even if a debiasing method manages to remove a biased vector from explicitly gender words, it may still be close to other words with the same stereotypical connotation (e.g. nurse remains close to receptionist and teacher). The bias-by-neighbors measure captures such remaining bias by providing the percentage of the 100 closest words that were male-biased in the original dataset. For completely debiased word embeddings, this percentage should be around 50% (with half of the words being (slightly) biased to one gender and the other half of the words to the other). This measure is applied to a set of biased professions provided by Bolukbasi et al. (2016), which originally has a bias score of 85.2%. INLP reduces this score to 73.4% and MP to 74.5%. Both methods thus yield comparable results that reveal that bias is reduced but not completely removed.

TPR-GAP: “gender in the wild”.

De-Arteaga et al. (2019) propose an approach that considers the impact of bias on downstream classification tasks. The approach measures to what extent debiasing methods can remove gender bias from a system that predicts a person’s occupation based on a biography describing them. A fair system should have equal performance for members of different classes Hardt et al. (2016). Specifically, it should not amplify a bias that is present in the label distribution. In the case of gender, the system should perform equally well predicting stereotypically gendered occupations such as surgeon, caretaker, secretary or marine, regardless of whether the biographical texts are about men or women. De-Arteaga et al. (2019) use the $GAP^{TPR}$ score to measure to what degree the performance of a classification system is impacted by bias. The score is calculated as follows: TPRg,y (Equation 1) is the true positive rates for a given profession ( $y$ ) and gender ( $g$ ). TPR-GAP Gap ${}_{g,y}^{\text{TPR}}$ is the difference (gap) between true positive rates (TPR) of gender $g$ and its opposite gender $\sim g$ in a given occupation $y$ (Equation 2). In order to calculate a single measure for bias in all occupations Romanov et al. (2019) take the root-mean square of the TPR-GAPs for all occupations (Equation 3).

We obtain 393,423 biographies from Ravfogel et al. (2020) which is a subset of the original corpus of De-Arteaga et al. (2019) describing people with 28 different occupations.De-Arteaga et al. (2019) use 399,000 biographies. The reduction is due to biographies no longer being available when scraping the data. We split the data in the same 65% training, 10% development and 25% test splits as used by De-Arteaga et al. (2019) and Ravfogel et al. (2020). Like Ravfogel et al. (2020), we use logistic classifiers that take one of three representations of the biographies as input: (1) one-hot BOW, (2) averaged FastText embeddings (Joulin et al., 2017) and (3) the last hidden state of BERT over the [CLS] token.

In this setup, debiasing projections are applied as as follows: For INLP, each of the three representations is debiased by training a linear classifier on all 28 occupation classes and using the learned vector to calculate the nullspace. Logistic regression in 100 iterations is used for the BOW representations, 150 linear SVM iterations for FastText representations and 300 linear SVM iterations for BERT. We use Scikit Learn Pedregosa et al. (2011) for all classification approaches. For the MP setup, we create a mean projection for every occupation $o$ (28 in total) by identifying the projection vector $P_{o}$ based on the mean of all female and the mean of all male biographies with occupation $o$ .

We report the accuracy of predicting the correct profession (Acc.), the root-mean-square of the $GAP^{TPR}$ scores of all professions ( $GAP^{RMS}$ ) and the correlation between percentage of women in a given profession and the respective $GAP^{TPR}$ scores (Corr.). If debiasing is successful, the classification accuracy should remain high (compared to before debiasing), while the GAP score should decrease. We measure the correlation between underpredicting women (the $GAP^{TPR}$ score) and the true percentage of women in the set of professions. A lower correlation is preferable. The intuition behind measuring the correlations in this way is to measure the degree to which bias is amplified compared to the distribution of men and women in a profession. For example, a difference of 5% can have different implications depending on the original distribution: It is worse if the predicted percentage of females in a profession goes down to 15% when the true percentage is 20% compared to a true percentage of 48% going to 43%.

The results presented in Table 5 show that overall results of INLP and MP are comparable. MP retains higher accuracy compared to INLP. MP reduces the $GAP^{RMS}$ much more for one-hot BOW, slightly more for FastText embeddings and slightly less for BERT. INLP reduces the correlation a bit more for both FastText and BERT.

Appendix C Additional Results of Diving Deeper

This appendix provides additional results on the WEAT scores. Figure 10 presents the result of applying random projection iterations after 8 INLP iterations. Figure 10 illustrates when random projections are applied to the original data (without first applying projections for removing gender). These figures illustrate that random projections after 8 INLP iterations mostly lead to a decrease in WEAT score, whereas applying random projection directly to fully biased data increases the scores in the far majority of the cases.

Introduction

Related Work

Projection Methods

2 Mean Projection (MP)

3 Tukey Median Projection (TMP)

Experiments

Linear Guarding.

Classification.

Effect on the Embedding Space.

WEAT.

Bias-by-neighbor.

TPR-GAP: “gender in the wild”.

Summary.

2 Diving Deeper

WEAT.

Summary.

Conclusion

Limitations

Binary gender.

Suitability for language with grammatical gender is unclear

Linear guarding only.

Ethics

References

Appendix A Tukey Median Projection

Appendix B Experiments

Linear Guarding.

Classification.

Clustering.

Exploring Neighbors.

Similarity Experiments

WEAT.

Bias-by-neighbor.

TPR-GAP: “gender in the wild”.

Appendix C Additional Results of Diving Deeper

Appendix D Changes in the Embedding Space of GloVe