Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information

Shun Shao, Yftah Ziser, Shay B. Cohen

Introduction

Natural language processing (NLP) models currently play a critical role in decision-supporting systems. Their predictions are often affected by undesirable biases encoded in real-world data they are trained on. Making sensitive predictions based on irrelevant input attributes such as gender, race, or religion (protected or guarded attributes) impacts user trust and the practical broad utility of NLP methods.

In recent years, representation learning approaches have become the mainstay of input encoding in NLP. While representation learning has yielded state-of-the-art results in many NLP tasks, controlling or inspecting the information encoded in these representations is hard. Thus, using rule-based methods to remove unwanted information from such representations is often not feasible. In the context of protected attributes, Bolukbasi et al. (2016) showed that word embeddings trained on the Google News corpus encode gender stereotypes. Later, Manzini et al. (2019) expanded this work and showed that word embeddings trained on the Reddit L2 corpus Rabinovich et al. (2018) encode race and religion biases.

We propose a simple yet effective technique to remove protected attribute information from neural representations. Our method, dubbed SAL for Spectral Attribute removaL, applies Singular Value Decomposition (SVD) on a covariance matrix between the input representation and the protected attributes and prunes highly co-varying directions. Figure 1 demonstrates how professional biography text representations from labeled gender clusters (each biography is marked with the gender of its subject; De-Arteaga et al. 2019) for different professions expand after the use of SAL, and become closer, implying a higher spread of each profession representations after SAL (§5.2.2).

In addition, we overcome the linear removal limitations of SAL and previous work by using eigenvalue decomposition of kernel matrices to obtain projections into directions with reduced covariance in the kernel feature space. We refer to this method as kSAL (for kernel SAL).

SAL outperforms the recent method of Ravfogel et al. (2020) aimed at solving the same problem and is able to remove guarded information much faster while retaining better performance for the main task. Further experiments demonstrate that our method performs well even when the available data for the protected attributes is limited.

Problem Formulation and Notation

We assume $n$ samples of $(\mathbf{X},\mathbf{Y},\mathbf{Z})$ , denoted by $(\mathbf{x}^{(i)},\mathbf{y}^{(i)},\mathbf{z}^{(i)})$ for $i\in[n]$ . These samples are used to train the classifier to predict the target values ( $y$ ) from the inputs ( $x$ ). These samples are also used to remove the information from the inputs based on the guarded attributes ( $z$ ).

Erasing Principal Directions

We describe SAL in this section. We explain the use of SVD on cross-covariance matrices (§3.1) and describe the core algorithm in §3.2 and the connection to other algorithms in §3.3.

where $\mathcal{O}_{i}$ is the set of pairs of vectors $(\mathbf{a},\mathbf{b})$ such that $||\mathbf{a}||_{2}=||\mathbf{b}||_{2}=1$ , $\mathbf{a}$ is orthogonal to $\bm{U}_{1},\ldots,\bm{U}_{i-1}$ and similarly, $\mathbf{b}$ is orthogonal to $\bm{V}_{1},\ldots,\bm{V}_{i-1}$ .

Once the orthogonal matrices in the form of $\bm{U}$ and $\bm{V}$ are found, one can truncate them (for example, use only a subset of the columns of $\bm{U}$ , represented as the semi-orthonormal matrix $\hat{\bm{U}}$ ) to use, for example, $\hat{\bm{U}}^{\top}\mathbf{X}$ , as a representation (linear projection) of $\mathbf{X}$ which co-varies the most with $\mathbf{Z}$ .

We suggest that rather than using the largest singular value vectors in $\bm{U}$ to project $\mathbf{X}$ , we should project $\mathbf{X}$ using the principal directions with the smallest singular values. This means we find a representative of $\mathbf{X}$ that co-varies the least with $\mathbf{Z}$ , essentially removing the information from $\mathbf{X}$ that is most related to $\mathbf{Z}$ and can be detected through covariance.

2 The SAL Algorithm

SVD is then performed on $\bm{\Omega}$ to obtain $(\bm{U},\bm{\Sigma},\bm{V})$ . We choose an integer value $k$ and define $\overline{\bm{U}}=\bm{U}_{(k+1):d}$ . The value of $k$ is bounded by the rank of $\bm{\Omega}$ . The rank of $\bm{\Omega}$ is bounded from above by $d$ and $d^{\prime}$ , the dimensions of the vectors of $\mathbf{X}$ and $\mathbf{Z}$ .

Then, the vectors $\mathbf{x}^{(i)}$ are projected using either $\overline{\bm{U}}^{\top}$ or $\overline{\bm{U}}\overline{\bm{U}}^{\top}$ . The latter projection attempts to project $\mathbf{x}^{(i)}$ to the original dimensionality and space after removing the information. More specifically, $\overline{\bm{U}}\overline{\bm{U}}^{\top}$ is a projection matrix to the range of $\bm{\Omega}$ .

The criterion we use to choose $k$ is based on the singular values in $\bm{\Sigma}$ . More specifically, we choose a threshold $\alpha\geq 1$ and choose the minimal $k$ such that $\bm{\Sigma}_{11}/\bm{\Sigma}_{k+1,k+1}>\alpha$ .

3 Connection to CCA and PCA

We describe connections to other matrix factorization methods.

The use of SVD on the cross-covariance matrix is very much related to the technique of Canonical Correlation Analysis (CCA), in which projections of $\mathbf{X}$ and $\mathbf{Z}$ are found such that they maximize the cross-correlation between these two random vectors. Rather than applying SVD on the cross-correlation matrix (CCA), we apply it on the cross-covariance matrix to preserve the $\mathbf{X}$ scale in our projection.

In all three cases of CCA, PCA and in addition, LSA (Latent Semantic Analysis; Dumais 2004), SVD or eigenvalue decomposition is used with the aim of maximizing the correlation or covariance between one or two random vectors. In our case, the SVD is used to minimize the covariance between projections of $\mathbf{X}$ and $\mathbf{Z}$ .

Kernel Extension to SAL

The kernel trick refers to learning and prediction without explicitly representing $\phi(\mathbf{x})$ or $\psi(\mathbf{z})$ . Rather than that, we assume two kernel functions, $K_{\phi}(\mathbf{x},\mathbf{x^{\prime}})$ and $K_{\psi}(\mathbf{z},\mathbf{z}^{\prime})$ that calculate similarities between two $x$ s or between two $z$ s.

Every kernel that satisfies the necessary properties can be shown to be a dot product in some feature space. This means that for a given kernel function $K_{\phi}(\mathbf{x},\mathbf{x^{\prime}})$ it holds that

for some $\phi$ function and similarly for $K_{\psi}(\mathbf{z},\mathbf{z}^{\prime})$ . Masking learning and prediction through a kernel function is often useful when the feature representations $\phi$ and $\psi$ are hard to explicitly compute, for example, because $m=\infty$ or $m^{\prime}=\infty$ (such as the case with the Radial Basis Function, RBF, kernel).

We show next that the kernel trick can be used to generalize SAL to nonlinear information removal.

2 Removal with the Kernel Trick

Rather than assuming a set of examples in the form mentioned in §2, we assume we are given as input two kernel matrices of dimension $n\times n$ :

In addition, for the justification of our algorithm, we define the following two feature matrices based on the kernel feature functions:

Note that these two matrices are never calculated explicitly. Given the definition of the kernel as a dot product in the feature space (Eq. 4), it can be shown that $\bm{K}_{\phi}=\bm{\Phi}^{\top}\bm{\Phi}$ and $\bm{K}_{\psi}=\bm{\Psi}^{\top}\bm{\Psi}$ . In addition, we slightly change the empirical cross-covariance matrix $\bm{\Omega}$ definition in Eq. 3 to: $\bm{\Omega}=\bm{\Phi}\bm{\Psi}^{\top}$ . (This means we ignore the constant $1/n$ in the above definition of $\bm{\Omega}$ , the constant that normalizes the matrix with respect to the number of examples. This does not change the nature of the following discussion, but it makes it simpler.) At this point, the question is how to perform SVD on $\bm{\Omega}$ without ever accessing directly the feature functions. This is where the spectral theory of matrices comes in handy.

More specifically, it is known that the left singular vectors of $\bm{\Omega}$ ( $\bm{U}$ ) are the eigenvectors of $\bm{\Omega}\bm{\Omega}^{\top}$ . In addition, the singular values of $\bm{\Omega}$ correspond to the square-root values of the eigenvalues of $\bm{\Omega}\bm{\Omega}^{\top}$ .

In addition, we show in Appendix A why an eigenvector $\mathbf{w}$ of $\bm{\Gamma}=\bm{K}_{\phi}\bm{K}_{\psi}$ can be transformed to an eigenvector of $\bm{\Omega}\bm{\Omega}^{\top}$ by multiplying $\mathbf{w}$ on the left by $\bm{\Phi}$ and calculating $\bm{\Phi}\mathbf{w}$ .

With this fact in mind, we are now ready to find the left singular vectors of $\bm{\Omega}$ by finding the eigenvalues of $\bm{\Gamma}$ , a matrix which is solely based on the kernel functions of $\mathbf{x}$ and $\mathbf{z}$ .

Let $\mathbf{w}_{1},\ldots,\mathbf{w}_{k}$ be eigenvectors of $\bm{\Gamma}$ and let $\mathbf{w}^{\prime}_{1},\ldots,\mathbf{w}^{\prime}_{k}$ be the orthonormalization of $\mathbf{w}_{i}$ , $i\in[k]$ based on the inner product $\langle\mathbf{w}_{i},\mathbf{w}_{j}\rangle=\mathbf{w}_{i}\bm{K}_{\phi}\mathbf{w_{j}}^{\top}$ . If we denote by $\bm{W}$ the matrix such that $\bm{W}_{j}=\mathbf{w}^{\prime}_{j}$ for $j\in[k]$ , then $\Phi\bm{W}=\bm{U}$ where $\bm{U}$ is the left singular vector matrix of $\bm{\Omega}$ . Then,

where $\kappa(\mathbf{x})$ is a function that returns a vector of length $n$ such that $[\kappa(\mathbf{x})]_{j}=K(\mathbf{x}^{(j)},\mathbf{x})$ . Eq. 9 shows we can calculate the projection of $\phi(\mathbf{x})$ while removing the information in $\psi(\mathbf{z})$ by using the smallest eigenvalue eigenvectors of $\bm{\Gamma}$ and kernel calculations of each training example with $\mathbf{x}$ .

3 Practical Kernel Removal

Using the kernel algorithm as above may lead to issues with tractability, as it possibly requires calculating the full eigenvector matrix of a large matrix (the product of two kernel matrices). We propose an alternative algorithm (kSAL) for the kernel case, which is more tractable.

Absorbing the kernel function computation as a constant, computing the kernel matrices is $\mathcal{O}(n^{2})$ and their product $\bm{\Gamma}$ in $O(n^{\omega})$ for $\omega<2.808$ using Strassen’s algorithm, but can be done much more efficiently when $\bm{K}_{\psi}$ is sparse, as normally expected. Calculating the top $k$ eigenvectors of $\bm{\Gamma}$ , has a cost of $\mathcal{O}(nk^{2}+k^{3})$ using, for example, the Arnoldi method.For example, Matlab implements a variant of the Arnoldi method for its function eigs. In §5.4, we report the clock running time for the kernel method.

Below, we experiment with RBF kernels (where $K_{\phi}(\mathbf{x},\mathbf{x}^{\prime})=\exp(-\gamma||\mathbf{x}-\mathbf{x}^{\prime}||_{2}^{2})$ ; we use $\gamma=0.1$ ) and polynomial kernel of degree 2 (where $K_{\phi}(\mathbf{x},\mathbf{x}^{\prime})=(1+\mathbf{x}^{\top}\mathbf{x}^{\prime})^{2}$ ). The $\mathbf{z}$ kernel remains linear (dot product).

Experiments

In our experiments, our main comparison algorithm is the iterative null space projection (INLP) algorithm of Ravfogel et al. (2020), which aims at solving an equivalent problem to ours. For the word embedding debiasing and fair classification (both setups), we follow the experimental settings of Ravfogel et al. (2020).We use the authors’ implementation for both the INLP method and the experimental settings: https://github.com/shauli-ravfogel/nullspace_projection. SAL provides linear guarding, similarly to INLP, while kSAL also captures nonlinear regularities with respect to $\mathbf{Z}$ (one-hot vector). We can provide such guarding for representations of state-of-the-art encoders (such as BERT), provided the representations are eventually fed into a classifier for prediction. The protected attributes we experiment with are gender and race.

For debiasing word embeddings (§5.1), we use 7,500 male and female associated words, 15K words overall. The dataset train/validation/test split sizes are (49%/21%/30%). All the splits are balanced, i.e., containing an equal amount of male and female associated words. For the fair sentiment classification task (§5.2), we use 10K training examples across all authors’ ethnicity ratios (0.5, 0.6, 0.7, and 0.8). All training sets have an equal amount of positive and negative sentiment examples. The test set is balanced for both sentiment and authors’ ethnicity labels. For the profession classification task (§5.2.2), the data train/validation/test split sizes are (65%/10%/25%), and all the splits combined contain 115K samples.

1 Word Embedding Debiasing

Word embeddings are often prone to encoding biases in various ways (see §6). We evaluate our methods on gender bias removal from GloVe word embeddings. We use the 150,000 most common words and discard the rest. We sort the embeddings by their projection on the $\overrightarrow{\text{he}}$ - $\overrightarrow{\text{she}}$ direction. Then we consider the top 7,500 word embeddings as male-associated words ( $z=1$ ) and the bottom 7,500 as female-associated words ( $z=-1$ ).

A linear classifier can perfectly predict the guarded gender attribute when trained on out-of-the-box GloVe embeddings. Removing the first direction ( $k=1$ ) does not affect the accuracy demonstrated in Figure 2. For $k=2$ , the performance drops to 50.2%, almost a random guess.

We further perform intrinsic semantic tests to ensure the debiased embeddings remain useful. We use SimLex-999, WordSim353, and Mturk771 (similarity and relatedness datasets) to calculate the correlation between cosine similarities of the word embeddings to the human-annotated similarity score Hill et al. (2015); Finkelstein et al. (2001); Halawi et al. (2012). We observed minor improvements for all tests when using debiased embeddings (Table 1), suggesting that our method keeps the embeddings intact. We also report the three most similar words (nearest neighbors) for ten random words before and after SAL (see Appendix B). We observe almost no change between the two sets of embedding results.

SAL debiasing does not provide a nonlinear information removal. In Figure 2 we plot the performance of nonlinear classifiers in the prediction of the linearly-guarded attribute (gender) as a function of the number of removed directions. We also provide linear classifier results for reference. We see that even after removing up to 30 principal directions, (linear) SAL is not sufficient for nonlinear classifiers – the gender can still be predicted. This finding is also noted by Ravfogel et al. (2020), who did not offer a direct solution. This finding partially motivates our development of kSAL.

All three kernels achieve high gender prediction accuracy when no information is removed ( $k=0$ ), with accuracy of 100%, 99.9% and 95.7% for the linear, polynomial, and RBF kernel, respectively. While the performance of the linear and polynomial kernels is not affected by removing one principal direction ( $k=1$ ), the RBF kernel accuracy drops to 86.3%. With $k=2$ , performance drops to 50.2%, 44.5% and 50.2% for the linear, polynomial, and RBF kernel, respectively, under nonlinear kernel removal. Compared to Figure 2 with SAL, we see kSAL effectively removes nonlinear information.

To quantitatively test whether the embeddings retain their geometric form when removing gender information, we compare the standard deviation ( $\rho$ ) of the values in $\bm{K}_{\phi}$ to the average deviation ( $\gamma$ ) of values of $\bm{K}_{\phi}$ from the corresponding values in $\hat{\bm{K}}_{\phi}$ (Eq. 10). When removing two principal directions, the largest approximation difference is seen in the linear kernel, with $\gamma/\rho=0.64$ . For the polynomial kernel, we observe $\gamma/\rho=0.52$ . For RBF, we have $\gamma/\rho=0.16$ .

2 Fair Classification

To further evaluate our method on downstream tasks, we follow fair classification tests of social media text and other texts.

The first task is sentiment analysis for social network users’ posts. We use the TwitterAAE dataset Blodgett et al. (2016), which contains users’ tweets ( $\mathbf{x}$ ), coupled with the users’ ethnic affiliations ( $\mathbf{z}$ ), and a binary label for the sentiment the tweet conveys ( $\mathbf{y}$ ). The dataset splits the users into two groups, African American English (AAE) speakers and Standard American English (SAE) speakers. As users’ privacy makes it hard to obtain ground truth labels for ethnic affiliation, the dataset uses the demographics of the neighborhoods the users live in as a proxy. Following Ravfogel et al. (2020), we use the encoder of Felbo et al. (2017), DeepMoji, to obtain the tweets representation. DeepMoji is suitable for our goal, as it has been shown to encode demographic information and, therefore, might lead to unfair classification Elazar and Goldberg (2018).

We experiment with four different setups. The dataset consists of an equal amount of positive and negative sentiment examples for all of them. The datasets differ with respect to the guarded attribute ratio. A ratio of $p\in\{0.5,0.6,0.7,0.8\}$ means that $p$ of the positive class examples are composed of AAE speakers, and $p$ of the negative class examples are composed of SAE speakers. We experiment with ratios of $0.5$ , $0.6$ , $0.7$ and $0.8$ . The larger the ratio, the higher the classifier’s tendency to make use of protected attributes to make its prediction.

We report the accuracy of the methods on the sentiment analysis task. To measure fairness, we use the difference in true positive rate (TPR-gap) between individuals belonging to different guarded attributes groups Hardt et al. (2016); Ravfogel et al. (2020). The rationale behind the TPR gap is that for an equal opportunity, a positive outcome must be independent of the guarded attribute ( $\mathbf{z}$ ), conditional on ( $\mathbf{y}$ ) being an actual positive. See Hardt et al. (2016) for more details.

Table 2 presents our results for the fair sentiment classification. For the first three ratios, $0.5$ , $0.6$ , and $0.7$ , we can see that both SAL ( $k=1,2$ ) and INLP maintain most of the main-task performance. In debiasing (TPR-Gap), SAL with $k=2$ significantly outperforms INLP. As expected, removing two directions results in better debiasing than removing one, but it does not lead to a performance drop on the main task. While for the last ratio, $0.8$ , INLP achieves the highest TPR-gap result, it comes at the cost of a sharp performance drop on the main task, resulting in a nearly random classifier. SAL ( $k=1,2$ ) maintains most of the main-task performance, and for $k=2$ , the TPR-gap is halved.

2.2 Fair Profession Classification

The second task is profession classification. De-Arteaga et al. (2019) attempt to quantify the bias in automatic hiring systems and show that even for a simple task, predicting a candidate’s profession based on a self-provided short biography, significant gaps result from the writer’s gender. This might influence the open positions an automatic system will recommend to a candidate, thus favoring candidates from one gender over the other. We hence follow the setup of De-Arteaga et al. (2019), who experiment with professions classification ( $\mathbf{y}$ ), from short biographies ( $\mathbf{x}$ ), and gender as a guarded attribute ( $\mathbf{z}$ ). We use a multiclass classifier to predict the profession, as there are 28 profession classes. We experiment with two types of text representations, FastText Joulin et al. (2016), based on bag of word embeddings (BWE) and BERT Devlin et al. (2018) encodings.

We report accuracy for the profession classification. For bias level measurement, we use a generalization of TPR-gap for multi-class, suggested by De-Arteaga et al. (2019), calculating the root mean square (RMS) of the TPR with respect to all classes.

De-Arteaga et al. (2019) also provided evidence for a strong correlation between TPR-gap and existing gender imbalances in occupations, which may lead to unfair classification.

Table 3 presents the profession classification results. Similar to the sentiment analysis task, SAL ( $k=1,2$ ) maintains most of the main-task performance, and for $k=2$ , the two-direction removal, the TPR-gap is lower. When comparing SAL ( $k=2$ ) to INLP, we observe a clear trade-off between maintaining the main task performance (SAL, $k=2$ ) and low TPR-gap scores (INLP).

3 Scarce Protected Attribute Labels

For many real-world applications, obtaining large amounts of labeled data for protected attributes can be costly, labor-intensive, and in some cases, infeasible due to an ever-increasing number of privacy regulations. In this analysis, we stress-test our algorithm by simulating a scenario in which only a limited amount of samples from the main task are coupled with the desired protected attribute labels. For this purpose, we replicate the fair sentiment classification experiments, but this time, feeding only a fraction of the annotated data to our debiasing method. The experiment is identical in terms of the main task, i.e., we use 100K samples for training the sentiment classifier. We experiment with different fractions of the debiasing data, i.e., 5%, of the sentiment training data containing labels about the protected attribute. We hence feed 5,000 samples for debiasing. The subsets for debiasing are chosen randomly. We repeat each experiment 10 times with different subsets. Table 4 presents our results. Using a small fraction of the data for debiasing did not significantly affect SAL’s ( $k=1,2$ ) main-task performance. INLP, on the other hand, suffers from a sharp performance decrease, resulting in a near-random sentiment classifier. SAL’s ( $k=1,2$ ) ability to debias the data is slightly worse than in the complete dataset setting but the resulting representations are still significantly less biased than the original ones. INLP achieves low TPR gaps, but it is hard to determine if this is due to an accurate bias removal or a result of corrupting the representations.

4 Kernel Experiments

Despite their flexibility in modeling rich feature functions, kernels have been documented to be computationally intensive. Lack of computational resources prevented us from using the full sentiment and bios datasets for our kernel experiments, and instead, we use $15,000$ training examples and $7,998$ test set examples (the full test set) for the sentiment dataset and $15,000$ training examples and $5,000$ examples for the profession dataset. For training on the acquired $15,000$ training examples, we used one Intel Xeon E5-2407 CPU, running at 2.2 GHz, for approximately five hours (for a time complexity analysis, see §4.3).

Table 5 shows that using only a small subset of the data, kSAL-poly2 reduces the TPR gaps while maintaining almost identical performance to the original model on both the sentiment analysis and profession classification tasks. For the sentiment analysis task, kSAL-RBF slightly improves the main task results while reducing the TPR-gap (RMS). For the RBF profession classification task, the results are unexpected, with main task performance increasing as we remove principal directions. This could be due to the pruning of the rich, infinite feature space RBF kernel represents (we also observe significant overfitting with RBF).With INLP, RBF-kernel SVM also obtains low-accuracy results.

5 Perturbed Inputs

While the transformation through $\overline{\bm{U}}\overline{\bm{U}}^{\top}$ maps $\mathbf{x}$ back into the original vector space (as a projection), it often turns out that it removes information in such a way that the original classifier (trained on data without removal) can no longer be used with the inputs after removal. This issue exists not only with our algorithm, but also with INLP, and indeed, like us, Ravfogel et al. (2020) re-trained their classifier after they created the cleaned projected inputs.

Ideally, we would want to remove information without necessarily having to retrain a classifier for the main task, as this is costly and perhaps unattainable. To test the effect of such an approach, we interpolated $\overline{\bm{U}}\overline{\bm{U}}^{\top}$ with the identity matrix, to eventually project $\mathbf{x}$ using $\lambda\overline{\bm{U}}\overline{\bm{U}}^{\top}+(1-\lambda)\bm{I}$ for $\lambda\in\{0,0.1,\ldots,1.0\}$ . This approach weakens the impact of the removal projection and retains some of the information in $\mathbf{x}$ . While an adversary can attack this approach,Consider that the matrix $\lambda\overline{\bm{U}}\overline{\bm{U}}^{\top}+(1-\lambda)\bm{I}$ could be invertible for $\lambda<1$ . it can mitigate the effects of privacy violations in cases where the service or software used with the modified representations cannot be retrained, especially if the service providers have no malicious intent.

Figure 3 describes an ablation experiment, ranging $\lambda$ as above on the bios dataset. We see that as we increase the intensity of the use of the SAL projection (increasing $\lambda$ ), the accuracy of both gender prediction and profession prediction decrease when training the original classifier on the non-projected inputs. While the behavior is similar for the gender accuracy for both INLP and our method, the decrease for the profession prediction is much sharper for $\lambda>0.4$ with INLP.

6 Runtime of SAL

We measure the time it takes both methods to learn a projection matrix for a given training set. Once we have a projection matrix, debiasing the data is done by multiplying the data representation matrix by the learned projection matrix. Since matrix multiplication is a common practice for many research disciplines, and both methods use it, we do not benchmark it as well. Table 6 presents the run-time differences between SAL and INLP. For all of the experiments, SAL runtime is smaller by at least three orders of magnitude than INLP runtime.

Related Work

In their influential work, Bolukbasi et al. (2016) revealed that word embeddings for many gender-neutral terms show a gender bias. Zhao et al. (2018) presented a customized training scheme for word embeddings, which minimizes the negative distances between words in the two groups, e.g., male and female related words, for gender debiasing. Gonen and Goldberg (2019) demonstrated that bias remains deeply intertwined in word embeddings even after using the above methods. For example, they showed several methods that can accurately predict the gender associated with gender-neutral words, even after applying the methods mentioned above. Similar to Ethayarajh et al. (2019), they concluded that removing a small number of intuitively selected gender directions cannot guarantee the elimination of bias. Motivated by this conclusion, Ravfogel et al. (2020) presented iterative null space projection (INLP). This debiasing algorithm iteratively projects features into a space where a linear classifier cannot predict the guarded attribute. The debiased representations are linearly guarded, i.e., they cannot guarantee bias removal beyond the linear level. Indeed, they show a simple nonlinear classifier can achieve high accuracy when predicting the guarded attribute. Their approach is also related to that of Xu et al. (2017). Previous work uses adversarial methods Ganin et al. (2016) for information removal Edwards and Storkey (2015); Li et al. (2018); Coavoux et al. (2018); Elazar and Goldberg (2018); Barrett et al. (2019); Han et al. (2021) with the one by Ravfogel et al. (2022) being related to ours through the use of the mini-max theorem with the squared-error loss on the reconstruction of a matrix similar to our covariance matrix. In addition, methods based on similarity measures between neural representations Colombo et al. (2022) were developed. To support the increasing interest in fair classification, Han et al. (2022) presented an open-source framework for standardizing the evaluation of debiasing methods. Finally, most relevant to this paper is an extension of SAL to the unaligned case, where protected attributes are not paired with input examples Shao et al. (2023).

Conclusions

We presented a method for removing information from learned representations. We extended our method by using kernels, showing we can provide an effective nonlinear guarding. We also experimented with real-world low-resource situations, in which only a small guarded attribute dataset is provided for information removal.

Limitations

There are two main technical limitations to our work: (a) while the kernel removal is nonlinear, it still depends on a feature representation that captures a specific type of nonlinearities; (b) like other kernel methods, the kernel removal method is significantly slower than direct SVD removal in cases where the feature representations can be written out without the need of an implicit kernel. Future work may apply random projections to the kernel matrices to decompose them more efficiently.

A general limitation of current information removal methods is that they can only remove information with respect to a specific class of classifiers. It could always be the case that complex correlations between the inputs and the guarded attributes exist, and that an adversary can try to exploit them to predict the guarded attribute if this class of classifiers is not too complex. Our use of kernels alleviates some of this issue, though not completely.

Finally, experimentally, we focus on text only in English. It is not clear to what extent our method generalizes to other languages in a useful manner, especially when morphology is rich, and the neural representations encode important information for the task at hand, but that information would be removed by our method.

Ethical Considerations

Public trust plays a significant role in the broad applicability of NLP in real-world scenarios, especially in critical situations that may directly impact people’s lives. NLP research of the kind presented in this paper helps this issue take the spotlight it deserves. However, we discourage NLP practitioners from using our method (and similar methods) as an out-of-the-shelf solution in deployed systems. We recommend investing a significant amount of time and effort in understanding the applicability and universality of our method to the debiasing of representations. Issues such as expected type of adversariality or tolerance level for drop in system performance need to be considered.

Acknowledgments

We thank the anonymous reviewers for their helpful comments. We especially appreciate the comment one of the reviewers provided regarding our title. Particularly, it could be misinterpreted as an indication of frustration at rejections of our paper (“gold”) in previous conferences. Rather, the “gold” in our case is the low-intensity principal vectors, which are pruned in most use cases of SVD. We also thank Shauli Ravfogel for providing support with the code for INLP, Ryan Cotterell for discussions and Matt Grenander for feedback on earlier drafts. The experiments in this paper were supported by compute grants from the Edinburgh Parallel Computing Center (Cirrus) and from the Baskerville Tier 2 HPC service (University of Birmingham).

References

Appendix A Eigenvectors of 𝚲𝚲\bm{\Lambda}

We turn to the following Lemma used in §4.2.

Since $\mathbf{w}$ is an eigenvector of $\bm{\Gamma}$ , it holds that $\Gamma\mathbf{w}=\lambda\mathbf{w}$ . Therefore:

and therefore $\bm{\Phi}\mathbf{w}$ is an eigenvalue of

Appendix B Nearest Neighbors Test for Word Embedding Debiasing

We give in Table 7 the ten nearest neighbor words for ten random words from the data, before and after using SAL. The neighboring words are determined through cosine similarity of the corresponding embeddings with respect to the pivot word embedding. We observe little to no difference in these two lists (before and after the removal).