Efficient fair PCA for fair representation learning

Matthäus Kleindessner, Michele Donini, Chris Russell, Muhammad Bilal Zafar

INTRODUCTION

Over the last decade, fairness in machine learning (Barocas et al., 2018) has become an established field. Numerous definitions of fairness, and algorithms trying to satisfy these, have been proposed. In the context of classification, two of the most prominent fairness notions are demographic parity (DP; Kamiran and Calders, 2011) and equality of opportunity (EO; Hardt et al., 2016). DP requires a classifier’s prediction to be independent of a datapoint’s demographic attribute (such as a person’s gender or race), and EO requires the prediction to be independent of the attribute given that the datapoint’s ground-truth label is positive. Formally, in the case of binary classification,

where $\operatorname{Pr}$ is a probability distribution over random variables $Y,\hat{Y}\in\{0,1\}$ and $Z\in\mathcal{Z}$ , with $Y$ representing the ground-truth label, $\hat{Y}$ representing the classifier’s prediction and $Z$ representing the demographic attribute.

An appealing approach to satisfy DP or EO is fair representation learning (e.g., Zemel et al., 2013; see Section 4 for related work): let $X\in\mathcal{X}$ denote a random vector representing features based on which predictions are made. The idea of fair representation learning is to learn a fair feature representation $f:\mathcal{X}\rightarrow\mathcal{X}^{\prime}$ such that $f(X)$ is (approximately) independent of the demographic attribute $Z$ (conditioned on $Y=1$ if one aims to satisfy EO). Once a fair representation is found, any model trained on this representation will also be fair. Of course, the representation still needs to contain some information about $X$ in order to be useful.

Leaving fairness aside, one of the most prominent methods for representation learning (in its special form of dimensionality reduction) is principal component analysis (PCA; e.g., Shalev-Shwartz and Ben-David, 2014). PCA projects the data onto a linear subspace such that the approximation error is minimized. The key idea of our paper is to alter PCA such that it gives a fair representation. This idea is not new: Olfat and Aswani (2019) and Lee et al. (2022) already proposed formulations of fair PCA that aim for the same goal. We discuss the differences between our paper and these works in detail in Section 4. In short, the differences are twofold: (i) while the goal is the same, the derivations are different, and we consider our derivation to be simpler and more intuitive. (ii) the different derivations lead to different algorithms, with our main algorithm being very similar to standard PCA. While our formulation allows for an analytical solution by means of eigenvector computations, the methods by Olfat and Aswani and Lee et al. rely on semidefinite programming or manifold optimization. While our algorithm can be implemented in a few lines of code and runs very fast, with the same complexity as standard PCA, their algorithms rely on specialized libraries and suffer from a huge running time. We believe that because of these advantages our new derivation of fair PCA and our proposed approach add value to the existing literature.

In Section 2, we first review PCA and then derive our formulation of fair PCA. We discuss extensions and variants, including a kernelized version, in Section 3. We provide a detailed discussion of related work in Section 4 and present extensive experiments in Section 5. Some details and experiments are deferred to the appendix.

Notation

FAIR PCA FOR FAIR REPRESENTATION LEARNING

We first review PCA and then derive our formulation of fair PCA. Our formulation is a relaxation of a strong constraint imposed on the PCA objective. We provide a natural interpretation of the relaxation and show that it is equivalent to the original constraint under a particular data model.

Our formulation of fair PCA In fair PCA, we aim to remove demographic information when projecting the dataset onto the $k$ -dimensional linear subspace. We look for a best-approximating projection such that the projected data does not contain demographic information anymore: let $z_{i}\in\{0,1\}$ denote the demographic attribute of datapoint $\mathbf{x}_{i}$ , which encodes membership in one of two demographic groups (we discuss how to extend our approach to multiple groups in Section 3.4 and to multiple attributes in Section 3.5). Ideally, we would like that no classifier can predict $z_{i}$ when getting to see only the projection of $\mathbf{x}_{i}$ onto the $k$ -dimensional subspace, that is we would want to solve

The derivation above yields a natural interpretation of the relaxed Problem (4). It is easy to see that the condition $\mathbf{U}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{X}\mathbf{z}=\mathbf{0}$ is equivalent to

Hence, fair PCA finds a best-approximating projection such that the projected data’s group-conditional means coincide. This interpretation implies that for a special data-generating model the relaxed Problem (4) solved by fair PCA coincides with Problem (3), which we originally wanted to solve.

If datapoints are sampled from a mixture of two Gaussians with identical covariance matrices and the two Gaussians corresponding to demographic groups, then, in the limit of $n\rightarrow\infty$ , (3) and (4) are equivalent.

EXTENSIONS & VARIANTS

We discuss several extensions and variants of our formulation of fair PCA and our proposed algorithm from Section 2.

2 Adaptation to Equal Opportunity

Our formulation of fair PCA in Section 2 aimed at making the data representation independent of the demographic attribute, thus aiming for demographic parity fairness of arbitrary downstream classifiers. If we instead aim for equality of opportunity fairness, of downstream classifiers trained to solve a specific task (coming with ground-truth labels $y_{i}$ ), we apply the procedure only to datapoints $\mathbf{x}_{i}$ with $y_{i}=1$ .

3 Kernelizing Fair PCA

A solution $\mathbf{\Lambda}$ is obtained by filling the columns of $\mathbf{\Lambda}$ with the generalized eigenvectors, corresponding to the largest $k$ eigenvalues, that solve $\mathbf{R}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{K}\mathbf{K}\mathbf{R}\mathbf{\Lambda}=\mathbf{R}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{K}\mathbf{R}\mathbf{\Lambda}\mathbf{W}$ , where $\mathbf{W}$ is a diagonal matrix containing the eigenvalues (Ghojogh et al., 2019). When projecting datapoints onto the linear subspace, we can write $\mathbf{U}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{X}=\mathbf{B}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{X}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{X}=\mathbf{\Lambda}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{R}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{K}$ , and hence we have kernelized fair PCA. We provide the pseudo code of kernelized fair PCA in Appendix B.1. Its running time is $\mathcal{O}(n^{3})$ when being given $\mathbf{K}$ as input, which is the same as the running time of standard kernel PCA.

4 Multiple Groups

5 Multiple Demographic Attributes

6 Higher-Order Variant: Equalizing Group-Conditional Covariance Matrices

Fair PCA finds a best-approximating projection that equalizes the group-conditional means. It is natural to ask whether one can additionally equalize group-conditional covariances in order to further exacerbate discriminability of the projected group-conditional distributions. For one demographic attribute with two demographic groups, this additional constraint would result in the following problem:

where $\mathbf{z}$ is the vector encoding group-membership as in Section 2 and $\mathbf{\Sigma}_{0}$ and $\mathbf{\Sigma}_{1}$ are the two group-conditional covariance matrices. Unfortunately, depending on $\mathbf{\Sigma}_{0}$ and $\mathbf{\Sigma}_{1}$ , this problem may not have a solution (e.g., when the feature variances for one group are much bigger than for the other group and hence $\mathbf{\Sigma}_{0}-\mathbf{\Sigma}_{1}$ is positive or negative definite). However, for small $k$ (or large $d$ ) we can apply a simple strategy to solve (8) approximately. After writing $\mathbf{U}=\mathbf{R}\mathbf{\Lambda}$ as in Section 2, the problem becomes

RELATED WORK

Most works study the problem of fair classification (e.g., Zafar et al., 2019), but fairness has also been studied for unsupervised learning tasks (e.g., Chierichetti et al., 2017). Two of the most prominent definitions of fairness in classification are demographic parity (Kamiran and Calders, 2011) and equal opportunity (Hardt et al., 2016) as introduced in Section 1. Methods for fair classification are commonly categorized into pre-processing, in-processing, and post-processing methods, depending on at which stage of the training pipeline they are applied (d’Alessandro et al., 2017). In the following we discuss the works most closely related to our paper, all of which can generally be considered as pre-processing methods.

Fair representation learning

Zemel et al. (2013) initiated the study of fair representation learning, where the goal is to learn an intermediate data representation that obfuscates demographic information while encoding other (non-demographic) information as well as possible. Once such a representation is found, any ML model trained on it should not be able to discriminate based on demographic information and hence be demographic parity fair. The approach of Zemel et al. learns prototypes and a probabilistic mapping of datapoints to these prototypes. Since then, numerous methods for fair representation learning have been proposed (e.g. Louizos et al., 2016; Moyer et al., 2018; Sarhan et al., 2020; Balunovic et al., 2022; Oh et al., 2022), many of them formulating the problem as an adversarial game (e.g. Edwards and Storkey, 2016; Beutel et al., 2017; Xie et al., 2017; Jia et al., 2018; Madras et al., 2018; Raff and Sylvester, 2018; Adel et al., 2019; Alvi et al., 2019; Feng et al., 2019; Song et al., 2019) and some of them adapting their approach to aim for downstream classifiers to be equal opportunity fair (e.g. Madras et al., 2018; Song et al., 2019). In contrast to our proposed approach, none of these techniques allows for an analytical solution and all of them require numerical optimization, which has often been found hard to perform, in particular for the adversarial approaches (cf. Feng et al., 2019, Sec. 5, or Oh et al., 2022, Sec. 2.2).

Fair PCA for fair representation learning and other methods for linear guarding

The methods discussed next are all methods for fair representation learning that bear some resemblance to our proposed approach. Most closely related to our work are the papers by Olfat and Aswani (2019), Lee et al. (2022), and Shao et al. (2022).

Olfat and Aswani (2019) introduced a notion of fair PCA with the same goal that we are aiming for in our formulation, that is finding a best-approximating projection such that no linear classifier can predict demographic information from the projected data. They use Pinsker’s inequality and an approximation of the group-conditional distributions by two Gaussians to obtain an upper bound on the best linear classifier’s accuracy. The upper bound is minimized when the projected data’s group-conditional means and covariance matrices coincide. Olfat and Aswani then formulate a semidefinite program (SDP) to minimize the projection’s reconstruction error while satisfying upper bounds on the differences in the projected data’s group-conditional means and covariance matrices. This SDP approach has been criticized by Lee et al. (2022, Section 5.1) for its high runtime and its relaxation of the rank constraint to a trace constraint, “yielding sub-optimal outputs in presence of (fairness) constraints, even to substantial order in some cases”. In Section 5 we rerun the experiments of Lee et al. and also observe that the running time of the method by Olfat and Aswani is prohibitively high. Furthermore, we consider our derivation of fair PCA to be more intuitive since we do not rely on upper bounds or a Gaussian approximation.

Arguing that matching only group-conditional means and covariance matrices of the projected data might be too weak of a constraint, Lee et al. (2022) define a version of fair PCA by requiring that the projected data’s group-conditional distributions coincide. They use the maximum mean discrepancy to measure the deviation of the group-conditional distributions and a penalty method for manifold optimization to solve the resulting optimization problem. While running much faster than the method by Olfat and Aswani (2019), we find the running time of the method by Lee et al. to be significantly higher than the running time of our proposed algorithms; still, in terms of the quality of the data representation our algorithms can compete. Lee et al. present their algorithm only for two demographic groups and it is unclear whether it can be extended to more than two groups.

Concurrently with the writing of our paper, Shao et al. (2022) proposed the spectral attribute removal (SAL) algorithm to remove demographic information via a data projection. Their algorithm is based on the observation that a singular value decomposition of the cross-covariance matrix between feature vector $\mathbf{x}$ and demographic attribute $z$ yields projections that maximize the covariance of $\mathbf{x}$ and $z$ . Although derived differently, it turns out that the SAL algorithm and our fair PCA method are closely related: SAL projects the data onto the subspace spanned by the columns of the matrix $\mathbf{R}$ in our Algorithm 1. Hence, for $k=d-1$ the two algorithms project the data onto the same subspace. However, SAL does not allow to choose an embedding dimension smaller than $d-1$ . While Shao et al. also provide a kernelized variant of their algorithm, they do not provide the interpretation of matching group-conditional means or any extension to also match group-conditional covariances.

There are also papers that propose methods for linear guarding, that is finding a data representation from which no linear classifier can predict demographic information, that are not related to PCA: Ravfogel et al. (2020) iteratively train a linear classifier to predict the demographic attribute and then project the data onto the classifier’s nullspace; Haghighatkhah et al. (2021) describe a procedure to find a projection such that the projected data is not linearly separable w.r.t. the demographic attribute anymore, but still linearly separable w.r.t. some other binary attributes; Ravfogel et al. (2022) formulate the problem of linear guarding as a linear minimax game, where a projection matrix competes against the parameter vector of a linear model. In case of linear regression this game can be solved analytically, while for logistic regression and other linear models a relaxation of the game is solved via alternate minimization and maximization.

Fair PCA for balancing reconstruction error

A very different notion of fair PCA was introduced by Samadi et al. (2018), which views PCA as a standalone problem and wants to balance the excess reconstruction error across different demographic groups. This line of work, which is incomparable to our notion of fair PCA and the notions discussed above, has been extended by Tantipongpipat et al. (2019), Pelegrina et al. (2021) and Kamani et al. (2022).

Information bottleneck method

EXPERIMENTS

In this section, we present a number of experiments.Code available on https://github.com/amazon-science/fair-pca. We first rerun and extend the experiments performed by Lee et al. (2022) in order to compare our algorithms to the existing methods for fair PCA by Olfat and Aswani (2019) and Lee et al. (2022) and to the methods for linear guarding by Ravfogel et al. (2020) and Ravfogel et al. (2022). We also apply our version of fair PCA to the CelebA dataset of facial images to illustrate its applicability to large high-dimensional datasets. We then demonstrate the usefulness of our proposed algorithms as means of bias mitigation and compare their performance to the reductions approach of Agarwal et al. (2018), which is the state-of-the-art in-processing method implemented in Fairlearn (https://fairlearn.org/). Some implementation details and details about datasets are provided in Appendix C.1 and C.2.

We used the code provided by Lee et al. (2022) to rerun their experiments and perform a comparison with our proposed algorithms. We additionally compared to the methods by Ravfogel et al. (2020) and Ravfogel et al. (2022) using the code provided by those authors, where we set all parameters to their default values except for the maximum number of outer iterations for the second method, which we decreased from 75000 to 10000 to get a somewhat acceptable running time. We extended the experimental evaluation of Lee et al. by reporting additional metrics, but other than that did not modify their code or experimental setting in any way.

In their first experiment (Section 8.2 in their paper), Lee et al. (2022) applied standard PCA, their method (referred to as MbF-PCA) and the method by Olfat and Aswani (2019) (FPCA) to synthetic data sampled from a mixture of two Gaussians of varying dimension $d$ . The two Gaussians correspond to two demographic groups. The target dimension $k$ is held constant at $5$ . We reran the code of Lee et al. and additionally applied the methods of Ravfogel et al. (2020) (INLP) and Ravfogel et al. (2022) (RLACE) and our algorithms for fair PCA, fair kernel PCA with a Gaussian kernel, and the variant of fair PCA that additionally aims to equalize group-conditional covariance matrices (referred to as Fair PCA-S; we set $l=\lfloor 0.85d\rfloor$ —cf. Section 3.6). Lee et al. reported the fraction of explained variance of the projected data (i.e., $\operatorname{trace}(\mathbf{U}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{X}\mathbf{X}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{U})/\operatorname{trace}(\mathbf{X}\mathbf{X}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}})$ for the projection defined by $\mathbf{U}$ —higher means better approximation of the original data), the squared maximum mean discrepancy (MMD2) based on a Gaussian kernel between the two groups after the projection (lower means the representation is more fair) and the running time of the methods. We additionally report the error of a linear classifier trained to predict the demographic information from the projected data (higher means the representation is more fair; we refer to this metric as linear inseparability). Figure 2 and Figure 4 show the results, where the boxplots are obtained from considering ten random splits into training and test data and the runtime curves show an average over the ten splits. While standard PCA does best in approximating the original data (variance about 50%), it does not yield a fair representation with high values for MMD2 (more than 0.5) and low values for linear inseparability ( $\approx 0$ ). Our algorithm for fair PCA does worse in approximating the data (variance above 30%), but drastically reduces the unfairness of standard PCA (MMD2 smaller than 0.07). The other methods yield even lower values for MMD2, but this comes at the cost of a worse approximation of the data. Our variant Fair PCA-S performs similarly to FPCA by Olfat and Aswani (2019). All methods except standard PCA perform similarly in terms of linear inseparability. The biggest difference is in the methods’ running times: while FPCA runs for more than 2000 seconds, RLACE for about 20 seconds, MbF-PCA for about 1.3 seconds, and INLP for about 0.8 seconds when the data dimension $d$ is as small as 100, none of our algorithms runs for more than 0.5 seconds even when $d=800$ . In the latter case, RLACE runs for about 260 seconds, MbF-PCA for about 43 seconds, and INLP for about 1270 seconds. In Appendix C.3, we study the running time of the methods as a function of the target dimension $k$ and observe that the running time of MbF-PCA drastically increases with $k$ (about 290 seconds when $d=100$ and $k=50$ ). This shows that none of the existing methods can be applied when both $d$ and $k$ are large (such as in the experiment on the CelebA dataset below) and provides strong evidence for the benefit of our proposed methods.

One of the reviewers asked for a comparison with the method of Samadi et al. (2018), which aims to balance the excess reconstruction error of PCA across different demographic groups (cf. Section 4). We emphasize once more that this fairness notion is incomparable to ours (also see the discussion in Appendix A of Lee et al., 2022). Still, we provide the results for the method of Samadi et al. (2018) on the three real-world datasets in Appendix C.5. As expected, their method yields much higher DP violations than our methods or the other competitors.

Applying fair PCA to CelebA similarly to Ravfogel et al. (2022)

Similarly to Ravfogel et al. (2022), we applied our fair PCA method to the CelebA dataset (Liu et al., 2015) to erase concepts such as “glasses” or “mustache” from facial images. The CelebA dataset comprises 202599 pictures of faces of celebrities. We rescaled all images to $80\times 80$ grey-scale images and applied our Algorithm 1 to the flattened raw-pixel vectors, using one of the bald, beard, eyeglasses, hat, mustache, or smiling annotations as demographic attributes. Figure 5 shows some results for eyeglasses; we provide more results, also for the other attributes, and a discussion in Appendix C.6. Due to their high running time, we were not able to apply the methods by Olfat and Aswani (2019), Lee et al. (2022), Ravfogel et al. (2020), or Ravfogel et al. (2022) to this large and high-dimensional dataset. However, results for the method of Ravfogel et al. (2022) for a smaller resolution can be found in their paper.

2 Comparison with Agarwal et al. (2018)

We compare our proposed algorithms as means of bias mitigation to the state-of-the-art in-processing method of Agarwal et al. (2018). While our algorithms learn a fair representation and perform standard training (without fairness considerations) on top of that representation to learn a fair classifier, the approach of Agarwal et al. modifies the training procedure. Concretely, their approach solves a sequence of cost-sensitive classification problems. We apply the various methods to the Adult Income and the Bank Marketing dataset (Moro et al., 2014), which are both available on the UCI repository (Dua and Graff, 2017). The goal for each method is to produce good accuracy vs. fairness trade-off curves—every point on a trade-off curve corresponds to a specific classifier. Note that the approach of Agarwal et al. yields randomized classifiers, which is problematic if a classifier strongly affects humans’ lives (Cotter et al., 2019). For our algorithms we deploy the strategy of Section 3.1 to produce the trade-off curves. Figure 4 shows the results. All results are average results obtained from considering ten random draws of train and test data (see Appendix C.2 for details). The plots show on the y-axis the accuracy of a classifier and on the x-axis its fairness violation, which is $\Delta_{DP}=|\operatorname{Pr}(\hat{Y}=1|Z=0)-\operatorname{Pr}(\hat{Y}=1|Z=1)|$ as in Section 5.1 when aiming for DP and $\Delta_{EO}:=|\operatorname{Pr}(\hat{Y}=1|Z=0,Y=1)-\operatorname{Pr}(\hat{Y}=1|Z=1,Y=1)|$ when aiming for EO. In the first and the second plot of each row we learn a logistic regression classifier, aiming to satisfy DP or EO. We see that fair PCA produces similar curves as the method by Agarwal et al. (note that in the bottom left plot $\Delta_{DP}$ is very small for all classifiers). However, fair PCA runs much faster: including the classifier training, fair PCA runs for 0.04 seconds on average while the method by Agarwal et al. runs for 4.6 seconds (see Appendix C.7 for details). These plots do not show results for Fair PCA-S and fair kernel PCA since they cannot compete (we provide those results in Appendix C.7). Fair PCA-S and fair kernel PCA can compete when training a kernel SVM classifier though (third and fourth plot of each row).

DISCUSSION

We provided a new derivation of fair PCA, aiming for a fair representation that does not contain demographic information. Our derivation is simple and allows for efficient algorithms based on eigenvector computations similar to standard PCA. Compared to existing methods for fair PCA, our proposed algorithms run much faster while achieving similar results. In a comparison with a state-of-the-art in-processing bias mitigation method we saw that our algorithms provide a significantly faster alternative to train fair classifiers.

APPENDIX

Appendix A ADDENDUM TO SECTION 2

The same example shows that $\mathcal{U}$ can be empty if we require $h(\mathbf{U}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{x}_{i})$ and $z_{i}$ to be uncorrelated (rather than independent) for all functions $h$ .

This shows that we have to relax Problem (3) in two ways in order to arrive at a well defined problem.

Appendix B ADDENDUM TO SECTION 3

In fair PCA for multiple groups we want to solve

If $m=2$ , then the first and the second column of $\mathbf{Z}$ coincide up to multiplication by $-1$ and the nullspace of $\mathbf{Z}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{X}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}$ is the same as if we removed one of the two columns from $\mathbf{Z}$ . This shows that for two groups, fair PCA as presented here is equivalent to fair PCA as presented in Section 2.

Finally, the interpretation of fair PCA provided in Section 2 also applies to the case of multiple groups: $\mathbf{Z}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{X}^{\mathchoice{\raisebox{0.0pt}{$ \displaystyle\intercal $}}{\raisebox{0.0pt}{$ \textstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptstyle\intercal $}}{\raisebox{0.0pt}{$ \scriptscriptstyle\intercal $}}}\mathbf{U}=\mathbf{0}$ is equivalent to

which in turn is equivalent to the projected data’s group-conditional means to coincide for all groups. Hence, an analogous version of Proposition 1 holds true for multiple groups.

The pseudo code of fair PCA for multiple demographic groups is provided in Algorithm 2. The pseudo code of fair kernel PCA for multiple demographic groups is provided in Algorithm 3.

Appendix C ADDENDUM TO SECTION 5

Solving generalized eigenvalue problem for fair kernel PCA: Fair kernel PCA requires to solve a generalized eigenvalue problem of the form $\mathbf{A}\mathbf{x}=\lambda\mathbf{B}\mathbf{x}$ for square matrices $\mathbf{A}$ and $\mathbf{B}$ that are given as input. In fair kernel PCA, $\mathbf{B}$ is guaranteed to be symmetric positive semi-definite, but not necessarily positive definite (and so is $\mathbf{A}$ ). We use the function eigsh from SciPy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.eigsh.html) to solve the generalized eigenvalue problem. While eigsh allows for a positive semi-definite $\mathbf{B}$ , it requires a parameter sigma to use the shift-invert mode in this case (this is in contrast to the function eig in Matlab, which does not require such a parameter and automatically chooses the best algorithm to solve the generalized eigenvalue problem in case of a singular $\mathbf{B}$ ; cf. https://de.mathworks.com/help/matlab/ref/eig.html). In order to avoid having to look for an appropriate value of sigma, when eigsh would require its specification, we simply add $10^{-5}\cdot\mathbf{I}$ to $\mathbf{B}$ , where $\mathbf{I}$ is the identity matrix, to guarantee that $\mathbf{B}$ is positive definite. This is a common practice in the context of kernel methods to avoid numerical instabilities (see, e.g., Williams and Seeger, 2000, Section 1.2).

Bandwith for fair kernel PCA: When running our proposed fair kernel PCA algorithm with a Gaussian kernel, we set the parameter $\gamma$ of the kernel function (cf. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.rbf_kernel.html#sklearn.metrics.pairwise.rbf_kernel) to $1/(d\cdot\operatorname{Var}(\text{training data}))$ , where $d$ is the dimension of the data (i.e., number of features) and $\operatorname{Var}(\text{training data})$ the variance of the flattened training data array. This value of $\gamma$ is the default value in Scikit-learn’s kernel SVM implementation (cf. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

Details for the experiments of Section 5.1

We used the experimental setup and code of Lee et al. (2022). Hence, most implementation details can be found in their paper or code repository. In addition, we provide the following details:

Training additional classifiers for evaluating representations: In addition to the metrics reported by Lee et al. (2022), we reported the accuracy and $\Delta_{DP}$ of a linear support vector machine (SVM) and a multilayer perceptron (MLP). We trained the linear SVM in Matlab using the function fitcsvm (https://de.mathworks.com/help/stats/fitcsvm.html) and the MLP using Scikit-learn’s MLPClassifier class (https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) with all parameters set to the default values (except for hidden_layer_sizes, max_iter, and random_state for the MLP).

Details for the experiments of Section 5.2

Data normalization: We normalized the data to have zero mean and unit variance on the training data.

Target dimension for our methods: As target dimension $k$ we chose $k=d-1$ , where $d$ is the data dimension, for fair PCA and fair kernel PCA, $k=\lfloor d/4\rfloor$ for Fair PCA-S (0.5), and $k=\lfloor d/2\rfloor$ for Fair PCA-S (0.85).

Controlling accuracy vs. fairness trade-off: For our methods, we deployed the strategy described in Section 3.1 to trade off accuracy vs. fairness. For the reductions approach of Agarwal et al. (2018), we controlled the trade-off by varying the parameter difference_bound in the classes DemographicParity or TruePositiveRateParity, which implement the fairness constraints. For all methods, we used 11 parameter values for generating the trade-off curves. For our methods, we set the fairness parameter $\lambda$ of Section 3.1 to $(i/10)^{3}$ , $i=0,1,\ldots,10$ . For the approach of Agarwal et al. we set difference_bound to 0.001, 0.005, 0.01, 0.015, 0.02, 0.03, 0.05, 0.07, 0.1, 0.15, 0.2.

Regularization parameters: We trained the logistic regression classifier using Scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) with regularization parameter $C=1/(2\cdot\text{size of training data}\cdot 0.01)$ and the kernel SVM classifier using Scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) with regularization parameter $C=1/(2\cdot\text{size of training data}\cdot 0.00005)$ . By default, both classifiers are trained with $l_{2}$ -regularization.

C.2 Details about Datasets

The Adult Income dataset is available on the UCI repository (Dua and Graff, 2017). Each record comprises 14 features (before one-hot encoding categorical ones) for an individual, such as their education or marital status, and the task is to predict whether an individual makes more than $50k per year or not (distribution: 23.9% yes - 76.1% no). In Section 5.1, we used the dataset as provided by Lee et al. (2022). They removed the features “fnlwgt” and “race”, and they subsampled the dataset to comprise 2261 records (cf. Appendix I.3 in their paper). They used the binary feature “sex” as demographic attribute (distribution: 66.8% male - 33.2% female). In our comparison with the method of Agarwal et al. (2018) presented in Section 5.2, we also used “sex” as demographic attribute; however, we did not remove any features and we randomly subsampled the dataset to comprise 5000 records for training and 5000 different records for evaluation (i.e., computing a classifier’s accuracy and fairness violation). In the runtime comparison of Appendix C.7 we used between 1000 and 40000 randomly sampled records for training.

Bank Marketing dataset (Moro et al., 2014; Dua and Graff, 2017)

The Bank Marketing dataset is available on the UCI repository (Dua and Graff, 2017). There are four versions available. We worked with the file bank-additional-full.csv. Each record comprises 20 features (before one-hot encoding categorical ones) for an individual, and the task is to predict whether an individual subscribes a term deposit or not (distribution: 11.3% yes - 88.7% no). We used a person’s binarized age (older than 40 vs. not older than 40) as demographic attribute (distribution: 42.3% older than 40 - 57.7% not older than 40), and we randomly subsampled the dataset to comprise 5000 records for training and 5000 different records for evaluation.

CelebA dataset (Liu et al., 2015)

The CelebA dataset comprises 202599 pictures of faces of celebrities together with 40 binary attribute annotations for each picture. The dataset comes in two versions: one that provides in-the-wild images, which may not only show a person’s face, but also their upper body, and one that provides aligned-and-cropped images, which only show a person’s face. For our experiment, we used the latter one. We used one of the bald, beard, eyeglasses, hat, mustache, or smiling annotations as demographic attributes. The distributions of these attributes can be seen in Figure 6.

COMPAS dataset (Angwin et al., 2016)

The COMPAS dataset is available on https://github.com/propublica/compas-analysis. We used the dataset as provided by Lee et al. (2022). They subsampled the dataset to comprise 2468 datapoints, removed the features “sex” and “c_charge_desc”, and used the feature “Race” for defining the demographic attribute (cf. Appendix I.1 in their paper).

German Credit (Dua and Graff, 2017)

The German Credit dataset is available on the UCI repository (Dua and Graff, 2017). It comprises 1000 datapoints. We used the dataset as provided by Lee et al. (2022). They removed the features “sex” and “personal_status” and used the feature “Age” for defining the demographic attribute (cf. Appendix I.2 in their paper).

C.3 Another Runtime Comparison

Figure 7 provides a comparison of the running times of the various methods (except for FPCA, which we already have seen to run extremely slow in Figure 4) in a related, but different setting as in the experiments of Section 5.1 on the synthetic data. The data is generated in the same way as in Section 5.1, but now we vary the target dimension $k$ and hold the data dimension $d$ constant at 100. We see that while the running time of our proposed methods only moderately increases with $k$ , the running time of MbF-PCA drastically increases with $k$ . Note that the running time of INLP even decreases with $k$ , which is by the design of the method (cf. Section 4).

C.4 Tables for German Credit and COMPAS

Table 2 and Table 3 provide the results of the experiments of Section 5.1 on the real data for the German Credit dataset and the COMPAS dataset, respectively. Note that the dimension of the COMPAS dataset is rather small—in particular, for $k=10$ , we do not expect Fair PCA-S (0.5) or Fair PCA-S (0.85) to behave any differently than fair PCA since $l=\max\{k,\lfloor f\cdot d\rfloor\}=k=d-1$ for both $f=0.5$ and $f=0.85$ .

C.5 Comparison with Samadi et al. (2018)

We applied the fair PCA method of Samadi et al. (2018) to the three real-world datasets considered in Section 5.1. As discussed in Section 4 and Section 5.1, the fairness notion underlying the method of Samadi et al. is incomparable to our notion of fair PCA. Samadi et al. provide theoretical guarantees for an algorithm that relies on solving a semidefinite program (SDP), but then propose to use a multiplicative weight update method for solving the SDP approximately in order to speed up computation. We observed that this can result in embedding dimensions that are much larger than the desired target dimension. We used the code provided by Samadi et al. without modifications; in particular, we used the same parameters for the multiplicative weight update algorithm as they used in their experiment on the LFW dataset.

Table 4 provides the results. We see that the downstream classifiers trained on the fair PCA representation of Samadi et al. have roughly similar values of accuracy and DP violation as standard PCA. Clearly, the DP violations are much higher than for our methods or the other competitors. Since the dimension of the fair PCA representation of Samadi et al. is not guaranteed to equal the desired target dimension $k$ , we do not report the metrics %Var and MMD2 in Table 4.

C.6 Fair PCA Applied to the CelebA Dataset

Figures 10 to 15 show examples of original CelebA images (top row of each figure) together with the results of applying fair PCA (middle and bottom row) for the various demographic attributes. We see that fair PCA adds something looking like glasses / a mustache / a beard to the faces, making it hard to tell whether an original face features those, and successfully obfuscates the demographic information for these attributes. Still the projected faces resemble the original ones to a good extent. For the attribute “smiling”, fair PCA also succeeds in obfuscating the demographic information, but the whole faces become more perturbed and less similar to the original ones. For the attributes “bald” and “hat”, fair PCA appears to fail, and we can tell for all of the faces under consideration that they do not feature baldness / a hat. We suspect that the reason for this might be the high diversity of hats or non-bald faces (see Figure 16 for some example images).

C.7 Comparison with Agarwal et al. (2018)

Figure 8 shows the results of the comparison with the reductions approach of Agarwal et al. (2018) when training a logistic regression classifier for Fair PCA-S and fair kernel PCA (next to the results for fair PCA and the method of Agarwal et al., which we have already seen in Figure 4 in Section 5.2). We see that Fair PCA-S produces smooth trade-off curves and can achieve lower fairness violation than fair PCA or the method of Agarwal et al. in some cases. However, the representation learned by fair kernel PCA only allows for a constant logistic regression classifier (with zero fairness violation and an accuracy equaling the probability of the predominant label—cf. Appendix C.2).

The plots of Figure 9 show the running time of the various methods as a function of the number of training points on the Adult Income dataset and when training a logistic regression classifier. The curves show the average over the eleven values of the fairness parameter / the parameter difference_bound (cf. Appendix C.1) and over ten random draws of training data together with the standard deviation as error bars. Note that for our methods the running time includes the time it takes to train the classifier on a representation produced by our method. While none of our methods ever runs for more than 0.5 seconds, the method of Agarwal et al., on average, runs for more than 32 seconds when training with 40000 datapoints and aiming for DP. The plots do not show curves for fair kernel PCA, which we cannot simply apply to this large number of training points due to its cubic running time in the number of datapoints. We leave it as an interesting question for future work to develop scalable approximation techniques for fair kernel PCA similarly to those that have been developed for standard kernel PCA or other kernel methods (e.g., Williams and Seeger, 2000; Kim et al., 2005; Chin and Suter, 2006).