Certifying and removing disparate impact

Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian

Introduction

In Griggs v. Duke Power Co. , the US Supreme Court ruled a business hiring decision illegal if it resulted in disparate impact by race even if the decision was not explicitly determined based on race. The Duke Power Co. was forced to stop using intelligence test scores and high school diplomas, qualifications largely correlated with race, to make hiring decisions. The Griggs decision gave birth to the legal doctrine of disparate impact, which today is the predominant legal theory used to determine unintended discrimination in the U.S. Note that disparate impact is different from disparate treatment, which refers to intended or direct discrimination. Ricci v. DeStefano examined the relationship between the two notions, and disparate impact remains a topic of legal interest.

Today, algorithms are being used to make decisions both large and small in almost all aspects of our lives, whether they involve mundane tasks like recommendations for buying goods, predictions of credit rating prior to approving a housing loan, or even life-altering decisions like sentencing guidelines after conviction. How do we know if these algorithms are biased, involve illegal discrimination, or are unfair?

These concerns have generated calls, by governments and NGOs alike, for research into these issues . In this paper, we introduce and address two such problems with the goals of quantifying and then removing disparate impact.

While the Supreme Court has resisted a “rigid mathematical formula” defining disparate impact , we will adopt a generalization of the 80 percent rule advocated by the US Equal Employment Opportunity Commission (EEOC) . We note that disparate impact itself is not illegal; in hiring decisions, business necessity arguments can be made to excuse disparate impact.

Given data set $D=(X,Y,C)$ , with protected attribute $X$ (e.g., race, sex, religion, etc.), remaining attributes $Y$ , and binary class to be predicted $C$ (e.g., “will hire”), we will say that $D$ has disparate impact if

for positive outcome class $YES$ and majority protected attribute $1$ where $\Pr(C=c|X=x)$ denotes the conditional probability (evaluated over $D$ ) that the class outcome is $c\in C$ given protected attribute $x\in X$ .Note that under this definition disparate impact is determined based on the given data set and decision outcomes. Notably, it does not use a broader sample universe, and does not take into account statistical significance as has been advocated by some legal scholars .

The two problems we consider address identifying and removing disparate impact. The disparate impact certification problem is to guarantee that, given $D$ , any classification algorithm aiming to predict some $C^{\prime}$ (which is potentially different from the given $C$ ) from $Y$ would not have disparate impact. By certifying any outcomes $C^{\prime}$ , and not the process by which they were reached, we follow legal precedent in making no judgment on the algorithm itself, and additionally ensure that potentially sensitive algorithms remain proprietary. The disparate impact removal problem is to take some data set $D$ and return a data set $\bar{D}=(X,\bar{Y},C)$ that can be certified as not having disparate impact. The goal is to change only the remaining attributes $Y$ , leaving $C$ as in the original data set so that the ability to classify can be preserved as much as possible.

We first introduce these problems to the computer science community and develop its theoretical underpinnings. The study of the EEOC’s 80% rule as a specific class of loss function does not appear to have received much attention in the literature. We link this measure of disparate impact to the balanced error rate (ber). We show that any decision exhibiting disparate impact can be converted into one where the protected attribute leaks, i.e. can be predicted with low ber.

Second, this theoretical result gives us a procedure for certifying the impossibility of disparate impact on a data set. This procedure involves a particular regression algorithm which minimizes ber. We connect ber to disparate impact in a variety of settings (point and interval estimates, and distributions). We discuss these two contributions in Sections 3 and 4.

In Section 5, we show how to transform the input dataset so that predictability of the protected attribute is impossible. We show that this transformation still preserves much of the signal in the unprotected attributes and has nice properties in terms of closeness to the original data distribution.

Finally, we present a detailed empirical study in Section 6. We show that our algorithm certifying lack of disparate impact on a data set is effective, such that with the three classifiers we used certified data sets don’t show disparate impact. We demonstrate the fairness / utility tradeoff for our partial repair procedures. Comparing to related work, we find that for any desired fairness value we can achieve a higher accuracy than other fairness procedures. This is likely due to our emphasis on changing the data to achieve fairness, thus allowing any strong classifier to be used for prediction.

Our procedure for detecting disparate impact goes through an actual classification algorithm. As we show in our experiments, a better classifier provides a more sensitive detector. We believe this is notable. As algorithms get better at learning patterns, they become more able to introduce subtle biases into the decision-making process by finding subtle dependencies among features. But this very sophistication helps detect such biases as well via our procedure! Thus, data mining can be used to verify the fairness of such algorithms as well.

Related Work

There is, of course, a long history of legal work on disparate impact. There is also related work under the name statistical discrimination in Economics. We will not survey such work here. Instead, we direct the reader to the survey of Romei and Ruggieri and to a discussion of the issues specific to data mining and disparate impact . Here, we focus on data mining research relating to combating discrimination. This research can be broadly categorized in terms of methods that achieve fairness by modifying the classifiers and those that achieve fairness by modifying data.

Kamishima et al. develop a regularizer for classifiers to penalize prejudicial outcomes and show that this can reduce indirect prejudice (their name for implicit discrimination like disparate impact) while still allowing for accurate classification. They note that as prejudicial outcomes are decreased, the classification accuracy is also decreased. Our work falls into the category of algorithms that change the input data. Previous work has focused on changing the class values of the original data in such a way so that the total number of class changes is small , while we will keep the class values the same for training purposes and change the data itself. Calders et al. have also previously examined one method for changing the data in which different data items are given weights and the weights are adjusted to achieve fairness. In this category of work, as well, there is worry that the change to the data will decrease the classification accuracy, and Calders et al. have formalized this as a fairness/utility tradeoff . We additionally note that lower classification accuracy may actually be the desired result, if that classification accuracy was due to discriminatory decision making in the past.

An important related work is the approach of “fairness through awareness” of Dwork et al. and Zemel et al. . Dwork et al. focus on the problem of individual fairness; their approach posits the existence of a similarity measure between individual entities and seeks to find classifiers that ensure similar outcomes on individuals that are similar, via a Lipschitz condition. In the work of Zemel et al. , this idea of protecting individual fairness is combined with a statistical group-based fairness criterion that is similar to the approach we take in this work. A key contribution of their work is that they learn a modified representation of the data in which fairness is ensured while attempting to preserve fidelity with the original classification task. While this group fairness measure is similar to ours in spirit, it does not match the legal definition we base our work on. Another paper that also (implicitly) defines fairness on an individual basis is the work by Thanh et al. . Their proposed repair mechanism changes class attributes of the data (rather than the data itself).

Pedreschi, Ruggieri and Turini have examined the “80% rule” that we study in this paper as part of a larger class of measures based on a classifier’s confusion matrix.

Disparate Impact and Error Rates

We start by reinterpreting the “80% rule” in terms of more standard statistical measures of quality of a classifier. This presents notational challenges. The terminology of “right” and “wrong”, “positive” and “negative” that is used in classification is an awkward fit when dealing with majority and minority classes, and selection decisions. For notational convenience only, we will use the convention that the protected class $X$ takes on two values: $X=0$ for the “minority” class and $X=1$ for the “default” class. For example, in most gender-discrimination scenarios the value would be assigned to “female” and $1$ to “male”. We will denote a successful binary classification outcome $C$ (say, a hiring decision) by $C=\,$ yes and a failure by $C=\,$ no. Finally, we will map the majority class to “positive” examples and the minority class to “negative” examples with respect to the classification outcome, all the while reminding the reader that this is merely a convenience to do the mapping, and does not reflect any judgments about the classes. The advantage of this mapping is that it renders our results more intuitive: a classifier with high “error” will also be one that is least biased, because it is unable to distinguish the two classes.

Table 1 describes the confusion matrix for a classification with respect to the above attributes where each entry is the probability of that particular pair of outcomes for data sampled from the input distribution (we use the empirical distribution when referring to a specific data set).

The $80\%$ rule can then be quantified as:

Note that the traditional notion of “accuracy” includes terms in the numerator from both columns, and so cannot be directly compared to the $80\%$ rule. Still, other class-sensitive error metrics are known, and more directly relate to the $80\%$ rule:

The sensitivity of a test (informally, its true positive rate) is defined as the conditional probability of returning yes on “positive” examples (a.k.a. the majority class). In other words,

The specificity of a test (its true negative rate) is defined as the conditional probability of returning no on “negative” examples (a.k.a. the minority) class. I.e.,

The likelihood ratio positive, denoted by $\text{LR}_{+}$ , is given by

We can now restate the $80\%$ rule in terms of a data set.

It will be convenient to work with the reciprocal of $\text{LR}_{+}$ , which we denote by

This will allow us to discuss the value associated with disparate impact before the threshold is applied.

Multiple classes. Disparate impact is defined only for two classes. In general, one might imagine a multivalued class attribute (for example, like ethnicity). In this paper, we will assume that a multivalued class attribute has one value designated as the “default” or majority class, and will compare each of the other values pairwise to this default class. While this ignores zero-sum effects between the different class values, it reflects the current binary nature of legal thought on discrimination. A more general treatment of joint discrimination among multiple classes is beyond the scope of this work.

Computational Fairness

Our notion of computational fairness starts with two players, Alice and Bob. Alice runs an algorithm $\mathcal{A}$ that makes decisions based on some input. For example, Alice may be an employer using $\mathcal{A}$ to decide who to hire. Specifically, $\mathcal{A}$ takes a data set $D$ with protected attribute $X$ and unprotected attributes $Y$ and makes a (binary) decision $C$ . By law, Alice is not allowed to use $X$ in making decisions, and claims to use only $Y$ . It is Bob’s job to verify that on the data $D$ , Alice’s algorithm $\mathcal{A}$ is not liable for a claim of disparate impact.

Trust model. We assume that Bob does not have access to $\mathcal{A}$ . Further, we assume that Alice has good intentions: specifically, that Alice is not secretly using $X$ in $\mathcal{A}$ while lying about it. While assuming Alice is lying about the use of $X$ might be more plausible, it is much harder to detect. More importantly, from a functional perspective, it does not matter whether Alice uses $X$ explicitly or uses proxy attributes $Y$ that have the same effect: this is the core message from the Griggs case that introduced the doctrine of disparate impact. In other words, our certification process is indifferent to Alice’s intentions, but our repair process will assume good faith.

We summarize our main idea with the following intuition:

If Bob cannot predict $X$ given the other attributes of $D$ , then $\mathcal{A}$ is fair with respect to Bob on $D$ .

We now present a formal definition of predictability and link it to the legal notion of disparate impact. Recall that $D=(X,Y,C)$ where $X$ is the protected attribute, $Y$ is the remaining attributes, and $C$ is the class outcome to be predicted.

The basis for our formulation is a procedure that predicts $X$ from $Y$ . We would like a way to measure the quality of this predictor in a way that a) can be optimized using standard predictors in machine learning and b) can be related to $\text{LR}_{+}$ . The standard notions of accuracy of a classifier fail to do the second (as discussed earlier) and using $\text{LR}_{+}$ directly fails to satisfy the first constraint.

The error measure we seek turns out to be the balanced error rate ber.

Let $f:Y\to X$ be a predictor of $X$ from $Y$ . The balanced error rate ber of $f$ on distribution $\mathcal{D}$ over the pair $(X,Y)$ is defined as the (unweighted) average class-conditioned error of $f$ . In other words,

$X$ is said to be $\epsilon$ -predictable from $Y$ if there exists a function $f:Y\rightarrow X$ such that

This motivates our definition of $\epsilon$ -fairness, as a data set that is not predictable.

A data set $D=(X,Y,C)$ is said to be $\epsilon$ -fair if for any classification algorithm $f\colon Y\to X$

with (empirical) probabilities estimated from $D$ .

Recall the definition of disparate impact from Section 3. We will be interested in examining the potential disparate impact of a classifier $g\colon Y\rightarrow C$ and will consider the value $\mathsf{DI}(g)=1/\text{LR}_{+}(g(Y),X)$ as it relates to the threshold $\tau$ . Where $g$ is clear from context, we will refer to this as $\mathsf{DI}$ .

The justification of our definition of fairness comes from the following theorem:

A data set is $(1/2-\beta/8)$ -predictable if and only if it admits disparate impact, where $\beta$ is the fraction of elements in the minority class $(X=0)$ that are selected $(C=1)$ .

We will start with the direction showing that disparate impact implies predictability. Suppose that there exists some function $g:Y\rightarrow C$ such that $\text{LR}_{+}(g(y),c)\geq\frac{1}{\tau}$ . We will create a function $\psi:C\rightarrow X$ such that $\text{{ber}}(\psi(g(y)),x)<\epsilon$ for $(x,y)\in D$ . Thus the combined predictor $f=\psi\circ g$ satisfies the definition of predictability.

Consider the confusion matrix associated with $g$ , depicted in Table 2.

Set $\alpha\triangleq\frac{b}{b+d}$ and $\beta\triangleq\frac{c}{a+c}$ . Then we can write $\text{LR}_{+}(g(y),X)=\frac{1-\alpha}{\beta}$ and $\mathsf{DI}(g)=\frac{\beta}{1-\alpha}$ .

We define the purely biased mapping $\psi\colon C\to X$ as $\psi(\text{YES})=1$ and $\psi(\text{NO})=0$ . Finally, let $\phi\colon Y\to X=\psi\circ g$ . The confusion matrix for $\phi$ is depicted in Table 3. Note that the confusion matrix for $\phi$ is identical to the matrix for $g$ .

We can now express $\text{{ber}}(\phi)$ in terms of this matrix. Specifically, $\text{{ber}}(\phi)=\frac{\alpha+\beta}{2}$ .

Representations. We can now express contours of the $\mathsf{DI}$ and ber functions as curves in the unit square $^{2}$ . Reparametrizing $\pi_{1}=1-\alpha$ and $\pi_{0}=\beta$ , we can express the error measures as $\mathsf{DI}(g)=\frac{\pi_{0}}{\pi_{1}}$ and $\text{{ber}}(\phi)=\frac{1+\pi_{0}-\pi_{1}}{2}$

As a consequence, any classifier $g$ with $\mathsf{DI}(g)=\delta$ can be represented in the $^{2}$ unit square as the line $\pi_{1}=\pi_{0}/\delta$ . Any classifier $\phi$ with $\text{{ber}}(\phi)=\epsilon$ can be written as the function $\pi_{1}=\pi_{0}+1-2\epsilon$ .

Let us now fix the desired $\mathsf{DI}$ threshold $\tau$ , corresponding to the line $\pi_{1}=\pi_{0}/\tau$ . Notice that the region $\{(\pi_{0},\pi_{1})\mid\pi_{1}\geq\pi_{0}/\tau\}$ is the region where one would make a finding of disparate impact (for $\tau=0.8$ ).

Now given a classification that admits a finding of disparate impact, we can compute $\beta$ . Consider the point $(\beta,\beta/\tau)$ at which the line $\pi_{0}=\beta$ intersects the $\mathsf{DI}$ curve $\pi_{1}=\pi_{0}/\tau$ . This point lies on the ber contour $(1+\beta-\beta/\tau)/2=\epsilon$ , yielding $\epsilon=1/2-\beta(\frac{1}{\tau}-1)/2$ In particular, for the $\mathsf{DI}$ threshold of $\tau=0.8$ , the desired ber threshold is

and so disparate impact implies predictability.

With this infrastructure in place, the other direction of the proof is now easy. To show that predictability implies disparate impact, we will use the same idea of a purely biased classifier. Suppose there is a function $f\colon Y\to X$ such that $\text{{ber}}(f(y),x)\leq\epsilon$ . Let $\psi^{-1}\colon X\to C$ be the inverse purely biased mapping, i.e. $\psi^{-1}(1)=\text{YES}$ and $\psi^{-1}(0)=\text{NO}$ . Let $g\colon Y\to C=\psi^{-1}\circ f$ . Using the same representation as before, this gives us $\pi_{1}\geq 1+\pi_{0}-2\epsilon$ and therefore

Recalling that $\mathsf{DI}(g)=\frac{\pi_{0}}{\pi_{1}}$ and that $\pi_{0}=\beta$ yields $\mathsf{DI}(g)\leq 1-\frac{1-2\epsilon}{\beta+1-2\epsilon}=\tau$ . For $\tau=0.8$ , this again gives us a desired ber threshold of $\epsilon=\frac{1}{2}-\frac{\beta}{8}$ . ∎

Note that as $\epsilon$ approaches $1/2$ the bound tends towards the trivial (since any binary classifier has ber at most $1/2$ ). In other words, as $\beta$ tends to , the bound becomes vacuous.

This points to an interesting line of attack to evade a disparate impact finding. Note that $\beta$ is the (class conditioned) rate at which members of the protected class are selected. Consider now a scenario where a company is being investigated for discriminatory hiring practices. One way in which the company might defeat such a finding is by interviewing (but not hiring) a large proportion of applicants from the protected class. This effectively drives $\beta$ down, and the observation above says that in this setting their discriminatory practices will be harder to detect, because our result can not guarantee that a classifier will have error significantly less than $0.5$ .

Observe that in this analysis we use an extremely weak classifier to prove the existence of a relation between predictability and disparate impact. It is likely that using a better classifier (for example the Bayes optimal classifier or even a classifier that optimizes ber) might yield a stronger relationship between the two notions.

Another source of uncertainty is in the ber estimate itself. Suppose that our classifier yields an error that lies in a range $[\gamma,\gamma^{\prime}]$ . Again, because of monotonicity, we will obtain an interval of values $[\tau,\tau^{\prime}]$ for $\mathsf{DI}$ . Note that if (using a Bayesian approach) we are able to build a distribution over ber, this distribution will then transfer over to the $\mathsf{DI}$ estimate as well.

2 Certifying (lack of) DI with SVMs

The above argument gives us a way to determine whether a data set is potentially amenable to disparate impact (in other words, whether there is insufficient information to detect a protected attribute from the provided data).

Algorithm. We run a classifier that optimizes ber on the given data set, attempting to predict the protected attributes $X$ from the remaining attributes $Y$ . Suppose the error in this prediction is $\epsilon$ . Then using the estimate of $\beta$ from the data, we can substitute this into the equation above and obtain a threshold $\epsilon^{\prime}$ . If $\epsilon^{\prime}<\epsilon$ , then we can declare the data set free from disparate impact.

Assume that we have an optimal classifier with respect to ber. Then we know that all classifiers will incur a ber of at least $\epsilon$ . By Theorem 4.1, this implies that no classifier on $D$ will exhibit disparate impact, and so our certification is correct.

The only remaining question is what classifier is used by this algorithm. The usual way to incorporate class sensitivity into a classifier is to use different costs for misclassifying points in different classes. A number of class-sensitive cost measures fall into this framework, and there are algorithms for optimizing these measures (see for a review), as well as a general (but expensive) method due to Joachims that does a clever grid search over a standard SVM to optimize a large family of class-sensitive measures. Oddly, ber is not usually included among the measures studied.

Formally, as pointed out by Zhao et al, ber is not a cost-sensitive classification error measure because the weights assigned to class-specific misclassification depend on the relative class sizes (so they can be normalized). However, for any given data set we know the class sizes and can reweight accordingly. We adapt a standard hinge-loss SVM to incorporate class-sensitivity and optimize for (regularized) ber. This adaptation is standard, and yields a cost function that can be optimized using AdaBoost.

We illustrate this by showing how to adapt a standard hinge-loss SVM to instead optimize ber This approach in general is well known; we provide a detailed derivation here for clarity.. Consider the standard soft-margin (linear) SVM:

The constraints in (2) are equivalent to the following:

in which the right side is the hinge loss.

Minimizing the balanced error rate (BER) would result in the following optimization:

where $n^{+}=|\{j\mid y_{j}=1\}|$ and $n^{-}=|\{j\mid y_{j}=-1\}|$ (we’ll use $n_{j}$ to refer to the correct count for the corresponding $y_{j}$ ). This optimization is NP-hard for the same reasons as minimizing the 0-1 loss is, so we will relax it to hinge loss as usual:

where $\xi_{j}$ has the same constraints as in (2). (7) now has the same form as an SVM in AdaBoost, so we can use existing techniqes to solve this SVM. Note also that $\sum_{j}D_{j}=1$ , making the $D_{j}$ a distribution over $[1..n]$ – this means that we can use an AdaBoost formulation without needing to adjust constants.

Removing disparate impact

Once Bob’s certification procedure has made a determination of (potential) disparate impact on $D$ , Alice might request a repaired version $\bar{D}$ of $D$ , where any attributes in $D$ that could be used to predict $X$ have been changed so that $\bar{D}$ would be certified as $\epsilon$ -fair. We now describe how to construct such a set $\bar{D}=(X,\bar{Y},C)$ such that $\bar{D}$ does not have disparate impact in terms of protected attribute $X$ . While for notational simplicity we will assume that $X$ is used directly in what follows, in practice the attribute used to stratify the data for repair need not directly be the protected attribute or even a single protected attribute. In the case of the Texas Top 10% Rule that admits the top ten percent of every high school class in Texas to the University of Texas , the attribute used to stratify is the high school attended, which is an attribute that correlates with race. If repair of multiple protected attributes is desired, the joint distribution can be used to stratify the data. (We will look into the effects of this experimentally in Section 6.2.)

Of course, it is important to change the data in such a way that predicting the class is still possible. Specifically, our goal will be to preserve the relative per-attribute ordering as follows. Given protected attribute $X$ and a single numerical attribute $Y$ , let $Y_{x}=\Pr(Y|X=x)$ denote the marginal distribution on $Y$ conditioned on $X=x$ . Let $F_{x}:Y_{x}\to$ be the cumulative distribution function for values $y\in Y_{x}$ and let $F^{-1}_{x}:\to Y_{x}$ be the associated quantile function (i.e $F^{-1}_{x}(1/2)$ is the value of $y$ such that $\Pr(Y\geq y|X=x)=1/2$ ). We will say that $F_{x}$ ranks the values of $Y_{x}$ .

Let $\bar{Y}$ be the repaired version of $Y$ in $\bar{D}$ . We will say that $\bar{D}$ strongly preserves rank if for any $y\in Y_{x}$ and $x\in X$ , its “repaired” counterpart $\bar{y}\in\bar{Y}_{x}$ has $F_{x}(y)=F_{x}(\bar{y})$ . Strongly preserving rank in this way, despite changing the true values of $Y$ , appears to allow Alice’s algorithm to continue choosing stronger (higher ranked) applicants over weaker ones. We present experimental evidence for this in Section 6.

With this motivation, we now give a repair algorithm that strongly preserves rank and ensures that $\bar{D}=(X,\bar{Y},C)$ is fair (i.e., is $\epsilon$ -fair for $\epsilon=1/2$ ). In the discussion that follows, for the sake of clarity we will treat $Y$ as a single attribute over a totally-ordered domain. To handle multiple totally-ordered attributes $Y_{1},\ldots,Y_{k}$ we will repair each attribute individually.

We define a “median” distribution $A$ in terms of its quantile function $F^{-1}_{A}$ : $F_{A}^{-1}(u)=\text{median\ }_{x\in X}F_{x}^{-1}(u)$ . The choice of the term “median” is not accidental.

For any two distributions $P$ and $Q$ on the line, the earthmover distance (using the underlying Euclidean distance $d(x,y)=|x-y|$ as the metric) can be written as

Algorithm. Our repair algorithm creates $\bar{Y}$ , such that for all $y\in Y_{x}$ , the corresponding $\bar{y}=F_{A}^{-1}(F_{x}(y))$ . The resulting $\bar{D}=(X,\bar{Y},C)$ changes only $Y$ while the protected attribute and class remain the same as in the original data, thus preserving the ability to predict the class. See Figure 1 for an example.

Notes. We note that this definition is reminiscent of the method by which partial rankings are combined to form a total ranking. The rankings are “Kemeny”-ized by finding a ranking that minimizes the sum of distances to the original rankings. However, there is a crucial difference in our procedure. Rather than merely reorganizing the data into a total ranking, we are modifying the data to construct this consensus distribution.

$\bar{D}$ is fair and strongly preserves rank.

In order to show that $\bar{D}$ strongly preserves rank, recall that we would like to show that $F_{x}(y)=F_{x}(\bar{y})$ for all $x\in X$ , $\bar{y}\in\bar{Y}_{x}$ , and $y\in Y_{x}$ . Since, by definition of our algorithm, $\bar{y}=F_{A}^{-1}(F_{x}(y))$ , we know that $F_{x}(\bar{y})=F_{x}(F^{-1}_{A}(F_{x}(y)))\mbox{,}$ so we would like to show that $F_{x}(F_{A}^{-1}(z))=z$ for all $z\in$ and for all $x$ . Recall that $F_{A}^{-1}(z)=\text{median\,}_{x\in X}F_{x}^{-1}(z)$ .

Suppose the above claim is not true. Then there are two values $z_{1}<z_{2}$ and some value $x$ such that $F_{x}(F_{A}^{-1}(z_{1}))>F_{x}(F_{A}^{-1}(z_{2}))$ . That is, there is some $x$ and two elements $y_{1}=F_{A}^{-1}(z_{1})$ , $y_{2}=F_{A}^{-1}(z_{2})$ such that $y_{1}>y_{2}$ . Now we know that $y_{1}=\text{median\,}_{x\in X}F_{x}^{-1}(z_{1})$ . Therefore, if $y_{1}>y_{2}$ it must be that there are strictly less than $|X|/2$ elements of the set $\{F_{x}^{-1}(z_{1})|x\in X\}$ below $y_{2}$ . But by the assumption that $z_{1}<z_{2}$ , we know that each element of $\{F_{x}^{-1}(z_{1})|x\in X\}$ is above the corresponding element of $\{F_{x}^{-1}(z_{2})|x\in X\}$ and there are $|X|/2$ elements of this latter set below $y_{2}$ by definition. Hence we have a contradiction and so a flip cannot occur, which means that the claim is true.

Note that the resulting $\bar{Y}_{x}$ distributions are the same for all $x\in X$ , so there is no way for Bob to differentiate between the protected attributes. Hence the algorithm is $1$ -fair. ∎

This repair has the effect that if you consider the $\bar{Y}$ values at some rank $z$ , the probability of the occurrence of a data item with attribute $x\in X$ is the same as the probability of the occurrence of $x$ in the full population. This informal observation gives the intuitive backing for the lack of predictability of $X$ from $\bar{Y}$ and, hence, the lack of disparate impact in the repaired version of the data.

Since the repair process outlined above is likely to degrade Alice’s ability to classify accurately, she might want a partially repaired data set instead. This in effect creates a tradeoff between the ability to classify accurately and the fairness of the resulting data. This tradeoff can be achieved by simply moving each inverse quantile distribution only part way towards the median distribution. Let $\lambda\in$ be the amount of repair desired, where $\lambda=0$ yields the unmodified data set and $\lambda=1$ is the fully repaired version described above. Recall that $F_{x}:Y_{x}\rightarrow$ is the function giving the rank of $y$ . The repair algorithm for $\lambda=1$ creates $\bar{Y}$ such that $\bar{y}=F_{A}^{-1}(F_{x}(y))$ where $A$ is the median distribution.

In the partial repair setting we will be creating a different distribution $A_{x}$ for each protected value $x\in X$ and setting $\bar{y}=F_{A_{x}}^{-1}(F_{x}(y))$ . Consider the ordered set of all $y$ at rank $u$ in their respective conditional distributions i.e the set $U(u)=\{F_{x}^{-1}(u)|x\in X\}$ . We can associate with $U$ the cumulant function $UF(u,y)=|\{y^{\prime}\geq y|y\in U(u)\}|/|U(u)|$ and define the associated quantile function $UF^{-1}(u,\alpha)=y$ where $UF(u,y)=\alpha$ . We can restate the full repair algorithm in this formulation as follows: for any $(x,y)$ , $\bar{y}=UF^{-1}(F_{x}(y),1/2)$ .

We now describe two different approaches to performing a partial repair, each with their own advantages and disadvantages. Intuitively, these repair methods differ in which space they operate in: the combinatorial space of ranks or the geometric space of values.

The intuition behind this repair strategy is that each item, rather than being moved to the median of its associated distribution, is only moved part of the way there, with the amount moved being proportional (in rank) to its distance from the median.

Fix an $x$ and consider any pair $(x,y)$ . Let $r=F_{x}(y)$ be the rank of $y$ conditioned on $X=x$ . Suppose that in the set $U(r)$ (the collection of all $y^{\prime}\in Y$ with rank $r$ in their respective conditional distributions) the rank of $y$ is $\rho$ . Then we replace $y$ by $\bar{y}\in U(r)$ whose rank in $U(r)$ is $\rho^{\prime}=\lfloor(1-\lambda)\rho+\lambda/2\rfloor$ . Formally, $\bar{y}=UF^{-1}(r,\rho^{\prime})$ . We call the resulting data set $\bar{D}_{\lambda}$ .

While this repair is intuitive and easy to implement, it does not satisfy the property of strong rank preservation. In other words, it is possible that two pairs $(x,y)$ and $(x,y^{\prime})$ with $y>y^{\prime}$ to be repaired in a way that $\bar{y}<\bar{y^{\prime}}$ . While this could potentially affect the quality of the resulting data (we discuss this in Section 6.2), it does not affect the fairness properties of the repair. Indeed, we formulate the fairness properties of this repair as a formal conjecture.

$\bar{D}_{\lambda}$ is $f(\lambda)$ -fair for a monotone function $f$ .

1.2 A Geometric Repair

The algorithm above has an easy-to-describe operational form. It does not however admit a functional interpretation as an optimization of a certain distance function, like the full repair. For example, it is not true that for $\lambda=1/2$ the modified distributions $\bar{Y}$ are equidistant (under the earthmover distance) between the original unrepaired distributions and the full repair. The algorithm we propose now does have this property, as well as possessing a simple operational form. The intuition is that rather than doing a linear interpolation in rank space between the original item and the fully repaired value, it does a linear interpolation in the original data space.

Let $F_{A}$ be the cumulative distribution associated with $A$ , the result performing a full repair on the conditional cumulative distributions as described in Section 5. Given a conditional distribution $F_{x}(y)$ , its $\lambda$ -partial repair is given by

Linear interpolation allows us to connect this repair to the underlying earthmover distance between repaired and unrepaired distributions. In particular,

For any $x$ , $d(Y_{x},\bar{Y}_{x})=\lambda d(Y_{x},Y_{A})$ where $Y_{A}$ is the distribution on $Y$ in the full repair, and $\bar{Y}_{x}$ is the $\lambda$ -partial repair. Moreover, the repair strongly preserves rank.

2 Fairness / Utility Tradeoff

The reason partial repair may be desired is that increasing fairness may result in a loss of utility. Here, we make this intuition precise. Let $\bar{D}_{\lambda}=(X,\bar{Y},C)$ be the partially repaired data set for some value of $\lambda\in$ as described above (where $\bar{D}_{\lambda=0}=D$ ). Let $\bar{g}_{\lambda}\colon\bar{Y}\to C$ be the classifier with the utility we are trying to measure.

The utility of a classifier $\bar{g}_{\lambda}\colon\bar{Y}\to C$ with respect to some partially repaired data set $\bar{D}_{\lambda}$ is

If the classifier $\bar{g}_{\lambda=0}\colon Y\to C$ has an error of zero on the unrepaired data, then the utility is 1. More commonly, $\gamma(\bar{g}_{\lambda=0},\bar{D}_{\lambda=0})<1$ . In our experiments, we will investigate how $\gamma$ decreases as $\lambda$ increases.

Experiments

We will now consider the certification algorithm and repair algorithm’s fairness/utility tradeoff experimentally on three data sets. The first is the Ricci data set at the heart of the Ricci v. DeStefano case . It consists of 118 test taker entries, each including information about the firefighter promotion exam taken (Lieutenant or Captain), the score on the oral section of the exam, the written score, the combined score, and the race of the test taker (black, white, or Hispanic). In our examination of the protected race attribute, we will group the black and Hispanic test takers into a single non-white category. The classifier originally used to determine which test takers to promote was the simple threshold classifier that allowed anyone with a combined score of at least 70% to be eligible for promotion . Although the true number of people promoted was chosen from the eligible pool according to their ranked ordering and the number of slots available, for simplicity in these experiments we will describe all eligible candidates as having been promoted. We use a random two-thirds / one-third split for the training / test data.

The other two data sets we will use are from the UCI Machine Learning Repositoryhttp://archive.ics.ucu.edu/ml. So that we can compare our results to those of Zemel et al. , we will use the same data sets and the same decisions about what constitutes a sensitive attribute as they do. First, we will look at the German credit data set, also considered by Kamiran and Calders . It contains 1000 instances, each of which consists of 20 attributes and a categorization of that instance as GOOD or BAD. The protected attribute is Age. In the examination of this data set with respect to their discriminatory measure, Kamiran and Calders found that the most discrimination was possible when splitting the instances into YOUNG and OLD at age 25 . We will discretize the data accordingly to examine this potential worst case. We use a random two-thirds / one-third split for the training / test data.

We also look at the Adult income data set, also considered by Kamishima et al. . It contains 48,842 instances, each of which consists of 14 attributes and a categorization of that person as making more or less than $50,000 per year. The protected attribute we will examine is Gender. Race is also an attribute in the data, and it will be excluded for classification purposes, except for when examining the effects of having multiple protected attributes - in this case, race will be categorized as white and non-white. The training / test split given in the original data is also used for our experiments.

For each of these data sets, we look at a total of 21 versions of the data - the original data set plus 10 partially or fully repaired attribute sets for each of the combinatorial and geometric partial repairs. These are the repaired attributes for $\lambda\in$ at increments of $0.1$ . Data preprocessing was applied before the partial repair algorithm was run.

Preprocessing. Datasets were preprocessed as follows:

Remove all protected attributes from $Y$ . This ensures that we are not trying to learn a classifier that depends on other protected attributes that might correlate with the target protected attribute. (The repair process does still get to know $X$ .)

Remove all unordered categorical features since our repair procedure assumes that the space of values is ordered. Ordered categories are converted to integers.On the Adult Income data, it happens that all missing values were of these unordered categorical columns, so no data sets had missing values after this step.

Scale each feature so that the minimum is zero and the maximum is one.

Classifiers. Three different classifiers were used as oracles for measuring discrimination (under the disparate impact measure and a measure by Zemel et al. ), and to test the accuracy of a classification after repair. The classifiers used for our experimental tasks were provided by the Scikit-learnhttp://scikit-learn.org. python package.

Logistic Regression: Liblinear’s logistic regression algorithm for L2 regularization and logistic loss. The classifier was configured to weight the examples automatically so that classes were weighted equally.

Support Vector Machine: Liblinear’s linear SVM algorithm for L2 regularization and L2 loss. The classifier was configured to weight the examples automatically so that classes were weighted equally.

Gaussian Naïve Bayes: Scikit-Learn’s naïve Bayes algorithm with a balanced class prior.

Parameter selection and cross-validation. LR and SVM classifiers were cross-validated using three-fold cross validation and the best parameter based on BER was chosen. We cross-validated the parameter controlling the tradeoff between regularization and loss, and $13$ parameters between $10^{-3}$ and $10^{3}$ , with logarithms uniformly spaced, were searched.

Repair details. The repair procedure requires a ranking of each attribute. The numeric values and ordered categorical attributes were ordered in the natural way and then quantiles were used as the ranks. Since the repair assumes that there is a point at each quantile value in each protected class, the quantiles were determined in the following way. For each attribute, the protected class with the smallest number of members was determined. This size determined how many quantile buckets to create. The other protected classes were then appropriately divided into the same number of quantile buckets, with the median value in each bucket chosen as a representative value for that quantile. Each quantile value in the fully repaired version is the median of the representative values for that quantile. The combinatorial partial repair determines all valid values for an attribute and moves the original data part way to the fully repaired data within this space. The geometric repair assumes all numeric values are allowed for the partial repair.

The goal in this section is to experimentally validate our certification algorithm, described in subsection 4.2. On each of the data sets described above, we attempt to predict the protected attribute from the remaining attributes. The resulting ber is compared to $\mathsf{DI}(g)$ where $g:Y\to C$ , i.e., the disparate impact value as measured when some classifier attempts to predict the class given the non-protected attributes. From the underlying data, we can calculate the ber threshold $\epsilon=1/2-\beta/8$ . Above this threshold, any classifier applied to the data will have disparate impact. The threshold is chosen conservatively so as to preclude false positives (times when we falsely declare the data to be safe from disparate impact).

In Figure 2 we can see that there are no data points greater than the ber threshold and also much below $\tau=0.8$ , the threshold for legal disparate impact. The only false positives are a few points very close to the line. This is likely because the $\beta$ value, as measured from the data, has some error. We can also see, from the points close to the ber threshold line on its left but below $\tau$ that while we chose the threshold conservatively, we were not overly conservative. Still, using a classifier other than the purely biased one in the certification algorithm analysis might allow this threshold to be tightened.

The points in the upper left quadrant of these charts represent false negatives of our certification algorithm on a specific data set and a specific classifier. However, our certification algorithm guarantees lack of disparate impact over any classifier, so these are not false negatives in the traditional sense. In fact, when a single data set is considered over all classifiers, we see that all such data sets below the ber threshold have some classifier that has $\mathsf{DI}$ close to or below $\tau=0.8$ .

One seemingly surprising artifact in the charts is the vertical line in the Adult Income data chart at $\text{{ber}}=0.5$ for the GNB repair. Recall that the chart is based off of two different confusion matrices - the ber comes from predicting gender while the disparate impact is calculated when predicting the class. In a two class system, the ber cannot be any higher than $0.5$ , so while the ability to predict the gender cannot get any worse, the resulting fairness of the class predictions can still improve, thus causing the vertical line in the chart.

2 Fairness / Utility Tradeoff

The goal in this section is to determine how much the partial repair procedure degrades utility. Using the same data sets as described above, we will examine how the utility (see Definition 5.3) changes $\mathsf{DI}$ (measuring fairness) increases. Utility will be defined with respect to the data labels. Note that this may itself be faulty data, in that the labels may not themselves provide the best possible utility based on the underlying, but perhaps unobservable, desired outcomes. For example, the results on the test from the Ricci data may not perfectly measure a firefighter’s ability and so outcomes based on that test may not correctly predict who should be promoted. Still, in the absence of knowledge of more precise data, we will use these labels to measure utility. For the Ricci data, which is unlabeled, we will assume that the true labels are those provided by the simple threshold classifier used on the non-repaired version of the Ricci data, i.e. that anyone with a score of at least 70% should pass the exam. Disparate impact ( $\mathsf{DI}$ ) for all data sets is measured with respect to the predicted outcomes on the test set as differentiated by protected attribute. The SVM described above is used to classify on the Adult Income and German Credit data sets while the Ricci data uses the simple threshold classifier. The utility ( $1-\text{{ber}}$ ) shown is based on the confusion matrix of the original labels versus the labels predicted by these classifiers.

The results, shown in Figure 3, demonstrate the expected decay over utility as fairness increases. Each unrepaired data set begins with $\mathsf{DI}<0.8$ , i.e., it would fail the 80% rule, and we are able to repair it to a legal value. For the Adult Income data set, repairing the data fully only results in a utility loss from about 74% to 72%, while for the German Credit data, repairing the data fully reduces the utility from about 72% to 50% - essentially random. We suspect that this difference in decay is inherent to the class decisions in the data set (and the next section will show that other existing fairness repairs face this same decay). We suspect that the lack of linearity in the utility decay in the German Credit data after it has fairness greater than $\mathsf{DI}=0.8$ is due to this low utility.

Looking more closely at the charts, we notice that some of the partially repaired data points have $\mathsf{DI}>1$ . Since $\mathsf{DI}$ is calculated with respect to fixed majority and minority classes, this happens when the classifier has given a good outcome to proportionally more minority than majority class members. These points should be considered unfair to the majority class.

Figure 3 also shows that combinatorial and geometric repairs have similar $\mathsf{DI}$ and utility values for all partial repair data sets. This means that either repair can be used.

Multiple Protected Attributes. Our repair procedure can operate over the joint distribution of multiple protected attributes. To examine how this affects utility, we considered the Adult Income data set repaired by gender only, race only, and over both gender and race. For the repairs with respect to race, a binary racial categorization of white and non-white is used. Repairs with respect to both race and gender are taken over the joint distribution. In the joint distribution case, the $\mathsf{DI}$ calculated is the average of the $\mathsf{DI}$ of each of the three protected sets (white women, non-white men, and non-white women) with respect to the advantaged group (white men). The classifier used to predict the class from the non-protected attributes is the SVM described earlier.

The results, shown in Figure 4, show that the utility loss over the joint distribution is close to the maximum of the utility loss over each protected attribute considered on its own. In other words, the loss does not compound. These good results are likely due in part to the size of the data set allowing each subgroup to still be large enough. On such data sets, allowing all protected attributes to be repaired appears reasonable.

3 Comparison to previous work

Here, we compare our results to related work on the German credit data and Adult income data sets. Logistic regression is used as a baseline comparison, fair naive Bayes is the solution from Kamiran and Calders , regularized logistic regression is the repair method from Kamishima et al. , and learned fair representations is Zemel et al.’s solution . All comparison data is taken from Zemel et al.’s implementations . Zemel et al. define discrimination as $(1-\alpha)-\beta$ . So that increasing Zemel scores mean that fairness has increased, as is the case with $\mathsf{DI}$ , we will look at the Zemel fairness score which we define as $1-((1-\alpha)-\beta)=2\cdot\text{{ber}}$ . Accuracy is the usual rate of successful classification. Unlike the compared works, we do not choose a single partial repair point. Figure 5 shows our fairness and accuracy results for both combinatorial and geometric partial repairs for values of $\lambda\in$ at increments of $0.1$ using all three classifiers described above.

Figure 5 shows that our method can be flexible with respect to the chosen classifier. Since the repair is done over the data, we can choose a classification algorithm appropriate to the data set. For example, on the Adult Income data set the repairs based on Naïve Bayes have better accuracy at high values of fairness than the repairs based on Logistic Regression. On the German and Adult data sets our results show that for any fairness value a partially repaired data set at that value can be chosen and a classifier applied to achieve accuracy that is better than competing methods.

Since the charts in Figure 5 include unrepaired data, we can also separate the effects of our classifier choices from the effects of the repair. In each classifier repair series, the data point with the lowest Zemel fairness (furthest to the left) is the original data. Comparing the original data point when the LR classifier was used to the LR classifier used by Zemel et al. as a comparison baseline, we see a large jump in both fairness and accuracy. Configuring the classifier to weight classes equally may have accounted for this improvement.

Limitations and Future Work

Our experiments show a substantial difference in the performance of our repair algorithm depending on the specific algorithms we chose. Given the myriad classification algorithms used in practice, there is a clear need for a future systematic study of the relationship between dataset features, algorithms, and repair performance.

In addition, our discussion of disparate impact is necessarily tied to the legal framework as defined in United States law. It would be valuable in future work to collect the legal frameworks of different jurisdictions, and investigate whether a single unifying formulation is possible.

Finally, we note that the algorithm we present operates only on numerical attributes. Although we are satisfied with its performance, we chose this setting mostly for its relative theoretical simplicity. A natural avenue for future work is to investigate generalizations of our repair procedures for datasets with different attribute types, such as categorical data, vector-valued attributes, etc.

Acknowledgments

This research was funded in part by the NSF under grant BIGDATA-1251049. Thanks to Deborah Karpatkin, David Robinson, and Natalie Shapero for helping us understand the legal interpretation of disparate impact. Any misunderstandings about these issues in this paper are our own. Thanks also to Mark Gould for pointing us to the Griggs v. Duke decision, which helped to set us down this path in the first place.

References

Appendix A A Survey of Discrimination Types

Previous work has introduced many seemingly dissimilar notions of discrimination. Here, we give a categorization of these in terms of disparate treatment or disparate impact. Note that the notion of disparate impact is outcome focused, i.e., it is determined based on a discriminatory outcome and, until a question of the legality of the process comes into play in court, is less concerned with how the outcome was determined. In the cases where previous literature has focused on the process, we will try to explain how we believe these processes would be caught under the frameworks of either disparate impact or disparate treatment, hence justifying both our lack of investigation into processes separately as well as the robustness of the existing legal framework.

Disparate treatment is the legal name given to outcomes that are discriminatory due to choices made explicitly based on membership in a protected class. In the computer science literature, this has previously been referred to as blatant explicit discrimination . When the protected class is used directly in a model, this has been referred to as direct discrimination .

Dwork et al. describe a situation in which a strong (under the ranking measure) member of the majority class might be purposefully assigned to the negative class in order to refute a claim of discrimination by members of the protected class by using the rejected candidate as an example . Under a disparate treatment or impact theory, this would not be an effective way to hide discrimination, since many qualified members of the majority class would have to be rejected to avoid a claim of disparate impact. If the rejection of qualified members of the majority class was done explicitly based on their majority class status, this would be categorized as disparate treatment à la the recent Ricci v. DeStefano decision .

Note that the Ricci decision specifically focuses on an instance in which disparate impact was found and, as a response, the results of a promotion test were not put into place. This was found to have disparate treatment against the majority class and presents an interesting circularity that any repair to disparate impact needs to address.

Dwork et al. also describe a scenario in which outcomes would be more accurate and more beneficial to the protected class if their membership status was taken into account in the model . Here there are two possible decisions. If the membership status is explicitly used to determine the outcome then this may be disparate treatment, however since the law is largely driven by the cases that are brought, if no individual was unjustly treated, then there might be no-one to bring a case, and so it’s possible that explicitly using the protected status would not be considered disparate treatment.

If the membership status is not used and the protected class and majority class are not represented in close to equal proportion in the positive outcome, then this is disparate impact. In either case, Dwork et al. might argue that individual fairness has not been maintained, since even if the protected class is proportionally represented, the wrong individuals of that class may be systemically receiving the positive outcome. It is possible to use the disparate impact framework to test for this as well. Suppose that $X$ is the protected attribute, with classes $x_{1}$ and $x_{2}$ . Suppose that $P$ is the attribute which, when combined with $X$ , determined who should receive a positive outcome, such that the “correct” outcomes are positive with values $x_{1}$ and $p_{1}$ or $x_{2}$ and $p_{2}$ . Then by considering the disparate impact over the classes $(x,p)$ in the joint distribution $(X,P)$ , the disparate impact to some subset of population $X$ can be detected. However, $P$ may be a status that is not legally protected, e.g., if $P$ are the grades of a student applying to college it is legal to take those grades into account when determining their acceptance status, so no illegal disparate impact has occurred.

A.2 Disparate Impact

Disparate impact (described formally earlier in this paper), is the legal theory that outcomes should not be different based on individuals’ protected class membership, even if the process used to determine that outcome does not explicitly base the decision on that membership but rather on proxy attribute(s). This idea has been a large point of investigation already in computer science and has appeared under many different names with slightly differing definitions, all of which could be measured as disparate impact. These other names for disparate impact are listed below.

A historic form of discrimination that uses someone’s neighborhood as a proxy for their race in order to deny them services .

Membership in the protected class is not explicitly used, but that information is encoded in other data that is used to make decisions, resulting in discriminatory outcomes .

A model uses attributes that are not independent of membership in the protected class and generates a discriminatory outcome .

Under this scenario, candidates are brought in for interviews at equal rates based on their class status, but are hired at differing rates. Dwork et al. state that this may be perceived as fair (by looking solely at the rates of interviews granted) and then could be later used to justify a lower rate of hiring among the protected class based on this historical data . We avoid this problem by concentrating throughout on the outcomes, i.e., the percent of each class actually hired. With this perspective, this issue would be identified using the disparate impact standard.

Kamishima et al. discuss the issue of unfair sampling or labeling in the training data based on protected class status . If such negative legacy causes the outcomes to differ by membership in the protected class, then such bias will be caught under the disparate impact standard. If negative legacy does not cause any differences to the outcome, then the bias has been overcome and no discrimination has occurred.

Kamishima et al. also discuss the impact of a model that has not yet converged due to the data set’s small size as a possible source of discrimination . Similarly to our understanding of negative legacy, we expect to catch this issue by examining the resulting model’s outcomes under the disparate impact standard.

Dwork et al. point out that differentially treated protected subgroups could be hidden within a larger statistically neutral subgroup . This is a version of Simpson’s paradox . We can find such differential treatment by conditioning on the right protected class subset. In other words, this is the same as one of our suggestions for how to choose $X$ in the repair process - consider multiple classes over their joint distribution as the protected attribute.