Fairer and more accurate, but for whom?

Alexandra Chouldechova, Max G'Sell

Introduction

Actuarial and clinical assessments of risk have long been mainstays of decision making in domains such as criminal justice, health care and human services. Within the criminal justice system, for instance, recidivism prediction instruments and judicial discretion commonly enter into decisions concerning bail, parole and sentencing. In these high-stakes settings, decisions made based on erroneous predictions can have a direct adverse impact on individuals’ lives. Institutions are therefore continually seeking to improve the accuracy of their risk predictions, and many are turning to proprietary commercial tools and more complex “black-box” prediction models in pursuit of accuracy gains.

When determining whether to replace or augment an existing risk assessment method, it is important to compare the proposed model to the existing approach across a range of task-relevant accuracy and fairness metrics. As we will demonstrate, a comparison that looks only at overall performance can present an incomplete and potentially misleading picture.

A motivating example. In May 2016 an investigative journalism team at ProPublica released a report (Angwin et al., 2016b) on a proprietary recidivism prediction instrument called COMPAS(Northpointe, 2010), developed by Northpointe Inc. The data set (Angwin et al., 2016a) released as part of this report contains COMPAS decile scores, 2-year recidivism outcomes and a number of demographic and crime-related variables for defendants scored as part of pre-trial proceedings in Broward County, Florida. In particular, the data set contains information on the number of prior offenses (hereon denoted Priors) for each defendant. Since criminal history is itself a good predictor of future recidivism, it is reasonable to suppose that before COMPAS was introduced, judges could have based their risk assessments on Priors instead. Our question is thus: Does COMPAS produce more accurate (and/or equitable) predictions of recidivism than Priors alone?

The table below summarizes the classification performance of the two models on the Broward county data.Following the ProPublica analysis, we restrict our attention to the 61506150 defendants in the data whose race was recorded as either African-American or Caucasian.

The numeric scores were converted to classification rules using a cutoff of 22 for Priors and 55 for COMPAS. These cutoffs were selected so that both models would classify approximately the same proportion of defendants as high-risk (42% and 39%, respectively). While COMPAS is somewhat more accurate according to the various metrics, the difference in performance is overall not very large. One might therefore be inclined to conclude that the choice of model does not make much difference, and that the results are similar perhaps because COMPAS likely puts a large weight on criminal history and thus reaches the same conclusion as Priors. This conclusion is incorrect. As it turns out, the two classifiers disagree on 32% of all cases. Furthermore, as we will see in Section 3.1, they differ tremendously in terms of error rates and racial disparities for certain subgroups of defendants. Model choice matters.

Main contributions. We introduce a model comparison framework based on a recursive binary partitioning algorithm for automatically identifying subgroups in which the differences between two classification models are most pronounced. The methods presented in this paper specifically focus on identifying subgroups where the models differ in terms of fairness-related quantities such as racial or gender disparities in error or acceptance rates. Our methods can be applied to black-box models trained according to an unknown mechanism, do not require knowledge of what inputs the models use to make predictions, and do not require the models to use the same input variables.

One noteworthy application of our method is in the model training phase, where one may wish to understand the effect that including a particular set of (potentially sensitive) variables has on the resulting classifications. While there are certainly settings where using a sensitive attribute in decision-making is prohibited by law, this is far from always being the case. Many domains permit the consideration of sensitive attributes when doing so improves the welfare of traditionally disadvantaged groups. Indeed, depending on the problem setting, there may be good reason to expect predictive factors or mechanisms to differ across groups. As Hardt (2014) argues, “statistical patterns that apply to the majority may be invalid within a minority group.” In settings where using information on sensitive attributes may be permitted, it is important to understand the implications that this choice has for fairness. Our framework provides a principled approach to investigating these kinds of issues. We explore this matter further in the hypothetical lending example of Section 3.2.

We begin with an overview of some related literature on model transparency and subgroup analysis. In Section 2 we describe the general framework for our model comparison approach and provide some details on the implementation. We conclude with experimental results where we investigate (i) how racial disparities differ across models in the ProPublica COMPAS data, and (ii) how gender disparities in acceptance rates change when additional sensitive attributes are added to a hypothetical model of creditworthiness.

2. Related Work

Within the algorithmic fairness literature, notable recent work has introduced new variable importance measures for quantifying the influence of variables on classification decisions (see, e.g., (Henelius et al., 2014; Adler et al., 2016; Datta et al., 2015, 2016)). A motivation common to much of this body work has been the problem of assessing whether sensitive attributes such as race or gender have direct or indirect influence on model outcomes. We also note the recent work of Zhang and Neill (2016), which considers the single-model problem of identifying subgroups in which the estimated event probabilities differ significantly from observed proportions. This existing literature differs from our proposal in that we seek to quantify and characterize the difference in fairness across different models rather than to assess the direct or indirect influence of features in a single pre-trained model. Our proposed method for characterizing differences in fairness across models has connections to recent work on subgroup analysis and recursive binary partitioning approaches for heterogeneous treatment effect estimation (Su et al., 2009; Athey and Imbens, 2015).

3. Fairness Metrics

Throughout the paper we will make references to “fairness metrics” or “disparities”, which often correspond to differences in a particular classification metric across groups. For instance, statistical parity or equal acceptance rates with respect to a binary gender indicator would be satisfied if men and women were classified to the positive outcome at approximately equal rates. False positive rate balance with respect to race would be satisfied in the COMPAS example if non-reoffending Black defendants were misclassified as high-risk at the same rate as non-reoffending White defendants. The work of (Hardt et al., 2016; Kleinberg et al., 2017; Chouldechova, 2017; Corbett-Davies et al., 2017; Berk et al., 2017) describes numerous commonly used metrics, and provides a discussion of inherent trade-offs that exist between them. Romei and Ruggieri (2014) provide a broader survey of multidisciplinary approaches to discrimination analysis that go beyond simple classification metrics.

Model Comparison framework

We now describe our methodology for identifying subgroups in which a given disparity differs across models. The central components of this approach are as follows. First, we define a quantity of interest, Δ\Delta, that captures differences in model fairness, and we show how this quantity is a simple function of the parameters of an exponential family model. We then apply a recursive binary partitioning algorithm that uses a score-type test for Δ\Delta to partition the covariate space into regions within which Δ\Delta is homogeneous.

Notation. We begin with some notation. Let A{a1,a2}A\in\{a_{1},a_{2}\} indicate a sensitive binary attribute (e.g., race in the COMPAS example), and let Y{0,1}Y\in\{0,1\} indicate the true outcome (e.g., 2-year recidivism). Let Y^m1,Y^m2{0,1}\hat{Y}_{m_{1}},\hat{Y}_{m_{2}}\in\{0,1\} denote the classifications made by two classifiers, m1m_{1} and m2m_{2}. Due to space limitations, we focus our description on disparities in the False Positive Rate. Extensions to other fairness metrics involving expressions of the form Y^A,Y\hat{Y}\mid A,Y (e.g., FNR, acceptance rates) are entirely analogous, and are discussed in Section 2.3.

We focus on the difference-in-differences instead of difference in absolute differences because it is important to be able to capture cases where the disparity differs in sign between two models but not necessarily in magnitude.

The goal of the proposed method is to partition the covariate space into subgroups such that Δ\Delta is homogeneous within each subgroup and different between subgroups. We say that Δ\Delta is homogeneous within a group if that group cannot be partitioned into subgroups with significantly different Δ\Delta values. In Section 2.1, we will describe the application of test-based recursive partitioning to obtain subgroups homogeneous in Δ\Delta. In Section 2.2, we describe the likelihood model which underlies the tests of homogeneity.

Given a set of partitioning covariates X1,,XpX_{1},\ldots,X_{p}—which need not correspond in any way to the inputs used by either classifier—we recursively partition the covariate space using tests of the homogeneity of Δ\Delta. Our partitioning procedure follows the approach of (Hothorn and Zeileis, 2015; Zeileis et al., 2008) for model-based recursive binary partitioning, and relies on a modified version of the corresponding software. Our implementation uses custom fitting functions supplied to the R package partykit. We briefly describe the procedure for the simple case where all of the splitting variables are categorical. The approach and software both fully extend to also handle numeric and ordinal variables.

Let KjK_{j} denote the number of distinct levels of variable XjX_{j}, and let Δkj\Delta^{j}_{k} denote the (population) value of Δ\Delta in level kk of variable XjX_{j}. Beginning with all observations in the root node, recursively split according to the following procedure:

For each partitioning variable j=1,,pj=1,\ldots,p, apply a score-type test (see Section 2.2 and Appendix A) for detecting when Δkj\Delta^{j}_{k} varies across the levels k{1,,Kj}k\in\{1,\ldots,K_{j}\}. Select the splitting variable with the most significant difference in Δ\Delta (the smallest pp-value for this test).

For the selected variable, partition its levels into the two groups which minimize the total deviance of the resulting model.

The recursion terminates when nodes cannot be further split without falling below a user-specified minimum size threshold, or no further splits can be identified for which the Bonferroni-adjusted pp-value is smaller than a user-specified significance threshold. As a final step, the tree is pruned to eliminate splits where the differences in Δ\Delta are not of practical significance, but which were statistically significant due to large sample sizes.

This partitioning scheme produces what we will refer to as a parameter instability tree, with splits defined based on individual covariates, similar to the familiar trees produced by CART (Breiman et al., 1984) in classification settings. The leaf nodes of the tree correspond to subgroups where Δ\Delta appears homogeneous. Figure 1 shows an example of the parameter instability tree for the COMPAS data, with Δ\Delta taken to be the difference in racial FPR disparity between the Priors and COMPAS model. Section 3.1 provides more details on the experimental setup.

2. Modeling classifications

To carry out the recursive partitioning of Section 2.1, we need a model for the classifications which is (a) reasonable, and (b) easily captures Δ\Delta in its parametrization. Given a population, a natural joint model of classifications (Y^m1,Y^m2)(\hat{Y}_{m_{1}},\hat{Y}_{m_{2}}) is as a multinomial conditional on sensitive attribute A{a1,a2}A\in\{a_{1},a_{2}\}, as illustrated in Table 1. This multinomial is parameterized by the probabilities

where a{a1,a2}a\in\{a_{1},a_{2}\}. The conditional multinomial is a convenient formulation, since all relevant FPR quantities can be represented in terms of these parameters:

An important observation is that, for the purpose of computing the quantity of interest Δ\Delta, it suffices to consider the coarser conditional multinomial over the three events {Y^m1=0,Y^m2=1},{Y^m1=1,Y^m2=0},{Y^m1=Y^m2}\{\hat{Y}_{m_{1}}=0,\hat{Y}_{m_{2}}=1\},\{\hat{Y}_{m_{1}}=1,\hat{Y}_{m_{2}}=0\},\{\hat{Y}_{m_{1}}=\hat{Y}_{m_{2}}\}. This reduced multinomial is parameterized by p10a1,p01a1,p10a2,p01a2p_{10}^{a_{1}},p_{01}^{a_{1}},p_{10}^{a_{2}},p_{01}^{a_{2}}. We summarize this observation in the proposition below.

The FPR difference-in-difference, Δ\Delta, can be written as

The proposition follows directly from the definition of Δ\Delta and the identity,

To obtain the score-type test statistic for Δ\Delta required in Step (1) of the partitioning scheme described in Section 2.1, we further reparameterize the model according to the transformations:

In this parameterization, η+,δ\eta^{+},\delta and η\eta^{-} are treated as nuisance parameters in the model likelihood for the purpose of forming the score-type test statistic for Δ\Delta. A more complete derivation of the test statistic can be found in Appendix A.

3. Extensions

In principle, the methodology can also be extended to sensitive attributes AA that have more than two levels. One would first need to define a quantity Δ\Delta that reflects the disparity of model predictions with respect to AA. For instance, in the case of acceptance rates, Δ\Delta could be taken to be the variance in acceptance rates across race (now understood to be non-binary). An extension of the proposed procedure to this quantity would thus identify subgroups where one model exhibits greater variability in acceptance rates across race compared to another model. Alternatively, the proposed approach can be applied directly in an all-pairs or one-versus-all manner.

Lastly, we note that the score-type test for testing the null hypothesis in Step (1) of Section 2.1 can be replaced with any other valid statistical test. One could thus use a test that has greater power against particular types of alternatives. Note, however, that the score test is a computationally efficient choice. This is because, unlike most tests, the score test only requires that maximum likelihood parameters be computed under the null. This obviates the need for model refitting under the alternative for each splitting variable.

Evaluation

We begin by revisiting our motivating example with ProPublica’s COMPAS data from Broward County, Florida. So far we have seen that the COMPAS score performs similarly to the priors count, Priors, in terms of overall classification metrics. To delve deeper into differences between these two recidivism prediction models, we apply our method to identify subgroups where COMPAS and Priors differ in terms of the disparity in false positive rates between Black and White defendants. The candidate splitting variables are taken to be sex, age_cat, c_charge_degree, juv_misd_count, juv_fel_count, and juv_other_count. Figure 1 shows the resulting parameter instability tree, and Figure 2 provides a more easily interpretable representation of the findings.

Our method identifies 77 subgroups defined in terms of sex, age_cat and c_charge_degree splits where the extent or nature of the disparity in FPR between Black and White defendants is different between the two models. For instance, as we can clearly see in rightmost panel of Figure 2, the racial FPR disparity among young men is large for COMPAS but is nearly for Priors.

We emphasize two key points. First, we observe that the Overall difference in racial FPR disparity is not reflective of differences at the subgroup level. Furthermore, we note that while the differences in FPR across the 77 subgroups are at least in part due to differences in recidivism prevalence across the subgroups, the same argument does not explain the differences between COMPAS and Priors within the subgroups.

2. Sensitive attributes as inputs

For our next example we use the Adult data set from the UCI database (Lichman, 2013) to frame a hypothetical lending problem. We fit two random forest models to the data to predict whether individuals are in the >50K income (“loan-worthy”) category. The Small model uses sex, age, workclass, education.years as inputs, while the Full model additionally uses race and marital.status, both of which are typically considered to be sensitive attributes. While we do not claim that either model is realistic, this setup does illustrate an interesting phenomenon.

We apply our method to identify subgroups where the disparity in lending rates between Male and Female applicants differs between the Small and Full model. More precisely, Δ\Delta in this example is taken to be:

The candidate splitting variables are taken to be education, age, marital.status and race. Figure 3 shows the resulting parameter instability tree, and Figure 4 provides a more interpretable representation of the results. Unlike in the COMPAS example, the number of terminal nodes presented in the tree differs from the number of subgroups presented in the Figure 4 summary. This is because the tree is shown prior to pruning, a final step that collapses nodes 3,5 and 6 into a single {Education <= High School} subgroup.

We observe that overall acceptance (lending) rates go up for both men and women when marital status and race is included in the model. We also find that the gender disparity in lending rates decreases—and even inverts—among Married individuals who have more than a High School education. The disparity also decreases considerably among unmarried individuals with at least a College education. However, this is largely due to the massive drop in lending rates among Men in this subgroup.

Conclusion

This paper introduced a test-based recursive binary partitioning approach to identifying subgroups where two models differ considerably in terms of their fairness properties. Using examples in recidivism prediction and lending, we showed how this approach can be used to detect large subgroup differences in fairness that are not apparent from an overall performance comparison. The methodology can be further extended to target other kinds of disparity parameters and to use other statistical tests for parameter instability.

Acknowledgements

We thank the anonymous FAT/ML referees for their helpful comments on the initial version of this manuscript.

References

Appendix A Score test derivation

In this section we present the derivation of the likelihood and score-type test used in the partitioning scheme described in Section 2.1. This derivation is carried out for the FPR difference-in-differences parameter as defined in expression (1). Tests for other parameters of interest may be derived analogously.

For identifying partitions of covariate space on which Δ\Delta is homogeneous, it is helpful to write the distribution and its likelihood directly in terms of Δ\Delta. We use the following reparameterization of the multinomial

The parameterization (η+,η,δ,Δ)(\eta^{+},\eta^{-},\delta,\Delta) is equivalent to (p01a1,p10a1,p01a1,p10a1)(p_{01}^{a_{1}},p_{10}^{a_{1}},p_{01}^{a_{1}},p_{10}^{a_{1}}), since

Likelihood and score

The log-likelihood for this multinomial model of a sample of size nn with parameter θ=(p00,p01,p10,p11)\theta=(p_{00},p_{01},p_{10},p_{11}) is

where nijan_{ij}^{a} is the number of observations in group A=aA=a classified to Y^m1=i\hat{Y}_{m_{1}}=i and Y^m2=j\hat{Y}_{m_{2}}=j. Additionally, na=n00a+n11an_{\bullet}^{a}=n_{00}^{a}+n_{11}^{a}, and pa=1p10ap01ap_{\bullet}^{a}=1-p_{10}^{a}-p_{01}^{a}.

For constructing the test of homogeneity, it is useful to have an expression for the score function for a single observation with respect to the parameters (η+,η,δ,Δ)(\eta^{+},\eta^{-},\delta,\Delta). This is given by:

Test statistic