Multiaccuracy: Black-Box Post-Processing for Fairness in Classification

Michael P. Kim, Amirata Ghorbani, James Zou

Introduction

Despite the successes of machine learning at complex tasks that involve making predictions about people, there is growing evidence that “state-of-the-art” models can perform significantly less accurately on minority populations than on the majority population. Indeed, a notable study of three commercial face recognition systems known as the “Gender Shades” project [BG18], demonstrated significant performance gaps across different populations at classification tasks. While all systems achieved roughly 90% accuracy at gender detection on a popular benchmark, a closer investigation revealed that the system was significantly less accurate on female subjects compared to males and on dark-skinned individuals compared to light-skinned. Worse yet, this discrepancy in accuracy compounded when comparing dark-skinned females to light-skinned males; classification accuracy differed between these groups by as much as $34\%$ ! This study confirmed empirically the intuition that machine-learned classifiers may optimize predictions to perform well on the majority population, inadvertently hurting performance on the minority population in significant ways.

A first approach to address this serious problem would be to update the training distribution to reflect the distribution of people, making sure historically-underrepresented populations are well-represented in the training data. While this approach may be viewed as an eventual goal, often for historical and social reasons, data from certain minority populations is less available than from the majority population. In particular, we may not immediately have enough data from these underrepresented subpopulations to train a complex model. Additionally, even when adequate representative data is available, this process necessitates retraining the underlying prediction model. In the common setting where the learned model is provided as a service, like a commercial image recognition system, there may not be sufficient incentive (financial, social, etc.) for the service provider to retrain the model. Still, the clients of the model may want to improve the accuracy of the resulting predictions across the population, even when they are not privy to the inner workings of the prediction system.

At a high level, our work focuses on a setting, adapted from [HKRR18], that is common in practice but distinct from much of the other literature on fairness in classification. We are given black-box access to a classifier, $f_{0}$ , and a relatively small “validation set" of labeled samples drawn from some representative distribution ${\cal D}$ ; our goal is to audit $f_{0}$ to determine whether the predictor satisfies a strong notion of subgroup fairness, multiaccuracy. Multiaccuracy requires (in a sense that we make formal in Section 2) that predictions be unbiased, not just overall, but on every identifiable subpopulation. If auditing reveals that the predictor does not satisfy multiaccuracy, we aim to post-process $f_{0}$ to produce a new classifier $f$ that is multiaccurate, without adversely affecting the subpopulations where $f_{0}$ was already accurate.

Even if the initial classifier $f_{0}$ was trained in good faith, it may still exhibit biases on significant subpopulations when evaluated on samples from ${\cal D}$ . This setting can arise when minority populations are underrepresented in the distribution used to train $f_{0}$ compared to the desired distribution ${\cal D}$ , as in the Gender Shades study [BG18]. In general, we make no assumptions about how $f_{0}$ was trained. In particular, $f_{0}$ may be an adversarially-chosen classifier, which explicitly aims to give erroneous predictions within some protected subpopulation while satisfying marginal statistical notions of fairness. Indeed, the influential work on “Fairness Through Awareness” [DHP+12], followed by [KNRW17, HKRR18], demonstrated the weakness of statistical notions of fairness (such as statistical parity, equalized odds, and calibration), showing that prediction systems can exhibit material forms of discrimination against protected populations, even though they satisfy statistical fairness conditions. Left unaddressed, such forms of discrimination may discourage the participation of minority populations, leading to further underrepresentation of these groups. Thus, our goal will be to mitigate systematic biases broadly enough to handle inadvertent and malicious forms of discrimination.

We investigate a notion of fairness – multiaccuracy – originally proposed in [HKRR18], and develop a framework for auditing and post-processing for multiaccuracy. We develop a new algorithm, Multiaccuracy Boost, where a simple learning algorithm – the auditor – is used to identify subpopulations in ${\cal D}$ where $f_{0}$ is systematically making more mistakes. This information is then used to iteratively post-process $f_{0}$ until the multiaccuracy condition – unbiased predictions in each identifiable subgroup – is satisfied. Our notion of multiaccuracy differs from parity-based notions of fairness, and is reasonable in settings such as gender detection where we would like to boost the classifier’s accuracy across many subgroups. We prove convergence guarantees for Multiaccuracy Boost and show that post-processing for multiaccuracy may actually improve the overall classification accuracy. We describe the post-processing algorithm in Section 3.

Empirically, we validate Multiaccuracy Boost in several different case studies: gender detection from images as in Gender Shades [BG18], a semi-synthetic medical diagnosis task, and adult income prediction. In all three cases, we use standard, initial prediction models that achieve good overall classification error but exhibit biases against significant subpopulations. After post-processing, the accuracy improves across these minority groups, even though minority-status is not explicitly given to the post-processing algorithm as a feature. As long as there are features in the audit set correlated with the (unobserved) human categories, then Multiaccuracy Boost is effective at improving the classification accuracy across these categories. As suggested by the theory, Multiaccuracy Boost actually improves the overall accuracy, by identifying subpopulations where the initial models systematically erred; further, post-processing does not significantly affect performance on groups where accuracy was already high. We show that Multiaccuracy Boost, which only accesses $f_{0}$ as a black-box, performs comparably and sometimes even better than very strong white-box alternatives which has full access to $f_{0}$ . These results are reported in Section 4.

In Section 4.1, we explore the gender detection example further, investigating some of the practical aspects of multiaccuracy auditing and post-processing. In particular, we observe that the representation of images used for auditing (and post-processing) matters; we show that auditing is more effective when using an embedding of the images that was trained using an unsupervised autoencoder compared to using the internal representation of the neural network used for prediction. These findings seem consistent with the guiding philosophy, put forth by [DHP+12], that maintaining “awareness” is paramount to detecting unfairness. We also show that the auditing process, which we use algorithmically as a way to boost the accuracy of the classifier, can also be used to help people understand why their prediction models are making mistakes. Specifically, the output of the multiaccuracy auditor can be used to produce examples of inputs where the predictor is erring significantly; this provides human interpretation for biases of the original classifier.

Setting and multiaccuracy

Let ${\cal X}$ denote the input space; we denote by $y:{\cal X}\to\left\{0,1\right\}$ the function that maps inputs to their label. Let ${\cal D}$ represent the validation data distribution supported on ${\cal X}$ ; the distribution ${\cal D}$ can be viewed as the “true" distribution, on which we will evaluate the accuracy of the final model. In particular, we assume that the important subpopulations are sufficiently represented on ${\cal D}$ (cf. Remark on data distribution). Our post-processing learner receives as input a small sample of labeled validation data $\left\{(x,y(x))\right\}$ , where $x\sim{\cal D}$ , as well as black-box access to an initial regression / classification model $f_{0}:{\cal X}\to$ . The goal is to output a new model (using calls to $f_{0}$ ) that satisfies the multiaccuracy fairness conditions (described below).

Importantly, we make no further assumptions about $f_{0}$ . Typically, we will think of $f_{0}$ as the output of a learning algorithm, trained on some other distribution ${\cal D}_{0}$ (also supported on ${\cal X}$ ); in this scenario, our goal is to mitigate any inadvertently-learned biases. That said, another important setting assumes that $f_{0}$ is chosen adversarially to discriminate against a protected population of individuals, while aiming to appear accurate and fair on the whole; here, we aim to protect subpopulations against malicious misclassification. The formal guarantees of multiaccuracy provide meaningful protections from both of these important forms of discrimination.

1 Multiaccuracy

The goal of multiaccuracy is to achieve low classification error, not just on ${\cal X}$ overall, but also on subpopulations of ${\cal X}$ . This goal is formalized in the following definition adapted from [HKRR18].

Let $\alpha\geq 0$ and let ${\cal C}\subseteq^{{\cal X}}$ be a class of functions on ${\cal X}$ . A hypothesis $f:{\cal X}\to$ is $({\cal C},\alpha)$ -multiaccurate if for all $c\in{\cal C}$ :

$({\cal C},\alpha)$ -multiaccuracy guarantees that a hypothesis appears unbiased according to a class of statistical tests defined by ${\cal C}$ . As an example, we could define the class in terms of a collection of subsets $S\subseteq{\cal X}$ , taking ${\cal C}$ to be $\chi_{S}$ (and its negation) for each subset in the collection; in this case, $({\cal C},\alpha)$ -multiaccuracy guarantees that for each $S$ , the predictions of $f$ are at most $\alpha$ -biased.

Ideally, we would hope to take ${\cal C}$ to be the class of all statistical tests. Requiring multiaccuracy with respect to such a ${\cal C}$ , however, requires learning the function $y(x)$ exactly, which is information-theoretically impossible from a small sample. In practice, if we take ${\cal C}$ to be a learnable class of functions, then $({\cal C},\alpha)$ -multiaccuracy guarantees accuracy on all efficiently-identifiable subpopulations.

For instance, if we took ${\cal C}$ to be the class of width- $4$ conjunctions, then multiaccuracy guarantees unbiasedness, not just on the marginal populations defined by race and separately gender, but by the subpopulations defined by the intersection of race, gender, and two other (possibly “unprotected") features. In particular, the subpopulations that multiaccuracy protects can be overlapping and include groups beyond traditionally-protected populations. This form of computationally-bounded intersectionality provides strong protections against forms of discrimination, like subset targeting, discussed in [DHP+12, HKRR18].

2 Classification accuracy from multiaccuracy

Multiaccuracy guarantees that the predictions of a classifier appear unbiased on a rich class of subpopulations; ideally though, we would state a guarantee in terms of the classification accuracy, not just the bias. Intuitively, as we take ${\cal C}$ to define a richer class of tests, the guarantees of multiaccuracy become stronger. This intuition is formalized in the following proposition.

Let $\hat{y}:{\cal X}\to\left\{-1,1\right\}$ as $\hat{y}(x)=1-2y(x)$ . Suppose that for $S\subseteq{\cal X}$ with $\operatorname*{\textnormal{\bf Pr}}_{x\sim{\cal D}}[x\in S]\geq\gamma$ , there is some $c\in{\cal C}$ such that $\operatorname*{\textnormal{\bf E}}_{x\sim{\cal D}}[\left|c(x)-\hat{y}_{S}(x)\right|]\leq\tau$ . Then if $f$ is $({\cal C},\alpha)$ -multiaccurate, $\textup{er}_{S}(f;y)\leq 2\cdot(\alpha+\tau)/\gamma$ .

That is, if there is a function in ${\cal C}$ that correlates well with the label function on a significant subpopulation $S$ , then multi-accuracy translates into a guarantee on the classification accuracy on this subpopulation.

Note that in our definition of multiaccuracy, we take an expectation over the distribution ${\cal D}$ of validation data. Ideally, ${\cal D}$ should reflect the true population distribution or could be aspirational, increasing the representation of populations who have experienced historical discrimination; for instance, the classification error guarantee of Proposition 1 improves as $\gamma$ , the density of the protected subpopulation $S$ , grows. Still, if we take the multiaccuracy error term $\alpha$ small enough, then we may still hope to improve the accuracy on less-represented subpopulations. Our experiments suggest that applying the multiaccuracy framework with an unbalanced valdiation distribution may still help improve the accuracy on underrepresented groups.

3 Auditing for multiaccuracy

With the definition of $({\cal C},\alpha)$ -multiaccuracy in place, a natural question to ask is how to test if a hypothesis $f$ satisfies the definition; further, if $f$ does not satisfy $({\cal C},\alpha)$ -multiaccuracy, can we update $f$ efficiently to satisfy the definition, while maintaining the overall accuracy? We will use a learning algorithm ${\cal A}$ to audit a classifier $f$ for multiaccuracy. The algorithm ${\cal A}$ receives a small sample from ${\cal D}$ and aims to learn a function $h$ that correlates with the residual function $f-y$ . In Section 3, we describe how to use such an auditor to solve the post-processing problem. This connection between subpopulation fairness and learning is also made in [KNRW17, HKRR18, KRR18], albeit for different tasks.

A special case of $({\cal A},\alpha)$ -multiaccuracy auditing uses a naive learning algorithm that iterates over statistical tests $c\in{\cal C}$ . Concretely, in our experiments, we audit with ridge regression and decision tree regression; both auditors are effective at identifying subpopulations on which the model is underperforming. Further, in the image recognition setting, we show that auditing can be used to produce interpretable synopses of the types of mistakes a predictor makes.

Post-processing for multiaccuracy

Here, we describe an algorithm, Multiaccuracy Boost, for post-processing a pre-trained model to achieve multiaccuracy. The algorithm is given black-box access to an initial hypothesis $f_{0}:{\cal X}\to$ and a learning algorithm ${\cal A}:({\cal X}\times)^{m}\to^{\cal X}$ , and for any accuracy parameter $\alpha>0$ , outputs a hypothesis $f:{\cal X}\to$ that passes $({\cal A},\alpha)$ -multiaccuracy auditing. The post-processing algorithm is an iterative procedure similar to boosting [FS95, SF12], that uses the multiplicative weights framework to improve suboptimal predictions identified by the auditor. This approach is similar to the algorithm given in [HKRR18] in the context of fairness and [TTV09] in the context of pseudorandomness. Importantly, we adapt these algorithms so that Multiaccuracy Boost exhibits what we call the “do-no-harm” guarantee; informally, if $f_{0}$ has low classification error on some subpopulation $S\subseteq{\cal X}$ identified by ${\cal A}$ , then the resulting classification error on $S$ cannot increase significantly. In this sense, achieving our notion of fairness need not adversely affect the utility of the classifier.

A key algorithmic challenge is to learn a multiaccurate predictor without overfitting to the small sample of validation data. In theory, we prove bounds on the sample complexity necessary to guarantee good generalization as a function of the class ${\cal C}$ , the error parameter $\alpha$ , and the size of subpopulations we wish to protect $\gamma$ . In practice, we need to balance the choice of ${\cal C}$ (or ${\cal A}$ ) and the number of iterations of our algorithm to make sure that the auditor is discovering true signal, rather than noise in the validation data. Indeed, if the auditor ${\cal A}$ learns an expressive enough class of functions, then our algorithm will start to overfit at some point; we show empirically that multiaccuracy post-processing improves the generalization error before overfitting. Next, we give an overview of the algorithm, and state its formal guarantees in Section 3.1.

At a high level, Multiaccuracy Boost starts by partitioning the input space ${\cal X}$ based on the initial classifier $f_{0}$ into ${\cal X}_{0}=\left\{x\in{\cal X}:f_{0}(x)\leq 1/2\right\}$ and ${\cal X}_{1}=\left\{x\in{\cal X}:f_{0}(x)>1/2\right\}$ ; note that we can partition ${\cal X}$ simply by calling $f_{0}$ . Partitioning the search space ${\cal X}$ based on the predictions of $f_{0}$ helps to ensure that the $f$ we output maintains the initial accuracy of $f_{0}$ ; in particular, it allows us to search over just the positive-labeled examples (negative, resp.) for a way to improve the classifier. The initial hypothesis may make false positive predictions and false negative predictions for very different reasons, even if in both cases the reason is simple enough to be identified by the auditor.

After partitioning the input space, the procedure iteratively uses the learning algorithm ${\cal A}$ to search over ${\cal X}$ (and within the partitions ${\cal X}_{0},{\cal X}_{1}$ ) to find any function which correlates significantly with the current residual in prediction $f-y$ . If ${\cal A}$ successfully returns some function $h:{\cal X}\to$ that identifies a significant subpopulation where the current hypothesis is inaccurate, the algorithm updates the predictions multiplicatively according to $h$ . In order to update the predictions simultaneously for all $x\in{\cal X}$ , at the $t$ th iteration, we build $f_{t+1}$ by incorporating $h_{t}$ into the previous model $f_{t}$ . This approach of augmenting the model at each iteration is similar to boosting. To guarantee good generalization of $h$ , we assume that ${\cal A}$ uses a fresh sample $D_{t}\sim{\cal D}^{m}$ per iteration. In practice, when we have few samples, we can put all of our samples in one batch and use noise-addition techniques to reduce overfitting [DFH+15, RZ16]; this connection to adaptive data analysis was studied formally in [HKRR18].

From the stopping condition, it is clear that when the algorithm terminates, $f_{T}$ passes $({\cal A},\alpha)$ -multiaccuracy auditing. Thus, it remains to bound the number of iterations $T$ before Multiaccuracy Boost terminates. Additionally, as described, the algorithm evaluates statistics like $\operatorname*{\textnormal{\bf E}}_{x\sim{\cal D}}[h(x)\cdot(f(x)-y(x))]$ , which depends on $y(x)$ for all $x\in{\cal X}$ ; we need to bound the number of validation samples we need to guarantee good generalization to the unseen population. Theorem 2 provides formal guarantees on the convergence rate and the sample complexity from ${\cal D}$ needed to estimate the expectations sufficiently-accurately.

The distinction between our approach and most prior works on fairness (especially [KNRW17]) is made clear from the “do-no-harm” property that Multiaccuracy Boost exhibits, stated formally as Theorem 3. In a nutshell, the property guarantees that on any subpopulation $S\subseteq{\cal X}$ that ${\cal A}$ audits, the classification error cannot increase significantly from $f_{0}$ to the post-processed classifier. Further, the bound we prove is very pessimistic. Both in theory and in practice, we do not expect any increase to occur. In particular, the convergence analysis of Multiaccuracy Boost follows by showing that at every update, the average cross-entropy loss on the population we update must drop significantly. Termination is guaranteed because after too many iterations of auditing, the post-processing will have learned $y$ perfectly. Thus, if we use Algorithm 3 to post-process a model that is already achieves high accuracy on the validation distribution the resulting model’s accuracy should not deteriorate in significant ways; empirically, we observe that classification accuracy (on held-out test set) tends to improve over ${\cal D}$ after multiaccuracy post-processing.

1 Formal guarantees of Multiaccuracy Boost

For clarity of presentation, we describe the formal guarantees of our algorithm assuming that ${\cal A}$ provably agnostic learns a class of tests ${\cal C}$ , in order to describe the sample complexity appropriately. The guarantees on the rate of convergence do not rely on such an assumption. We show that, indeed, Algorithm 3 must converge in a bounded number of iterations. The proof follows by showing that, for an appropriately chosen $\eta$ (on the order of $\alpha$ ), each update improves the cross-entropy loss over the updated set $S$ , so the bound depends on the initial cross-entropy loss.

To estimate the statistics used in Multiaccuracy Boost, we need to bound the sample complexity required for the auditor to generalize. Informally, we say the dimension $d({\cal C})$ of an agnostically learnable class ${\cal C}$ is a value such that $m\geq\Omega\left(\frac{d({\cal C})+\log(1/\delta)}{\alpha^{2}}\right)$ random samples from ${\cal D}$ guarantee uniform convergence over ${\cal C}$ with accuracy $\alpha$ with failure probability at most $\delta$ . Examples of upper bounds on this notion of dimension include $\log(\left|{\cal C}\right|)$ for a finite class of tests, the VC-dimension [KV94] for boolean tests, and the metric entropy [BLM13] of real-valued tests. We state the formal guarantees as Theorem 2.

Roughly speaking, for constant $\alpha,\delta$ , the sample complexity scales with the dimension of the class ${\cal C}$ . For many relevant classes ${\cal C}$ for which we would want to enforce $({\cal C},\alpha)$ -multiaccuracy, this dimension will be significantly smaller than the amount of data required to train an accurate initial $f_{0}$ . Note also that our sample complexity is completely generic and we make no effort to optimize the exact bound. In particular, for structured ${\cal C}$ and ${\cal A}$ , better uniform convergence bounds can be proved. Further, appealing to a recent line of work on adaptive data analysis initiated by [DFH+15, RZ16], we can avoid resampling at each iteration as in [HKRR18].

The do-no-harm property guarantees that the classification error on any subpopulation that ${\cal A}$ audits cannot increase significantly. As we assume ${\cal A}$ can identify a very rich class of overlapping sets, in aggregate, this property gives a strong guarantee on the utility of the resulting predictor. Further, the proof of Theorem 3 reveals that this worst-case bound is very pessimistic and can be improved with stronger assumptions.

Let $\alpha,\beta,\gamma>0$ and $S\subseteq{\cal X}$ be a subpopulation where $\operatorname*{\textnormal{\bf Pr}}_{x\sim{\cal D}}[x\in S]\geq\gamma$ . Suppose ${\cal A}$ audits the characteristic function $\chi_{S}(x)$ and its negation. Let $f:{\cal X}\to$ be the output of Algorithm 3 when given $f_{0}:{\cal X}\to$ , ${\cal A}$ , and $\alpha\leq\beta\gamma$ as input. Then the classification error of $f$ on the subset $S$ is bounded as

Experimental Evaluation

We evaluate the empirical performance of Multiaccuracy Boost in three case studies. The first and most in-depth case study aims to emulate the conditions of the Gender Shades study [BG18], to test the effectiveness of multiaccuracy auditing and post-processing on this important real-world example. In Section 4.1, we show experimental results for auditing using two different validation data sets. In particular, one data set is fairly unbalanced and similar to the data used to train, while the other data set was developed in the Gender Shades study and is very balanced. For each experiment, we report for various subpopulations, the population percentage in ${\cal D}$ , accuracies of the initial model, our black-box post-processed model, and white-box benchmarks. In Section 4.1.1, we discuss further subtleties of applying the multiaccuracy framework involving the representation of inputs passed to ${\cal A}$ for auditing; in Section 4.1.2, we show how auditing can be used beyond post-processing. In particular, the hypotheses that ${\cal A}$ learns can be used to highlight subpopulations – in an interpretable way – on which the model is making mistakes.

We evaluate the effectiveness of multiaccuracy post-processing on two other prediction tasks. In both these case studies, we take the training and auditing data distribution ${\cal D}$ to be the same, though the number of examples used for auditing will be quite small. Multiaccuracy still improves the performance on significant subpopulations. These results suggest some interesting hypotheses about why machine-learned models may incorporate biases in the first place, which warrant further investigation.

In this case study, we replicate the conditions of the Gender Shades study [BG18] to evaluate the effectiveness of the multiaccuracy framework in a realistic setting. For our initial model, we train an inception-resnet-v1 [SIVA17] gender classification model using the CelebA data set with more than 200,000 face images [LLWT15]. The resulting test accuracy on CelebA for binary gender classification is 98.4%.

To test the initial performance of $f_{0}$ , we evaluated on a random subset of the LFW+a data containing 6,880 face images, each of which is labeled with both gender and race – black (B) and non-black (N). For gender classification on LFW+a, $f_{0}$ achieves 94.4% accuracy. Even though the overall accuracy is high, the error rate is much worse for females (23.1%) compared to males (0.7%) and worse for blacks (10.2%) compared to non-blacks (5.1 %); these results are qualitatively very similar to those observed by the commercial gender detection systems studied in [BG18]. We applied Multiaccuracy Boost, which converged in 7 iterations. The resulting classifier’s classification error in minority subpopulations was substantially reduced, even though the auditing distribution was similar as the training distribution.

We compare Multiaccuracy Boost against a strong white-box baseline. Here, we retrain the network of $f_{0}$ using the audit set. Specifically, we retrain the last two layers of the network, which gives the best results amongst retraining methods. We emphasize that this baseline requires white-box access to $f_{0}$ , which is often infeasible in practice. Multiaccuracy Boost accesses $f_{0}$ only as a black-box without any additional demographic information, and still achieves comparable, if not improved, error rates compared to retraining. We report the overall classification accuracy as well as accuracy on different subpopulations – e.g. BF indicates black female – in Table 1.

The second face dataset, PPB, in addition to being more balanced, is much smaller; thus, this experiment can be viewed as a stress test, evaluating the data efficiency of our post-processing technique. The test set has 415 individuals and the audit set has size 855. PPB annotates each face as dark (D) or light-skinned (L). As with LFW+a, we evaluated the test accuracy of the original $f_{0}$ , the multiaccurate post-processed classifier, and retrained classifier on each subgroup. Multiaccuracy Boost converged in 5 iterations and again, substantially reduced error despite a small audit set and the lack of annotation about race or skin color (Table 2).

To further test the data efficiency of Multiaccuracy Boost, we evaluate the effect of audit set size on the resulting accuracy of each method. In Fig. 1, we report the performance of Multiaccuracy Boost versus the white-box retraining method for different sizes of audit set. The plot displays the difference in accuracy for the overall population along with the subgroups that suffered the most initial bias. It shows that the performance of Multiaccuracy Boost may actually be better than the white-box retraining baseline when validation data is especially scarce.

As discussed earlier, in the reported gender detection experiments, we audit for multiaccuracy using ridge regression over an encoding of images produced by a variational autoencoder. Using the representation of images produced by this encoding intuitively makes sense, as the autoencoder’s reconstruction objective aims to preserve as much information about the image as possible while reducing the dimension. Still, we may wonder whether multiaccuracy auditing over a different representation of the images would perform better. In particular, since we are interested in improving the accuracy on the gender detection task, it seems plausible that a representation of the images based on the internal layers of the initial prediction network might preserve the information salient to gender detection more effectively.

Collectively, these findings bolster the argument for “fairness through awareness”, which advocates that in order to make fair predictions, sensitive information (like race or gender) should be given to the (trustworthy) classifier. While none of these representations explicitly encode sensitive group information, the VAE representation does preserve information about the original input, for instance skin color, that seems useful in understanding the group status. The prediction network is trained to have the best prediction accuracy (on an unbalanced training data set), and thus, the representations from the network reasonably may contain less information about these sensitive features. These results suggest that the effectiveness of multiaccuracy does depend on the representation of inputs used for auditing, but so long as the representation is sufficiently expressive, Multiaccuracy Boost may be robust to the exact encoding of the features.

1.2 Multiaccuracy auditing as diagnostic

As was shown in [BG18], we’ve demonstrated that models trained in good faith on unbalanced data may exhibit significant biases on the minority populations. For instance, the initial classification error on black females is significant, whereas on white males, it is near . Importantly, the only way we were able to report these accuracy disparities was by having access to a rich data set where gender and race were labeled. Often, this demographic information will not be available; indeed, the CelebA images are not labeled with race information, and as such, we were unable to evaluate the subpopulation classification accuracy on this set. Thus, practitioners may be faced with a problem: even if they know their model is making undesirable mistakes, it may not be clear if these mistakes are concentrated on specific subpopulations. Absent any identification of the subpopulations on which the model is underperforming, collecting additional training data may not actually improve performance across the board.

We demonstrate that multiaccuracy auditing may serve as an effective diagnostic and interpretation tool to help developers identify systematic biases in their models. The idea is simple: the auditor returns a hypothesis $h$ that essentially “scores” individual inputs $x$ by how wrong the prediction $f_{0}(x)$ is. If we consider the magnitude of their scores $\left|h(x)\right|$ , then we may understand better the biases that the encoder is discovering.

2 Additional case studies

Multiaccuracy auditing and post-processing is applicable broadly in supervised learning tasks, not just in image classification applications. We demonstrate the effectiveness of Multiaccuracy Boost in two other settings: the adult income prediction task and a semi-synthetic disease prediction task.

For the first case study, we utilize the adult income prediction data set [Koh96] with 45,222 samples and 14 attributes (after removing subjects with unknown attributes) for the task of binary prediction of income more than $50k for the two major groups of Black and White. We remove the sensitive features of gender – female (F) and male (M) and race (for the two major groups) – black (B) and white (W) – from the data, to simulate settings where sensitive features are not available to the algorithm training. We trained a base algorithm,$ f_{0}$, which is a neural network with two hidden layers on 27,145 randomly selected individuals. The test set consists of an independent set of 15,060 persons.

We audit using a decision tree regression model (max depth $5$ ) ${\cal A}_{\text{dt}}$ to fit the residual $f(x)-y(x)$ . ${\cal A}_{\text{dt}}$ receives samples of validation data drawn from the same distribution as training; that is ${\cal D}={\cal D}_{0}$ . In particular, we post-process with 3,017 individuals sampled from the same adult income dataset (disjoint from the training set of $f_{0}$ ). The auditor is given the same features as the original prediction model, and thus, is not given the gender or race of any individual. We evaluate the post-processed classifier on the same independent test set. Multiaccuracy Boost converges in 50 iterations with $\eta=1$ .

As a baseline, we trained four separate neural networks with the same architecture as before (two hidden layers) for each of the four subgroups using the audit data. As shown in Table 4, multiaccuracy post-processing achieves better accuracy both in aggregate and for each of the subgroups. Importantly, the subgroup-specific models requires explicit access to the sensitive features of gender and race. Training a classifier for each subgroup, or explicitly adding subgroup accuracy into the training objective, assumes that the subgroup is already identified in the data. This is not feasible in the many applications where, say, race or more granular categories are not given. Even when the subgroups are identified, we often do not have enough samples to train accurate classifiers on each subgroup separately. This example illustrates that multiaccuracy can help to boost the overall accuracy of a black-box predictor in a data efficient manner.

2.1 Semi-Synthetic Disease Prediction

We design a disease prediction task based on real individuals, where the phenotype to disease relation is designed to be different for different subgroups, in order to simulate a challenging setting. We used 40,000 individuals sampled from the UK Biobank [SGA+15]. Each individual contains 60 phenotype features. To generate a synthetic disease outcome for each subgroup, we divided the data set into four groups based on gender – male (M) and female (F) – and age – young (Y) and old (O). For each subgroup, we create synthetic binary labels using a different polynomial function of the input features with different levels of difficulty. The polynomial function orders are 1, 4, 2, and 6 for OF, OM, YF, and YM subgroups respectively.

For $f_{0}$ , we trained a neural network with two hidden layers on 32,000 individuals, without using the gender and age features. Hyperparameter search was done for the best weight-decay and drop-out parameters. The $f_{0}$ we discover performs moderately well on every subpopulation, with the exception of old females (OF) where the classification error is significantly higher. Note that this subpopulation had the least representation in ${\cal D}_{0}$ . Again, we audit using ${\cal A}_{\text{dt}}$ to run decision tree regression with validation data samples drawn from ${\cal D}={\cal D}_{0}$ . Specifically, the auditor receives a sample of 4,000 individuals without the gender or age features. As a baseline, we trained a separate classifier for each of the subgroups using the same audit data. As Table 5 shows, Multiaccuracy Boost significantly lowers the classification error in the old female population.

Discussion

In this work, we propose multiaccuracy auditing and post-processing as a method for improving the fairness and accountability of black-box prediction systems. Here, we discuss how our work compares to prior works, specifically, how it fits into the growing literature on fairness for learning systems. We conclude with further discussion of our results and speculation about future investigations.

Many different notions of fairness have been proposed in literature on learning and classification [DHP+12, HPS16, ZWS+13, DIKL17, HKRR18, KNRW17, HSNL18, KRR18, RY18]. Many of these works encode some notion of parity, e.g. different subgroups should have similar false positive rates, as an explicit objective/constraint in the training of the original classifier. The fairness properties are viewed as constraints on the classifier that ultimately limit the model’s utility. A common belief is that in order to achieve equitable treatment for protected subpopulations, the performance on other subpopulations necessarily must degrade.

A notable exception to this pattern is the work of Hébert-Johnson et al. [HKRR18], which introduced a framework for achieving fairness notions that aim to provide accurate predictions for many important subpopulations. [HKRR18] introduced the notion of multiaccuracy[HKRR18] refers to this notion as “multi-accuracy-in-expectation”. and a stronger notion, dubbed multicalibration, in the context of regression tasks. Multicalibration guarantees (approximately) calibrated predictions, not just overall, but on a rich class of structured “identifiable" subpopulations. [HKRR18] provides theoretical algorithms for achieving multiaccuracy and multicalibration, and shows how to post-process a model to achieve multicalibration in a way that improves the regression objective across all subpopulations (in terms of squared-error). Our work directly extends the approach of [HKRR18], adapting their work to the binary classification setting. Our post-processing algorithm, Multiaccuracy Boost, builds on the algorithm given in [HKRR18], providing the additional “do-no-harm” property. This property guarantees that if the initial predictor $f_{0}$ has small classification error on some identifiable group, then the resulting post-processed model will also have small classification error on this group.

Independent work of Kearns et al. [KNRW17] also investigated how to achieve statistical fairness guarantees, not just for traditionally-protected groups, but on rich families of subpopulations. [KNRW17] proposed a framework for auditing and learning models to achieve fairness notions like statistical parity and equal false positive rates. Both works [HKRR18, KNRW17] connect the task of learning a model that satisfies the notion of fairness to the task of (weak) agnostic learning [Kea98, KSS94, KMV08, Fel10]. [KNRW17] also reduces the problem of learning a classifier satisfying parity-based notions of fairness across subgroups to the problem of auditing; it would be interesting if their notion of auditing can be used by humans as a way to diagnose systematic discrimination.

Our approach to post-processing, which uses a learning algorithm as a fairness auditor, is similar in spirit to the approach to learning of [KNRW17], but differs technically in important ways. In particular, in the framework of [KNRW17], the auditor is used during (white-box) training to constrain the model selected from a pre-specified hypothesis class; ultimately, this constrains the accuracy of the predictions. In our setting (as in [HKRR18]), we do not restrict ourselves to an explicitly-defined hypothesis class, so we can augment the current model using the auditor; these augmentations improve the accuracy of the model.

Indeed, at a technical level, our post-processing algorithm is most similar to work on boosting [FS95, SF12], specifically, gradient boosting [MBBF00, Fri01]. Still, our perspective is quite different from the standard boosting setting. Rather than using an expressive class of predictors as the base classifiers to be able to learn the function directly, our setting focuses on the regime where data is limited and we must restrict our attention to simple classes. Thus, it becomes important that we leverage the expressiveness (and initial accuracy) of $f_{0}$ if we are to obtain strong performance using the multiaccuracy approach. Further, the termination of Multiaccuracy Boost certifies that the final model satisfies $({\cal A},\alpha)$ -multiaccuracy; in general, standard boosting algorithms will not provide such a certificate.

Motivated by unfairness that arises as the result of feedback loops in classification settings, another recent work of Hashimoto et al. [HSNL18] aims to improve fairness at a subpopulation level. Specifically, their notion of fairness similarly aims to give accurate (i.e. bounded loss) predictions not just overall, but on all significant subpopulations. In the multiaccuracy setting, we argued that this goal was information-theoretically infeasible; [HSNL18] sidesteps this impossibility by optimizing over a fixed hypothesis class, and formulating the problem as a min-max optimization. They give show how to relax the problem of minimizing the worst-case subpopulation loss and reduce the relaxation to a certain robust optimization problem. While their approach does not guarantee optimality, it gives a strong certificate upper-bounding the maximum loss over all subpopulations.

A different approach to subgroup fairness is studied by Dwork et al. [DIKL17]. This work investigates the question of how to learn a “decoupled” classifier, where separate classifiers are learned for each subgroup and then combined to achieve a desired notion of fairness. While applicable in some settings, at times, this approach may be untenable. First, decoupling the classification problem requires that we have race, age, and other attributes of interest in the dataset and that the groups we wish to protect are partitioned by these attributes; this information is often not available. Even if this information is available, a priori, it may not always be obvious which subpopulations require special attention. In contrast, the multiaccuracy approach allows us to protect a rich class of overlapping subpopulations without explicit knowledge of the vulnerable populations. An interesting direction for future investigation could try to pair multiaccuracy auditing (to identify subpopulations in need of protection) with the decoupled classification techniques of [DIKL17].

The present work, along with [HKRR18, KNRW17, KRR18], can be viewed as studying information-fairness tradeoffs in prediction tasks (i.e. strengthening the notion of fairness that can be guaranteed using a small sample). These works fit into the larger literature on fairness in learning and prediction tasks [DHP+12, ZWS+13, BG18, HPS16, DIKL17, KRR18, RY18], discussions of the utility-fairness tradeoffs in fair classification [ALMK16, KMR17, Cho17, CG16, CDPF+17, PRW+17]. While fairness and accountability serve as the main motivations for developing the multiaccuracy framework, our results may have broader interest. In particular, multiaccuracy post-processing may be applicable in domain adaptation settings, particularly under label distribution shift as studied recently in [LWS18], but when the learner gets a small number of labeled samples from the new distribution.

2 Conclusion

The multiaccuracy framework can be applied very broadly; importantly, we can post-process any initial model $f_{0}$ given only black-box access to $f_{0}$ and a small set of labeled validation data. We show that in a wide range of realistic settings, post-processing for multiaccuracy helps to mitigate systematic biases in predictors across sensitive subpopulations, even when the identifiers for these subpopulations are not given to the auditor explicitly. In our experiments, we observe that standard supervised learning optimizes for overall performance, leading to settings where certain subpopulations incur substantially worse error rates. Multiaccuracy provides a framework for fairness in classification by improving the accuracy in identifiable subgroups, in a way that suffers no tradeoff between accuracy and utility. We demonstrate – both theoretically and empirically – that post-processing with Multiaccuracy Boost serves as an effective tool for improving the accuracy across important subpopulations, and does not harm the populations that are already classified well.

Multiaccuracy works to the extent that the auditor can effectively identify specific subgroups where the original classifier $f_{0}$ tends to make mistakes. The power of multiaccuracy lies in the fact that in many settings, we can identify issues with $f_{0}$ using a relatively small amount of audit data. Thus, multiaccuracy auditing is limited: if the mistakes appear overly-complicated to the bounded auditor (for information- or complexity-theoretic reasons), then the auditor will not be able to identify these mistakes. Our empirical results suggest, however, that in many realistic settings, the subpopulations on which a classifier errs are efficiently-identifiable. This observation may be of interest beyond the context of fairness. In particular, our experiments improving the accuracy of a model trained on CelebA on the LFW+a and PPB test sets suggests a lightweight black-box alternative to more sophisticated transfer learning techniques, which may warrant further investigation.

Our empirical investigations reveal some additional interesting aspects of the multiaccuracy framework. In particular, we’ve shown that multiaccuracy auditing can identify underrepresented groups receiving suboptimal predictions even when the sensitive attributes defining these groups are not explicitly given to the auditor, which proves useful for diagnosing where models make mistakes. We feel that it may be of further interest within the study of model interpretability. Finally, it is striking that Multiaccuracy Boost tends to improve, not just subgroup accuracy, but also the overall accuracy, even when the minority groups remain underrepresented in the validation data. While some of these findings may be due to suboptimal training of our initial models, we believe this is not the only factor at play. In particular, we hypothesize that understanding why models incorporate biases during training – and further, why simple interventions like multiaccuracy post-processing can significantly improve generalization error – requires investigating the dynamics of overfitting during training, not just on the population as a whole, but across significant subpopulations.

The authors thank Omer Reingold and Guy N. Rothblum for their advice and helpful discussions throughout the development of this work; we thank Weihao Kong, Aditi Raghunathan, and Vatsal Sharan for feedback on early drafts of this work.

References

We use the inner product $\langle h,g\rangle=\operatorname*{\textnormal{\bf E}}_{x\sim{\cal D}}[h(x)\cdot g(x)]$ and the $p$ -norms $\left\|h\right\|_{p}=\left(\operatorname*{\textnormal{\bf E}}_{x\sim{\cal D}}[\left|h(x)\right|^{p}]\right)^{1/p}$ .

Appendix A Multiaccuracy and classification error

Let $\hat{y}:{\cal X}\to\left\{-1,1\right\}$ as $\hat{y}(x)=1-2y(x)$ . Suppose that for $S\subseteq{\cal X}$ with $\operatorname*{\textnormal{\bf Pr}}_{x\sim{\cal D}}[x\in S]\geq\gamma$ , there is some $c\in{\cal C}$ such that $\left\|c-\hat{y}_{S}\right\|_{1}\leq\tau$ . Then if $f$ is $({\cal C},\alpha)$ -multiaccurate, $\textup{er}_{S}(f;y)\leq 2\cdot(\alpha+\tau)/\gamma$ .

For $i,j\in\left\{0,1\right\}$ , let $S_{ij}=\left\{x\in S:y(x)=i\land\bar{f}(x)=j\right\}$ . Further denote $\beta_{ij}=\operatorname*{\textnormal{\bf Pr}}_{x\sim{\cal D}}[x\in S_{ij}]$ . Note that the classification error on a set $S$ is $\textup{er}_{S}(f;y)\leq(\beta_{01}+\beta_{10})/\gamma$ .

Let $\hat{y}(x)=1-2y(x)$ and suppose $c(x)=\hat{y}(x)_{S}+z(x)$ where $\left\|\delta\right\|_{1}\leq\tau$ . Then, we derive the following inequality.

where (6) follows by Hölder’s inequality, from the fact that the contribution to the expectation of $(1-2y(x))\cdot(f(x)-y(x))$ from $S_{00}$ and $S_{11}$ is lower bounded by , and by the definition $\hat{y}_{S}(x)=0$ for $x\not\in S$ . Further, because we know any $x\in S_{01}\cup S_{10}$ is misclassified, we can lower bound the contribution by $1/2$ . Thus, if $\operatorname*{\textnormal{\bf E}}_{x\sim{\cal D}}[c(x)\cdot(f(x)-y(x))]\leq\alpha$ , then by rearranging we conclude

Let $\alpha,\beta,\gamma>0$ and $S\subseteq{\cal X}$ be a subpopulation where $\operatorname*{\textnormal{\bf Pr}}_{x\sim{\cal D}}[x\in S]\geq\gamma$ . Suppose for ${\cal A}$ audits the characteristic function $\chi_{S}(x)$ and its negation. Let $f:{\cal X}\to$ be the output of Algorithm 3 when given $f_{0}:{\cal X}\to$ , ${\cal A}$ , and $0<\alpha\leq\beta\gamma$ as input. Then the classification error of $f$ on the subset $S$ is bounded as

Suppose that $\textup{er}_{S}(f_{0};y)\leq\tau$ . Consider $S_{1}=\left\{x\in S:f_{0}(x)>1/2\right\}$ ; suppose $\textup{er}_{S_{1}}(f_{0};y)=\tau_{1}$ . By assumption, $-\chi_{S}(x)$ is audited on ${\cal X}_{1}$ . Consider $\operatorname*{\textnormal{\bf E}}_{x\sim S_{1}}[-\chi_{S}(x)\cdot(f(x)-y(x))]$ .

where (12) follows from applying Hölder’s inequality and the assumption that $\textup{er}_{S_{1}}(f_{0};y)=\tau_{1}$ ; and (13) follows from lower bounding the contribution to the expectation based on the true label and the predicted label. Note that $\operatorname*{\textnormal{\bf Pr}}_{x\sim S}[x\in S_{1}]\cdot\operatorname*{\textnormal{\bf E}}_{x\sim S_{1}}[y(x)-f(x)]\leq\alpha/\gamma=\beta$ by the fact that $f$ passes multiaccuracy auditing by ${\cal A}$ and the assumption that $\operatorname*{\textnormal{\bf Pr}}_{x\sim{\cal D}}[x\in S]\geq\gamma$ . Rearranging gives the following inequality

where the additional $\tau_{1}$ comes from accounting for the false positives.

A similar argument holds for $S_{0}$ with $\textup{er}_{S_{0}}(f_{0};y)=\tau_{0}$ , using $\chi_{S}(x)$ . We can expand $\textup{er}_{S}(f;y)$ as a convex combination of the classification error over $S_{0}$ and $S_{1}$ .

by the fact that $S$ is partitioned into $S_{0}$ and $S_{1}$ and $\tau$ is a corresponding convex combination of $\tau_{0}$ and $\tau_{1}$ . ∎

Appendix B Analysis of Algorithm 3

Here, we analyze the sample complexity and running time of Algorithm 3.

We essentially assume the sample complexity issues away by working with the notion of dimension. We give an example proof outline of a standard uniform convergence argument using metric entropy as in [BLM13].

Suppose ${\cal C}\subseteq^{\cal X}$ has $\varepsilon$ -covering number $N_{\varepsilon}={\cal N}(\varepsilon,{\cal C},\left\|\cdot\right\|_{1})$ . Then, with probability at least $1-\delta$ ,

The lemma follows from a standard uniform convergence argument. First, observe that because every $c:{\cal X}\to$ and $y\in\left\{0,1\right\}$ that the empirical estimate using $m$ samples has sensitivity $1/m$ . Thus, we can apply McDiarmid’s inequality to show concentration of the following statistic.

Then, using a standard covering argument, for $N={\cal N}(\varepsilon,{\cal C},\left\|\cdot\right\|_{1})$ the $\varepsilon$ -covering number, we can bound the deviation with high probability. Specifically, taking $O\left(\frac{\log(N/\delta)}{\alpha^{2}}\right)$ samples guarantees that the empirical estimate for each $c\in{\cal C}$ will be within $O(\alpha)$ with probability at least $1-\delta$ . Taking $\delta$ small enough to union bound against every iteration and adjusting constants shows gives the lemma. ∎

Note that this analysis is completely generic, and more sophisticated arguments may improve the resulting bounds that leverage structure in the specific ${\cal C}$ of interest.

B.2 Convergence analysis

We will track progress of Algorithm 3 by tracking the expected cross-entropy loss. We show that every update makes the expected cross-entropy loss decrease significantly. As the loss is bounded below by , then positive progress at each iteration combined with an upper bound on the initial loss gives the convergence result.

Note that when we estimate the statistical queries from data, we only have access to approximate answers. Thus, per the sample complexity argument above, we assume that each statistical query is $\alpha/4$ -accurate. Further, we will update $f_{t}$ if we find an update $c_{t}$ where $\langle c_{t},f-y\rangle\geq 3\alpha/4$ . Thus, at convergence, it should be clear that the resulting hypothesis will be $({\cal C},\alpha)$ -multiaccurate. The goal is to show that this way, Multiaccuracy Boost converges quickly.

We state this lemma in terms of a class ${\cal C}$ but the proof reveals that any nontrivial update that ${\cal A}$ returns suffices to make progress.

Recall $f_{t}(x)=\frac{q_{t}(x)}{1+q_{t}(x)}$ , so $1-f_{t}(x)=\frac{1}{1+q_{t}(x)}$ . Thus, we can rewrite (23) as follows.

where (25) follows by the multiplicative weights update rule implies $q_{t+1}(x)=e^{-\eta c_{t}(x)}q_{t}(x)$ for $x\in S_{t}$ . Next, we expand the final logarithmic term.

where (27) follows by upper bounding the Taylor series approximation for $e^{z}$ for $z\geq-1$ ; and (29) follows by the fact that $f_{t}(x)\in$ . Combining the expressions, we can simplify as follows.

When we update $x\in{\cal X}$ simultaneously according to $c$ , we can express the change in expected cross-entropy as follows.

where (35) follows from the fact that we assumed that our estimates of the statistical queries were $\alpha/4$ -accurate and that we update based on $c_{t}$ if $\langle c_{t},f-y\rangle$ is at least $3\alpha/4$ according to our estimates. Thus, taking $\eta=\alpha/4$ , then we see the change in expected cross-entropy over ${\cal X}$ is at least $\alpha^{2}/16$ , which shows the lemma. ∎

Appendix C Linear convergence from gradient learning

We claim that this auditor learns the partial derivative function in a way that guarantees linear convergence.

Using this inequality, we can bound the inner product between $h_{f}$ and $f-y$ .

Thus, using the analysis of the multiplicative weights update from Section B, we can see that the progress in cross-entropy can be bounded as

Rearranging and taking $\eta=\frac{1}{8L}$ , we arrive at the following inequality that implies linear convergence.