Fairness Through Computationally-Bounded Awareness
Michael P. Kim, Omer Reingold, Guy N. Rothblum
Introduction
More and more, machine learning systems are being used to make predictions about people. Algorithmic predictions are now being used to answer questions of significant personal consequence; for instance, Is this person likely to repay a loan? or Is this person likely to recommit a crime? . As these classification systems have become more ubiquitous, concerns have also grown that classifiers obtained via machine learning might discriminate based on sensitive attributes like race, gender, or sexual orientation. Indeed, machine-learned classifiers run the risk of perpetuating or amplifying historical biases present in the training data. Examples of discrimination in classification have been well-illustrated ; nevertheless, developing a systematic approach to fairness has been challenging. Often, it feels that the objectives of fair classification are at odds with obtaining high-utility predictions.
In an influential work, Dwork et al. proposed a framework to resolve the apparent conflict between utility and fairness, which they call “fairness through awareness." This framework takes the perspective that a fair classifier should treat similar individuals similarly. The work formalizes this abstract goal by assuming access to a task-specific similarity metric on pairs of individuals. The proposed notion of fairness requires that if the distance between two individuals is small, then the predictions of a fair classifier cannot be very different. More formally, for some small constant , we say a hypothesis satisfies -metric fairnessNote the definition given in is slightly different; in particular, they propose a more general Lipschitz condition, but fix . if the following (approximate) Lipschitz condition holds for all pairs of individuals from the population .
Subject to these intuitive similarity constraints, the classifier may be chosen to maximize utility. Note that, in general, the metric may be designed externally (say, by a regulatory agency) to address legal and ethical concerns, independent from the task of learning. In particular, in certain settings, the metric designers may have access to a different set of features than the learner. For instance, perhaps the metric designers have access to sensitive attributes, but for legal, social, or pragmatic reasons, the learner does not. In addition to its conceptual simplicity, the modularity of fairness through awareness makes it a very appealing framework. Currently, there are many (sometimes contradictory) notions of what it means for a classifier to be fair , and there is much debate on which definitions should be applied in a given context. Discrimination comes in many forms and classification is used in a variety of settings, so naturally, it is hard to imagine any universally-applicable definition of fairness. Basing fairness on a similarity metric offers a flexible approach for formalizing a variety of guarantees and protections from discrimination.
Still, a challenging aspect of this approach is the assumption that the similarity metric is known for all pairs of individuals.Indeed, identifies this assumption as “one of the most challenging aspects” of the framework. Deciding on an appropriate metric is itself a delicate matter and could require human input from sociologists, legal scholars, and specialists with domain expertise. For instance, in the loan repayment example, a simple, seemingly-objective metric might be a comparison of credit scores. A potential concern, however, is that these scores might themselves be biased (i.e. encode historical discriminations). In this case, a more nuanced metric requiring human input may be necessary. Further, if the metric depends on features that are latent to the learner (e.g. some missing sensitive feature) then the metric could appear arbitrarily complex to the learner. As such, in many realistic settings, the resulting metric will not be a simple function of the learner’s feature vectors of individuals.
In most machine learning applications, where the universe of individuals is assumed to be very large, even writing down an appropriate metric could be completely infeasible. In these cases, rather than require the metric value to be specified for all pairs of individuals, we could instead ask a panel of experts to provide similarity scores for a small sample of pairs of individuals. While it is information-theoretically impossible to guarantee metric fairness from a sampling-based approach, we still might hope to provide a strong, provable notion of fairness that maintains the theoretical appeal and practical modularity of the fairness through awareness framework.
In this work, we propose a new theoretical framework for fair classification based on fairness through awareness – which we dub “fairness through computationally-bounded awareness” – that eliminates the considerable issue of requiring the metric to be known exactly. Our approach maintains the simplicity and flexibility of fairness through awareness, but provably only requires a small number of random samples from the underlying metric, even though we make no structural assumptions about the metric. In particular, our approach works even if the metric provably cannot be learned. Specifically, our notion will require that a fair classifier treat similar subpopulations of individuals similarly, in a sense we will make formal next. While our definition relaxes fairness through awareness, we argue that it still protects against important forms of discrimination that the original work aimed to combat; further, we show that stronger notions necessarily require a larger sample complexity from the metric. As in , we investigate how to learn a classifier that achieves optimal utility under similarity-based fairness constraints, assuming a weaker model of limited access to the metric. We give positive and negative results that show connections between achieving our fairness notion and learning.
Setting and metric multifairness
The focus on linear functions is not too restrictive; in particular, by increasing the dimension to , we can learn any degree- polynomial function of the original features. By increasing , we can approximate increasingly complex functions.
1 Metric multifairness
We define our relaxation of metric fairness with respect to a rich class of statistical tests on the pairs of individuals. Let a comparison be any subset of the pairs of . Our definition, which we call metric multifairness, is parameterized by a collection of comparisons and requires that a hypothesis appear Lipschitz according to all of the statistical tests defined by the comparisons .
Let be a collection of comparisons and let be a metric. For some constants , a hypothesis is -metric multifair if for all ,
To begin, note that metric multifairness is indeed a relaxation of metric fairness; if we take the collection to be the collection of all pairwise comparisons, then -metric multifairness is equivalent to -metric fairness.
In order to achieve metric multifairness from a small sample from the metric, however, we need a lower bound on the density of each comparison in ; in particular, we can’t hope to enforce metric fairness from a small sample. For some , we say that a collection of comparisons is -large if for all , . A natural next choice for would be a collection of comparisons that represent the Cartesian products between traditionally-protected groups, defined by race, gender, etc. In this case, as long as the minority populations are not too small, then a random sample from the metric will concentrate around the true expectation, and we could hope to enforce this statistical relaxation of metric fairness. While this approach is information-theoretically feasible, its protections are very weak.
To highlight this weakness, suppose we want to predict the probability individuals will repay a loan, and our metric is an adjusted credit score. Even after adjusting scores, two populations (say, defined by race) may have large average distance because overall has better credit than ; still, within and , there may be significant subpopulations and that should be treated similarly (possibly representing the qualified members of each group). In this case, a coarse statistical relaxation of metric fairness will not require that a classifier treat and similarly; instead, the classifier could treat everyone in better than everyone in – including treating unqualified members of better than qualified members of . Indeed, the weaknesses of broad-strokes statistical definitions served as motivation for the original work of . We would like to choose a class that strengthens the fairness guarantees of metric multifairness, but maintains its efficient sample complexity.
While we can define metric multifairness with respect to any collection , typically, we will think of as a rich class of overlapping subsets; equivalently, we can think of the collection as an expressive class of boolean functions, where for , if and only if . In particular, should be much more expressive than simply defining comparisons across traditionally-protected groups. The motivation for choosing such an expressive class is exemplified in the following proposition.
Suppose there is some , such that . Then if is -metric multifair, then satisfies -metric fairness for at least a -fraction of the pairs in .
That is, if there is some subset that identifies a set of pairs whose metric distance is small, then any metric multifair hypothesis must also satisfy the stronger individual metric fairness notion on many pairs from . This effect will compound if many different (possibly overlapping) comparisons are identified that have small average distance. We emphasize that these small-distance comparisons are not known before sampling from the metric; indeed, this would imply the metric was (approximately) known a priori. Still, if the class is rich enough to correlate well with various comparisons that reveal significant information about the metric, then any metric multifair hypothesis will satisfy individual-level fairness on a significant fraction of the population!
While increasing the expressiveness of increases the strength of the fairness guarantee, in order to learn from a small sample, we cannot choose to be arbitrarily complex. Thus, in choosing we must balance the strength of the fairness guarantee with the information bottleneck in accessing through random samples. Our resolution to these competing needs is complexity-theoretic: while information-theoretically, we can’t hope to ensure fair treatment across all subpopulations, we can hope ensure fair treatment across efficiently-identifiable subpopulations. For instance, if we take to be a family defined according to some class of computations of bounded dimension – think, the set of conjunctions of a constant number of boolean features or short decision trees – then we can hope to accurately estimate and enforce the metric multifairness conditions. Taking such a bounded ensures that a hypothesis will be fair on all comparisons identifiable within this computational bound. This is the sense in which metric multifairness provides fairness through computationally-bounded awareness.
2 Learning model
Throughout, our goal is to learn a hypothesis from noisy samples from the metric that satisfies multifairness. Specifically, we assume an algorithm can obtain a small number of independent random metric samples where is drawn at random over the distribution of pairs of individuals, and is a random variable of bounded magnitude with .
We emphasize that this is a very limited access model. As Theorem 2 shows we achieve -metric multifairness from a number of samples that depends logarithmically on the size of independent of the complexity of the similarity metric.Alternatively, for continuous classes of , we can replace with some notion of dimension (VC-dimension, metric entropy, etc.) through a uniform convergence argument. Recall that can be an arbitrary symmetric function; thus, the learner does not necessarily have enough information to learn . Still, for exponentially-sized , the learner can guarantee metric multifairness from a polynomial-sized sample, and the strength of the guarantee will scale up with the complexity of (as per Propsition 1).
In order to ensure a strong notion of fairness, we assume that the subpopulations we wish to protect are well-represented in the pairs drawn from . This assumption, while important, is not especially restrictive, as we think of the metric samples as coming from a regulatory committee or ethically-motivated party; in other words, in practical settings, it is reasonable to assume that one can choose the metric sampling distribution based on the notion of fairness one wishes to enforce.
Label access.
When we learn linear families, our goal will be to learn from a sample of labeled examples. We assume the algorithm can ask for independent random samples .
Measuring optimality.
To evaluate the utility guarantees of our learned predictions, we take a comparative approach. Suppose is a collection of comparisons. For , we say a hypothesis is -optimal with respect to , if
where is an optimal -metric multifair hypothesis.
Learning a metric multifair hypothesis
As in , we formulate the problem of learning a fair set of predictions as a convex program. Our objective is to minimize the expected loss , subject to the multifairness constraints defined by .For the sake of presentation, throughout the theorem statements, we will assume that is -Lipschitz on the domain of legal predictions/labels to guarantee bounded error; our results are proved more generally. Specifically, we show that a simple variant of stochastic gradient descent due to learns such linear families efficiently.
Note that the metric sample complexity depends logarithmically on . Thus, information-theoretically, we can hope to enforce metric mutlifairness with a class that grows exponentially and still be efficient. While the running time of each iteration depends on , note that the number of iterations is independent of . In Section 4, we show conditions on under which we can speed up the running time of each iteration to depend logarithmically on .
We give a description of the switching subgradient method in Algorithm 3. At a high level, at each iteration, the procedure checks to see if any constraint is significantly violated. If it finds a violation, it takes a (stochastic) step towards feasibility. Otherwise, it steps according a stochastic subgradient for the objective.
For convenience of analysis, we define the residual on the constraint defined by as follows.
Note that is convex in the predctions and thus, for linear families is convex in . We describe the algorithm assuming access to the following estimators, which we can implement efficiently (in terms of time and samples). First, we assume we can estimate the residual on each with tolerance such that for all , . Next, we assume access to a stochastic subgradient oracle for the constraints and the objective. For a function , let denote the set of subgradients of at . We abuse notation, and let refer to a vector-valued random variable where \operatorname*{\textnormal{{E}}}[\nabla\phi(w)\big{|}w]\in\partial\phi(w). We assume access to stochastic subgradients for for all and . We include a full analysis of the algorithm and proof of Theorem 2 in the Appendix.
One specific application of this result is as a way post-process learned predictions to ensure fairness. In particular, suppose we are given the predictions from some pre-trained model for individuals, but are concerned that these predictions may not be fair. We can use Algorithm 3 to post-process these labels to select near-optimal metric multifair predictions. Note these predictions will be optimal with respect to the unconstrained family of predictions – not just predictions that come from a specific hypothesis class (like linear functions).
Specifically, in this setting we can represent an unconstrained set of predictions as a linear hypothesis in dimensions: take , and let the feature vector for be the th standard basis vector. Then, we can think of the input labels to Algorithm 3 to be the output of any predictor that was trained separately.Nothing in our analysis required labels ; we can instead take the labels . For instance, if we have learned a highly-accurate model, but are unsure of its fairness, we can instantiate our framework with, say, the squared loss between the original predictions and the returned predictions; then, we can view the program as a procedure to project the highly-accurate predictions onto the set of metric multifair predictions. Importantly, our procedure only needs a small set of samples from the metric and not the original data used to train the model.
Post-processing prediction models for fairness has been studied in a few contexts . This post-processing setting should be contrasted to these settings. In our setting, the predictions are not required to generalize out of sample (in terms of loss or fairness). On the one hand, this means the metric multifairness guarantee does not generalize outside the individuals; on the other hand, because the predictions need not come from a bounded hypothesis class, their utility can only improve compared to learning a metric multifair hypothesis directly.
In addition to preserving the utility of previously-trained classifiers, separating the tasks of training for utility and enforcing fairness may be desirable when intentional malicious discrimination may be anticipated. For instance, when addressing the forms of racial profiling that can occur through targeted advertising (as described in ), we may not expect self-interested advertisers to adhere to classification under strict fairness constraints, but it stands to reason that prominent advertising platforms might want to prevent such blatant abuses of their platform. In this setting, the platform could impose metric multifairness after the advertisers specify their ideal policy.
Reducing search to agnostic learning
As presented above, the switching subgradient descent method converges to a nearly-optimal point in a bounded number of subgradient steps, independent of . The catch is that at the beginning of each iteration, we need to search over to determine if there is a significantly violated multifairness constraint. As we generally want to take to be a rich class of comparisons, in many cases will be prohibitive. As such, we would hope to find violated constraints in sublinear time, preferably even poly-logarithmic in . Indeed, we show that if a concept class admits an efficient agnostic learner, then we can solve the violated constraint search problem over the corresponding collection of comparisons efficiently.
Agnostic learning can be phrased as a problem of detecting correlations. Suppose , and let be some distribution supported on . We denote the correlation between and on as . We let denote the concept class and denote the hypothesis class. The task of agnostic learning can be stated as follows: given sample access over some distribution to some function , find some hypothesis that is comparably correlated with as the best . That is, given access to , an agnostic learner with accuracy for concept class returns some from the hypothesis class such that
An agnostic learner is typically considered efficient if it runs in polynomial time in (or an appropriate notion of dimension of ), , and . Additionally, distribution-specific learners and learners with query access to the function have been studied . In particular, membership queries tend to make agnostic learning easier. Our reduction does not use any metric samples other than those that the agnostic learner requests. Thus, if we are able to query a panel of experts according to the learner, rather than randomly, then an agnostic learner that uses queries could be used to speed up our learning procedure.
When we solve the convex program with switching subgradient descent, at the beginning of each iteration, we check if there is any such that the residual quantity is greater than some threshold. If we find such an , we step according to the subgradient of . In fact, the proof of the convergence of switching subgradient descent reveals that as long as when there is some where is in violation, we can find some for some constant , where , then we can argue that the learned hypothesis will be -metric multifair and achieve utility commensurate with the best -metric multifair hypothesis.
We show a general reduction from the problem of searching for a violated comparison to the problem of agnostic learning over the corresponding family of boolean functions. In particular, recall that for a collection of comparisons , we can also think of as a family of boolean concepts, where for each , there is an associated boolean function where if and only if . We frame this search problem as an agnostic learning problem, where we design a set of “labels” for each pair such that any function that is highly correlated with these labels encodes a way to update the parameters towards feasibility.
Recall the search problem: given a current hypothesis , is there some such that ? Consider the labeling each pair with . Let ; note that we can treat as a constant for all . Further, suppose is such that , or equivalently, . Then, by the assumption that is -large, the correlation between the corresponding boolean function and labels can be lower bounded as . Suppose we have an agnostic learner that returns a hypothesis with accuracy. Then, we know that by the lower bound on the optimal . Then, consider the function defined as follows.
Thus, given that there exists some where , we can find some real-valued comparison , such that . ∎
Discovering a violation of at least guarantees progress in the duality gap at each step, so the theorem follows from the analysis of Algorithm 3.
Hardness of learning metric multifair hypotheses
In this section, we show that our algorithmic results cannot be improved significantly. In particular, we focus on the post-processing setting of Section 3.1. We show that the metric sample complexity is tight up to a factor unconditionally. We also show that some learnability assumption on is necessary in order to achieve a high-utility -metric multifair predictions efficiently. In particular, we give a reduction from inverting a boolean concept to learning a hypothesis that is metric multifair on a collection derived from , where the metric samples from encode information about the concept . Recall, that for any and , we can always output a trivial -metric multifair hypothesis by outputting a constant hypothesis. This leads to a subtlety in our reductions, where we need to leverage the learner’s ability to simultaneously satisfy the metric multifairness constraints and achieve high utility.
Both lower bounds follow the same general construction. Suppose we have some boolean concept class for some universe . We will construct a new universe and define a collection of “bipartite” comparisons over subsets of . Then, given samples from , we define corresponding metric values where is some function of for all . Finally, we need to additionally encode the objective of inverting into labels for , such that to obtain good loss, the post-processor must invert on . We give a full description of the reduction in the Appendix.
While we argued earlier that some relaxation of metric fairness is necessary if we want to learn from a small set of metric samples, it is not clear that multifairness with respect to is the strongest relaxation we can obtain. In particular, we might hope to guarantee fairness on all large comparisons, rather than just a finite class . The following theorem shows that such a hope is misplaced: in order for an algorithm to guarantee that the Lipschitz condition holds in expectation over a finite collection of large comparisons , then either the algorithm takes random metric samples, or the algorithm outputs a set of nearly useless predictions. For concreteness, we state the theorem in the post-processing setting of Section 3.1; the construction can be made to work in the learning setting as well.
Let be constants and suppose is an algorithm that has random sample access to and outputs a -metric multifair set of predictions for -large . Then, takes random samples from or outputs a set of predictions with loss that approaches the loss achievable with no metric queries.
The construction uses a reduction from the problem of learning a linear function; we then appeal to a lower bound from linear algebra on the number of random queries needed to span a basis.
Hardness from pseudorandom functions.
Our reduction implies that a post-processing algorithm for -metric multifairness with respect to an arbitrary metric gives us a way of distinguishing functions in from random.
Assuming one-way functions exist, there is no efficient algorithm for computing -optimal -metric multifair predictions for general and constant .
Essentially, without assumptions that is a learnable class of boolean functions, some nontrivial running time dependence on is necessary. The connection between learning and pseudorandom functions is well-established; under stronger cryptographic assumptions as in , the reduction implies that a running time of is necessary for some constant .
Related works and discussion
Many exciting recent works have investigated fairness in machine learning. In particular, there is much debate on the very definitions of what it means for a classifier to be fair . Beyond the work of Dwork et al. , our work bears most similarity to two recent works of Hébert-Johnson et al. and Kearns et al. . As in this work, both of these papers investigate notions of fairness that aim to strengthen the guarantees of statistical notions, while maintaining their practicality. These works also both draw connections between achieving notions of fairness and efficient agnostic learning. In general, agnostic learning is considered a notoriously hard computational problem ; that said, in the context of fairness in machine learning, show that using heuristic methods to agnostically learn linear hypotheses seems to work well in practice.
Metric multifairness does not directly generalize either or , but we argue that it provides a more flexible alternative to these approaches for subpopulation fairness. In particular, these works aim to achieve specific notions of fairness – either calibration or equalized error rates – across a rich class of subpopulations. As has been well-documented , calibration and equalized error rates, in general, cannot be simultaneously satisfied. Often, researchers frame this incompatibility as a choice: either you satisfy calibration or you satisfy equalized error rates; nevertheless, there are many applications where some interpolation between accuracy (à la calibration) and corrective treatment (à la equalized error rates) seems appropriate.
Metric-based fairness offers a way to balance these conflicting fairness desiderata. In particular, one could design a similarity metric that preserves accuracy in predictions and separately a metric that performs corrective treatment, and then enforce metric multifairness on an appropriate combination of the metrics. For instance, returning to the loan repayment example, an ideal metric might be a combination of credit scores (which tend to be calibrated) and a metric that aims to increase the loans given to historically underrepresented populations (by, say, requiring the top percentiles of each subpopulation be treated similarly). Different combinations of the two metrics would place different weights on the degree of calibration and corrective discrimination in the resulting predictor. Of course, one could equally apply this metric in the framework of , but the big advantage with metric multifairness is that we only need a small sample from the metric to provide a relaxed, but still strong guarantee of fairness.
We are optimistic that metric multifairness will provide an avenue towards implementing metric-based fairness notions. At present, the results are theoretical, but we hope this work can open the door to empirical studies across diverse domains, especially since one of the strengths of the framework is its generality. We view testing the empirical performance of metric multifairness with various choices of metric and collection as an exciting direction for future research.
Finally, two recent theoretical works also investigate extensions to the fairness through awareness framework of . Gillen et al. study metric-based individually fair online decision-making in the presence of an unknown fairness metric. In their setting, every day, a decision maker must choose between candidates available on that day; the goal is to have the decision maker’s choices appear metric fair on each day (but not across days). Their work makes a strong learnability assumption about the underlying metric; in particular, they assume that the unknown metric is a Mahalanobis metric, whereas our focus is on fair classification when the metric is unknown and unrestricted. Rothblum and Yona study fair machine learning under a different relaxation of metric fairness, which they call approximate metric fairness. They assume that the metric is fully specified and known to the learning algorithm, whereas our focus is on addressing the challenge of an unknown metric. Their notion of approximate metric fairness aims to protect all (large enough) groups, and thus, is more strict than metric multifairness.
The authors thank Cynthia Dwork, Roy Frostig, Fereshte Khani, Vatsal Sharan, Paris Siminelakis, and Gregory Valiant for helpful conversations and feedback on earlier drafts of this work. We thank the anonymous reviewers for their careful reading and suggestions on how to improve the clarity of the presentation.
References
Appendix A Analysis of Algorithm 3
Here, we give a high-level overview of the analysis of Algorithm 3. We defer some technical lemmas to Appendix A.3. We refer to as the set of “feasible iterations” where we step according to the objective; that is,
Fairness analysis
We begin by showing that the hypothesis that Algorithm 3 returns satisfies metric multifairness.
Suppose for all , the residual oracle has tolerance . Then, is -metric multifair.
We choose our final hypothesis to be the weighted average of the feasible iterates. Note that the update rules for and imply that is a convex combination of hypotheses where no constraint appears significantly violated, . By convexity of we have the following inequality for all .
Further, for all and all , by the assumed tolerance of , we know that
Given that for all , , then applying the triangle inequality, we conclude that for each comparison ,
Hence, is -metric multifair. ∎
Utility and runtime analysis
We analyze the utility of Algorithm 3 using a duality argument. For notational convenience, denote . In addition to the assumptions in the main body, throughout, we assume the following bounds on the subgradients for all .
Let and . After running Algorithm 3 for iterations, then with probability at least (over the stochastic subgradients)
We give the full proof of Lemma 7 in Appendix A.3.
A.1 Answering residual queries
Next, we describe how to answer residual queries efficiently, in terms of time and samples.
For , for a -large collection of comparisons , with probability , given access to metric samples, every residual query can be answered correctly with tolerance provided
Proposition 9 shows that can be estimated for all from a small number of metric samples. The proof follows a standard Chernoff plus union bound argument. For completeness, we give a full proof next. Thus, Lemma 8 follows by showing that at each iteration can be estimated from a small number of evaluations of the current hypothesis .
We can estimate the expected value of the deviation on over with a small set of unlabeled samples from ; we will evaluate the hypothesis for each of these samples. Using an identical argument as in the case of the expected metric value, we can prove the following bound on how many comparisons we need to make, which shows the lemma.
Suppose is -large. Then with probability at least , for all , the empirical estimate for of samples deviates from the true expected value by at most provided
Here, we show that a small number of samples from the metric suffices to estimate the expected metric distance over all . Suppose is -large. Then with probability at least , for all , the empirical estimates for of metric samples deviate from their true expected value by at most provided
Let represent a random metric sample. Suppose for each , we obtain such samples where , and let be the empirical average over the sample. Then, by Hoeffding’s inequality, we know
If , then the probability that the estimate is not within of its true value is less than . Union bounding over , the probability that every estimate has tolerance will be at least .
Because is -large, for every , the probability a random metric sample is in is at least . If we take samples, then with probability at least , one of the samples will be in . Thus, to guarantee has tolerance for all with probability ,
A.2 Answering subgradient queries
Next, we argue that the subgradient oracles can be implemented efficiently without accessing any metric samples. First, suppose we want to take a step according to ; while is not differentiable, we can compute a legal subgradient defined by partial subderivatives given as follows.
The subgradient does not depend on , so no samples from the metric are necessary. Further, Algorithm 3 only assumes access to stochastic subgradient oracle with bounded entries. If we sample a single , then will be an unbiased estimate of a subgradient of ; we claim, the entries will also be bounded. In particular, assuming implies each partial is bounded by , so that we can take .
A.3 Utility analysis of Algorithm 3
In this appendix, we give a full proof of Lemma 7. We defer the proof of certain technical lemmas to Appendix A.4 for the sake of presentation.
with probability at least over the randomness of the algorithm.
As before, we refer to as the set of feasible iterations, where we step according to the objective, and as the set of infeasible iterations, where we step according to the violated constraints. Recall, we denote the set of subgradients of a function (or ) at by and denote by a stochastic subgradient, where \operatorname*{\textnormal{{E}}}[\nabla L(w)\big{|}w]\in\partial L(w).
When we do not step according to the objective, we step according to the subgradient of some violated comparison constraint. In fact, we show that stepping according to any convex combination of such subgradients suffices to guarantee progress in the duality gap. In the case wher , we assume that we can find some convex combination where for all , . We show that if we step according to the corresponding combination of the subgradients of , we can bound the duality gap. Specifically, for , let the algorithm’s step be given by
where for each , we have \operatorname*{\textnormal{{E}}}\left[\nabla R_{S}(w_{k})\big{|}w_{k}\right]\in\partial R_{S}(w_{k}). Let and denote the step size for the objective and residual steps, respectively. Then, consider the following choice of dual multipliers for each .
Expanding the definition of and applying convexity, we can bound the duality gap as follows
where (16) follows from expanding then applying convexity of and the definition of and (18) follows by our choice of for each .
With the duality gap expanded into one sum over the feasible iterates and one sum over the infeasible iterates, we can analyze these iterates separately. The following lemmas show how to track the contribution of each term to the duality gap in terms of a potential function defined as
For notational convenience, for each , let e(w_{k})=\operatorname*{\textnormal{{E}}}[\nabla L(w_{k})\big{|}w_{k}]-\nabla L(w_{k}) be the noise in the subgradient computation.
For all and for all ,
Again, for notational convenience, for each , let e(w_{k})=\sum_{S\in{\cal C}}\alpha_{k,S}\left(\operatorname*{\textnormal{{E}}}[\nabla R_{S}(w_{k})\big{|}w_{k}]-\nabla R_{S}(w_{k})\right) be the noise in the subgradient computation.
For all and for all ,
We defer the proofs of Lemmas 10 and 11 to Appendix A.4. Assuming Lemmas 10 and 11, we bound the duality gap as follows.
by rearranging. Noting that can be bounded by , it remains to bound . We show that for a sufficiently large , then cannot be positive.
Consider the terms in the supremum over . Note that we can upper bound . Additionally, we upper bound the error incurred due to the objective subgradient noise with the following lemma, which we prove in Appendix A.4.
With probability at least , the contribution of the noisy subgradient computation to the duality gap can be bounded as follows.
Assuming the lemma and that , then, we can bound by splitting the negative term involving to balance both positive terms.
Thus, the sum of and is at most . ∎
A.4 Deferred proofs from analysis of Algorithm 3
We show that the update rule implies the following inequality in terms of , and .
Suppose . Then, for all ,
where (29) follows by substituting the definition of twice; and (30) follows from the fact that for any closed convex set and ,
Rearranging (28) implies the following inequality holds for all .
We will use the following technical identity to prove the lemma.
Finally, we can show the inequality stated in the lemma.
where (34) follows from (32); (35) follows by using Proposition 14 to write the expression in terms of ’s; and (36) follows by the gradient step . ∎
Proof of Lemma 10
Here, we bound the contribution to the duality gap of each of the feasible iterations as follows.
Let where .
where (39) follows by substituting ; (40) follows by expanding the inner product and applying Lemma 13 to the first term; (41) follows by our choice of . ∎
Proof of Lemma 11
Here, we bound the contribution to the duality gap of each of the infeasible iterates . We assume has tolerance . Then we show
Recall, we let e(w_{k})=\sum_{S\in{\cal C}}\alpha_{k,S}\left(\operatorname*{\textnormal{{E}}}[\nabla R_{S}(w_{k})\big{|}w_{k}]-\nabla R_{S}(w_{k})\right). For each , for any , we can rewrite as follows.
Multiplying by and taking the convex combination of according to , we apply Lemma 13 to obtain the following inequality.
where (43) follows by substituting for each and the definition of ; (44) follows by expanding the inner product and applying Lemma 13; (45) follows by our choice of ; (46) follows by the fact that when we update according to a constraint, we know with tolerance , so . ∎
Proof of Lemma 12
Here, we show that with probability at least , the contribution of the noisy subgradient computation to the duality gap can be bounded as follows.
Let . Further, let for and for . Then, we can rewrite the expression to bound using and expand as follows.
Taking this probability to be at most , we can upper bound by . Then, noting that for any , we can take a union bound to conclude with probability at least the following inequalities hold.
Further, we note that the first summation concentrates at least as quickly as the second, so by union bounding again,
Appendix B Hardness for metric multifair predictions
The lower bounds follow the same general construction. Suppose we have some boolean concept class for some universe . We will construct a new universe where and define a collection of “bipartite” comparisons over subsets of . We will assume access to random samples for some function ; we define corresponding metric values where is some function of for all . Finally, we need to label such that such that to obtain good loss, the post-processing algorithm must learn something non-trivial about (which will differ across the two lower bounds we prove).
Consider with ideal labels given as for and for , and let be the hinge loss. We encode the original boolean concept in the distance metric, where for ,
Then, consider the collection of comparisons given by where we take . We take , large enough that the average prediction for is at least in the optimal utility set of multifair predictions. In particular, any average deviation by more than in would result in larger loss than setting all of for and all where .
Let be constants and suppose is an algorithm that has random sample access to and outputs a -metric multifair set of predictions for -large . Then, takes random samples from or outputs a set of predictions with loss that approaches the loss achievable with no metric queries.
Further note, if we only take queries, the incurred loss approaches the trivial loss exponentially in . Appealing to lower bounds on the number of random queries needed to span a basis, the theorem follows.
Outline of hardness from pseudorandom functions
We use the construction to demonstrate that under weak complexity assumptions, there are algorithmic barriers to generally efficient algorithms for metric multifairness. In particular, we will assume that defines a pseudorandom function family. The existence of one-way functions implies the existence of pseudorandom functions , so the proposition follows.
Assuming one-way functions exist, any algorithm for computing -optimal -metric multifair predictions for any and requires time .
Suppose we can post-process predictions to achieve -optimal -metric multifair predictions for any and and some small constant . Let define a pseudorandom function family. We will show that we can distinguish between a function drawn uniformly at random from the function family and a truly random function .
We’re given some samples of the form . Returning to the proposed construction, in the case is a truly random function, then with high probability, for all . Thus, any -optimal set of predictions will achieve loss .
In the case, where for some , note that labeling according to and all as is a feasible point that obtains loss
This feasible quantity upper bounds the optimal loss. Then, we can express the expectation of difference in the predictions across the set defined by as follows.
We assume the set of predictions is -metric multifair. Thus, because , we can upper bound this term by . In total then, , and by the fact that , then . Further, consider the following lower bound on the loss on .
If is a pseudorandom function family the must be bounded away from . Thus, in the case where , we have a non-trivial lower bound on the achievable loss under -metric multifairness. Thus, we can distinguish when and when is truly random. ∎