Fair and Diverse DPP-based Data Summarization
L. Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, Nisheeth K. Vishnoi
Introduction
A problem facing many services – from search engines and news feeds to machine learning – is data summarization: how can one select a small but representative, i.e., diverse, subset from a large dataset. For instance, Google Images outputs a small subset of images from its enormous dataset given a user query. Similarly, in training a learning algorithm one may be required to choose a subset of data points to train on as training on the entire dataset may be costly. However, data summarization algorithms prevalent in the online world have been recently shown to be biased with respect to sensitive attributes such as gender, race and ethnicity. For instance, a recent study found evidence of systematic under-representation of women in search results . Concretely, the above work studied the output of Google Images for various search terms involving occupations and found, e.g., that for the search term “CEO”, the percentage of women in top 100 results was , significantly lower than the ground truth of . Through studies on human subjects, they also found that such misrepresentations have the power to influence people’s perception about reality. Beyond humans, since data summaries are used to train algorithms, there is a danger that these biases in the data might be passed on to the algorithms that use them; a phenomena that is being revealed more and more in automated data-driven processes in education, recruitment, banking, and judiciary systems, see .
A robust and widely deployed method for data summarization is to associate a diversity score to each subset and select a subset with probability proportional to this score; see . This paper focuses on a concrete geometric measure of diversity of a subset of a dataset of vectors – the determinantal measure denoted by ; and the resulting probability distribution is called a determinantal point process (DPP). generalizes the correlation measure for two vectors to multiple vectors and, intuitively, the larger , the more diverse is in the feature space. Among benefits of are its overall simplicity, wide applicability – not depending on combinatorial properties of the data, and efficient computability. A potential downside might be the additional effort required in modeling, i.e., to represent the data in a suitable vector form so that the geometry of the dataset indeed corresponds to diversity. Despite the well-acknowledged ability of DPPs to produce diverse subsets, unfortunately, there seems to be no obvious way to ensure that this also guarantees fairness in the DPP samples in the form of appropriate representation of sensitive attributes in the subset selected. Partially, this is due to the fact that fairness could mean different things in different contexts. For instance, consider a dataset in which each data point has a gender. One notion of fairness, useful in ensuring that the ground truth does not get distorted, is proportional representation: i.e., the fraction of Males (respectively Females) in the output set should be identical to that in the input dataset . Another notion of fairness, argued to be necesseary to reverse the effect of historical biases , could be equal representation – the number of Males is equal to that of Females independent of the ratio in the input dataset. While these measures of fairness have natural generalizations to the case when the number of sensitive types is more than two, and can be refined in several ways, one thing remains common: they all operate in the combinatorial space of sensitive attributes of the data points.
Simple examples (see, e.g., Figure 1) show that, in certain settings, geometric diversity does not imply fairness and vice-versa; however, there seems to be no intrinsic barrier in attaining both. We initiate a rigorous study of the problem of incorporating fairness with respect to sensitive attributes of data in DPP-based sampling for data summarization. Our contributions are: A framework that can incorporate a wide class of notions of fairness with respect to disjoint sensitive attributes and, conditioned on being fair in the specified sense, outputs subsets where the probability of a set is still proportional to . In particular, we model the problem as sampling from a partition DPP – the parts correspond to different sensitive attributes and the goal is to select a specified number of points from each. Unfortunately, the problem of sampling from partition DPPs has been recently shown to be intractable in a strong sense and the question of designing fast algorithms for it, at the expense of being approximate, has been open. Our main technical result is a linear time algorithm (see Section 3.2) to sample from partition DPPs that is guaranteed to output samples from close to the DPP distribution under a natural condition on the data (see Definition 3.1). We prove that random data matrices satisfy this condition in Section 3.4. Experimentally, we run our algorithm on the Adult dataset and a curated image dataset with various parameter settings and observe a marked improvement in fairness without compromising geometric diversity by much. A theoretical justification of this low price of fairness is provided in Section 4; while there have been few works on controlling fairness, ours is the first to give a rigorous, quantitative price of fairness guarantee in any setting. Overall, our work gives a general and rigorous algorithmic solution to the problem of controlling bias in DPP-based sampling algorithms for data summarization while maximizing diversity.
DPP-based sampling has been deployed for many data summarization tasks including text and images , videos , documents , recommendation systems , and sensors ; and the study of DPPs with additional budget or resource constraints is of importance. While for unconstrained DPPs there are efficient algorithms to sample , the problem of sampling from constrained DPPs is intractable; see , where pseudopolynomial time algorithms for partition DPPs are presented. There is also work on approximate MCMC algorithms for sampling from various discrete point processes (see and the references therein), and algorithms that are efficient for constrained DPPs under certain restrictions on the data matrix and constraints (see and the references therein). To the best of our knowledge, ours is the first algorithm for constrained DPPs that is near-linear time. Our algorithm is a greedy, approximate algorithm, and can be considered an extension of a similar algorithm for unconstrained DPPs given by . Finally, our work contributes towards an ongoing effort to measure, understand and incorporate fairness in algorithms (e.g., see ).
Our Model
This volume generalizes the correlation measure for two vectors to multiple vectors and, intuitively, the larger the volume, the more diverse is in the feature space; see Figure 2 for an illustration. Geometric diversity gives rise to the following distribution on subsets known as a determinantal point process (DPP).
A characteristic of a DPP measure is that the inclusion of one item makes including other similar items less likely. Consequently, DPPs assign greater probability to subsets of points that are diverse; for example, a DPP prefers search results that cover multiple aspects of a user’s query, rather than the most popular one.
Thus, the distribution above can be thought of as the most diverse while being fair; we call it partition DPP, or -DPP.
From the algorithmic perspective, the main problem we study is that of coming up with efficient algorithms to sample from -DPPs. The flexibility that our framework provides in specifying the fairness constraints comes at a computational cost – coming up with algorithms to sample from -DPPs. This is a significant challenge, especially given the results of that show that sampling from -DPPs is P-hard.
Our Algorithm
The following lemma is a simple generalization of the formula derived above for orthogonal families of vectors and inspires our algorithm for -DPPs. The proof of this lemma is presented in Section 6.2.
2 Our Sample and Project Algorithm
Before we describe our algorithms for sampling from -DPPs, it is instructive to consider the special case of -DPPs itself and the simple “orthogonal” scenario – where all the vectors , for , are pairwise orthogonal. In such a case, there is a simple iterative algorithm: sample with probability , then add to and remove from ; repeat until . It is intuitively clear, and not hard to prove, that the final probability of obtaining a given set as a sample is proportional to and, hence, recovers the -DPP exactly.
In case of -DPPs where all the vectors are pairwise orthogonal, and we need to sample vectors from partition , we can sample the required number of elements from each partition independently using the procedure in the previous paragraph. The orthogonality of the vectors and the disjointness of the parts implies that this sampling procedure gives the right probability distribution.
However, when the vectors are no longer pairwise orthogonal, the above heuristic can fail miserably. This is where we invoke Lemma 3.1. It suggests the following strategy: once we select a vector, then we should orthogonalize all the remaining vectors with respect to it before repeating the sampling procedure. For the case of -DPPs, it can be shown that this heuristic outputs a set with probability no more than times its desired probability . The term is primarily because the vectors can be chosen in any of the orders. Taking this simple heuristic as a starting point and incorporating an additional idea to deal with partition constraints, we arrive at our Sample and Project algorithm – see Algorithm 1.
Note that the size of the data for this problem is already , hence, the algorithm does only linear work per sampled point. For -DPPs there is only one known exact algorithm which samples in time , which is polynomial only when .
Another possible approach for sampling from DPPs is the Markov Chain Monte Carlo method. It was proved in that Markov Chains can be used to sample from -DPPs in time roughly given a “warm start”, i.e., a set of significant probability. This approach does not extend to -DPPs – indeed in the underlying probability distribution is required to be Strongly Rayleigh, a property which holds for -DPPs, but fails for -DPPs whenever the number of parts is at least two. One can still formulate an analogous MCMC algorithm for the case of -DPPs – it fails on specially crafted “bad instances” but seems to perform well on real world data. However, even ignoring the lack of provable guarantees for this algorithm, it does not seem possible to reduce its running time below , which significantly limits its practical applicability.
3 Provable Guarantees for Our Algorithm
We now present a theorem which connects the output distribution of Algorithm 1 to the corresponding -DPP. To establish such a guarantee we require the following assumption on the singular values of the matrices .
One can construct simple examples that motivate the necessity of such a condition.Consider an example with parts and vectors of dimension , where the first part contains vectors (where denotes the th standard basis vector) and the second part consists of . Such a partition is not -balanced for any since has non-zero singular values and has only of them ( has of them). The Sample and Project algorithm indeed fails to approximate the -DPP, as it outputs a set with non-zero determinant with exponentially small probability. For a positive and negative example of -balanced property, see Figure 4.
Importantly, Algorithm 1 never outputs a set , hence the only way its output distribution could significantly differ from the -DPP would be if certain sets appeared in the output with larger probabilities than specified by the -DPP. Our main theoretical result for Sample and Project is that for -balanced instances we can control the scale at which such a violation can happen.
The proof of the approximation guarantee uses techniques inspired by who prove a similar bound for -DPP sampling.
We use the following lemmas in the proof of the theorem. The proof of these lemmas appear in Section 6.3 and Section 6.4.
where are the singular values of and is the sub-matrix of with rows corresponding to .
Given a -balanced partition, Algorithm 1 returns a set such that is non-zero with probability one.
We use also the following low rank approximation lemma in the proof of Theorem 3.2.
is achieved for and attains the value .
Hence in the -th iteration, for all . Therefore, the probability that the sequence is the output of the algorithm is
The numerator of above is by Lemma 3.1. Let denote the denominator. For each term in the denominator where denotes the Frobenius norm and is the rank matrix with rows such that is the projection of vector on . By a result on low rank approximations (see Lemma 3.5), we can bound the above quantity as
where is the -th singular value of and second inequality is due to the -balanced property of the partition. Using above, the denominator of (1) becomes
4 β𝛽\beta-balanced property for random data
To prove Theorem 3.6 we will use the following matrix concentration inequality.
Given independent, random, Hermitian matrices that satisfy
Consider the formation of one such partition . Let be the random variable taking value with probability and 0 with probability . will be all those elements for which we do not sample 0. Then for this instance we have that
Let , and . Then it can be seen that
We know that if , then for all , – see e.g. . Therefore if we show that , then for all ,
This implies that will satisfy the -balanced condition for . To show that holds (with decent probability), it is enough to show that . We will show it using Matrix concentration inequalities. But first we need to bound .
From the above two inequalities, we have that
Hence the probability that all the partitions satisfy this -balanced condition, for , is atleast
Since and , it can be seen that
Therefore the partition is -balanced, for , with probability .
The quantity is also called the statistical leverage score of with respect to . For two partitions, the theorem states that if the leverage score of all rows is , then the partitions are -balanced for .
Price of Fairness
In this section we present conditions under which the -DPP and -DPP distributions are close to each other. Note that the support of a -DPP is a subset of the support of the corresponding -DPP. Thus, a natural definition of the price of fairness is the KL-divergence between them.
We define the following property for the input data and analyze its price of fairness.
For , the partition is called a -drop partition with respect to and if for all , Here is the -th largest singular value of .
Roughly, this says that, if is small, then each of the matrices is effectively a rank- matrix. Such a notion of low effective rank appears frequently in the machine learning literature . We prove the following theorem that asserts that if the -drop condition is satisfied, then we can be sure that most of the probability mass is concentrated on subsets which satisfy partition constraints. In such a case, sampling a sized subset using any -DPP algorithm will output a subset which satisfies partition constraints with high probability.
Let and suppose that the partition is -drop w.r.t. and , with and . If n\geq\sqrt{2}k\cdot\big{(}\frac{\gamma}{\sigma_{n}}\big{)}^{2} (with , where is the largest singular value of and is the smallest non-zero singular value of ) then the price of ensuring fairness is
We will use the following lemma in the proof.
We start by decomposing the terms in and analyzing each term individually using Lemma 4.2. Given a set , let . Then . Using this, the family can be decomposed as
Let denote the following family of subsets
and, for brevity, let denote the following set integer tuples (all but )
Given this notation, we can write the following sum as
We analyze each term of the above summation individually. We start by noting that
where for all , , this is a simple consequence of the fact that is positive semidefinite. Therefore,
Whenever a set of cardinality does not belong to , for at least one , we have that . Let us now analyze how does a sum of the form behave depending on whether or .
Since ,
Since ,
Using the above inequalities, we obtain that for every
Note that the size of the set of tuples is bounded from above by . Therefore,
It remains to find a lower bound for . Using Lemma 3.3, we obtain
By using the inequality we finally arrive at
Using the assumption that n\geq\sqrt{2}k\cdot\big{(}\frac{\gamma}{\sigma_{n}}\big{)}^{2} we obtain
and an application of Lemma 4.2 finishes the proof.
Empirical Results
In each experiment, we compare several different probability distributions from which to select samples from a dataset: As benchmarks we consider the (unconstrained) distributions, -DPP (see Def 2.2), and UNIF, which selects a uniformly random subset of size from the dataset . We compare this against different methods which select from a fair family of allowed subsets, -DPP (see Def 2.3), and -DPP (see Def 5.1 below).
Algorithms for -DPPs are simply obtained by independently using a -DPP sampler with on each part . For sampling from all the above listed distribution we use the Sample and Project algorithm as described in Section 3.2.
In each experiment, we report the geometric diversity (see Def 2.1) and the fairness as measured by the KL-divergence from the desired frequency over parts. Formally, given a probability distribution over the parts of the dataset, we define the relative unfairness measure of a set as where denotes the vector of frequencies, i.e., for . In particular, typically we want to have as small as possible – ideally equal to . When for all , we refer to as . When , we refer to as .
2 Experiment on Image Dataset
We gathered a collection of images curated using Google image search as follows: Four search terms were used: (a) “Scientist Male”, (b) “Scientist Female”, (c) “Painter Male”, and (d) “Painter Female”.The images are available at goo.gl/hNukfP.
Following , each image was processed with the vlfeat toolbox to obtain sets of 128-dimensional SIFT descriptors . All such descriptors are collected in a single set and subsampled to roughly of its total size. The resulting set of descriptors was clustered using the -means algorithm where is the number of means. The feature vector for an image is the normalized histogram of the nearest clusters to the descriptors in the image.
2.2 Experiment on Biased Datasets
Our goal is to understand how the bias in the underlying dataset can affect the performance of the different sampling distributions with respect to fairness and geometric diversity. We include all female (b and d) images, but vary how many of the male images (a and c) appear in the dataset in order to create biased sets that have between to male images. The male images are selected uniformly at random from the set of all male scientists and male artists for each repetition in the experiment. We sample 40 images from each biased dataset; roughly the number that fits on the first page of an image search result. We conduct 200 repetitions. We place fairness constraints so that -DPP and -DPP select exactly of their samples from the male (a and c) images and female (b and d) images, regardless of the bias in the underlying dataset. Note that we do not enforce constraints across scientist (a and b) images and artist (c and d) images, but measure the unfariness with respect to all four attributes.
2.3 Results
With respect to , -DPP significantly outperforms -DPP, and UNIF (paired one-sided -tests, ), see Figure 5. As expected, the bias in the underlying dataset can dramatically affect the fairness of UNIF and -DPP as neither approach is designed to correct for such biases. However, -DPP and -DPP both enforce fairness constraints; note that this is despite the fact that the sampling was only equal with respect to gender and not profession. The latter does not appear to affect the outcome here.
With respect to the diversity , -DPP has significantly higher than UNIF and -DPP (paired one-sided -tests, ). Moreover, -DPP performs comparatively to -DPP; the mean diversity of -DPP is higher, but not significantly so. Thus, we observe that, when the underlying data is biased, there is a tradeoff between (for which -DPP performs best) and (for which -DPP performs best); however the differences in geometric diversity are negligible while differences in unfairness can be very large.
3 Experiment on Real-World Dataset
The Adult income dataset consists of roughly 45000 records of subjects each with 14 features such as age, race, education and a binary label indicating whether a subject’s incomes is above or below 50K USD.Data downloaded from https://archive.ics.uci.edu/ml/datasets/adult. This dataset has been widely studied in the context of fairness (see, ).
In preprocessing the data we filter out incomplete entries, and from the remaining ones we pick a random subset of 5000 records for our experiments. We vectorize the data as follows: Categorical fields (with a small number of possible values) we turn into sets of binary fields. As the dimension of such feature vectors is quite small – – the DPP framework allows sampling sets of cardinality at most . For this reason we enrich the feature vectors in a standard way – by adding pairwise products of all existing features as separate ones – this, after removing redundant columns, yields feature vectors of dimension .
3.2 Experiment on Equal and Proportional Representation
We conduct our experiment across either gender or race as the sensitive attribute. For the former, we use the gender categories provided in the dataset; all entries were labeled either male (68.3%) or female (31.7%). For the latter, we use the race categories provided in the dataset; we consider the partition Caucasian (85.7%) and non-Caucasian (14.3%).
In addition to the algorithms mentioned above, we report the performance of an additional benchmark -UNIF, which selects a uniformly random subset of size from .
In our subsampling, we consider both equal representation, where each attribute makes up of 50% of the selected points, and proportional representation, where each attribute is represented with the same ratio as in the original population.
3.3 Results
We observe that -DPP has the highest diversity out of all constrained sampling methods regardless of the proportion of representation or sensitive attribute; see Table 1. Surprisingly, the diversity of -DPP matches that of the unconstrained -DPP for Gender under proportional representation and for Race under equal representation. In the other two settings – Gender under equal representation and Race under proportional representation – the -DPP score is lower than that of -DPP, but minimally so, and outperforms -DPP by several standard deviations.
We note that -UNIF, although it has very poor geometric diversity as a whole, performs better under equal representation than it does under proportional representation. This fact suggests that there could be value in selecting sensitive attributes equally beyond the consideration of fairness.
The fact that -DPP performs so well, especially when significantly changing the distribution of sensitive attributes (e.g., for race, from 14.3% non-Caucasian to 50% non-Caucasian), is quite surprising. Overall, it appears that one can support very dramatic changes to the underlying distributions of attributes with minimal or even zero loss to geometric diversity by using our -DPP algorithm.
4 Experiment on Price of Fairness
We look at the effect of the scaling of singular values, suggested by Theorem 4.1, on the sampled subsets of our Algorithm. In this experiment we take an instance of random vectors and use different sampling methods to sample a subset from the dataset, and report the and value of the sampled subset. Following this, we scale the tail singular values of the partition matrices by and again report the and values.
We also present a heuristic approach, Scale-And-Sample, for constrained sampling which will use any -DPP algorithm as a sub-routine. The algorithm is simple. For each , scale the smallest singular values by . Then sample a sized subset using any -DPP algorithm.
The results are presented in Table 2. It can be seen that after scaling the tail singular values of the partition matrices, the mean value for -DPP is very low, and resembles closely the constrained sampling case. We also note that the Scale-And-Sample approach to constrained sampling suggested earlier performs very well. The mean relative unfairness measure is almost zero. Furthermore, the value of the geometric diversity parameter is also similar to unscaled -DPP.
Proofs
2 Proof of Lemma 3.1
We will prove this lemma by induction. For the base case where there is just one row in , is equal to which is equal to .
Let be the matrix with as rows. Assume that the statement is true for rows, i.e.,
Note that elementary row product or addition transformations do not change the determinant. We will apply these transformation to make the entries of first row and first column go to zero.
Let denote the -th row of the above matrix and denote the entry. Then the transformation
will make the entry go to zero. For the rest of the elements,
We continue this way and next apply the transformation
This will make the entry go to zero and by the similar analysis as above we get , where is the subspace spanned by the vectors . After applying row transformations of the form
we get that the entries , for and
Note that defined in the statement of the lemma.
We can apply similar column operations to make all the entries of the first column, except , go to zero. Since these elementary operations do not affect the determinant, we get Therefore
3 Proof of Lemma 3.3
where are the singular values of .
The coefficient of in is equal to . Let be the set of all principal -minors of . It is a well known fact in linear algebra that the coefficient of in is equal to
4 Proof of Lemma 3.4
We first show that for every part , the corresponding matrix has rank at least . For this, first note that has at least non-zero singular values, i.e., . This follows from the fact that the number of non-zero singular values determines the rank of . The rank of is certainly at least , since otherwise the diversity of every subset of size would be zero.
From the -balance condition it follows that the number of non-zero singular values of is the same as for , and hence also the rank of is at least , as claimed.
Note now that the set of vectors output by the algorithm has determinant zero if and only if for an iteration there exists a partition such that and for all , where .
Conclusion and Future Work
In this paper we initiated the study of fair and diverse DPP-based sampling for data summarization. We provide a novel and fast algorithm that can sample from a DPP that satisfy fairness constraints based on the desired proportion of samples with a given attribute. Our algorithm gives provably good guarantees when the data matrix satisfies a natural -balance property. We prove that a large class of datasets satisfy the -balance condition. We define a notion of price of fairness, the KL-divergence between the fairness constrained distribution and the unconstrained distribution and theoretically show that, when the data satisfies reasonable properties, this price would be low. We further show experimentally that adding fairness constraints results in minimal loss to diversity, even when the underlying dataset is very biased, or when the proportion of attributes is changed significantly.
Several challenging problems remain from a technical standpoint; naturally, a first question would be whether the theorems can be improved either by attaining better approximation guarantees, or by weakening the necessary conditions. Extending these results to arbitrary group structures (as opposed to partitions) would be very relevant, but appears to be significantly more challenging.
From a practical point of view, it remains to be seen what effect de-biasing a sampler has on the end result of a machine learning algorithm (e.g., classification), both on its accuracy and on the bias down the line.