Fair and Diverse DPP-based Data Summarization

L. Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, Nisheeth K. Vishnoi

Introduction

A problem facing many services – from search engines and news feeds to machine learning – is data summarization: how can one select a small but representative, i.e., diverse, subset from a large dataset. For instance, Google Images outputs a small subset of images from its enormous dataset given a user query. Similarly, in training a learning algorithm one may be required to choose a subset of data points to train on as training on the entire dataset may be costly. However, data summarization algorithms prevalent in the online world have been recently shown to be biased with respect to sensitive attributes such as gender, race and ethnicity. For instance, a recent study found evidence of systematic under-representation of women in search results . Concretely, the above work studied the output of Google Images for various search terms involving occupations and found, e.g., that for the search term “CEO”, the percentage of women in top 100 results was 11%11\%, significantly lower than the ground truth of 27%27\%. Through studies on human subjects, they also found that such misrepresentations have the power to influence people’s perception about reality. Beyond humans, since data summaries are used to train algorithms, there is a danger that these biases in the data might be passed on to the algorithms that use them; a phenomena that is being revealed more and more in automated data-driven processes in education, recruitment, banking, and judiciary systems, see .

A robust and widely deployed method for data summarization is to associate a diversity score to each subset and select a subset with probability proportional to this score; see . This paper focuses on a concrete geometric measure of diversity of a subset SS of a dataset {vx}xX\{v_{x}\}_{x\in X} of vectors – the determinantal measure denoted by G(S)G(S) ; and the resulting probability distribution is called a determinantal point process (DPP). G(S)G(S) generalizes the correlation measure for two vectors to multiple vectors and, intuitively, the larger G(S)G(S), the more diverse is SS in the feature space. Among benefits of G()G(\cdot) are its overall simplicity, wide applicability – not depending on combinatorial properties of the data, and efficient computability. A potential downside might be the additional effort required in modeling, i.e., to represent the data in a suitable vector form so that the geometry of the dataset indeed corresponds to diversity. Despite the well-acknowledged ability of DPPs to produce diverse subsets, unfortunately, there seems to be no obvious way to ensure that this also guarantees fairness in the DPP samples in the form of appropriate representation of sensitive attributes in the subset selected. Partially, this is due to the fact that fairness could mean different things in different contexts. For instance, consider a dataset in which each data point has a gender. One notion of fairness, useful in ensuring that the ground truth does not get distorted, is proportional representation: i.e., the fraction of Males (respectively Females) in the output set should be identical to that in the input dataset . Another notion of fairness, argued to be necesseary to reverse the effect of historical biases , could be equal representation – the number of Males is equal to that of Females independent of the ratio in the input dataset. While these measures of fairness have natural generalizations to the case when the number of sensitive types is more than two, and can be refined in several ways, one thing remains common: they all operate in the combinatorial space of sensitive attributes of the data points.

Simple examples (see, e.g., Figure 1) show that, in certain settings, geometric diversity does not imply fairness and vice-versa; however, there seems to be no intrinsic barrier in attaining both. We initiate a rigorous study of the problem of incorporating fairness with respect to sensitive attributes of data in DPP-based sampling for data summarization. Our contributions are: A framework that can incorporate a wide class of notions of fairness with respect to disjoint sensitive attributes and, conditioned on being fair in the specified sense, outputs subsets where the probability of a set is still proportional to G()G(\cdot). In particular, we model the problem as sampling from a partition DPP – the parts correspond to different sensitive attributes and the goal is to select a specified number of points from each. Unfortunately, the problem of sampling from partition DPPs has been recently shown to be intractable in a strong sense and the question of designing fast algorithms for it, at the expense of being approximate, has been open. Our main technical result is a linear time algorithm (see Section 3.2) to sample from partition DPPs that is guaranteed to output samples from close to the DPP distribution under a natural condition on the data (see Definition 3.1). We prove that random data matrices satisfy this condition in Section 3.4. Experimentally, we run our algorithm on the Adult dataset and a curated image dataset with various parameter settings and observe a marked improvement in fairness without compromising geometric diversity by much. A theoretical justification of this low price of fairness is provided in Section 4; while there have been few works on controlling fairness, ours is the first to give a rigorous, quantitative price of fairness guarantee in any setting. Overall, our work gives a general and rigorous algorithmic solution to the problem of controlling bias in DPP-based sampling algorithms for data summarization while maximizing diversity.

DPP-based sampling has been deployed for many data summarization tasks including text and images , videos , documents , recommendation systems , and sensors ; and the study of DPPs with additional budget or resource constraints is of importance. While for unconstrained DPPs there are efficient algorithms to sample , the problem of sampling from constrained DPPs is intractable; see , where pseudopolynomial time algorithms for partition DPPs are presented. There is also work on approximate MCMC algorithms for sampling from various discrete point processes (see and the references therein), and algorithms that are efficient for constrained DPPs under certain restrictions on the data matrix and constraints (see and the references therein). To the best of our knowledge, ours is the first algorithm for constrained DPPs that is near-linear time. Our algorithm is a greedy, approximate algorithm, and can be considered an extension of a similar algorithm for unconstrained DPPs given by . Finally, our work contributes towards an ongoing effort to measure, understand and incorporate fairness in algorithms (e.g., see ).

Our Model

This volume generalizes the correlation measure for two vectors to multiple vectors and, intuitively, the larger the volume, the more diverse is SS in the feature space; see Figure 2 for an illustration. Geometric diversity gives rise to the following distribution on subsets known as a determinantal point process (DPP).

A characteristic of a DPP measure is that the inclusion of one item makes including other similar items less likely. Consequently, DPPs assign greater probability to subsets of points that are diverse; for example, a DPP prefers search results that cover multiple aspects of a user’s query, rather than the most popular one.

Thus, the distribution above can be thought of as the most diverse while being fair; we call it partition DPP, or PP-DPP.

From the algorithmic perspective, the main problem we study is that of coming up with efficient algorithms to sample from PP-DPPs. The flexibility that our framework provides in specifying the fairness constraints comes at a computational cost – coming up with algorithms to sample from PP-DPPs. This is a significant challenge, especially given the results of that show that sampling from PP-DPPs is #\#P-hard.

Our Algorithm

The following lemma is a simple generalization of the formula derived above for orthogonal families of vectors and inspires our algorithm for PP-DPPs. The proof of this lemma is presented in Section 6.2.

2 Our Sample and Project Algorithm

Before we describe our algorithms for sampling from PP-DPPs, it is instructive to consider the special case of kk-DPPs itself and the simple “orthogonal” scenario – where all the vectors vxv_{x}, for xXx\in X, are pairwise orthogonal. In such a case, there is a simple iterative algorithm: sample xXx\in X with probability vx2\propto\left\lVert v_{x}\right\rVert^{2}, then add xx to SS and remove xx from XX; repeat until S=k|S|=k. It is intuitively clear, and not hard to prove, that the final probability of obtaining a given set SS as a sample is proportional to xSvx2=det(VSVS)\prod_{x\in S}\left\lVert v_{x}\right\rVert^{2}=\det(V_{S}V_{S}^{\top}) and, hence, recovers the kk-DPP exactly.

In case of PP-DPPs where all the vectors are pairwise orthogonal, and we need to sample kik_{i} vectors from partition XiX_{i}, we can sample the required number of elements from each partition independently using the procedure in the previous paragraph. The orthogonality of the vectors and the disjointness of the parts implies that this sampling procedure gives the right probability distribution.

However, when the vectors vxv_{x} are no longer pairwise orthogonal, the above heuristic can fail miserably. This is where we invoke Lemma 3.1. It suggests the following strategy: once we select a vector, then we should orthogonalize all the remaining vectors with respect to it before repeating the sampling procedure. For the case of kk-DPPs, it can be shown that this heuristic outputs a set SS with probability no more than k!k! times its desired probability . The k!k! term is primarily because the kk vectors can be chosen in any of the k!k! orders. Taking this simple heuristic as a starting point and incorporating an additional idea to deal with partition constraints, we arrive at our Sample and Project algorithm – see Algorithm 1.

Note that the size of the data for this problem is already Θ(mn)\Theta(mn), hence, the algorithm does only linear work per sampled point. For PP-DPPs there is only one known exact algorithm which samples in time mO(p)m^{O(p)}, which is polynomial only when p=O(1)p=O(1) .

Another possible approach for sampling from DPPs is the Markov Chain Monte Carlo method. It was proved in that Markov Chains can be used to sample from kk-DPPs in time roughly O~(mk4+mn2)\widetilde{O}(mk^{4}+mn^{2}) given a “warm start”, i.e., a set S0S_{0} of significant probability. This approach does not extend to PP-DPPs – indeed in the underlying probability distribution is required to be Strongly Rayleigh, a property which holds for kk-DPPs, but fails for PP-DPPs whenever the number of parts is at least two. One can still formulate an analogous MCMC algorithm for the case of PP-DPPs – it fails on specially crafted “bad instances” but seems to perform well on real world data. However, even ignoring the lack of provable guarantees for this algorithm, it does not seem possible to reduce its running time below O(mk4+mn2)O(mk^{4}+mn^{2}), which significantly limits its practical applicability.

3 Provable Guarantees for Our Algorithm

We now present a theorem which connects the output distribution of Algorithm 1 to the corresponding PP-DPP. To establish such a guarantee we require the following assumption on the singular values of the matrices VXiV_{X_{i}}.

One can construct simple examples that motivate the necessity of such a condition.Consider an example with p=2p=2 parts and m=3nm=3n vectors of dimension 2n2n, where the first part contains vectors e1,e2,,e2ne_{1},e_{2},\ldots,e_{2n} (where eie_{i} denotes the iith standard basis vector) and the second part consists of e1,e2,,ene_{1},e_{2},\ldots,e_{n}. Such a partition is not β\beta-balanced for any β>0\beta>0 since VV has 2n2n non-zero singular values and VX2V_{X_{2}} has only nn of them (VX1V_{X_{1}} has 2n2n of them). The Sample and Project algorithm indeed fails to approximate the PP-DPP, as it outputs a set with non-zero determinant with exponentially small probability. For a positive and negative example of β\beta-balanced property, see Figure 4.

Importantly, Algorithm 1 never outputs a set SBS\notin\mathcal{B}, hence the only way its output distribution could significantly differ from the PP-DPP would be if certain sets SBS\in\mathcal{B} appeared in the output with larger probabilities than specified by the PP-DPP. Our main theoretical result for Sample and Project is that for β\beta-balanced instances we can control the scale at which such a violation can happen.

The proof of the approximation guarantee uses techniques inspired by who prove a similar bound for kk-DPP sampling.

We use the following lemmas in the proof of the theorem. The proof of these lemmas appear in Section 6.3 and Section 6.4.

where σ1,σ2,,σn\sigma_{1},\sigma_{2},\dots,\sigma_{n} are the singular values of VV and VSV_{S} is the sub-matrix of VV with rows corresponding to SS.

Given a β\beta-balanced partition, Algorithm 1 returns a set SS such that det(VSVS)\det(V_{S}V_{S}^{\top}) is non-zero with probability one.

We use also the following low rank approximation lemma in the proof of Theorem 3.2.

is achieved for B=AB=A^{\prime} and attains the value j=k+1nσj2\sum_{j=k+1}^{n}\sigma_{j}^{2}.

Hence in the (j1)(j-1)-th iteration, wx=ΠHj(vx)w_{x}=\Pi_{H_{j}}(v_{x}) for all xXx\in X. Therefore, the probability that the sequence τ\tau is the output of the algorithm is

The numerator of above is det(VSVS)\det(V_{S}V_{S}^{\top}) by Lemma 3.1. Let Dx1,,xkD_{x_{1},\dots,x_{k}} denote the denominator. For each term in the denominator xXlΠHj(vx)2=VXlVXlF2\sum\limits_{x\in X_{l}}\left\lVert\Pi_{H_{j}}(v_{x})\right\rVert^{2}=\left\lVert V_{X_{l}}-V_{X_{l}}^{\prime}\right\rVert_{F}^{2} where F\left\lVert\cdot\right\rVert_{F} denotes the Frobenius norm and VXlV_{X_{l}}^{\prime} is the rank j1j-1 matrix with rows {vx}xXl\{v_{x}^{\prime}\}_{x\in X_{l}} such that vxv_{x}^{\prime} is the projection of vector vxv_{x} on HjH_{j}. By a result on low rank approximations (see Lemma 3.5), we can bound the above quantity as

where σl,t\sigma_{l,t} is the tt-th singular value of VXlV_{X_{l}} and second inequality is due to the β\beta-balanced property of the partition. Using above, the denominator of (1) becomes

4 β𝛽\beta-balanced property for random data

To prove Theorem 3.6 we will use the following matrix concentration inequality.

Given independent, random, Hermitian matrices M1,,MmM_{1},\dots,M_{m} that satisfy

Consider the formation of one such partition XiX_{i}. Let YjY_{j} be the random variable taking value vjvjv_{j}v_{j}^{\top} with probability 1/p1/p and 0 with probability (11/p)(1-1/p). XiX_{i} will be all those elements for which we do not sample 0. Then for this instance we have that

Let uj:=(pVV)12vju_{j}:=(pV^{\top}V)^{-\frac{1}{2}}v_{j}, Zj=ujujZ_{j}=u_{j}u_{j}^{\top} and M~i:=j=1mZj\widetilde{M}_{i}:=\sum_{j=1}^{m}Z_{j}. Then it can be seen that

We know that if ABA\preceq B, then for all jj, λj(A)λj(B)\lambda_{j}(A)\leq\lambda_{j}(B) – see e.g. . Therefore if we show that (1ε)IM~i(1-\varepsilon)\cdot I\preceq\widetilde{M}_{i}, then for all j{1,,n}j\in\{1,\dots,n\},

This implies that VXiV_{X_{i}} will satisfy the β\beta-balanced condition for β=p1ε\beta=\sqrt{\frac{p}{1-\varepsilon}}. To show that M~i(1ε)I\widetilde{M}_{i}\succeq(1-\varepsilon)\cdot I holds (with decent probability), it is enough to show that λmin(M~i)(1ε)\lambda_{\min}(\widetilde{M}_{i})\geq(1-\varepsilon). We will show it using Matrix concentration inequalities. But first we need to bound λmax(Zj)\lambda_{\max}(Z_{j}).

From the above two inequalities, we have that

Hence the probability that all the partitions satisfy this β\beta-balanced condition, for β=p1ε\beta=\sqrt{\frac{p}{1-\varepsilon}}, is atleast

Since ε=δ/2\varepsilon=\delta/2 and 0δ10\leq\delta\leq 1, it can be seen that

Therefore the partition is β\beta-balanced, for β=(1+δ)p\beta=\sqrt{(1+\delta)p}, with probability 1/e\geq 1/e. \square

The quantity vj(VV)1vjv_{j}^{\top}(V^{\top}V)^{-1}v_{j} is also called the statistical leverage score of vjv_{j} with respect to VVV^{\top}V. For two partitions, the theorem states that if the leverage score of all rows is O(1logn)O(\frac{1}{\log n}), then the partitions are β\beta-balanced for β2\beta\approx\sqrt{2}.

Price of Fairness

In this section we present conditions under which the kk-DPP and PP-DPP distributions are close to each other. Note that the support of a PP-DPP is a subset of the support of the corresponding kk-DPP. Thus, a natural definition of the price of fairness is the KL-divergence between them.

We define the following property for the input data and analyze its price of fairness.

For 0δ10\leq\delta\leq 1, the partition X1,,XpX_{1},\dots,X_{p} is called a δ\delta-drop partition with respect to VV and k1,,kpk_{1},\dots,k_{p} if for all i{1,,p}i\in\{1,\dots,p\}, σi,ki+1δσi,ki.\sigma_{i,k_{i}+1}\leq\delta\sigma_{i,k_{i}}. Here σi,j\sigma_{i,j} is the jj-th largest singular value of VXiV_{X_{i}}.

Roughly, this says that, if δ\delta is small, then each of the matrices VXiV_{X_{i}} is effectively a rank-kik_{i} matrix. Such a notion of low effective rank appears frequently in the machine learning literature . We prove the following theorem that asserts that if the δ\delta-drop condition is satisfied, then we can be sure that most of the probability mass is concentrated on subsets which satisfy partition constraints. In such a case, sampling a kk sized subset using any kk-DPP algorithm will output a subset which satisfies partition constraints with high probability.

Let ε(0,1)\varepsilon\in(0,1) and suppose that the partition X1,,XpX_{1},\dots,X_{p} is δ\delta-drop w.r.t. VV and k1,,kpk_{1},\dots,k_{p}, with δεnN0\delta\leq\frac{\varepsilon}{nN_{0}} and N0:=(k+p1p1)N_{0}:=\binom{k+p-1}{p-1}. If n\geq\sqrt{2}k\cdot\big{(}\frac{\gamma}{\sigma_{n}}\big{)}^{2} (with γ:=max{σi,1}i\gamma:=\max\{\sigma_{i,1}\}_{i}, where σi,1\sigma_{i,1} is the largest singular value of VXiV_{X_{i}} and σn\sigma_{n} is the smallest non-zero singular value of VV) then the price of ensuring fairness is DKL(qq)log1(1ε).D_{KL}(q^{\star}||q)\leq\log\frac{1}{(1-\varepsilon)}.

We will use the following lemma in the proof.

We start by decomposing the terms in SCBdet(VSVS)\sum_{S\in\mathcal{C\setminus B}}\det(V_{S}V_{S}^{\top}) and analyzing each term individually using Lemma 4.2. Given a set SXS\subseteq X, let Si:=SXiS_{i}:=S\cap X_{i}. Then S=i=1pSiS=\bigcup_{i=1}^{p}S_{i}. Using this, the family CB\mathcal{C\setminus B} can be decomposed as

Let S(j1,,jp)S_{(j_{1},\dots,j_{p})} denote the following family of subsets

and, for brevity, let J\mathcal{J} denote the following set integer tuples (all but (k1,k2,,kp)(k_{1},k_{2},\ldots,k_{p}))

Given this notation, we can write the following sum as

We analyze each term of the above summation individually. We start by noting that

where for all ii, Si=SXiS_{i}=S\cap X_{i}, this is a simple consequence of the fact that VVVV^{\top} is positive semidefinite. Therefore,

Whenever a set SS of cardinality kk does not belong to B\mathcal{B}, for at least one ii, we have that Si=SXi>ki|S_{i}|=|S\cap X_{i}|>k_{i}. Let us now analyze how does a sum of the form TXi,T=jdet(VTVT)\sum_{T\subseteq X_{i},|T|=j}\det(V_{T}V_{T}^{\top}) behave depending on whether jkij\leq k_{i} or j>kij>k_{i}.

Since δ<εnN0\delta<\frac{\varepsilon}{nN_{0}},

Since δ<εnN0\delta<\frac{\varepsilon}{nN_{0}},

Using the above inequalities, we obtain that for every (j1,,jp)J(j_{1},\ldots,j_{p})\in\mathcal{J}

Note that the size of the set of tuples J\mathcal{J} is bounded from above by J(k+p1p1)=N0|\mathcal{J}|\leq\binom{k+p-1}{p-1}=N_{0}. Therefore,

It remains to find a lower bound for SCdet(VSVS)\sum_{S\in\mathcal{C}}\det(V_{S}V_{S}^{\top}). Using Lemma 3.3, we obtain

By using the inequality (nk)nkkk\binom{n}{k}\geq\frac{n^{k}}{k^{k}} we finally arrive at

Using the assumption that n\geq\sqrt{2}k\cdot\big{(}\frac{\gamma}{\sigma_{n}}\big{)}^{2} we obtain

and an application of Lemma 4.2 finishes the proof. \square

Empirical Results

In each experiment, we compare several different probability distributions from which to select kk samples from a dataset: As benchmarks we consider the (unconstrained) distributions, kk-DPP (see Def 2.2), and UNIF, which selects a uniformly random subset of size kk from the dataset XX. We compare this against different methods which select from a fair family of allowed subsets, PP-DPP (see Def 2.3), and kik_{i}-DPP (see Def 5.1 below).

Algorithms for kik_{i}-DPPs are simply obtained by independently using a kk-DPP sampler with k=kik=k_{i} on each part XiX_{i}. For sampling from all the above listed distribution we use the Sample and Project algorithm as described in Section 3.2.

In each experiment, we report the geometric diversity G()G(\cdot) (see Def 2.1) and the fairness as measured by the KL-divergence from the desired frequency over parts. Formally, given a probability distribution qq over the pp parts of the dataset, we define the relative unfairness measure of a set SXS\subseteq X as Dq(S):=DKL(qs),D^{q}(S):=D_{KL}(q||s), where s=(s1,,sp)s=(s_{1},\ldots,s_{p}) denotes the vector of frequencies, i.e., si=XiSSs_{i}=\frac{|X_{i}\cap S|}{|S|} for i=1,2,,pi=1,2,\ldots,p. In particular, typically we want to have Dq()D^{q}(\cdot) as small as possible – ideally equal to . When qi=1/pq_{i}=1/p for all ii, we refer to DqD^{q} as DunD^{\rm un}. When qi=Xi/mq_{i}=|X_{i}|/m, we refer to DqD^{q} as DpropD^{\rm prop}.

2 Experiment on Image Dataset

We gathered a collection of images curated using Google image search as follows: Four search terms were used: (a) “Scientist Male”, (b) “Scientist Female”, (c) “Painter Male”, and (d) “Painter Female”.The images are available at goo.gl/hNukfP.

Following , each image was processed with the vlfeat toolbox to obtain sets of 128-dimensional SIFT descriptors . All such descriptors are collected in a single set and subsampled to roughly 10%10\% of its total size. The resulting set of 104\approx 10^{4} descriptors was clustered using the kk-means algorithm where k=128k=128 is the number of means. The feature vector for an image is the normalized histogram of the nearest clusters to the descriptors in the image.

2.2 Experiment on Biased Datasets

Our goal is to understand how the bias in the underlying dataset can affect the performance of the different sampling distributions with respect to fairness and geometric diversity. We include all female (b and d) images, but vary how many of the male images (a and c) appear in the dataset in order to create biased sets that have between 10%10\% to 50%50\% male images. The male images are selected uniformly at random from the set of all male scientists and male artists for each repetition in the experiment. We sample 40 images from each biased dataset; roughly the number that fits on the first page of an image search result. We conduct 200 repetitions. We place fairness constraints so that PP-DPP and kik_{i}-DPP select exactly 50%50\% of their samples from the male (a and c) images and female (b and d) images, regardless of the bias in the underlying dataset. Note that we do not enforce constraints across scientist (a and b) images and artist (c and d) images, but measure the unfariness Dun()D^{\rm un}(\cdot) with respect to all four attributes.

2.3 Results

With respect to Dun()D^{\rm un}(\cdot), PP-DPP significantly outperforms kk-DPP, and UNIF (paired one-sided tt-tests, p<0.05p<0.05), see Figure 5. As expected, the bias in the underlying dataset can dramatically affect the fairness of UNIF and kk-DPP as neither approach is designed to correct for such biases. However, PP-DPP and kik_{i}-DPP both enforce fairness constraints; note that this is despite the fact that the sampling was only equal with respect to gender and not profession. The latter does not appear to affect the outcome here.

With respect to the diversity G()G(\cdot), PP-DPP has significantly higher G()G(\cdot) than UNIF and kik_{i}-DPP (paired one-sided tt-tests, p<0.05p<0.05). Moreover, PP-DPP performs comparatively to kk-DPP; the mean diversity of kk-DPP is higher, but not significantly so. Thus, we observe that, when the underlying data is biased, there is a tradeoff between Dun()D^{\rm un}(\cdot) (for which PP-DPP performs best) and G()G(\cdot) (for which kk-DPP performs best); however the differences in geometric diversity are negligible while differences in unfairness can be very large.

3 Experiment on Real-World Dataset

The Adult income dataset consists of roughly 45000 records of subjects each with 14 features such as age, race, education and a binary label indicating whether a subject’s incomes is above or below 50K USD.Data downloaded from https://archive.ics.uci.edu/ml/datasets/adult. This dataset has been widely studied in the context of fairness (see, ).

In preprocessing the data we filter out incomplete entries, and from the remaining ones we pick a random subset of 5000 records for our experiments. We vectorize the data as follows: Categorical fields (with a small number of possible values) we turn into sets of binary fields. As the dimension nn of such feature vectors is quite small – 5050 – the DPP framework allows sampling sets of cardinality at most k50k\leq 50. For this reason we enrich the feature vectors in a standard way – by adding pairwise products of all existing features as separate ones – this, after removing redundant columns, yields feature vectors of dimension 992992.

3.2 Experiment on Equal and Proportional Representation

We conduct our experiment across either gender or race as the sensitive attribute. For the former, we use the gender categories provided in the dataset; all entries were labeled either male (68.3%) or female (31.7%). For the latter, we use the race categories provided in the dataset; we consider the partition Caucasian (85.7%) and non-Caucasian (14.3%).

In addition to the algorithms mentioned above, we report the performance of an additional benchmark kik_{i}-UNIF, which selects a uniformly random subset of size kik_{i} from XiX_{i}.

In our subsampling, we consider both equal representation, where each attribute makes up of 50% of the selected points, and proportional representation, where each attribute is represented with the same ratio as in the original population.

3.3 Results

We observe that PP-DPP has the highest diversity out of all constrained sampling methods regardless of the proportion of representation or sensitive attribute; see Table 1. Surprisingly, the diversity of PP-DPP matches that of the unconstrained kk-DPP for Gender under proportional representation and for Race under equal representation. In the other two settings – Gender under equal representation and Race under proportional representation – the PP-DPP score is lower than that of kk-DPP, but minimally so, and outperforms kik_{i}-DPP by several standard deviations.

We note that kik_{i}-UNIF, although it has very poor geometric diversity as a whole, performs better under equal representation than it does under proportional representation. This fact suggests that there could be value in selecting sensitive attributes equally beyond the consideration of fairness.

The fact that PP-DPP performs so well, especially when significantly changing the distribution of sensitive attributes (e.g., for race, from 14.3% non-Caucasian to 50% non-Caucasian), is quite surprising. Overall, it appears that one can support very dramatic changes to the underlying distributions of attributes with minimal or even zero loss to geometric diversity by using our PP-DPP algorithm.

4 Experiment on Price of Fairness

We look at the effect of the scaling of singular values, suggested by Theorem 4.1, on the sampled subsets of our Algorithm. In this experiment we take an instance of random vectors and use different sampling methods to sample a subset from the dataset, and report the Dun()D^{\rm un}(\cdot) and logG()\log G(\cdot) value of the sampled subset. Following this, we scale the tail singular values of the partition matrices by δ=O(1/n)\delta=O(1/n) and again report the Dun()D^{\rm un}(\cdot) and logG()\log G(\cdot) values.

We also present a heuristic approach, Scale-And-Sample, for constrained sampling which will use any kk-DPP algorithm as a sub-routine. The algorithm is simple. For each VXiV_{X_{i}}, scale the smallest (nki)(n-k_{i}) singular values by 1/n1/n. Then sample a i=1pki\sum_{i=1}^{p}k_{i} sized subset using any kk-DPP algorithm.

The results are presented in Table 2. It can be seen that after scaling the tail singular values of the partition matrices, the mean Dun()D^{\rm un}(\cdot) value for kk-DPP is very low, and resembles closely the constrained sampling case. We also note that the Scale-And-Sample approach to constrained sampling suggested earlier performs very well. The mean relative unfairness measure Dun()D^{\rm un}(\cdot) is almost zero. Furthermore, the value of the geometric diversity parameter logG()\log G(\cdot) is also similar to unscaled PP-DPP.

Proofs

2 Proof of Lemma 3.1

We will prove this lemma by induction. For the base case where there is just one row in WW, det(WW)\det(WW^{\top}) is equal to w12\left\lVert w_{1}\right\rVert^{2} which is equal to ΠH1w12\left\lVert\Pi_{H_{1}}w_{1}\right\rVert^{2}.

Let WW^{\prime} be the matrix with {w1,,wk1}\{w_{1},\dots,w_{k-1}\} as rows. Assume that the statement is true for k1k-1 rows, i.e.,

Note that elementary row product or addition transformations do not change the determinant. We will apply these transformation to make the entries of first row and first column go to zero.

Let (i)(i) denote the ii-th row of the above matrix and WW(i,j)WW^{\top}_{(i,j)} denote the (i,j)(i,j) entry. Then the transformation

will make the WW(1,2)WW^{\top}_{(1,2)} entry go to zero. For the rest of the elements,

We continue this way and next apply the transformation

This will make the WW(1,3)WW^{\top}_{(1,3)} entry go to zero and by the similar analysis as above we get WW(1,i)=wki+1ΠH2(wk)WW^{\top}_{(1,i)}=w_{k-i+1}^{\top}\Pi_{H^{\prime}_{2}}(w_{k}), where HiH^{\prime}_{i} is the subspace spanned by the vectors {wk1,,wki}\{w_{k-1},\dots,w_{k-i}\}. After applying k1k-1 row transformations of the form

we get that the entries WW(1,i)=0WW^{\top}_{(1,i)}=0, for i1i\neq 1 and

Note that Hk=HkH^{\prime}_{k}=H_{k} defined in the statement of the lemma.

We can apply similar column operations to make all the entries of the first column, except WW(1,1)WW^{\top}_{(1,1)}, go to zero. Since these elementary operations do not affect the determinant, we get Therefore

3 Proof of Lemma 3.3

where σ1,,σm\sigma_{1},\dots,\sigma_{m} are the singular values of VV.

The coefficient of xmkx^{m-k} in i=1m(x+σi2)\prod_{i=1}^{m}(x+\sigma_{i}^{2}) is equal to 1i1<i2<<ikmσi12σi22σik2\sum_{1\leq i_{1}<i_{2}<\ldots<i_{k}\leq m}\sigma_{i_{1}}^{2}\sigma_{i_{2}}^{2}\cdot\cdots\cdot\sigma_{i_{k}}^{2}. Let Wk\mathcal{W}_{k} be the set of all principal kk-minors of VVVV^{\top}. It is a well known fact in linear algebra that the coefficient of xmkx^{m-k} in det(xI+VV)\det(xI+VV^{\top}) is equal to

4 Proof of Lemma 3.4

We first show that for every part ii, the corresponding matrix VXiV_{X_{i}} has rank at least kk. For this, first note that VV has at least kk non-zero singular values, i.e., σk>0\sigma_{k}>0. This follows from the fact that the number of non-zero singular values determines the rank of VV. The rank of VV is certainly at least kk, since otherwise the diversity of every subset of size kk would be zero.

From the β\beta-balance condition it follows that the number of non-zero singular values of VXiV_{X_{i}} is the same as for VV, and hence also the rank of VXiV_{X_{i}} is at least kk, as claimed.

Note now that the set of vectors output by the algorithm has determinant zero if and only if for an iteration jj there exists a partition XiX_{i} such that SXi<ki|S\cap X_{i}|<k_{i} and wx=0\left\lVert w_{x}\right\rVert=0 for all xXix\in X_{i}, where S={x1,,xj1}S=\{x_{1},\dots,x_{j-1}\}.

Conclusion and Future Work

In this paper we initiated the study of fair and diverse DPP-based sampling for data summarization. We provide a novel and fast algorithm that can sample from a DPP that satisfy fairness constraints based on the desired proportion of samples with a given attribute. Our algorithm gives provably good guarantees when the data matrix satisfies a natural β\beta-balance property. We prove that a large class of datasets satisfy the β\beta-balance condition. We define a notion of price of fairness, the KL-divergence between the fairness constrained distribution and the unconstrained distribution and theoretically show that, when the data satisfies reasonable properties, this price would be low. We further show experimentally that adding fairness constraints results in minimal loss to diversity, even when the underlying dataset is very biased, or when the proportion of attributes is changed significantly.

Several challenging problems remain from a technical standpoint; naturally, a first question would be whether the theorems can be improved either by attaining better approximation guarantees, or by weakening the necessary conditions. Extending these results to arbitrary group structures (as opposed to partitions) would be very relevant, but appears to be significantly more challenging.

From a practical point of view, it remains to be seen what effect de-biasing a sampler has on the end result of a machine learning algorithm (e.g., classification), both on its accuracy and on the bias down the line.

References