Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, Adam Tauman Kalai

cs.IR cs.LG stat.ML

Introduction

The presence of automated decision-making systems in our daily lives is growing. As a result these systems play an increasingly active role in shaping our future. Far from being passive players that consume information, automated decision-making systems are participating actors: their predictions today affect the world we live in tomorrow. In particular, they determine many aspects of how we experience the world, from the news we read and the products we shop for to the job postings we see. The increased prevalence of machine learning has therefore been accompanied by a growing concern regarding the circumstances and mechanisms by which such systems may reproduce and augment the various forms of discrimination and injustices that are present in today’s society.

One domain in which the use of machine learning is increasingly popular—and in which unfair practices can lead to particularly negative consequences—is that of online recruiting and automated hiring. Maintaining an online professional presence has become increasingly important for people’s careers, and this information is often used as input to automated decision-making systems that advertise open positions and recruit candidates for jobs and other professional opportunities. In order to perform these tasks, a system must be able to accurately assess people’s current occupations, skills, interests, and “potential.” However, even the simplest of these tasks—determining someone’s current occupation—can be non-trivial. Although this information may be provided in a structured form on some professional networking platforms, this is not always the case. As a result, recruiters often browse candidates’ websites in an attempt to manually determine their current occupations. Machine learning promises to reduce this burden; however, as we will explain in this paper, occupation classification is susceptible to gender bias, stemming from existing gender imbalances in occupations.

To study gender bias in occupation classification, we created a new dataset of hundreds of thousands of online biographies, written in English, from the Common Crawl corpus. Because biographies are typically written in the third person by their subjects (or people familiar with their subjects) and because pronouns are gendered in English, we were able to extract (likely) self-identified binary gender from the biographies. We note, though, that this binary model is a simplification that fails to capture important aspects of gender and erases people who do not fit within its assumptions.

Using this dataset, we predicted people’s occupations by performing multi-class classification using three different semantic representations: bag-of-words, word embeddings, and deep recurrent neural networks. For each representation, we considered two scenarios: (1) where explicit gender indicators are available to the classifier, (2) where explicit gender indicators are “scrubbed” to promote fairness or to comply with regulations or laws. We define explicit gender indicators to be information, such as first names and gendered pronouns, that make it possible to determine gender. We note that the practice of “scrubbing” explicit gender indicators and other sensitive attributes is not unique to machine learning, and is often used as a way to mitigate the effects of implicit and explicit bias on decisions made by humans. For example, gender diversity in orchestras was significantly improved by the introduction of “blind” auditions, where candidates play behind a curtain (Goldin and Rouse, 2000).

To quantify gender bias, we compute the true positive rate (TPR) gender gap—i.e., the difference in TPRs between genders—for each occupation. The TPR for a given gender and occupation is defined as the proportion of people with that gender and occupation that are correctly predicted as having that occupation. We also compute the correlation between these TPR gender gaps and existing gender imbalances in occupations, and show how this may compound these imbalances; we connect this finding with an existing notion of indirect discrimination in political philosophy. We show that “scrubbing” explicit gender indicators reduces the TPR gender gaps, while maintaining overall classifier accuracy. However, we also show that significant TPR gender gaps remain in the absence of explicit gender indicators, and that these gaps are correlated with existing gender imbalances. For orchestra auditions, the sounds made by candidates’ shoes mean that a curtain is not sufficient to make an audition “blind.” It is therefore common practice to additionally roll out a carpet or to ask candidates to remove their shoes (Goldin and Rouse, 2000). By analogy, “scrubbing” explicit gender indicators is like introducing a curtain—the sounds made by the candidates’ shoes remain.

Our paper has two main takeaways: First, “scrubbing” explicit gender indicators is not sufficient to remove gender bias from an occupation classifier. Second, even in the absence of such indicators, TPR gender gaps are correlated with existing gender imbalances in occupations; occupation classifiers may therefore compound existing gender imbalances. Although we focus on gender bias, we note that other biases, such as those involving race or socioeconomic status, may also be present in occupation classification or in other tasks related to online recruiting and automated hiring. We structure our analysis so as to inform discussions about these biases as well.

In the next section, we provide a brief overview of related work. We then describe our data collection process in Section 3 and outline our methodology in Section 4, before presenting our analysis and results in Section 5. We conclude with a discussion in Section 6.

Related Work

Recent work has studied the ways in which stereotypes and other human biases may be reflected in semantic representations such as word embeddings (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018). Natural language processing researchers have also studied gender bias in coreference resolution (Zhao et al., 2018; Rudinger et al., 2018), showing that systems perform better when linking a gender pronoun to an occupation in which that gender is overrepresented than to an occupation in which it is underrepresented. Gender bias has also been studied in YouTube’s autocaptioning (Tatman, 2017), where researchers found a higher word error rate for female speakers. In the context of language identification, researchers have also investigated racial bias, showing that African-American English is often misclassified as non-English (Blodgett and O’Connor, 2017). Finally, machine learning methods for identifying toxic comments exhibit disproportionately high false positive rates for words like gay and homosexual (Dixon et al., 2017).

In the context of structured data, there have been extensive discussions about proxy behavior that may occur when sensitive attributes are not explicitly available but can be determined from other attributes (Pope and Sydnor, 2011; Barocas and Selbst, 2016; Zemel et al., 2013). Related discussions have focused on the phenomenon of differential subgroup validity (Ayres, 2002), where the choice of attributes may disadvantage groups for whom the chosen attributes are not equally predictive of the target label (Calders and Žliobaitė, 2013). Barocas and Selbst (2016) discussed these issues in the context of automated hiring; Kim (2016) explained how data-driven decisions that systematically bias people’s access to opportunities relate to existing antidiscrimination legislation, identifying voids that may need to be filled to account for potential risks stemming from automated decision-making systems. Researchers have also discussed making available sensitive attributes as a means to improve fairness (Dwork et al., 2012), as well as various ways to use these attributes (Dwork et al., 2018; Pope and Sydnor, 2011). Finally, although our paper does not directly consider ranking scenarios, fairness in ranking is particularly relevant to discussions about gender bias in online recruiting and automated hiring (Zehlike et al., 2017; Celis et al., 2018; Yang and Stoyanovich, 2017; Biega et al., 2018; Geyik and Kenthapadi, 2018).

We quantify gender bias by computing the TPR gender gap—i.e., the difference in TPRs between genders—for each occupation. This notion of bias is closely related to the equality of opportunity fairness metric of Hardt et al. (2016). We choose to focus on TPR gender gaps because they enable us to study the ways in which gender imbalances may be compounded; in turn, we relate this to compounding injustices (Hellman, 2018)—an existing notion of indirect discrimination in political philosophy that holds that it is a general moral duty to refrain from taking actions that would harm people when those actions are informed by, and would compound, prior injustices suffered by those people. We show that the TPR gender gaps are correlated with existing gender imbalances in occupations. As a result, occupation classifiers compound injustices when existing gender imbalances are attributable to historical discrimination.

Our paper is also closely related to research on gender bias in hiring (Sarsons, 2015, 2017; Ginther and Kahn, 2004; Bertrand and Duflo, 2017). In particular, Bertrand and Mullainathan (2004) conducted an experiment in which they responded to help-wanted ads using fictitious resumes, varying names so as to signal gender and race, while keeping everything else the same. They were therefore able to measure the effect of (inferred) gender and race on the likelihood of being called for an interview. Similarly, we study the effect of explicit gender indicators on occupation classification.

Computational linguistics researchers have explored the use of lexical and syntactic features to infer authors’ genders (Cheng et al., 2011; Koppel et al., 2002). Given that our dataset consists of online biographies, our paper is also related to research on differences between the ways that men and women represent themselves. In the context of online professional presences, Altenburger et al. (2017) analyzed self-promotion in LinkedIn, finding that women are more modest than men in expressing accomplishments and are less likely to use free-form fields. Researchers have also studied differences in volubility between men and women (Brescoll, 2011), showing that women’s fear of being highly voluble is justified by the fact that both men and women negatively evaluate highly voluble women. Moving beyond self-representation, Niven and Zilber (2001) analyzed congressional websites and found that differences between the ways that the media portray men and women in Congress cannot be explained by differences between the ways that they portray themselves. Meanwhile, Smith et al. (2018) analyzed attributes used to describe men and women in performance evaluations, showing that negative attributes are more often used to describe women than men. This research on representation by others relates to our paper because we cannot be sure that the online biographies in our dataset were actually written by their subjects.

Data Collection Process

To study gender bias in occupation classification, we created a new dataset using the Common Crawl. Specifically, we identified online biographies, written in English, by filtering for lines that began with a name-like pattern (i.e., a sequence of two capitalized words) followed by the string “is a(n) (xxx) title,” where title is an occupation from the BLS Standard Occupation Classification system.https://www.bls.gov/soc/ We identified the twenty-eight most frequent occupations based on their appearance in a small subset of the Common Crawl. In a few cases, we merged occupations. For example, we created the occupation professor by merging occupations that consist of professor and a modifier, such as economics professor. Having identified the most frequent occupations, we processed WETWET is a special file format containing cleaned text extracted from webpages. files from sixteen distinct crawls from 2014 to 2018, extracting online biographies corresponding to those occupations only. Finally, we performed de-duplication by treating biographies as duplicates if they had the same first name, last name, and occupation, and either no middle name was present or one middle name was a prefix of the other. The resulting dataset consists of 397,340 biographies spanning twenty-eight different occupations. Of these occupations, professor is the most frequent, with 118,400 biographies, while rapper is the least frequent, with 1,406 biographies (see Figure 1). The longest biography is 194 tokens, while the shortest is eighteen; the median biography length is seventy-two tokens. We note that the demographics of online biographies’ subjects differ from those of the overall workforce, and that our dataset does not contain all biographies on the Internet; however, neither of these factors is likely to undermine our findings.

Because some occupations have a high gender imbalance, our validation and testing splits must be large enough that every gender and occupation are sufficiently represented. We therefore used stratified-by-occupation splits, with 65% of the biographies (258,370) designated for training, 10% (39,635 biographies) designated for validation, and 25% (99,335 biographies) designated for testing.

A complete implementation that reproduces the dataset can be found in the source code available at http://aka.ms/biasbios.

Methodology

We used our dataset to predict people’s occupations, taken from the first sentence of their biographies as described in the previous section, given the remainder of their biographies. For example, consider the hypothetical biography Nancy Lee is a registered nurse. She graduated from Lehigh University, with honours in 1998. Nancy has years of experience in weight loss surgery, patient support, education, and diabetes. The goal is to predict nurse from She graduated from Lehigh University, with honours in 1998. Nancy has years of experience in weight loss surgery, patient support, education, and diabetes.

We used three different semantic representations of varying complexity: bag-of-words (BOW), word embeddings (WE), and deep recurrent neural networks (DNN). When using the BOW and WE representations, we used a one-versus-all logistic regression as the occupation classifier; to construct the DNN representation, we started with word embeddings as input and then trained a DNN to predict occupations in an end-to-end fashion. For each representation, we considered two scenarios: (1) where explicit gender indicators—e.g., first names and pronouns—are available to the classifier, (2) where explicit gender indicators are “scrubbed.” For example, these scenarios correspond to predicting the occupation nurse from the text [She] graduated from Lehigh University, with honours in 1998. [Nancy] has years of experience in weight loss surgery, patient support, education, and diabetes, with and without the bracketed words.

The BOW representation encodes the $i^{\textrm{th}}$ biography as a sparse vector $x_{i}^{\textrm{BOW}}$ . Each element of this vector corresponds to a word type in the vocabulary, equal to 1 if the biography contains a token of this type and 0 otherwise. Despite recent successes of using more complex semantic representations for document classification, the BOW representation provides a good baseline and is still widely used, especially in scenarios where interpretability is important. To predict occupations, we trained a one-versus-all logistic regression with $L_{2}$ regularization using our dataset’s training split represented using the BOW representation.

Word embeddings

The WE representation encodes the $i^{\textrm{th}}$ biography as a vector $x_{i}^{\textrm{WE}}$ , obtained by averaging the fastText word embeddings (Bojanowski et al., 2017; Mikolov et al., 2018) for the word types present in that biography.We note that the fastText word embeddings were trained using the Common Crawl, albeit using a different subset than the one we used to create our dataset. The WE representation is surprisingly effective at capturing non-trivial semantic information (Adi et al., 2016). To predict occupations, we trained a one-versus-all logistic regression with $L_{2}$ regularization using our dataset’s training split represented using the WE representation.

Deep recurrent neural networks

To construct the DNN representation, we started with the fastText word embeddings as input and then trained a DNN to predict occupations in an end-to-end fashion. We used an architecture similar to that of Yang et al. (2016), but with just one bi-directional recurrent neural network at the level of words and with gated recurrent units (GRUs) (Chung et al., 2014) instead of long short-term memory units; this model uses an attention mechanism—an integral part of modern neural network architectures (Vaswani et al., 2017). Our choice of architecture was motivated by a desire to use a relatively simple model that would be easy to interpret.

Formally, given the $i^{\textrm{th}}$ biography represented as a sequence of tokens $w_{i}^{1},\ldots,w_{i}^{T}$ , we start by replacing each token $w_{i}^{t}$ with the fastText word embedding for that word type to yield $e_{i}^{1},\ldots,e_{i}^{T}$ . The DNN then uses a GRU to process the biography in both forward and reverse directions and concatenates the corresponding hidden states from both directions to re-represent the $t^{\textrm{th}}$ token as follows:

Next, the DNN projects each hidden state $h_{i}^{t}$ to the attention dimension $k_{a}$ via a fully connected layer with weights $W_{a}$ and $b_{a}$ , and transforms the result into an unnormalized scalar $u_{i}^{t}$ via a vector $w_{a}$ :

Each scalar is then normalized to yield an attention weight:

Finally, we obtain the DNN representation via a weighted sum:

where $\hat{y}_{i}$ is the predicted occupation for the $i^{\textrm{th}}$ biography.

We trained the DNN using our dataset’s training split and a standard cross-entropy loss applied to the output of the last layer.

2. Explicit Gender Indicators

For each semantic representation, we considered two scenarios. In the first scenario, the representation included all word types, meaning that explicit gender indicators are available to the occupation classifier. In the second scenario, we “scrubbed” explicit gender indicators prior to creating the representation, meaning that these indicators are not available to the occupation classifier. Specifically, we deleted the subject’s first name, along with the words he, she, her, his, him, hers, himself, herself, mr, mrs, and ms from each biography.

Analysis and Results

In this section, we analyze the potential allocation harms that can result from semantic representation bias. To do this, we study the performance of the occupation classifier for each semantic representation, with and without explicit gender indicators, as described in the previous section. The classifiers’ overall accuracies are shown in Figure 2. We start by analyzing gender bias for the scenario in which the semantic representations include all word types, including explicit gender indicators. We then analyze gender bias in the scenario in which explicit gender indicators are “scrubbed,” and use the DNN’s per-token attention weights to understand proxy behavior that occurs in the absence of explicit gender indicators.

For each semantic representation, we quantify gender bias by using our dataset’s testing split to calculate the occupation classifier’s TPR gender gap—i.e., the difference in TPRs between binary genders $g$ and ${\sim}g$ —for each occupation $y$ :

where $\hat{Y}$ and $Y$ are random variables representing the predicted and target labels (i.e., occupations) for a biography and $G$ is a random variable representing the binary gender of the biography’s subject.

Defining the percentage of people with gender $g$ in occupation $y$ as $\pi_{g,y}=P\,[G=g\,|\,Y=y]$ , Figure 3 shows $\textrm{Gap}_{\textrm{female},y}$ versus $\pi_{\textrm{female},y}$ for each occupation $y$ for the BOW representation with explicit gender indicators; Figure 4 depicts the same information for all three representations, with and without explicit gender indicators.

Compounding imbalance

We define the gender imbalance of occupation $y$ as $\frac{\pi_{g,y}}{\pi_{\sim g,y}}$ ; gender $g$ is underrepresented if $\frac{\pi_{g,y}}{\pi_{\sim g,y}}<1$ or, equivalently, if $\pi_{g,y}<0.5$ . The gender imbalance is compounded if the underrepresented gender has a lower TPR than the overrepresented gender—e.g., if $\textrm{Gap}_{g,y}<0$ and $g$ is underrepresented.

If $\pi_{g,y}<0.5$ and $\textrm{Gap}_{g,y}<0$ , then

If $\pi_{g,y}<\pi_{{\sim}g,y}$ and $\textrm{TPR}_{g,y}<\textrm{TPR}_{{\sim}g,y}$ , then

so the gender imbalance for the true positives in occupation $y$ is larger than the initial gender imbalance in that occupation. ∎

As explained in Section 2, if the initial gender imbalance is due to prior injustices, an occupation classifier will compound these injustices, which may correspond to indirect discrimination (Hellman, 2018).

It is clear from Figure 3 that there are few occupations with an equal percentage of men and women—i.e., almost all occupations have a gender imbalance—and that for that for occupations in which women (conversely men) are underrepresented, $\textrm{Gap}_{\textrm{female},y}<0$ (conversely $\textrm{Gap}_{\textrm{male},y}<0$ ). In other words, there is a positive correlation between the TPR gender gap for an occupation $y$ and the gender imbalance in that occupation. (Figure 4 illustrates that this is also the case for the WE and DNN representations.) As a result, if the occupation classifier for the BOW representation were used to recruit candidates for jobs in occupation $y$ , it would compound the gender imbalance by a factor of $\frac{\textrm{TPR}_{g,y}}{\textrm{TPR}_{{\sim}g,y}}$ , where $g$ is the underrepresented gender. For example, $14.6\%$ of the surgeons in our dataset’s testing split are women—i.e., $\pi_{\textrm{female},\textrm{surgeon}}<0.5$ . The classifier for the BOW representation is able to correctly predict that $71\%$ of male surgeons and $54.5\%$ of female surgeons are indeed surgeons—i.e., $\textrm{Gap}_{\textrm{female},\textrm{surgeon}}<0$ . Consequently, only $11.6\%$ of the true positives are women, so the gender imbalance is compounded.

Counterfactuals

To isolate the effects of explicit gender indicators on the representations’ occupation classifiers, we examined differences between the classifiers’ predictions on our dataset’s testing split as described above and their predictions on our dataset’s testing split with first names removed and other explicit gender indicators (see Section 4.2) swapped for their complements, keeping everything else the same. This analysis is similar in spirit to the experiment of Bertrand and Mullainathan (2004), in which they responded to help-wanted ads using fictitious resumes in order to measure the effect of gender and race on the likelihood of being called for an interview. By analyzing the counterfactuals obtained by swapping gender indicators, we can answer the question, “Which occupation would this classifier predict if this biography had used indicators corresponding to the other gender.” This question is interesting because we would expect an occupation classifier to predict the same occupation for a man and a woman with identical biographies. We note that this question is not the same as the question, “Which occupation would this classifier predict if this biography’s subject were the other gender.” Although the latter question is arguably more interesting, it cannot be answered without additionally changing all other factors that are correlated with gender (Kilbertus et al., 2017).

For the BOW representation, we find that the classifier’s predictions for $5.5\%$ of the biographies in our testing split change when their gender indicators are swapped; for the WE and DNN representations, these percentages are $12.2\%$ and $4.6\%$ , respectively. To better understand the effects of explicit gender indicators on the classifiers’ predictions, we consider pairs of occupations. Specifically, for each gender $g$ and pair of occupations $(y^{1},y^{2})$ , we identify the set of biographies that are incorrectly predicted as having occupation $y^{1}$ with their original gender indicators, but correctly predicted as having occupation $y^{2}$ when their gender indicators are swapped:

Tables 2 and 2 list, for the BOW representation, the five pairs of occupations with the largest values of $\Pi_{g,(y^{1},y^{2})}$ . For example, $7.1\%$ of male paralegals whose occupations are only correctly predicted when their gender indicators are swapped are incorrectly predicted as attorneys when their biographies use male indicators. Similarly, $14.7\%$ of female rappers whose occupations are only correctly predicted when their gender indicators are swapped are incorrectly predicted as models when their biographies use female indicators.

2. Without Explicit Gender Indicators

If there are no differences between the ways that men and women in occupation $y$ represent themselves in their biographies other than explicit gender indicators, then “scrubbing” these indicators should be sufficient to remove all information about gender from the biographies—i.e.,

making it impossible to predict the gender of a “scrubbed” biography’s subject belonging to occupation $y$ better than random.

In order to determine whether “scrubbing” explicit gender indicators is sufficient to remove all information about gender, we used a balanced subsample of our dataset to predict people’s gender. We created a subsampled training split by first discarding from our dataset’s training split all occupations for which there were not at least $1,000$ biographies for each gender. For each of the remaining twenty-one occupations, we then subsampled $1,000$ biographies for each gender to yield $42,000$ biographies, balanced by occupation and gender. To create a subsampled validation split, we first identified the occupation and gender from those represented in the subsampled training split with the smallest number of biographies in our dataset’s validation split. Then, we subsampled that number of biographies from our dataset’s validation split for each of the twenty-one occupations represented in the subsampled training split and each gender. We created a subsampled testing split similarly. When using the BOW and WE representations, we used a logistic regression with $L_{2}$ regularization as the gender classifier; to construct the DNN representation, we started with word embeddings as input and then trained a DNN to predict gender in an end-to-end fashion, similar to the methodology described in Section 4.

Using the subsampled testing split, we find that the gender classifier for the BOW representation has an accuracy of $65.5\%$ , while the DNN representation has an accuracy of $68.2\%$ . These accuracies are higher than $50\%$ , so “scrubbing” explicit gender indicators is not sufficient to remove all information about gender. This finding is reinforced by the scatterplot in Figure 5, which shows log frequency versus correlation with $G=\textrm{female}$ for each word type in the vocabulary. It is clear from this scatterplot that deleting all words that are correlated with gender would not be feasible.

True positive rate gender gap and compounding imbalance

For each semantic representation, we again quantify gender bias by using our (original) dataset’s testing split to calculate the occupation classifier’s TPR gender gap for each occupation. Figure 4 shows $\textrm{Gap}_{\textrm{female},y}$ versus $\pi_{\textrm{female},y}$ for each occupation $y$ for all three representations, with and without explicit gender indicators. “Scrubbing” explicit gender indicators reduces the TPR gender gaps, while the classifiers’ accuracies (shown in Figure 2) remain roughly the same; however, for some occupations, $\textrm{Gap}_{\textrm{female},y}$ is still very large. Moreover, because there is still a positive correlation between the TPR gender gap for an occupation $y$ and the gender imbalance in that occupation, “scrubbing” explicit gender indicators will not prevent the classifiers from compounding gender imbalances.

We note that compounding imbalances are especially problematic if people repeatedly encounter such classifiers—i.e., if an occupation classifier’s predictions determine the data used by subsequent occupation classifiers. Who is offered a job today will affect the gender (im)balance in that occupation in the future. If a classifier compounds existing gender imbalances, then the underrepresented gender will, over time, become even further underrepresented—a phenomenon sometimes referred to as the “leaky pipeline.”

To illustrate this phenomenon, we performed simulations using the DNN representation in which the candidate pool at time $t+1$ is defined by the true positives at time $t$ . Defining the percentage of people with gender $g$ in occupation $y$ at time $t$ as $\pi_{g,y}^{(t)}$ , we fit a linear regression to the TPR gender gaps for different values of $\pi_{g,y}^{(t)}$ :

Using this regression model, we are then able to calculate the percentage of people with gender $g$ in occupation $y$ at time $t+1$ :

Figure 6 shows $\pi_{g,y}^{(t)}$ for $t=0,\ldots,10$ ; each subplot corresponds to a different initial gender imbalance. Over time, the gender imbalances compound. We note that there are many different TPR pairs $\textrm{TPR}^{(t)}_{g,y}$ and $\textrm{TPR}^{(t)}_{{\sim}g,y}$ that can result in a given TPR gender gap $\textrm{Gap}^{(t)}_{g,y}$ . For example, a TPR gender gap of $-0.2$ might correspond to $0.6-0.8$ or to $0.7-0.9$ . Moreover, different TPR pairs will result in different percentages of people with gender $g$ in occupation $y$ at time $t+1$ . The bands in Figure 6 therefore reflect these differences.

Attention to gender

The DNN’s per-token attention weights allow us to understand proxy behavior that occurs in the absence of explicit gender indicators. The attention weights indicate which tokens are most predictive. For example, Figure 7 depicts the per-token attention weights from the occupation classifier for the DNN representation when predicting Bill Gates’ occupation from an excerpt of his biography on Wikipedia; the larger the weight, the stronger the color. The attention weights for the words software and architect are very large, and the DNN predicts software engineer.

In order to understand proxy behavior that occurs in the absence of explicit gender indicators, we first used the subsampled testing split, described above, to obtain per-token attention weights from the gender classifier for the DNN representation. We then used these weights to find “proxy candidates”—i.e., the words that are most predictive of gender in the absence of explicit gender indicators. Specifically, we computed the sum of the per-token attention weights for each word type, and then selected the types with the largest sums as “proxy candidates.” Across multiple runs, we found that the words women, husband, mother, woman, and female (ordered by decreasing total attention) were consistently “proxy candidates.”

For each “proxy candidate,” we then used our dataset’s testing split, with and without explicit gender indicators, to create histograms of the per-token attention weights from the occupation classifier for the DNN representation. These histograms represent the extent to which that “proxy candidate” is predictive of occupation, with and without gender indicators. By comparing the histograms for each “proxy candidate,” we are able to identify words that are used as proxies for gender in the absence of explicit gender indicators: if there is a big difference between the histograms, then the “proxy candidate” is likely a proxy. Figure 8 shows per-occupation histograms for the word women, with (left) and without (right) explicit gender indicators. It is clear that in the absence of explicit gender indicators, the classifier has larger attention weights for the word women for all occupations. We see similar behavior for the other “proxy candidates,” suggesting that the classifier uses proxies for gender in the absence of explicit gender indicators.

The occupations in Figure 8 are ordered by TPR gender gap from negative to positive. For occupations in the middle, where there are small or no TPR gender gaps, the classifier still has non-zero attention weights for the word women. This means that using gender information does not necessarily lead to a TPR gender gap. We also note that it’s possible that the classifier is using gender information to differentiate between occupations with very different gender imbalances that are otherwise similar, such as physician and surgeon.

Discussion and Future Work

In this paper, we presented a large-scale study of gender bias in occupation classification using a new dataset of hundreds of thousands of online biographies. We showed that there are significant TPR gender gaps when using three different semantic representations: bag-of-words, word embeddings, and deep recurrent neural networks. We also showed that the correlation between these TPR gender gaps and existing gender imbalances in occupations may compound these imbalances. By performing simulations, we demonstrated that compounding imbalances are especially problematic if people repeatedly encounter occupation classifiers because the underrepresented gender will become even further underrepresented.

Recently, Dwork and Ilvento (2018) showed that fairness does not hold under composition, meaning that if two classifiers are individually fair according to some fairness metric, then the sequential use of these classifiers will not necessarily be fair according the same metric. One interpretation of our finding regarding compounding imbalances is that unfairness holds under composition. Understanding why this is the case, especially given that fairness does not hold under composition, is an interesting direction for future work.

We found that the TPR gender gaps are reduced by “scrubbing” explicit gender indicators, while the classifiers’ overall accuracies remain roughly the same. This constitutes an empirical example where there is little tradeoff between promoting fairness—in this case by “scrubbing” explicit gender indicators—and performance. This also constitutes an empirical example where fairness is improved by “scrubbing” sensitive attributes, contrary to other examples in the literature (Kleinberg et al., 2018). That said, in the absence of explicit gender indicators, we did find that (1) we were able to predict the gender of a biography’s subject better than random, even when controlling for occupation; (2) significant TPR gender gaps remain for some occupations; (3) there is still a positive correlation between the TPR gender gap for an occupation and the gender imbalance in that occupation, so existing gender imbalances may be compounded. These findings indicate that there are differences between men’s and women’s online biographies other than explicit gender indicators. These differences may be due to the ways that men and women represent themselves or due to men and women having different specializations within an occupation. Our findings highlight both the risks of using machine learning in a high-stakes setting and the difficulty of trying to promote fairness by “scrubbing” sensitive attributes.

Our future work will focus primarily on understanding how best to mitigate TPR gender gaps and compounding imbalances in online recruiting and automated hiring. Finally, although we focused on gender bias, we note that other biases, such as those involving race or socioeconomic status, may also be present in occupation classification. Our methodology and analysis approach may prove useful for quantifying such biases, provided relevant group membership information is available. Moreover, quantifying such biases is an important direction for future work—it is likely that they exist and, in the absence of evidence that they do not, online recruiting and automated hiring run the risk of compounding prior injustices.

Appendix

Appendix A True positive rate gender gaps across representations

Figure 9 shows TPR gender gaps for BOW trained without gender indicators. Figures 10 and 11 show the results for WE, with and without gender indicators, respectively. Figures 12 and 13 show the results for DNN, with and without gender indicators, respectively.

Appendix B Attention to gender

Figure 14 shows the aggregated attention of the DNN model to words “wife” and “husband”. As with the word “women”, the model trained without gender indicators places more attention on these words. Notice, however, that the shift in attention weights, while it exists, is smaller than for the word “women”, which is consistent with the lower aggregate attention in the gender prediction model.

B.2. Attention to gender indicators

Figure 15 shows the attention of the model, trained with and without gender indicators, on the word “she” during the prediction of the occupation based on biographies with gender indicators. One may expect that in the latter case the model would not attend to this word as it has not seen it during the training. However, the results indicate quite the opposite. In fact, the model puts much more attention to it. This can be attributed to the use of word embeddings, which enables the model to learn about words even if it has not explicitly seen them. Interestingly, when exposed to the word “she” during prediction, the model seems to receive a stronger gender signal than it has seen during training, and pays a significant amount of attention to it.