Language (Technology) is Power: A Critical Survey of "Bias" in NLP

Su Lin Blodgett, Solon Barocas, Hal Daumé, Hanna Wallach

Introduction

A large body of work analyzing “bias” in natural language processing (NLP) systems has emerged in recent years, including work on “bias” in embedding spaces (e.g., Bolukbasi et al., 2016a; Caliskan et al., 2017; Gonen and Goldberg, 2019; May et al., 2019) as well as work on “bias” in systems developed for a breadth of tasks including language modeling Lu et al. (2018); Bordia and Bowman (2019), coreference resolution Rudinger et al. (2018); Zhao et al. (2018a), machine translation Vanmassenhove et al. (2018); Stanovsky et al. (2019), sentiment analysis Kiritchenko and Mohammad (2018), and hate speech/toxicity detection (e.g., Park et al., 2018; Dixon et al., 2018), among others.

Although these papers have laid vital groundwork by illustrating some of the ways that NLP systems can be harmful, the majority of them fail to engage critically with what constitutes “bias” in the first place. Despite the fact that analyzing “bias” is an inherently normative process—in which some system behaviors are deemed good and others harmful—papers on “bias” in NLP systems are rife with unstated assumptions about what kinds of system behaviors are harmful, in what ways, to whom, and why. Indeed, the term “bias” (or “gender bias” or “racial bias”) is used to describe a wide range of system behaviors, even though they may be harmful in different ways, to different groups, or for different reasons. Even papers analyzing “bias” in NLP systems developed for the same task often conceptualize it differently.

For example, the following system behaviors are all understood to be self-evident statements of “racial bias”: (a) embedding spaces in which embeddings for names associated with African Americans are closer (compared to names associated with European Americans) to unpleasant words than pleasant words Caliskan et al. (2017); (b) sentiment analysis systems yielding different intensity scores for sentences containing names associated with African Americans and sentences containing names associated with European Americans Kiritchenko and Mohammad (2018); and (c) toxicity detection systems scoring tweets containing features associated with African-American English as more offensive than tweets without these features Davidson et al. (2019); Sap et al. (2019). Moreover, some of these papers focus on “racial bias” expressed in written text, while others focus on “racial bias” against authors. This use of imprecise terminology obscures these important differences.

We survey 146 papers analyzing “bias” in NLP systems, finding that their motivations are often vague and inconsistent. Many lack any normative reasoning for why the system behaviors that are described as “bias” are harmful, in what ways, and to whom. Moreover, the vast majority of these papers do not engage with the relevant literature outside of NLP to ground normative concerns when proposing quantitative techniques for measuring or mitigating “bias.” As a result, we find that many of these techniques are poorly matched to their motivations, and are not comparable to one another.

We then describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing “bias” in NLP systems. We argue that such work should examine the relationships between language and social hierarchies; we call on researchers and practitioners conducting such work to articulate their conceptualizations of “bias” in order to enable conversations about what kinds of system behaviors are harmful, in what ways, to whom, and why; and we recommend deeper engagements between technologists and communities affected by NLP systems. We also provide several concrete research questions that are implied by each of our recommendations.

Method

Our survey includes all papers known to us analyzing “bias” in NLP systems—146 papers in total. We omitted papers about speech, restricting our survey to papers about written text only. To identify the 146 papers, we first searched the ACL Anthology111https://www.aclweb.org/anthology/ for all papers with the keywords “bias” or “fairness” that were made available prior to May 2020. We retained all papers about social “bias,” and discarded all papers about other definitions of the keywords (e.g., hypothesis-only bias, inductive bias, media bias). We also discarded all papers using “bias” in NLP systems to measure social “bias” in text or the real world (e.g., Garg et al., 2018).

To ensure that we did not exclude any relevant papers without the keywords “bias” or “fairness,” we also traversed the citation graph of our initial set of papers, retaining any papers analyzing “bias” in NLP systems that are cited by or cite the papers in our initial set. Finally, we manually inspected any papers analyzing “bias” in NLP systems from leading machine learning, human–computer interaction, and web conferences and workshops, such as ICML, NeurIPS, AIES, FAccT, CHI, and WWW, along with any relevant papers that were made available in the “Computation and Language” and “Computers and Society” categories on arXiv prior to May 2020, but found that they had already been identified via our traversal of the citation graph. We provide a list of all 146 papers in the appendix. In Table 1, we provide a breakdown of the NLP tasks covered by the papers. We note that counts do not sum to 146, because some papers cover multiple tasks. For example, a paper might test the efficacy of a technique for mitigating “bias” in embedding spaces in the context of sentiment analysis.

Once identified, we then read each of the 146 papers with the goal of categorizing their motivations and their proposed quantitative techniques for measuring or mitigating “bias.” We used a previously developed taxonomy of harms for this categorization, which differentiates between so-called allocational and representational harms Barocas et al. (2017); Crawford (2017). Allocational harms arise when an automated system allocates resources (e.g., credit) or opportunities (e.g., jobs) unfairly to different social groups; representational harms arise when a system (e.g., a search engine) represents some social groups in a less favorable light than others, demeans them, or fails to recognize their existence altogether. Adapting and extending this taxonomy, we categorized the 146 papers’ motivations and techniques into the following categories:

Representational harms:222We grouped several types of representational harms into two categories to reflect that the main point of differentiation between the 146 papers’ motivations and proposed quantitative techniques for measuring or mitigating “bias” is whether or not they focus on stereotyping. Among the papers that do not focus on stereotyping, we found that most lack sufficiently clear motivations and techniques to reliably categorize them further.

Stereotyping that propagates negative generalizations about particular social groups.

Differences in system performance for different social groups, language that misrepresents the distribution of different social groups in the population, or language that is denigrating to particular social groups.

Questionable correlations between system behavior and features of language that are typically associated with particular social groups.

Vague descriptions of “bias” (or “gender bias” or “racial bias”) or no description at all.

In Table 2 we provide counts for each of the six categories listed above. (We also provide a list of the papers that fall into each category in the appendix.) Again, we note that the counts do not sum to 146, because some papers state multiple motivations, propose multiple techniques, or propose a single technique for measuring or mitigating multiple harms. Table 3, which is in the appendix, contains examples of the papers’ motivations and techniques across a range of different NLP tasks.

Findings

Categorizing the 146 papers’ motivations and proposed quantitative techniques for measuring or mitigating “bias” into the six categories listed above enabled us to identify several commonalities, which we present below, along with illustrative quotes.

We found that the papers’ motivations span all six categories, with several papers falling into each one. Appropriately, papers that provide surveys or frameworks for analyzing “bias” in NLP systems often state multiple motivations (e.g., Hovy and Spruit, 2016; Bender, 2019; Sun et al., 2019; Rozado, 2020; Shah et al., 2020). However, as the examples in Table 3 (in the appendix) illustrate, many other papers (33%) do so as well. Some papers (16%) state only vague motivations or no motivations at all. For example,

[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “[N]o human should be discriminated on the basis of demographic attributes by an NLP system.” —Kaneko and Bollegala (2019) {quoting}[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “[P]rominent word embeddings […] encode systematic biases against women and black people […] implicating many NLP systems in scaling up social injustice.” —May et al. (2019)

These examples leave unstated what it might mean for an NLP system to “discriminate,” what constitutes “systematic biases,” or how NLP systems contribute to “social injustice” (itself undefined).

Papers’ motivations sometimes include no normative reasoning.

We found that some papers (32%) are not motivated by any apparent normative concerns, often focusing instead on concerns about system performance. For example, the first quote below includes normative reasoning—namely that models should not use demographic information to make predictions—while the other focuses on learned correlations impairing system performance.

[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “In [text classification], models are expected to make predictions with the semantic information rather than with the demographic group identity information (e.g., ‘gay’, ‘black’) contained in the sentences.” —Zhang et al. (2020a)

[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “An over-prevalence of some gendered forms in the training data leads to translations with identifiable errors. Translations are better for sentences involving men and for sentences containing stereotypical gender roles.” —Saunders and Byrne (2020)

We found that even papers with clear motivations often fail to explain what kinds of system behaviors are harmful, in what ways, to whom, and why. For example,

[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “Deploying these word embedding algorithms in practice, for example in automated translation systems or as hiring aids, runs the serious risk of perpetuating problematic biases in important societal contexts.” —Brunet et al. (2019)

[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “[I]f the systems show discriminatory behaviors in the interactions, the user experience will be adversely affected.” —Liu et al. (2019)

These examples leave unstated what “problematic biases” or non-ideal user experiences might look like, how the system behaviors might result in these things, and who the relevant stakeholders or users might be. In contrast, we find that papers that provide surveys or frameworks for analyzing “bias” in NLP systems often name who is harmed, acknowledging that different social groups may experience these systems differently due to their different relationships with NLP systems or different social positions. For example, Ruane et al. (2019) argue for a “deep understanding of the user groups [sic] characteristics, contexts, and interests” when designing conversational agents.

Papers about NLP systems developed for the same task often conceptualize “bias” differently.

Even papers that cover the same NLP task often conceptualize “bias” in ways that differ substantially and are sometimes inconsistent. Rows 3 and 4 of Table 3 (in the appendix) contain machine translation papers with different conceptualizations of “bias,” leading to different proposed techniques, while rows 5 and 6 contain papers on “bias” in embedding spaces that state different motivations, but propose techniques for quantifying stereotyping.

Papers’ motivations conflate allocational and representational harms.

We found that the papers’ motivations sometimes (16%) name immediate representational harms, such as stereotyping, alongside more distant allocational harms, which, in the case of stereotyping, are usually imagined as downstream effects of stereotypes on résumé filtering. Many of these papers use the imagined downstream effects to justify focusing on particular system behaviors, even when the downstream effects are not measured. Papers on “bias” in embedding spaces are especially likely to do this because embeddings are often used as input to other systems:

[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “However, none of these papers [on embeddings] have recognized how blatantly sexist the embeddings are and hence risk introducing biases of various types into real-world systems.” —Bolukbasi et al. (2016a) {quoting}[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “It is essential to quantify and mitigate gender bias in these embeddings to avoid them from affecting downstream applications.” —Zhou et al. (2019)

In contrast, papers that provide surveys or frameworks for analyzing “bias” in NLP systems treat representational harms as harmful in their own right. For example, Mayfield et al. (2019) and Ruane et al. (2019) cite the harmful reproduction of dominant linguistic norms by NLP systems (a point to which we return in section 4), while Bender (2019) outlines a range of harms, including seeing stereotypes in search results and being made invisible to search engines due to language practices.

2 Techniques

Perhaps unsurprisingly given that the papers’ motivations are often vague, inconsistent, and lacking in normative reasoning, we also found that the papers’ proposed quantitative techniques for measuring or mitigating “bias” do not effectively engage with the relevant literature outside of NLP. Papers on stereotyping are a notable exception: the Word Embedding Association Test Caliskan et al. (2017) draws on the Implicit Association Test Greenwald et al. (1998) from the social psychology literature, while several techniques operationalize the well-studied “Angry Black Woman” stereotype Kiritchenko and Mohammad (2018); May et al. (2019); Tan and Celis (2019) and the “double bind” faced by women May et al. (2019); Tan and Celis (2019), in which women who succeed at stereotypically male tasks are perceived to be less likable than similarly successful men Heilman et al. (2004). Tan and Celis (2019) also examine the compounding effects of race and gender, drawing on Black feminist scholarship on intersectionality Crenshaw (1989).

Papers’ techniques are poorly matched to their motivations.

We found that although 21% of the papers include allocational harms in their motivations, only four papers actually propose techniques for measuring or mitigating allocational harms.

Papers focus on a narrow range of potential sources of “bias.”

We found that nearly all of the papers focus on system predictions as the potential sources of “bias,” with many additionally focusing on “bias” in datasets (e.g., differences in the number of gendered pronouns in the training data Zhao et al. (2019)). Most papers do not interrogate the normative implications of other decisions made during the development and deployment lifecycle—perhaps unsurprising given that their motivations sometimes include no normative reasoning. A few papers are exceptions, illustrating the impacts of task definitions, annotation guidelines, and evaluation metrics: Cao and Daumé (2019) study how folk conceptions of gender Keyes (2018) are reproduced in coreference resolution systems that assume a strict gender dichotomy, thereby maintaining cisnormativity; Sap et al. (2019) focus on the effect of priming annotators with information about possible dialectal differences when asking them to apply toxicity labels to sample tweets, finding that annotators who are primed are significantly less likely to label tweets containing features associated with African-American English as offensive.

A path forward

We now describe how researchers and practitioners conducting work analyzing “bias” in NLP systems might avoid the pitfalls presented in the previous section—the beginnings of a path forward. We propose three recommendations that should guide such work, and, for each, provide several concrete research questions. We emphasize that these questions are not comprehensive, and are intended to generate further questions and lines of engagement.

Our three recommendations are as follows:

Ground work analyzing “bias” in NLP systems in the relevant literature outside of NLP that explores the relationships between language and social hierarchies. Treat representational harms as harmful in their own right.

Provide explicit statements of why the system behaviors that are described as “bias” are harmful, in what ways, and to whom. Be forthright about the normative reasoning Green (2019) underlying these statements.

Examine language use in practice by engaging with the lived experiences of members of communities affected by NLP systems. Interrogate and reimagine the power relations between technologists and such communities.

Turning first to (R1), we argue that work analyzing “bias” in NLP systems will paint a much fuller picture if it engages with the relevant literature outside of NLP that explores the relationships between language and social hierarchies. Many disciplines, including sociolinguistics, linguistic anthropology, sociology, and social psychology, study how language takes on social meaning and the role that language plays in maintaining social hierarchies. For example, language is the means through which social groups are labeled and one way that beliefs about social groups are transmitted (e.g., Maass, 1999; Beukeboom and Burgers, 2019). Group labels can serve as the basis of stereotypes and thus reinforce social inequalities: “[T]he label content functions to identify a given category of people, and thereby conveys category boundaries and a position in a hierarchical taxonomy” Beukeboom and Burgers (2019). Similarly, “controlling images,” such as stereotypes of Black women, which are linguistically and visually transmitted through literature, news media, television, and so forth, provide “ideological justification” for their continued oppression (Collins, 2000, Chapter 4).

As a result, many groups have sought to bring about social changes through changes in language, disrupting patterns of oppression and marginalization via so-called “gender-fair” language Sczesny et al. (2016); Menegatti and Rubini (2017), language that is more inclusive to people with disabilities ADA (2018), and language that is less dehumanizing (e.g., abandoning the use of the term “illegal” in everyday discourse on immigration in the U.S. Rosa (2019)). The fact that group labels are so contested is evidence of how deeply intertwined language and social hierarchies are. Taking “gender-fair” language as an example, the hope is that reducing asymmetries in language about women and men will reduce asymmetries in their social standing. Meanwhile, struggles over language use often arise from dominant social groups’ desire to “control both material and symbolic resources”—i.e., “the right to decide what words will mean and to control those meanings”—as was the case in some white speakers’ insistence on using offensive place names against the objections of Indigenous speakers (Hill, 2008, Chapter 3).

Sociolinguists and linguistic anthropologists have also examined language attitudes and language ideologies, or people’s metalinguistic beliefs about language: Which language varieties or practices are taken as standard, ordinary, or unmarked? Which are considered correct, prestigious, or appropriate for public use, and which are considered incorrect, uneducated, or offensive (e.g., Campbell-Kibler, 2009; Preston, 2009; Loudermilk, 2015; Lanehart and Malik, 2018)? Which are rendered invisible Roche (2019)?333Language ideologies encompass much more than this; see, e.g., Lippi-Green (2012), Alim et al. (2016), Rosa and Flores (2017), Rosa and Burdick (2017), and Charity Hudley (2017). Language ideologies play a vital role in reinforcing and justifying social hierarchies because beliefs about language varieties or practices often translate into beliefs about their speakers (e.g. Alim et al., 2016; Rosa and Flores, 2017; Craft et al., 2020). For example, in the U.S., the portrayal of non-white speakers’ language varieties and practices as linguistically deficient helped to justify violent European colonialism, and today continues to justify enduring racial hierarchies by maintaining views of non-white speakers as lacking the language “required for complex thinking processes and successful engagement in the global economy” Rosa and Flores (2017).

Recognizing the role that language plays in maintaining social hierarchies is critical to the future of work analyzing “bias” in NLP systems. First, it helps to explain why representational harms are harmful in their own right. Second, the complexity of the relationships between language and social hierarchies illustrates why studying “bias” in NLP systems is so challenging, suggesting that researchers and practitioners will need to move beyond existing algorithmic fairness techniques. We argue that work must be grounded in the relevant literature outside of NLP that examines the relationships between language and social hierarchies; without this grounding, researchers and practitioners risk measuring or mitigating only what is convenient to measure or mitigate, rather than what is most normatively concerning.

More specifically, we recommend that work analyzing “bias” in NLP systems be reoriented around the following question: How are social hierarchies, language ideologies, and NLP systems coproduced? This question mirrors \NAT@swafalse\NAT@partrue\NAT@fullfalse\NAT@citetpbenjamin2020vision call to examine how “race and technology are coproduced”—i.e., how racial hierarchies, and the ideologies and discourses that maintain them, create and are re-created by technology. We recommend that researchers and practitioners similarly ask how existing social hierarchies and language ideologies drive the development and deployment of NLP systems, and how these systems therefore reproduce these hierarchies and ideologies. As a starting point for reorienting work analyzing “bias” in NLP systems around this question, we provide the following concrete research questions:

How do social hierarchies and language ideologies influence the decisions made during the development and deployment lifecycle? What kinds of NLP systems do these decisions result in, and what kinds do they foreclose?

General assumptions: To which linguistic norms do NLP systems adhere Bender (2019); Ruane et al. (2019)? Which language practices are implicitly assumed to be standard, ordinary, correct, or appropriate?

Task definition: For which speakers are NLP systems (and NLP resources) developed? (See Joshi et al. (2020) for a discussion.) How do task definitions discretize the world? For example, how are social groups delineated when defining demographic attribute prediction tasks (e.g., Koppel et al., 2002; Rosenthal and McKeown, 2011; Nguyen et al., 2013)? What about languages in native language prediction tasks Tetreault et al. (2013)?

Data: How are datasets collected, preprocessed, and labeled or annotated? What are the impacts of annotation guidelines, annotator assumptions and perceptions Olteanu et al. (2019); Sap et al. (2019); Geiger et al. (2020), and annotation aggregation processes Pavlick and Kwiatkowski (2019)?

Evaluation: How are NLP systems evaluated? What are the impacts of evaluation metrics Olteanu et al. (2017)? Are any non-quantitative evaluations performed?

How do NLP systems reproduce or transform language ideologies? Which language varieties or practices come to be deemed good or bad? Might “good” language simply mean language that is easily handled by existing NLP systems? For example, linguistic phenomena arising from many language practices Eisenstein (2013) are described as “noisy text” and often viewed as a target for “normalization.” How do the language ideologies that are reproduced by NLP systems maintain social hierarchies?

Which representational harms are being measured or mitigated? Are these the most normatively concerning harms, or merely those that are well handled by existing algorithmic fairness techniques? Are there other representational harms that might be analyzed?

2 Conceptualizations of “bias”

Turning now to (R2), we argue that work analyzing “bias” in NLP systems should provide explicit statements of why the system behaviors that are described as “bias” are harmful, in what ways, and to whom, as well as the normative reasoning underlying these statements. In other words, researchers and practitioners should articulate their conceptualizations of “bias.” As we described above, papers often contain descriptions of system behaviors that are understood to be self-evident statements of “bias.” This use of imprecise terminology has led to papers all claiming to analyze “bias” in NLP systems, sometimes even in systems developed for the same task, but with different or even inconsistent conceptualizations of “bias,” and no explanations for these differences.

Yet analyzing “bias” is an inherently normative process—in which some system behaviors are deemed good and others harmful—even if assumptions about what kinds of system behaviors are harmful, in what ways, for whom, and why are not stated. We therefore echo calls by Bardzell and Bardzell (2011), Keyes et al. (2019), and Green (2019) for researchers and practitioners to make their normative reasoning explicit by articulating the social values that underpin their decisions to deem some system behaviors as harmful, no matter how obvious such values appear to be. We further argue that this reasoning should take into account the relationships between language and social hierarchies that we described above. First, these relationships provide a foundation from which to approach the normative reasoning that we recommend making explicit. For example, some system behaviors might be harmful precisely because they maintain social hierarchies. Second, if work analyzing “bias” in NLP systems is reoriented to understand how social hierarchies, language ideologies, and NLP systems are coproduced, then this work will be incomplete if we fail to account for the ways that social hierarchies and language ideologies determine what we mean by “bias” in the first place. As a starting point, we therefore provide the following concrete research questions:

What kinds of system behaviors are described as “bias”? What are their potential sources (e.g., general assumptions, task definition, data)?

In what ways are these system behaviors harmful, to whom are they harmful, and why?

What are the social values (obvious or not) that underpin this conceptualization of “bias?”

3 Language use in practice

Finally, we turn to (R3). Our perspective, which rests on a greater recognition of the relationships between language and social hierarchies, suggests several directions for examining language use in practice. Here, we focus on two. First, because language is necessarily situated, and because different social groups have different lived experiences due to their different social positions Hanna et al. (2020)—particularly groups at the intersections of multiple axes of oppression—we recommend that researchers and practitioners center work analyzing “bias” in NLP systems around the lived experiences of members of communities affected by these systems. Second, we recommend that the power relations between technologists and such communities be interrogated and reimagined. Researchers have pointed out that algorithmic fairness techniques, by proposing incremental technical mitigations—e.g., collecting new datasets or training better models—maintain these power relations by (a) assuming that automated systems should continue to exist, rather than asking whether they should be built at all, and (b) keeping development and deployment decisions in the hands of technologists Bennett and Keyes (2019); Cifor et al. (2019); Green (2019); Katell et al. (2020).

There are many disciplines for researchers and practitioners to draw on when pursuing these directions. For example, in human–computer interaction, Hamidi et al. (2018) study transgender people’s experiences with automated gender recognition systems in order to uncover how these systems reproduce structures of transgender exclusion by redefining what it means to perform gender “normally.” Value-sensitive design provides a framework for accounting for the values of different stakeholders in the design of technology (e.g., Friedman et al., 2006; Friedman and Hendry, 2019; Le Dantec et al., 2009; Yoo et al., 2019), while participatory design seeks to involve stakeholders in the design process itself Sanders (2002); Muller (2007); Simonsen and Robertson (2013); DiSalvo et al. (2013). Participatory action research in education Kemmis (2006) and in language documentation and reclamation Junker (2018) is also relevant. In particular, work on language reclamation to support decolonization and tribal sovereignty Leonard (2012) and work in sociolinguistics focusing on developing co-equal research relationships with community members and supporting linguistic justice efforts (e.g., Bucholtz et al., 2014, 2016, 2019) provide examples of more emancipatory relationships with communities. Finally, several workshops and events have begun to explore how to empower stakeholders in the development and deployment of technology Vaccaro et al. (2019); Givens and Morris (2020); Sassaman et al. (2020)444Also https://participatoryml.github.io/ and how to help researchers and practitioners consider when not to build systems at all Barocas et al. (2020).

As a starting point for engaging with communities affected by NLP systems, we therefore provide the following concrete research questions:

How do communities become aware of NLP systems? Do they resist them, and if so, how?

What additional costs are borne by communities for whom NLP systems do not work well?

Do NLP systems shift power toward oppressive institutions (e.g., by enabling predictions that communities do not want made, linguistically based unfair allocation of resources or opportunities Rosa and Flores (2017), surveillance, or censorship), or away from such institutions?

Who is involved in the development and deployment of NLP systems? How do decision-making processes maintain power relations between technologists and communities affected by NLP systems? Can these processes be changed to reimagine these relations?

Case study

To illustrate our recommendations, we present a case study covering work on African-American English (AAE).555This language variety has had many different names over the years, but is now generally called African-American English (AAE), African-American Vernacular English (AAVE), or African-American Language (AAL) Green (2002); Wolfram and Schilling (2015); Rickford and King (2016). Work analyzing “bias” in the context of AAE has shown that part-of-speech taggers, language identification systems, and dependency parsers all work less well on text containing features associated with AAE than on text without these features Jørgensen et al. (2015, 2016); Blodgett et al. (2016, 2018), and that toxicity detection systems score tweets containing features associated with AAE as more offensive than tweets without them Davidson et al. (2019); Sap et al. (2019).

These papers have been critical for highlighting AAE as a language variety for which existing NLP systems may not work, illustrating their limitations. However, they do not conceptualize “racial bias” in the same way. The first four of these papers simply focus on system performance differences between text containing features associated with AAE and text without these features. In contrast, the last two papers also focus on such system performance differences, but motivate this focus with the following additional reasoning: If tweets containing features associated with AAE are scored as more offensive than tweets without these features, then this might (a) yield negative perceptions of AAE; (b) result in disproportionate removal of tweets containing these features, impeding participation in online platforms and reducing the space available online in which speakers can use AAE freely; and (c) cause AAE speakers to incur additional costs if they have to change their language practices to avoid negative perceptions or tweet removal.

More importantly, none of these papers engage with the literature on AAE, racial hierarchies in the U.S., and raciolinguistic ideologies. By failing to engage with this literature—thereby treating AAE simply as one of many non-Penn Treebank varieties of English or perhaps as another challenging domain—work analyzing “bias” in NLP systems in the context of AAE fails to situate these systems in the world. Who are the speakers of AAE? How are they viewed? We argue that AAE as a language variety cannot be separated from its speakers—primarily Black people in the U.S., who experience systemic anti-Black racism—and the language ideologies that reinforce and justify racial hierarchies.

Even after decades of sociolinguistic efforts to legitimize AAE, it continues to be viewed as “bad” English and its speakers continue to be viewed as linguistically inadequate—a view called the deficit perspective Alim et al. (2016); Rosa and Flores (2017). This perspective persists despite demonstrations that AAE is rule-bound and grammatical Mufwene et al. (1998); Green (2002), in addition to ample evidence of its speakers’ linguistic adroitness (e.g., Alim, 2004; Rickford and King, 2016). This perspective belongs to a broader set of raciolinguistic ideologies Rosa and Flores (2017), which also produce allocational harms; speakers of AAE are frequently penalized for not adhering to dominant language practices, including in the education system Alim (2004); Terry et al. (2010), when seeking housing Baugh (2018), and in the judicial system, where their testimony is misunderstood or, worse yet, disbelieved Rickford and King (2016); Jones et al. (2019). These raciolinguistic ideologies position racialized communities as needing linguistic intervention, such as language education programs, in which these and other harms can be reduced if communities accommodate to dominant language practices Rosa and Flores (2017).

In the technology industry, speakers of AAE are often not considered consumers who matter. For example, Benjamin (2019) recounts an Apple employee who worked on speech recognition for Siri:

[indentfirst=false,leftmargin=0.5cm,rightmargin=0.5cm,font=raggedright] “As they worked on different English dialects — Australian, Singaporean, and Indian English — [the employee] asked his boss: ‘What about African American English?’ To this his boss responded: ‘Well, Apple products are for the premium market.”’

The reality, of course, is that speakers of AAE tend not to represent the “premium market” precisely because of institutions and policies that help to maintain racial hierarchies by systematically denying them the opportunities to develop wealth that are available to white Americans Rothstein (2017)—an exclusion that is reproduced in technology by countless decisions like the one described above.

Engaging with the literature outlined above situates the system behaviors that are described as “bias,” providing a foundation for normative reasoning. Researchers and practitioners should be concerned about “racial bias” in toxicity detection systems not only because performance differences impair system performance, but because they reproduce longstanding injustices of stigmatization and disenfranchisement for speakers of AAE. In re-stigmatizing AAE, they reproduce language ideologies in which AAE is viewed as ungrammatical, uneducated, and offensive. These ideologies, in turn, enable linguistic discrimination and justify enduring racial hierarchies Rosa and Flores (2017). Our perspective, which understands racial hierarchies and raciolinguistic ideologies as structural conditions that govern the development and deployment of technology, implies that techniques for measuring or mitigating “bias” in NLP systems will necessarily be incomplete unless they interrogate and dismantle these structural conditions, including the power relations between technologists and racialized communities.

We emphasize that engaging with the literature on AAE, racial hierarchies in the U.S., and raciolinguistic ideologies can generate new lines of engagement. These lines include work on the ways that the decisions made during the development and deployment of NLP systems produce stigmatization and disenfranchisement, and work on AAE use in practice, such as the ways that speakers of AAE interact with NLP systems that were not designed for them. This literature can also help researchers and practitioners address the allocational harms that may be produced by NLP systems, and ensure that even well-intentioned NLP systems do not position racialized communities as needing linguistic intervention or accommodation to dominant language practices. Finally, researchers and practitioners wishing to design better systems can also draw on a growing body of work on anti-racist language pedagogy that challenges the deficit perspective of AAE and other racialized language practices (e.g. Flores and Chaparro, 2018; Baker-Bell, 2019; Martínez and Mejía, 2019), as well as the work that we described in section 4.3 on reimagining the power relations between technologists and communities affected by technology.

Conclusion

By surveying 146 papers analyzing “bias” in NLP systems, we found that (a) their motivations are often vague, inconsistent, and lacking in normative reasoning; and (b) their proposed quantitative techniques for measuring or mitigating “bias” are poorly matched to their motivations and do not engage with the relevant literature outside of NLP. To help researchers and practitioners avoid these pitfalls, we proposed three recommendations that should guide work analyzing “bias” in NLP systems, and, for each, provided several concrete research questions. These recommendations rest on a greater recognition of the relationships between language and social hierarchies—a step that we see as paramount to establishing a path forward.

Acknowledgments

This paper is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1451512. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We thank the reviewers for their useful feedback, especially the suggestion to include additional details about our method.

References

Appendix A Appendix

In Table 3, we provide examples of the papers’ motivations and techniques across several NLP tasks.

In this section, we provide some additional details about our method—specifically, our categorization.

We considered a paper to cover a given NLP task if it analyzed “bias” with respect to that task, but not if it only evaluated overall performance on that task. For example, a paper examining the impact of mitigating “bias” in word embeddings on “bias” in sentiment analysis would be counted as covering both NLP tasks. In contrast, a paper assessing whether performance on sentiment analysis degraded after mitigating “bias” in word embeddings would be counted only as focusing on embeddings.

What counts as a motivation?

We considered a motivation to include any description of the problem that motivated the paper or proposed quantitative technique, including any normative reasoning.

We excluded from the “Vague/unstated” category of motivations the papers that participated in the Gendered Ambiguous Pronoun (GAP) Shared Task at the First ACL Workshop on Gender Bias in NLP. In an ideal world, shared task papers would engage with “bias” more critically, but given the nature of shared tasks it is understandable that they do not. As a result, we excluded them from our counts for techniques as well. We cite the papers here; most propose techniques we would have categorized as “Questionable correlations,” with a few as “Other representational harms” Abzaliev (2019); Attree (2019); Bao and Qiao (2019); Chada (2019); Ionita et al. (2019); Liu (2019); Lois et al. (2019); Wang (2019); Xu and Yang (2019); Yang et al. (2019).

We excluded Dabas et al. (2020) from our survey because we could not determine what this paper’s user study on fairness was actually measuring.

Finally, we actually categorized the motivation for Liu et al. (2019) (i.e., the last row in Table 3) as “Questionable correlations” due to a sentence elsewhere in the paper; had the paragraph we quoted been presented without more detail, we would have categorized the motivation as “Vague/unstated.”

A.2 Full categorization: Motivations

Hovy and Spruit (2016); Caliskan et al. (2017); Madnani et al. (2017); Dixon et al. (2018); Kiritchenko and Mohammad (2018); Shen et al. (2018); Zhao et al. (2018b); Bhaskaran and Bhallamudi (2019); Bordia and Bowman (2019); Brunet et al. (2019); Chaloner and Maldonado (2019); De-Arteaga et al. (2019); Dev and Phillips (2019); Font and Costa-jussà (2019); James-Sorenson and Alvarez-Melis (2019); Kurita et al. (2019); Mayfield et al. (2019); Pujari et al. (2019); Romanov et al. (2019); Ruane et al. (2019); Sedoc and Ungar (2019); Sun et al. (2019); Zmigrod et al. (2019); Hutchinson et al. (2020); Papakyriakopoulos et al. (2020); Ravfogel et al. (2020); Strengers et al. (2020); Sweeney and Najafian (2020); Tan et al. (2020); Zhang et al. (2020b).

Stereotyping

Bolukbasi et al. (2016a, b); Caliskan et al. (2017); McCurdy and Serbetçi (2017); Rudinger et al. (2017); Zhao et al. (2017); Curry and Rieser (2018); Díaz et al. (2018); Santana et al. (2018); Sutton et al. (2018); Zhao et al. (2018a, b); Agarwal et al. (2019); Basta et al. (2019); Bhaskaran and Bhallamudi (2019); Bordia and Bowman (2019); Brunet et al. (2019); Cao and Daumé (2019); Chaloner and Maldonado (2019); Cho et al. (2019); Dev and Phillips (2019); Font and Costa-jussà (2019); Gonen and Goldberg (2019); James-Sorenson and Alvarez-Melis (2019); Kaneko and Bollegala (2019); Karve et al. (2019); Kurita et al. (2019); Lauscher and Glavaš (2019); Lee et al. (2019); Manzini et al. (2019); Mayfield et al. (2019); Précenth (2019); Pujari et al. (2019); Ruane et al. (2019); Stanovsky et al. (2019); Sun et al. (2019); Tan and Celis (2019); Webster et al. (2019); Zmigrod et al. (2019); Gyamfi et al. (2020); Hube et al. (2020); Hutchinson et al. (2020); Kim et al. (2020); Nadeem et al. (2020); Papakyriakopoulos et al. (2020); Ravfogel et al. (2020); Rozado (2020); Sen and Ganguly (2020); Shin et al. (2020); Strengers et al. (2020).

Other representational harms

Hovy and Søgaard (2015); Blodgett et al. (2016); Bolukbasi et al. (2016b); Hovy and Spruit (2016); Blodgett and O’Connor (2017); Larson (2017); Schnoebelen (2017); Blodgett et al. (2018); Curry and Rieser (2018); Díaz et al. (2018); Dixon et al. (2018); Kiritchenko and Mohammad (2018); Park et al. (2018); Shen et al. (2018); Thelwall (2018); Zhao et al. (2018b); Badjatiya et al. (2019); Bagdasaryan et al. (2019); Bamman et al. (2019); Cao and Daumé (2019); Chaloner and Maldonado (2019); Cho et al. (2019); Davidson et al. (2019); De-Arteaga et al. (2019); Fisher (2019); Font and Costa-jussà (2019); Garimella et al. (2019); Loukina et al. (2019); Mayfield et al. (2019); Mehrabi et al. (2019); Nozza et al. (2019); Prabhakaran et al. (2019); Romanov et al. (2019); Ruane et al. (2019); Sap et al. (2019); Sheng et al. (2019); Sun et al. (2019); Sweeney and Najafian (2019); Vaidya et al. (2019); Gaut et al. (2020); Gencoglu (2020); Hovy et al. (2020); Hutchinson et al. (2020); Kim et al. (2020); Peng et al. (2020); Rios (2020); Sap et al. (2020); Shah et al. (2020); Sheng et al. (2020); Tan et al. (2020); Zhang et al. (2020a, b).

Questionable correlations

Jørgensen et al. (2015); Hovy and Spruit (2016); Madnani et al. (2017); Rudinger et al. (2017); Zhao et al. (2017); Burns et al. (2018); Dixon et al. (2018); Kiritchenko and Mohammad (2018); Lu et al. (2018); Park et al. (2018); Shen et al. (2018); Zhang et al. (2018); Badjatiya et al. (2019); Bhargava and Forsyth (2019); Cao and Daumé (2019); Cho et al. (2019); Davidson et al. (2019); Dev et al. (2019); Garimella et al. (2019); Garg et al. (2019); Huang et al. (2019); James-Sorenson and Alvarez-Melis (2019); Kaneko and Bollegala (2019); Liu et al. (2019); Karve et al. (2019); Nozza et al. (2019); Prabhakaran et al. (2019); Romanov et al. (2019); Sap et al. (2019); Sedoc and Ungar (2019); Stanovsky et al. (2019); Sweeney and Najafian (2019); Vaidya et al. (2019); Zhiltsova et al. (2019); Chopra et al. (2020); Gonen and Webster (2020); Gyamfi et al. (2020); Hube et al. (2020); Ravfogel et al. (2020); Rios (2020); Ross et al. (2020); Saunders and Byrne (2020); Sen and Ganguly (2020); Shah et al. (2020); Sweeney and Najafian (2020); Yang and Feng (2020); Zhang et al. (2020a).

Vague/unstated

Rudinger et al. (2018); Webster et al. (2018); Dinan et al. (2019); Florez (2019); Jumelet et al. (2019); Lauscher et al. (2019); Liang et al. (2019); Maudslay et al. (2019); May et al. (2019); Prates et al. (2019); Prost et al. (2019); Qian et al. (2019); Swinger et al. (2019); Zhao et al. (2019); Zhou et al. (2019); Ethayarajh (2020); Huang et al. (2020); Jia et al. (2020); Popović et al. (2020); Pryzant et al. (2020); Vig et al. (2020); Wang et al. (2020); Zhao et al. (2020).

Surveys, frameworks, and meta-analyses

Hovy and Spruit (2016); Larson (2017); McCurdy and Serbetçi (2017); Schnoebelen (2017); Basta et al. (2019); Ethayarajh et al. (2019); Gonen and Goldberg (2019); Lauscher and Glavaš (2019); Loukina et al. (2019); Mayfield et al. (2019); Mirzaev et al. (2019); Prabhumoye et al. (2019); Ruane et al. (2019); Sedoc and Ungar (2019); Sun et al. (2019); Nissim et al. (2020); Rozado (2020); Shah et al. (2020); Strengers et al. (2020); Wright et al. (2020).

Appendix B Full categorization: Techniques

De-Arteaga et al. (2019); Prost et al. (2019); Romanov et al. (2019); Zhao et al. (2020).

Stereotyping

Bolukbasi et al. (2016a, b); Caliskan et al. (2017); McCurdy and Serbetçi (2017); Díaz et al. (2018); Santana et al. (2018); Sutton et al. (2018); Zhang et al. (2018); Zhao et al. (2018a, b); Agarwal et al. (2019); Basta et al. (2019); Bhaskaran and Bhallamudi (2019); Brunet et al. (2019); Cao and Daumé (2019); Chaloner and Maldonado (2019); Dev and Phillips (2019); Ethayarajh et al. (2019); Gonen and Goldberg (2019); James-Sorenson and Alvarez-Melis (2019); Jumelet et al. (2019); Kaneko and Bollegala (2019); Karve et al. (2019); Kurita et al. (2019); Lauscher and Glavaš (2019); Lauscher et al. (2019); Lee et al. (2019); Liang et al. (2019); Liu et al. (2019); Manzini et al. (2019); Maudslay et al. (2019); May et al. (2019); Mirzaev et al. (2019); Prates et al. (2019); Précenth (2019); Prost et al. (2019); Pujari et al. (2019); Qian et al. (2019); Sedoc and Ungar (2019); Stanovsky et al. (2019); Tan and Celis (2019); Zhao et al. (2019); Zhou et al. (2019); Chopra et al. (2020); Gyamfi et al. (2020); Nadeem et al. (2020); Nissim et al. (2020); Papakyriakopoulos et al. (2020); Popović et al. (2020); Ravfogel et al. (2020); Ross et al. (2020); Rozado (2020); Saunders and Byrne (2020); Shin et al. (2020); Vig et al. (2020); Wang et al. (2020); Yang and Feng (2020); Zhao et al. (2020).

Other representational harms

Jørgensen et al. (2015); Hovy and Søgaard (2015); Blodgett et al. (2016); Blodgett and O’Connor (2017); Blodgett et al. (2018); Curry and Rieser (2018); Dixon et al. (2018); Park et al. (2018); Thelwall (2018); Webster et al. (2018); Badjatiya et al. (2019); Bagdasaryan et al. (2019); Bamman et al. (2019); Bhargava and Forsyth (2019); Cao and Daumé (2019); Font and Costa-jussà (2019); Garg et al. (2019); Garimella et al. (2019); Liu et al. (2019); Loukina et al. (2019); Mehrabi et al. (2019); Nozza et al. (2019); Sap et al. (2019); Sheng et al. (2019); Stanovsky et al. (2019); Vaidya et al. (2019); Webster et al. (2019); Ethayarajh (2020); Gaut et al. (2020); Gencoglu (2020); Hovy et al. (2020); Huang et al. (2020); Kim et al. (2020); Peng et al. (2020); Ravfogel et al. (2020); Rios (2020); Sap et al. (2020); Saunders and Byrne (2020); Sheng et al. (2020); Sweeney and Najafian (2020); Tan et al. (2020); Zhang et al. (2020a, b).

Questionable correlations

Jurgens et al. (2017); Madnani et al. (2017); Rudinger et al. (2017); Zhao et al. (2017); Burns et al. (2018); Díaz et al. (2018); Kiritchenko and Mohammad (2018); Lu et al. (2018); Rudinger et al. (2018); Shen et al. (2018); Bordia and Bowman (2019); Cao and Daumé (2019); Cho et al. (2019); Davidson et al. (2019); Dev et al. (2019); Dinan et al. (2019); Fisher (2019); Florez (2019); Font and Costa-jussà (2019); Garg et al. (2019); Huang et al. (2019); Liu et al. (2019); Nozza et al. (2019); Prabhakaran et al. (2019); Qian et al. (2019); Sap et al. (2019); Stanovsky et al. (2019); Sweeney and Najafian (2019); Swinger et al. (2019); Zhiltsova et al. (2019); Zmigrod et al. (2019); Hube et al. (2020); Hutchinson et al. (2020); Jia et al. (2020); Papakyriakopoulos et al. (2020); Popović et al. (2020); Pryzant et al. (2020); Saunders and Byrne (2020); Sen and Ganguly (2020); Shah et al. (2020); Sweeney and Najafian (2020); Zhang et al. (2020b).