CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation

Mark Diaz, Ian D. Kivlichan, Rachel Rosen, Dylan K. Baker, Razvan Amironesei, Vinodkumar Prabhakaran, Emily Denton

Introduction

Human computation refers to the practice of tapping into human intelligence and cognition as computational elements within an information processing system design, often done on a large global scale (Quinn and Bederson, 2011). The sheer scale of human computation that the Web enables has made possible things that were previously unimaginable, e.g., Captchas digitizing the entire NYTimes historical publications, global participatory platforms for human rights and crises response,https://www.ushahidi.com/ and large-scale data and distributed analyses enabled by citizen science projects.https://www.citizenscience.gov/ In particular, human computation has played a critical role in the research, development, and deployment of modern-day artificial intelligence systems, through the creation of training datasets (Deng et al., 2009), and human-in-the-loop systems (Filippova et al., 2019; Monarch, 2021). By enabling efficient and scalable distribution of data labelling microtasks, crowdsourcing platforms are a natural choice for dataset developers aiming to cheaply and efficiently generate dataset annotations.

In this paper we explore the challenges and decision points inherent to crowdsourced annotation of machine learning datasets and propose a framework, CrowdWorkSheets, for reflecting on data-set annotation decisions, and documenting them in a standardized manner. At a high level, CrowdWorkSheets prompts dataset developers to ask: who is annotating the data, and why is that important? We consider how the ethical concerns of data annotation intersect with the identities of the annotators, the social structures surrounding their work, and how their individual perspectives may become encoded within the dataset labels. In doing so, we push back against the prevalent notion that crowdworkers are interchangeable and instead seek to illuminate why they are not. Data generated in crowdwork tasks is shaped by a range of social factors and the datasets that workers help to build continue to shape systems long after worker engagement ends. Processes of annotation thus impact future models built from this data; therefore, understanding the perspectives captured through data labeling is crucial to fully understanding these models and the potential social impact they can have.

Our work is motivated by, and extends, prior scholarship examining ethical considerations relating to crowdsourcing. For instance, Vakharia and Lease (2015) outline various kinds of challenges encountered in this space by analyzing and comparing seven different crowdsourcing platforms. In addition, Schlagwein et al. (2019) conducted extensive fieldwork, engaging crowdworkers, platform organizers, and requesters over the course of three years to uncover a range of ethical dilemmas relating to gig economy crowdsourcing. Shmueli et al. (2021) identified risks of harm to crowdworkers engaged in NLP tasks. Attending to ethical issues more broadly, Kocsis and De Vreede (2016) used value-sensitive design and transparency literature to develop a taxonomic framework of ethical considerations in crowdsourcing.

Our primary contribution is the introduction of CrowdWorkSheets, a novel framework designed to facilitate critical reflection and transparent documentation of dataset annotation decisions, processes, and outcomes. CrowdWorkSheets complements and extends dataset development and documentation frameworks that have previously been developed in service of transparency, accountability, and reproducibility (Gebru et al., 2018; Bender and Friedman, 2018; Holland et al., 2018; Chmielinski et al., 2020; Hutchinson et al., 2021a; Ramírez et al., 2021; Pushkarna et al., 2022), but focuses specifically on unique considerations relating to crowdsourced data-set annotation. Similar to recent dataset documentation frameworks that have tailored to specific domains (e.g. (Srinivasan et al., 2021; Rostamzadeh et al., 2021)), our work starts from a recognition of the limitations of “one-size-fits-all” solutions to ethical issues in dataset development. More specifically, we offer CrowdWorkSheets as a targeted intervention to address unresolved ethical problems in crowdsourcing that relate specifically to worker subjectivity and worker experiences.

The remainder of this paper is structured as follows. First, re review literature relating to (1) how annotators’ individual and collective social experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms, and what that relationship means for their ability to engage in fair work. Next, we introduce the CrowdWorkSheets considerations and documentation questions. Finally, we step through a hypothetical case study to illustrate how a dataset developer might use CrowdWorkSheets to document their decisions.

Who is annotating ML datasets and why does it matter?

The historical lineage of crowdsourced labor can be traced back to manufacturing innovation of piecework (Alkhatib et al., 2017)—a form of labor that produced the “unskilled worker” as the paradigmatic interchangeable component and which has been credited with giving rise to the productivity and ingenuity of American manufacturing (Freeman, 2018). In an analogous manner, crowdwork platforms are often designed to position crowdworkers as interchangeable (Irani and Silberman, 2013). While some forms of digital work can be decomposed and distributed, the presumption that crowdsourced dataset annotators exercise near-identical capacities of perception and judgement ignores the fact that social position, identity, and experience shape how annotators apply knowledge. Yet, recent empirical work has revealed that dataset annotators are often treated as interchangeable in practice. For example, relatively little attention is given or documented about annotator positionality—how annotator social identity shapes their understanding of the world (Geiger et al., 2020; Scheuerman et al., 2021). Crowd workers are often selected by task requesters based on quality metrics, rather than on any socially defining features of their knowledge or experience. This is concerning; when crowd-sourced annotations are used to build datasets capturing subjective phenomena, such as sentiment or hate speech, annotators’ values and subjective judgments shape the perspectives that machine learning models learn from in a manner that is wholly unaccounted for.

Understanding socio-cultural factors of an annotator pool—or even selecting annotators based on these factors—is important because annotator’s identity and lived experience can impact how annotation questions are interpreted and responded to. More generally, subjective interpretations of a task can produce divergent annotations across different communities (Sen et al., 2015). As Aroyo and Welty (2015) argue, the notion of “one truth” in crowdsourcing responses is a myth; disagreement between annotators, which is often viewed as problematic noise, can actually provide a valuable signal.

A variety of social, cultural, economic, and infrastructural factors contribute to the sociodemographic distribution of workers on any given platform. For example, as (Gray and Suri, 2019) points out, the remote nature of crowdwork differentially attracts workers along gender lines, such as mothers who do crowdwork because it allows for an easier balance of childcare in comparison to other work. Other work similarly notes significant gender differences among workers who report engaging in crowdwork because they are only able to conduct work from their homes (Berg, 2015). This leads to a different gender balance among crowdworkers in the United States than in many other parts of the world; crowdworkers in most of the world are disproportionately male, while in contrast over 60% of U.S. annotators are female (Posch et al., 2018). Ipeirotis (2010) hypothesizes this to be due to the remote nature of the work, which attracts stay-at-home parents and unemployed or underemployed adults, who are more likely to be women. Additionally, health problems and disability are also a factor that cause many workers to only be able to work from home and motivates them to pursue crowd work (Berg, 2015).

Since many crowdsourced annotator pools are sociodemographically skewed, there are implications for which populations and cultural values are represented in datasets and models (Ghosh et al., 2021) as well as which populations face the challenges of crowdwork (Irani and Silberman, 2013; Gray and Suri, 2019). Accounting for skews in annotator demographics is critical for contextualizing datasets and ensuring responsible downstream use. In short, there is value in acknowledging, and accounting for, worker’s socio-cultural background—both from the perspective of data quality and societal impact.

2. Lived experiences of annotators as expertise

Just as substantive work experience lends valuable domain expertise for a given problem (e.g., annotation of medical imagery by a medical professional), lived experience with, and proximity to, a problem domain can provide a valuable source of expertise for dataset annotation. For example, women experience higher rates of sexual harassment online compared to men, and among those who have experienced online abuse, women are more likely to identify it as such (Vogels, 2021). This underscores the importance of considering raters’ experience with gender-based harassment, when using crowdwork to annotate/moderate online harassment. Recent work has highlighted how the “average” rater, in terms of gender and other social characteristics, varies dramatically depending on which geographies raters are selected from (Posch et al., 2018). Additionally, as a result of the previously mentioned sociodemographic differences among who is likely to conduct crowdwork, ratings on sexual harassment data, for example, may differ according to the geographic distribution of raters.

At the same time, relevant lived experience among annotators does not always fall along demographic lines. Waseem (2016) demonstrated that incorporating feminist and antiracist activists’ perspectives into hate speech annotations yielded better aligned models. Similarly, Patton et al. (2019) demonstrated the importance of situated domain expertise—including contextualized knowledge of local language, concepts, and gang activity—when annotating Twitter images to detect pathways to violence among gang-involved youth in Chicago. They found that expert annotations (i.e., those from individuals situated in Chicago and with community ties) significantly diverged from those of graduate students who were scholars of social work and who were trained to perform the annotation task but who lacked this lived experience.

In summary, a core question to answer in data collection is how much annotator’s identity, lived experience, and prior knowledge of a problem space matters for the task at hand, and how it impacts what the resulting dataset is intended to capture. While the aforementioned examples constitute relatively subjective tasks, even seemingly objective tasks such as annotating medical text vary surprisingly with annotator backgrounds and experience. Aroyo and Welty (2015) show that medical experts are more likely to erroneously identify medical relations as being expressed in text compared with non experts because the experts already know the relation is true based on knowledge external to the task. Their work underscores a need to examine annotator experience even in tasks that appear to be unambiguous or objective.

Worker Experiences of Dataset Annotation

Another series of considerations are rooted in annotators’ experiences with annotation work itself and how those experiences impact how they do their work. These include issues related to worker compensation, power imbalances in between worker and requester, and the structure of annotation work itself—all of which can pose barriers to crowdworker well-being and their ability to produce quality work.

Compensation policies of crowdwork platforms should be a core aspect to consider when thinking about responsible data collection. For instance, in the U.S., there are currently no regulations around worker pay for crowdwork (Berg, 2015), and the Fair Labor Standards Act that established the minimum wage,https://www.dol.gov/agencies/whd/flsa is not applicable for crowdworkers as they are independent contractors (Semuels, 2018). Reports on how much crowdworkers actually earn vary, but generally show an average lower than minimum wage (Irani and Silberman, 2013); surveys of workers from Amazon Mechanical Turk and Crowdflower place it on average between $1 and$ 5.5 per hour (Berg, 2015) with a median wage of roughly $2 / hour (Hara et al., 2018; Semuels, 2018); only a small fraction of workers (4%) earn more than$ 7.25 / hour (Hara et al., 2018).

Recent research has also identified how crowdworking platforms often necessitate various kinds of unpaid labor from crowdworkers, which reduces overall wages.

For example, one report found that for every hour of paid work, workers spend another 18 minutes on unpaid work, including searching for tasks (Berg, 2015). Another recent study found that once daily invisible labor was accounted for, the median hourly wage for crowdworkers on Amazon Mechanical Turk dropped from $3.76 to$ 2.83 (Toxtli et al., 2021).

Workers often invest significant labor outside the platform itself to find tasks, relying on web browsers extensions and participating in crowd work forums (Hanrahan et al., 2021; Kaplan et al., 2018). Time spent working is compounded by competition from other crowdworkers (Semuels, 2018), which can pressure workers to be constantly available to look for work (Berg, 2015).

The working conditions of crowdworkers are characterized by long working hours, partially as a result of this competition. As Berg (2015) notes, this conflicts with the work flexibility motivates many workers to choose crowd work.

Worker psychological safety is a particular area of concern. Crowd workers who work on content moderation of user generated content often need to look at content that includes violent imagery or sexual and pornographic content (Roberts, 2016), or to transcribe conversations about trafficking children into sexual slavery (Emerson, 2019). In many cases, it is impossible to ascertain that a job may contain such content (Emerson, 2019). If crowdworkers find themselves upset or disturbed by this content, they have little recourse; often, workers need to sign non-disclosure agreements preventing them from talking to anyone about the awful things they must look at, even for support (Roberts, 2016). Additionally, raising concerns to their employers is quite difficult; both bureaucracy and physical distance (many of these workers are in the Global South) prohibit any direct lines of feedback or complaints. There is research available on the long-term impacts of viewing harmful user-generated content, but it is difficult to assess the full harm this causes to workers’ well-being (Roberts, 2016).

2. Power dynamics

Power dynamics between the requesters and annotators is another major challenge. Annotators are often heavily distanced from those leading the development of datasets are requesting tasks, which can obfuscate working conditions. Top-down organizational structures often results in the workers viewing requesters as more informed as they are the ones who provided the data and the label schema (Miceli et al., 2020). Hence, instead of resolving ambiguities, workers are more likely to try to judge from the standpoint of the requester, often with limited exposure to the goals of the annotation. This contributes to the portability trap (Selbst et al., 2019): a “failure to understand how repurposing algorithmic solutions designed for one social context may be misleading, inaccurate, or otherwise do harm when applied to a different context.”

Power dynamics are also at play in the rejection of work: a large majority of crowdworkers (94% as per (Berg, 2015)) have had work that was rejected or for which they were not paid. Yet, some platforms give requesters full rights over the data they receive, regardless of whether they accept or reject it, and workers have no way of taking legal action if requesters use rejected work anyway (Irani and Silberman, 2013); Roberts (2016) describes this system as one that “enables wage theft”. Moreover, rejecting work and withholding pay is painful because rejections are often caused by unclear instructions and the lack of meaningful feedback channels. Many crowdworkers report that poor communication negatively affects their work (Berg, 2015). Moreover, requesters get to choose whether the work is up to their standards before choosing whether to pay for it, even though rejections are often caused by unclear instructions and the very lack of feedback channels they refuse to provide (Berg, 2015). Workers also feel powerless to speak up about perceived injustices from requesters or the platform; Amazon Mechanical Turk (AMT) users have reportedly had their accounts suspended for speaking negatively about Amazon (Semuels, 2018). Additionally, requesters can block users who offer them feedback without consequence (Berg, 2015).

Power asymmetries also reflect global power dynamics. For instance, since technology development happens primarily in the West, human computation from the Global South is often relegated to the margins (Sambasivan et al., 2021). In particular, (Sambasivan et al., 2021) points out that the technical, social, ethical, and physical distance between the builders of a technology and the communities it is meant to serve is large, in such settings. (Bott and Young, 2012) has pointed out the potential of crowdsourcing to revolutionize civic participation in many developing countries to address complex challenges in governance around global issues such as climate change, poverty, armed conflict, and other crises. They also point out the challenges when it comes to employing crowdsourced interventions on the ground in the Global South. They note that systemic disparities endemic to local contexts are often reflected in who is represented in crowd; for instance, the digital crowd in the Global South tends to over-represent the elite, educated, young males who belong to the upper tiers of local social hierarchies.

Roberts (2016) compares the commercial content moderation work to the practice of developed nations offloading their hazardous e-waste refuse on countries in the Global South. Her interviewees characterized this work as “akin to being immersed in ‘a cesspool’ – feeling that they are within a pit of toxic matter and waste day in and day out”. The metaphor goes further in highlighting the fact that digital content moderation when outsourced to countries in the Global South, serves to keep the digital refuse away from the field of vision of those in the Global North who are responsible for its existence, and for whom it was intended, in much the same way the rotting garbage and e-waste produced in the Global North is kept away.

On the other hand, some platforms have geographical blocking, which many non-Americans find problematic (Berg, 2015) since it can be used to exclude them. This reinforces the dynamic where requesters in the United States get to decide which global perspectives they want to consider for their task, and which they want to disregard.

The anonymous and geographically distributed nature of crowdsourced annotation work imposes significant barriers to collective action on the part of dataset annotators. While some platforms offer communication spaces, such as discussion forums, for workers to communicate with one another, these platform-moderated spaces have been shown to be ineffective at supporting labor organizing or worker power (Gerber, 2021). In response, several tools have been developed independently from crowdwork platforms to support crowd workers.

For example, TurkerNation, Turk Alert, MTurkGrind, and Reddit’s /r/HITsWorthTurkingFor offer online forums for AMT workers to share information about well-paying work and share experiences with different requesters and Turkopticon (Irani and Silberman, 2013; tur, 2008) is a browser add-on that enables AMT workers to review and report requesters and view reviews from other workers. These tools can help workers overcome the information asymmetries built into the AMT platform (Martin et al., 2014). Dynamo is another community platform designed specifically to support and enable collective action for AMT workers, creating “unities without unions” (Salehi et al., 2015). A 2015 study of the platform found that twenty-two ideas for action had been generated and two active campaigns had been initiated.

In summary, responsible data annotation requires careful consideration of the power dynamics that structure the working relationship between requesters, annotators, and the platforms.

(Semuels, 2018) As a result, workers feel pressured to be constantly available online to look for work and work longer hours (Berg, 2015). (Berg, 2015) notes that this is in conflict with one of the reasons many workers choose crowdwork, which is flexibility in working hours.

CrowdWorkSheets: A Documentation Framework for Crowdsourced Dataset Annotation

We now introduce our framework, CrowdWorkSheets, which outlines a series of considerations designed to guide the collection, use, and dissemination of crowd-sourced annotations and questions designed to elicit information about various decisions and outcomes. We have decomposed the framework into sections based on different parts of a typical dataset construction pipeline, from the formulation of tasks to dissemination of datasets.

First, we must ask: what are we asking annotators to do? Our considerations and documentation questions focus on many aspects of task formulation including which assumptions we make about annotators, how we handle ambiguity and subjectivity within our task, and how our task is ultimately framed and communicated.

While some tasks tend to pose objective questions with a correct answer (is there a human face in an image?), oftentimes datasets aim to capture judgement on relatively subjective tasks with no universally correct answer (is this piece of text offensive?). Moreover, even seemingly objective tasks can still be rife with ambiguity or corner-cases and ultimately require subjective judgements to be made on the part of annotators. As such, it is important to consider how questions afford varied interpretations or may require subjective judgements on the part of the annotators. Clarifying such aspects of of an annotation task as critical to ensuring a resulting dataset captures the aspects of human intelligence they are meant to capture. Moreover, as discussed in Section 3, a survey of crowdworkers on AMT found that many instances of work rejection were due to unclear instructions (Berg, 2015).

While we discuss the nuances of annotator selection in greater depth in Section 4.2, tasks should be formulated based on considerations regarding who will be annotating data and what perspectives should (or should not) be included. Determinations should be tied to the purpose of dataset creation and the downstream use cases it is meant to serve, rather than what is convenient, efficient, or scalable. Some tasks may benefit from being informed by the annotators’ lived experiences and thus may be designed to explicitly seek out such expertise. On the other hand, a dataset developer may want to frame task instructions so as to restrict the annotators from relying on their lived experiences, e.g. for a dataset meant to capture a set of policies defined by a platform.

Finally, when formulating a task, it is important to consider how much information to disclose to annotators about the task in advance. Some information may be essential to disclose in order to enable annotators to make informed decision regarding whether or not to accept the task. For example, disclosure of how data will be stored, packaged, and potentially published may be particularly important when sociodemographic, or other sensitive information, about annotators is being requested. Similarly, disclosure of risks relating to psychological harm should be included where appropriate.

Consider the role subjectivity plays in your annotation task. Remember that individuals with different social and cultural backgrounds might differ in their judgements.

Consider the forms of expertise that should be incorporated through data annotation, including both formal disciplinary training and lived experience with the problem domain. Remember that insufficiently capturing this expertise in the annotator pool may carry risks for downstream model usage.

Make sure task instructions are clear and unambiguous in order to prevent annotators from wasting time on a task where their work will be rejected due to misunderstandings. Consider assessing the task instructions in a small-scale setting prior to launching your full annotation task.

Consider the personal information you are collecting from annotators and the potential ethical or privacy risks that may accompany such collection.

Consider the amount of information you disclose to annotators prior to engagement with the task and ensure annotators have an opportunity to make informed decisions based on any potential risks the task carries.

At a high level, what are the subjective aspects of your task?

What assumptions do you make about annotators?

How did you choose the specific wording of your task instructions? What steps, if any, were taken to verify the clarity of task instructions and wording for annotators?

What, if any, risks did your task pose for annotators and were they informed of the risks prior to engagement with the task?

What are the precise instructions that were provided to annotators?

2. Selecting annotators

Next, we ask: who is annotating the data? While there is no single “correct” way to assemble an annotator pool, the selection of an annotator pool is a highly consequential decision. Since annotators from different communities can produce significantly different annotations given the same task (Sen et al., 2015), it is important to recognize that annotator selection may have a significant impact on the labels of your dataset. With this in mind, it is important to consider the intended use of the datasets—which communities will be most impacted by models built from the data, and which communities could be harmed the most by resulting biases present if they are not represented in the annotator pool?

In some cases, social identities of annotators indicate a form of expertise relevant to our task so it may be prudent to select annotators based on self-identified sociodemographic factors. In other cases, it may be important to select annotators based on other forms of expertise or experience with a problem domain. Understanding one’s desired annotator pool may subsequently impact decisions regarding platform selection, as different platforms offer differing degrees of flexibility to assemble custom annotator pools.

While selecting annotators based on sociodemographic factors may help ensure a dataset reflects perspectives of certain groups, targeted data collection efforts—particularly those oriented towards the inclusion of marginalized groups—are not without risk. For example, (Denton et al., 2020) discuss how the mere inclusion of marginalized groups within a dataset, without sufficient attention to broader considerations of data capture and use, can operate as a form of “predatory inclusion”The term ”predatory inclusion” has been used to describes modes of inclusion that are extractive and predatory in nature in other domains (e.g. (Seamster and Charron-Chénier, 2017)). Discourses of inclusion can serve to “further rather than subvert vulnerability to what might more broadly be called ‘data violence’” (Hoffmann, 2021). From a privacy perspective, if sociodemograph-ic information is collected and published with a dataset, developers should take extra care to mitigate risks of unintentionally making annotators identifiable.

While there are multiple valid ways to assemble an annotator pool, remember that annotators are not interchangeable, and that the decisions in this stage can heavily impact the final dataset.

Consider the ways in which social identities of annotators may relate to the forms of expertise important for the task.

Consider the intended usage contexts of the dataset, and the marginalized communities therein, when choosing which annotators to be prioritized to be included.

Consider how labor practices intersect with the choice of who the annotators are. For example: if female annotators make up the majority, as they do in the U.S. (Posch et al., 2018), consider how fair payment, or a lack thereof, could impact this group.

Are there certain perspectives that should be privileged? If so, how did you seek these perspectives out?

Are there certain perspectives that would be harmful to include? If so, how did you screen these perspectives out?

Were sociodemographic characteristics used to select annotators for your task? If so, please detail the process.

If you have any aggregated sociodemographic statistics about your annotator pool, please describe.

Do you have reason to believe that sociodemographic characteristics of annotators may have impacted how they annotated the data? Why or why not?

3. Platform and infrastructure choices

Next, we ask, under what conditions are data annotated? As described in Section 3, platform policies around compensation and power asymmetries play a huge role in shaping worker experiences and the quality of work that annotators produce. Different platforms offer different affordances for communication between task requesters and annotators, which might impact the extent to which task requesters can incorporate annotator feedback into the task framing or annotator guidelines. Different platforms also impose different minimum-pay constraints; requesters may want to support platforms that uphold fair pay standards. Additionally, requesters should be mindful of potential differences between legal minimum wages and a living wage (Glasmeier, 2020).

Separately from the platform, task creators should be aware of worker pay per hour; some platforms may only offer requesters the option to select pay per item for an annotation task, and the defaults may be set low. Task creators should take care when estimating work time per item to ensure they are paying workers fairly. Another thing to consider when choosing a platform for data annotation is how well that platform supports rater psychological safety. Some platforms provide more affordances than others for crowdworkers to seek out support if they are experiencing distress, or if they otherwise have questions or feedback for requesters.

Consider platform’s underlying annotator pool and the options they provide to source specialized rater pools, and whe-ther they enable you to curate an appropriate pool of annotators (e.g. considering sociodemographic factors or domain expertise).

Consider comparing and contrasting the minimum pay requirements established across different platforms. You may choose to support a platform that upholds fair pay standards.

Consider the extent to which you would like to establish a channel of communication and feedback between your team and the annotators. Platform mediated channels of communication can give annotators an opportunity to provide feedback on confusing instructions, or otherwise seek out support.

What annotation platform did you utilize?

At a high level, what considerations informed your decision to choose this platform?

Did the chosen platform sufficiently meet the requirements you outlined for annotator pools? Are any aspects not covered?

How much were annotators compensated? Did you consider any particular pay standards, when determining their compensation? If so, please describe.

4. Dataset analysis and evaluation

Once data instances are annotated, what do we do with the results? This section focuses on considerations related to the process of converting the “raw” annotations into the labels that are ultimately packaged in a dataset. A common practice in building crowdsourced annotations for discrete labeling tasks is to obtain multiple annotator judgements that are then aggregated (e.g., through majority voting) to obtain a single “ground truth” that is released in the dataset (Sabou et al., 2014). However, the disagreements between annotators may embed valuable nuances about the task (Ovesdotter Alm, 2011; Aroyo and Welty, 2013). Aggregation, in such cases may obscure such nuances, and potentially exclude perspectives from minority annotators (Prabhakaran et al., 2021). It is thus critical to consider uncertainty and disagreement between annotators, and potentially leverage this as a signal, to avoid losing nuanced and diverse opinions in the aggregation process. It might be important to analyze how annotators disagree along sociodemographic lines in order to be able to share this information with potential users of the dataset, so they can best understand how to represent these diverse perspectives in their use of the data.

Consider including uncertainty or disagreement between annotations on each instance as a signal in the dataset.

Consider analyzing systematic disagreements between annotators of different sociodemographic groups in order to better understand how diverse perspectives are represented.

Consider how the final dataset annotations will relate to individual annotator responses. For instance, one option is to release only the aggregated labels, e.g. through a majority vote. Consider what valuable information might be lost through such aggregation.

How do you define the annotation quality in your context, and how did you assess quality in your dataset?

Have you conducted any analysis on disagreement patterns? If so, what analyses did you use and what were the major findings?

Did you analyze potential sources of disagreement?

How do the individual annotator responses relate to the final labels released in the dataset?

5. Dataset release and maintenance

Finally, it is critical to consider what is the future of the dataset? Data exists within an ever-changing world, and should be viewed and used in that context. Users of the dataset now and in the future should understand the limitations of the data based on when and how it was collected. For example, a dataset may require periodic updates to remain robust to new slang or changes in language use over time. In addition, annotation tasks may be predicated upon legal definitions or medical standards that may change according to decisions by institutions or governing bodies.

Consider designing and sharing a dataset maintenance plan (Hutchinson et al., 2021b).

Consider potential conditions under which annotations may become outdated or less useful.

Do you have reason to believe the annotations in this dataset may change over time? Do you plan to update your dataset?

Are there any conditions or definitions that, if changed, could impact the utility of your dataset?

Will you attempt to track, impose limitations on, or otherwise influence how your dataset is used? If so, how?

Were annotators informed about how the data is externalized? If changes to the dataset are made, will they be informed?

Is there a process by which annotators can later choose to withdraw their data from the dataset? Please detail.

Case Study

We now present a hypothetical case study to demonstrate how our considerations outlined in Section 4 might be incorporated in practice and how dataset annotation decisions might be documented using CrowdWorkSheets. Responses to documentation questions are not intended to be prescriptive, nor are they completely comprehensive. Instead, they should be considered as one of many valid responses to this line of inquiry, and as a way to provoke further thought and discussion.

In this hypothetical case study, we take our goal to be the development of a benchmark dataset for public release to support academic research in social media content moderation. A Twitter corpus of 20,000 English-language tweets has been collected and we seek to label each tweet independently on a four-point “toxicity” scale defined in (Dixon, 2018).

At a high level, what are the subjective aspects of your task? Judgements of toxicity of online comments is highly subjective. What makes a tweet harmful or hurtful varies greatly not only by the literal content of the tweet, but by the context surrounding it. In our task setup, tweets are presented to annotators in isolation, so they do not have access to the overall context of the online conversation. As such, we anticipate that annotators may infer surrounding context and make subjective judgements based on this inference.

What assumptions do you make about annotators? Some of the key assumptions we make of our annotators:

Annotators that claim proficiency in English and familiarity with social media have enough context to reasonably interpret the task.

By giving a clear understanding of the goals of this work and explicitly indicating that this is a subjective task where disagreement is expected, we will increase the likelihood that annotators will allow their lived experiences to inform how they label toxicity.

By paying well, we will increase the likelihood that annotators will take time to think through particularly challenging examples.

How did you choose the specific wording of your task instructions? What steps, if any, were taken to verify the clarity of task instructions and wording for annotators? To align with existing research in the area, we’ve chosen to give annotators an existing definition of toxicity, as “rude, disrespectful or otherwise likely to make someone leave a discussion” (Aroyo et al., 2019). To settle on a final task wording, our research team first completed 50 annotation tasks each to identify any obvious challenges applying this definition. We then ran several small pilot studies with slightly varying task instructions, and allowed annotators the option to give feedback on aspects that were unclear. Looking over these results, we settled on the question phrasing that yielded the least reported confusion. We intentionally chose to leave our definition of toxicity somewhat open to interpretation, operating under the understanding that being overly specific in task instructions for subjective work does not improve response quality (Aroyo and Welty, 2015). We also explicitly informed annotators that we expect a variety of interpretations of each comment, and that we were looking for their personal best judgements in the given situation. To motivate thoughtful responses, we chose to pay well above minimum page and gave annotators a clear idea of the ultimate purpose of their work. However, we know that it is inevitable that some annotators will simply give answers they think we want as quickly as possible. While we screen out responses below a minimum duration, it’s impossible to ensure every answer is honest and thoughtful. We assume that these responses are randomly distributed; we leave it up to dataset users to do further analysis.

What, if any, risks did your task pose for annotators and were they informed of the risks prior to engagement with the task? Our task required annotators to read text that potentially contained hate speech, slurs, and other harmful content. As such, the task posed a risk of psychological harm to annotators. Moreover, given that we selected annotators who had previously experienced online harassment, there is a potential for the task to trigger an emotional response related to past trauma. We informed annotators about this risk prior to the start of the task. We also informed annotators that we would be requesting sociodemographic information in order to assess disagreement across different groups. We outlined our data storage policy and steps we took to prevent responses from being linked to sociodemographic information.

What are the precise instructions that were provided to annotators? The final task instructions used for data collection reflected in the released data is available at HypotheticalTaskInstructions.com.

Are there certain perspectives that should be privileged? If so, how did you seek these perspectives out? We want to privilege the perspectives of annotators who have personally experienced online harassment or hold marginalized identities that are often targeted online. To this end, we included screening questions such that our annotator pool consisted of raters who have direct experience with online harassment. We intentionally defined “direct experience” very broadly to capture a wide range of experiences, intending to include annotators who’ve been personally harassed by others via online channels, who’ve encountered online content that threatened or disparaged identities they share, who have experience moderating online forums, or who have felt otherwise personally affected by harmful online content.

Are there certain perspectives that would be harmful to include? If so, how did you screen these perspectives out? We believe that there are many harmful worldviews annotators might hold that we do not want captured by our annotations; we do not want to employ annotators who participate in hateful online communities, for example. To attempt to account for this, we identified several tweets that we agreed were unambiguously toxic, and screened out any annotators that did not label these as toxic.

Were sociodemographic characteristics used to select annotators for your task? If so, please detail the process. In addition to screening for annotators who have previously experienced online harassment, we selected annotators based on self-identified gender and age. We aimed for an approximately gender balanced pool and we selected for at least 10% of the annotators to be older than 65 years old. Because annotators were sourced from multiple geographic regions, we could not easily specify thresholds for racial or ethnic diversity; however, because we are screening for annotators who have experienced harassment online, we achieved decent representation among marginalized groups.

If you have any aggregated sociodemographic statistics about your annotator pool, please describe. We first selected annotators who indicated that they had previously experienced online harassment. This resulted in a pool that is disproportionately composed of women and people of color compared with platform demographics. More specific demographic breakdowns are available with the released dataset.

Do you have reason to believe that sociodemographic characteristics of annotators may have impacted how they annotated the data? Why or why not? Yes, we believe that annotators who have themselves experienced online harassment may be more likely to identify tweets as toxic. Based on rates of reported experience with hate speech attacks, we also expect that these annotators will disproportionately be members of marginalized social groups in their respective geographic region.

Consider the intended context of use of the dataset and the individuals and communities that may be impacted by a model trained on this dataset. Are these communities represented in your annotator pool? Our intended audience is researchers studying English-language online content moderation, although we can anticipate that our work may have impact within industry. Content moderation has far-reaching and pervasive influence on online discourse, which impacts a wide range of individuals and communities. Not everyone is equally vulnerable to the worst impacts of toxic language online, so we specifically selected for an annotator pool where this more vulnerable population is represented.

What annotation platform did you utilize? We’re using HypotheticalPlatform.

At a high level, what considerations informed your decision to choose this platform? We have selected HypotheticalPlatform for several reasons: First, they are a generally reliable platform with a history of high data quality. Second, they are able to guarantee that annotators are paid at or above a living wage. Third, their platform’s interface allows annotators to easily communicate feedback and concerns. And finally, their platform allows us to make ample use of screening questions to select the annotator pool for our main body of work.

Did the chosen platform sufficiently meet the requirements you outlined for annotator pools? Are any aspects not covered? We were able to meet all of our requirements for annotator pools through the use of many screening and demographic questions. The main trade-off we made to accomplish this is in cost; to pay annotators well, including for their time answering screening questions, we set a limit on the number of tweets we could label.

What, if any, communication channels did your chosen platform offer to facilitate communication with annotators? How did this channel of communication influence the annotation process and/or resulting annotations? We included a free response section at the end of our survey to allow feedback from annotators. In our pilot studies, we used this to clarify our task instructions. In the full study, most annotators left this blank, so we chose to leave them out of the final dataset.

How much were annotators compensated? Did you consider any particular pay standards, when determining their compensation? If so, please describe. Informed by the 2020 results of the MIT Living Wage Calculator (Glasmeier, 2020), we aimed for annotators to take home at least $25/hr on our work, with the goal of comfortably reaching a living wage for a single adult with no dependents, and decrease the pressure to complete tasks as quickly as possible. Annotators were paid$ 6.25 for labeling a batch of 40 tweets, designed to take no more than 15 minutes, and verified over the course of the annotation job.

How do you define the quality of annotations in your context, and how did you assess the quality in the dataset you constructed? We assessed quality along several dimensions, each of which had an associated question in each 40-question batch:

Attention: We included 1 attention check question was introduced that instructs the annotator to give a particular response so ensure annotators are reading each question;

Self-consistency: We included 2 duplicated questions within each batch, to ensure annotators were actually reading each tweet and being self-consistent in their responses.

Alignment with pre-defined ratings: We included several 2 tweets that the research team had pre-labeled as unoffensive and highly offensive. We chose tweets for which we would expect no disagreement from annotators.

We removed from the final dataset all batches where 2 or more of these 5 data quality questions were incorrectly answered. This ultimately accounted for 12% of our data.

Have you conducted any analysis on disagreement patterns? If so, what analyses did you use and what were the major findings? While the main purpose of this work is data collection and not analysis, we did conduct very preliminary analyses as a starting point for dataset users. We ran standard inter-annotator agreement metrics and found a relatively low interannotator agreement across all raters (Fleiss’ $\kappa=0.25$ (Fleiss, 1971)). However, we do not believe this to be an issue of data quality—when we looked at the data aggregated along different demographic axes, we found many demographic groups with high interannotator agreement whose annotations differ significantly from the majority opinion.

Did you analyze potential sources of disagreement? In our preliminary analysis, we looked at a few annotator demographics as a source of disagreement. There are a myriad of other factors one could analyze with respect to disagreement—tweet topic, presence or absence of particular words, or how quickly annotators responded, for example—but as this is intended to be released as a research dataset, we have not conducted all of these analyses.

How do the individual annotator responses relate to the final labels released in the dataset? After bucketing annotator demographics such that no annotator was uniquely identifiable, we released all responses, attached to the demographics of the annotator that gave each response. We chose not to aggregate responses into final tweet toxicity labels, and instead leave this to dataset users to aggregate in a way that’s appropriate for their use case.

Do you have reason to believe the annotations in this dataset may change over time? Do you plan to update your dataset? The relevancy of and perceptions about tweets will certainly change over time. In an effort to remind dataset users that this data should be taken in its temporal context, we include the month and year that each tweet was (a) written and (b) annotated as meta-data. However, as a longer-term strategy, we are also open-sourcing and making public all parts of our annotation pipeline, including rater instructions, data formatting schemes, and information on how to coordinate with our data labeling partners. We will publicly extend an open invitation to future collaborators who want to reuse our pipeline to annotate more data. If this pipeline is used and our guidelines followed satisfactorily, we will append future annotations to our existing dataset.

Are there any conditions or definitions that, if changed, could impact the utility of your dataset? Over time we expect societal views to deviate somewhat from the annotations collected. For example, it will not capture any shifts in attitude regarding language targeting social groups that may be considered marginalized in the future but that are not considered marginalized today.

Will you attempt to track, impose limitations on, or otherwise influence how your dataset is used? If so, how? To access the data, we require dataset users indicate their affiliation, contact information, and use case. The research team will be assessing uses on a case-by-case basis, with particular attention given to risks associated with use cases that explicitly include sociodemographic data in their modeling. We also ask that any publications cite our dataset release paper so we can track academic uses of the dataset. Our full data license if available at HypotheticalDataLicense.com.

Were annotators informed about how the data is externalized? If changes to the dataset are made, will they be informed? Annotators were informed that this data will be released as a research dataset prior to engaging in the task. We allowed raters to opt in to an email list that with share updates about data release an availability. This site will contain an automatically-updated list of papers that cite our dataset release paper.

Is there a process by which annotators can later choose to withdraw their data from the dataset? If so, please detail. By design, we have no mechanisms of linking individual annotators to specific responses, and so have no option for annotators to withdraw their annotations form our dataset. We make this explicit to the annotators, and allow them to stop answering questions at any point if they decide they no longer want to continue.

Conclusion

In this work, we challenge the common portrayal of dataset annotators as interchangeable. Rather, we argue, their individual histories and experiences bring unique perspectives to the table that can become encoded in the overall dataset in a significant ways. Therefore, it becomes imperative to consider how the process of selecting annotators, and their experience working on annotation, is documented alongside other aspects of dataset development. Towards this end, we introduced CrowdWorkSheets, a framework for reflecting on and documenting key decision points of crowdsourced dataset development, and a set of recommendations for dataset developers. While this framework is oriented towards individual dataset developers, we also recognize the role large institutions can play in shifting incentives to engage with these recommendations, e.g. incentivizing transparent dataset documentation through conference submission and reviewer guidelines. Funding: This research was supported by Google.