Gendered Mental Health Stigma in Masked Language Models

Inna Wanyin Lin, Lucille Njoo, Anjalie Field, Ashish Sharma, Katharina Reinecke, Tim Althoff, Yulia Tsvetkov

Introduction

Mental health issues are heavily stigmatized, preventing many individuals from seeking appropriate care Sickel et al. (2014). In addition, social psychology studies have shown that this stigma manifests differently for different genders: mental illness is more visibly associated with women, but tends to be more harshly derided in men Chatmon (2020). This asymmetrical stigma constitutes harms towards both men and women, increasing the risks of under-diagnosis or over-diagnosis respectively.

Since language is central to psychotherapy and peer support, NLP models have been increasingly employed on mental health-related tasks Chancellor and De Choudhury (2020); Sharma et al. (2021, 2022); Zhang and Danescu-Niculescu-Mizil (2020). Many approaches developed for these purposes rely on pretrained language models, thus running the risk of incorporating any pre-learned biases these models may contain Straw and Callison-Burch (2020). However, no prior research has examined how biases related to mental health stigma are represented in language models. Understanding if and how pretrained language models encode mental health stigma is important for developing fair, responsible mental health applications. To the best of our knowledge, our work is the first to operationalize mental health stigma in NLP research and aim to understand the intersection between mental health and gender in language models.

In this work, we propose a framework to investigate joint encoding of gender bias and mental health stigma in masked language models (MLMs), which have become widely used in downstream applications Devlin et al. (2019); Liu et al. (2019).

Our framework uses questionnaires developed in psychology research to curate prompts about mental health conditions. Then, with several selected language models, we mask out parts of these prompts and examine the model’s tendency to generate explicitly gendered words, including pronouns, nouns, first names, and noun phrases.We focus most of our analyses on binary genders (female and male), due to the lack of gold-standard annotations of language indicating non-binary and transgender. We discuss more details of this limitation in Limitations. In order to disentangle general gender biases from gender biases tied to mental health stigma, we compare these results with prompts describing health conditions that are not related to mental health. Additionally, to understand the effects of domain-specific training data, we investigate both general-purpose MLMs and MLMs pretrained on mental health corpora. We aim to answer the two research questions below.

RQ1: Do MLMs associate mental health conditions with a particular gender? To answer RQ1, we curate three sets of prompts that reflect three healthcare-seeking phases: diagnosis, intention, and action, based on the widely-cited Health Action Process Approach (Schwarzer et al., 2011). We prompt the models to generate the subjects of sentences that indicate someone is (1) diagnosed with a mental health condition, (2) intending to seek help or treatment for a mental health condition, and (3) taking action to get treatment for a mental health condition. We find that models associate mental health conditions more strongly with women than with men, and that this disparity is exacerbated with sentences indicating intention and action to seek treatment. However, MLMs pretrained on mental health corpora reduce this gender disparity and promote gender-neutral subjects.

RQ2: How do MLMs’ embedded preconceptions of stereotypical attributes in people with mental health conditions differ across genders? To answer RQ2, we create a set of prompts that describe stereotypical views of someone with a mental health condition by rephrasing questions from the Attribution Questionnaire (AQ-27), which is widely used to evaluate mental health stigma in psychology research (Corrigan et al., 2003). Then, using a recursive heuristic, we prompt the models to generate gendered phrases and compare the aggregate probabilities of different genders. We find that MLMs pretrained on mental health corpora associate stereotypes like anger, blame, and pity more strongly with women than men, while associating avoidance and lack of help with men.

Our empirical results from these two research questions demonstrate that models do perpetuate harmful patterns of overlooking men’s mental health and capture social stereotypes of men being less likely to receive care for mental illnesses. However, different models reduce stigma in some ways and increase it in other ways, which has significant implications for the use of NLP in mental health as well as in healthcare in general. In showing the complex nuances of models’ gendered mental health stigma, we demonstrate that context and overlapping dimensions of identity are important considerations when assessing computational models’ social biases and applying these models in downstream applications.Code and data are publicly available at https://github.com/LucilleN/Gendered-MH-Stigma-in-Masked-LMs.

Background and Related Work

Mental health stigma and gender. Mental health stigma can be defined as the negative perceptions of individuals based on their mental health status Corrigan and Watson (2002). This definition is implicitly composed of two pieces: assumptions about who may have mental health conditions in the first place, and assumptions about what such people are like in terms of characteristics and personality. Thus, our study at the intersection of gender bias and mental health stigma is twofold: whether models associate mental health conditions with a particular gender, and what presuppositions these models have towards different genders with mental illness.

Multiple psychology studies have reported that mental health stigma manifests differently for different genders Sickel et al. (2014); Chatmon (2020). Regarding the first aspect of stigma, mental illness is consistently more associated with women than men. The World Health Organization (WHO) reports a greater number of mental health diagnoses in women than in men WHO (2021), but the fewer diagnoses in men does not indicate that men struggle less with mental health. Rather, men are less likely to seek help and are significantly under-diagnosed, and stigma has been cited as a leading barrier to their care Chatmon (2020).

Regarding the second aspect of stigma, prior work in psychology has developed ways to evaluate specific stereotypes towards individuals with mental illness. Specifically, the widely used attribution model developed by Corrigan et al. (2003) defines nine dimensions of stigmaWe use stigma in this paper to refer to public stigma, which can be more often reflected in language than other types of stigma: self stigma and label avoidance. about people with mental illness: blame, anger, pity, help, dangerousness, fear, avoidance, segregation, and coercion. The model uses a questionnaire (AQ-27) to evaluate the respondent’s stereotypical perceptions towards people with mental health conditions Corrigan et al. (2003). To the best of our knowledge, no prior work has examined how these stereotypesDimensions of stigma refers to the nine dimensions of public stigma of mental health, stereotypes towards people with mental health conditions refers to specific stereotypical perceptions. For example, “dangerousness” is a dimension of stigma and “people with schizophrenia are dangerous” is a stereotype. differ towards people with mental health conditions from different gender groups.

Bias research in NLP. There is a large body of prior work on bias in NLP models, particularly focusing on gender, race, and disability Garrido-Muñoz et al. (2021); Blodgett et al. (2020); Liang et al. (2021). Most of these works study bias in a single dimension as intersectionality is difficult to operationalize (Field et al., 2021), though a few have investigated intersections like gender and race Tan and Celis (2019); Davidson et al. (2019). Our methodology follows prior works that used contrastive sentence pairs to identify bias (Nangia et al., 2020; Nadeem et al., 2020; Zhao et al., 2018; Rudinger et al., 2018), but unlike existing research, we draw our prompts and definitions of stigma directly from psychology studies (Corrigan et al., 2003; Schwarzer et al., 2011).

Mental health related bias in NLP. There has been little work examining mental health bias in existing models. One relevant work evaluated mental health bias in two commonly used word embeddings, GloVe and Word2Vec Straw and Callison-Burch (2020). Our project expands upon this work as we focus on more recent MLMs, including general-purpose MLM RoBERTa, as well as MLMs pretrained on health and mental health corpora, MentalRoBERTa Ji et al. (2021) and ClinicalLongformer Li et al. (2022). Another line of work studied demographic-related biases in models and datasets used for identifying depression in social media texts Aguirre et al. (2021); Aguirre and Dredze (2021); Sherman et al. (2021). These works focus on extrinsic biases – biases that surface in downstream applications, such as poor performance for particular demographics. Our paper differs in that we focus on intrinsic bias in MLMs – biases captured within a model’s parameters – which can lead to downstream extrinsic biases when such models are applied in the real world.

Methodology

We develop a framework grounded in social psychology literature to measure MLMs’ gendered mental health biases. Our core methodology centers around (1) curating mental-health-related prompts and (2) comparing the gender associations of tokens generated by the MLMs. We choose to use mask-filling, as opposed to generating free text or dialogue responses about mental health, because mask-filling provides a more controlled framework: there are a finite set of options to define the mask in a sentence, which makes it easier to analyze and interpret the results. In this section, we discuss methods for the two research questions introduced in § 2.

RQ1 explores whether models associate mental illness more with a particular gender. To explore this, we conduct experiments in which we mask out the subjects ”Subject” refers to the person being described, which may or may not be the grammatical subject of the sentence. in the sentences, then evaluate the model’s likelihood of filling in the masked subjects with male, female, or gender-unspecified words, which include pronouns, nouns, and names. The overarching idea is that if the model is consistently more likely to predict a female subject, this would indicate that the model might be encoding preexisting societal presuppositions that women are more likely to have a mental health condition. We analyze these likelihoods quantitatively to identify statistically significant patterns in the model’s gender choices.

Prompt Curation. We manually construct three sets of simple prompts that reflect different stages of seeking healthcare. These stages are grounded in the Health Action Process Approach (HAPA) Schwarzer et al. (2011), a psychology theory that models how individuals’ health behaviors change. We develop prompt templates in three different stages to explore stigma at different parts of the process, differentiating being diagnosed from intending to seek care and from actually taking action to receive care. For each prompt template, we create 11 sentences by replacing “[diagnosis]” with one of the top-11 mental health (MH) or non-mental-health-related (non-MH) diagnoses (more details in § 3.3). Example templates and their corresponding health action phases include: • Diagnosis: “ has [diagnosis]” • Intention: “ is looking for a therapist for [diagnosis]” • Action: “ takes medication for [diagnosis]” The full list of prompts can be found in Appendix A.

Mask Values. For each prompt, we identify female, male, and unspecified-gender words in the model’s mask generations and aggregate their probabilities (see footnote 1). Most prior work has primarily considered pronouns as representations of gender Rudinger et al. (2018); Zhao et al. (2018). However, nouns and names are also common in mental health contexts, such as online health forums and therapy transcripts. In fact, some names and nouns frequently appear in the top generations of masked tokens. Thus, we look for: (1) Binary-gendered pronouns (e.g., “He” and “She”). (2) Explicitly gendered nouns (e.g., “Father” and “Mother”). We draw this list of 66 nouns from Field and Tsvetkov (2020). (3) Gender-associated first names (e.g., “David” and “Mary”). We identify the top 1,000 most common, unambiguous male and female first names in Field et al. (2022)’s Wikipedia data and consider any non-repeated names in these lists to be gendered. Any generations that do not fall into the above categories are considered unspecified-gender (e.g., words like “they” and “friend”). For each prompt, we sum the probabilities of all female, male, and unspecified-gender words with probabilities higher than $0.01$ .

2 RQ2: Gender Associations with Dimensions of Mental Health Stigma

RQ1 focuses on gender association in general, but does not explore specific, more nuanced stereotypes or dimensions of stigma, which we examine in RQ2. RQ2 investigates what personal attributes language models associate with men and women with mental illnesses. At a high level, our methodology for RQ2 is similar to RQ1: we once again mask out words that MLMs will fill with a female, male, or unspecified-gender word, but we use a different set of prompts and propose a new method for generating noun phrases.

Prompt Curation. The Attribution Questionnaire (AQ-27) is widely used in psychology research to evaluate an individual’s level of stigma towards people with mental illness Corrigan et al. (2003). The language used in the questionnaire has been subsequently validated by years of psychology research, so we use these questions as the basis for creating RQ2’s prompts. AQ-27 begins with a vignette about a hypothetical man named Harry who has schizophrenia, followed by 27 Likert-scale questions to probe the respondent about their views towards Harry. These 27 questions correspond to 9 dimensions of mental health stigma, with 3 questions per dimension. Since we want to curate prompts that do not indicate explicit gender, we rephrase the questions into descriptive sentences. Each sentence describes a stereotypical view towards a masked-out subject who has a mental health condition. Similarly to RQ1, we use the same set of 11 most common mental health diagnoses to create 11 versions of each of these sentences. Examples of this set of prompts include: • “I would feel aggravated by a who has [diagnosis].” • “I think a with [diagnosis] should be forced into treatment even if they do not want to.” • “I feel much sympathy for a with [diagnosis].” The full set of prompts is in Appendix B.

Recursive Masking for Gendered Phrase Generation. Some prompts in this set describe very specific situations, and the probabilities of generating a single-token gendered subject are relatively low. To reduce the sparsity of generated gendered subjects, we design a recursive procedure that enables generating multi-token noun phrases as follows. First, we pass the model an initial prompt: e.g. “I feel aggravated by a with schizophrenia.” Then, if the model generates an unspecified-gender subject (e.g. friend), we prompt the model to generate a linguistic modifier by adding a mask token directly before the token generated in step 1: e.g., “I feel aggravated by a friend with schizophrenia.” We repeat step 2 a predefined number of times ( $n=3$ ), though $n$ can be adjusted to create phrases of different lengths. Since we mask out the subjects in the prompts, the final generated tokens are almost always well-formed noun phrases. At each recursive step, we consider the top 10 generations. We stop after $n=3$ steps, as generations afterwards have low probabilities and do not contribute significantly to the aggregate probabilities.

3 Experimental Setup

Models. For each RQ, we experiment with three models: RoBERTa, MentalRoBERTa, and ClinicalLongformer. Although we also experimented with BERT and MentalBERT, we choose to focus our analyses on RoBERTa for two reasons: (1) RoBERTa is trained primarily on web text whereas BERT’s pretraining data include BookCorpus and English Wikipedia which may incorporate confounding gender stereotypes Fast et al. (2016); Field et al. (2022); (2) RoBERTa is trained with a dynamic masking procedure, which potentially increases the model’s robustness. Thus, RoBERTa is likely more suitable for many real-world MH-related downstream applications, such as online peer support. We compare RoBERTa and MentalRoBERTa to explore the effect of pretraining a model on domain-specific social media data. We also compare these to ClinicalLongformer, a model trained on medical notes, because it may potentially be applicable to clinical therapeutic settings. A summary of the differences between these models is in Appendix G.1.

Diagnoses. With each of these models, we experiment with prompts made from two different sets of diagnoses. For prompts about mental health, we consider only the 11 most common MH disorders MedlinePlus (2021) because of the breadth of mental illnesses: depression, bipolar disorder, anxiety, panic disorder, obsessive-compulsive disorder (OCD), post-traumatic stress disorder (PTSD), anorexia, bulimia, psychosis, borderline personality disorder, and schizophrenia.

Additionally, to control for the confounding effect of gender bias unrelated to mental health, we use a set of non-MH-related conditions. This set consists of the 11 most common general health problems Raghupathi and Raghupathi (2018): heart disease, cancer, stroke, respiratory disease, injuries, diabetes, Alzheimer’s disease, influenza, pneumonia, kidney disease, and septicemia.

Results

In this section, we discuss the main results for our two research questions.We conduct $t$ -test and use the following notation to report significance: ***: $p$ ¡.001, **: $p<.01$ , *: $p<.05$ . We report Cohen’s $d$ as effect size and compare $d$ with recommended medium and large effect sizes: 0.5 and 0.8. (Schäfer and Schwarz, 2019). More details are in Appendix G.2. Comprehensive results of all statistical tests are in Appendix C and E.

Social psychology research has shown that mental health issues are associated more strongly with women than men (§2). RQ1 examines whether these gendered mental health associations manifest in MLMs by comparing the probabilities of generating female, male, and unspecified-gender words in sentences about mental health. Figure 3 shows a subset of results, and full results are shown in Figure 5.

Female vs. male subjects. We first compare RoBERTa’s probabilities of generating female and male subjects when filling masks in prompts (Figure 2). Across all MH prompts, RoBERTa consistently predicts female subjects with a significantly higher probability than male subjects (Figure 3B, 32% vs. 19%, $p=0.00$ , $d=1.6$ ). This gender disparity is consistent in all three health action phases: diagnosis, intention, and action ( $p=0.00,0.00,0.00$ , $d=1.7,1.4,1.9$ ). However, this pattern does not consistently appear in all three phases with non-MH diagnoses prompts (Figure 3C). Additionally, the gender disparity, i.e. $P_{F}-P_{M}$ , predicted by RoBERTa is consistently higher with MH prompts than with non-MH prompts (13% vs. 4%, $p=0.00$ , $d=1.0$ ), indicating that RoBERTa does encode gender bias specific to mental health.

Effect of domain-specific pretraining. In this experiment, we compare RoBERTa and MentalRoBERTa to investigate whether a MLM pretrained on MH corpora exhibits similar gender biases. We find that female subjects are still more probable than male subjects in MH prompts, indicating that there may be some MH related gender bias. However, the differences between male and female subject prediction probabilities are considerably smaller in MentalRoBERTa than in RoBERTa (Figure 3A, 5% vs. 13%, $p=0.00$ , $d=0.95$ ). This suggests that pretraining on MH-related data actually attenuates this form of gender bias.

Gender disparity across health action phases. Next, we explore whether models’ MH-related gender bias changes when prompts indicate that a person is at different stages of receiving care: simply having a diagnosis, intending to seek care, and actively receiving care. Even though MentalRoBERTa displays less gender disparity overall, we find that in both RoBERTa and MentalRoBERTa, the disparity between female and male probabilities increases as we progress from diagnosis to intention to action. The differences between the female and male subjects are even more pronounced for action prompts, such as “ sees a psychiatrist for [diagnosis],” “ sees a therapist for [diagnosis],” and “ takes medication for [diagnosis]” in RoBERTa (34% vs. 19%, $p=0.00$ , $d=1.90$ ). The fact that the gender disparity widens in treatment-seeking behavior indicates that both models encode the societal constraint that men are less likely to seek and receive care Chatmon (2020).

Gender-associating vs. unspecified-gender subjects. Additionally, we explore models’ tendencies to make gender assumptions at all, as opposed to filling masks with unspecified-gender words. RoBERTa has a very low tendency to produce unspecified-gender words in MH prompts (7%). On the other hand, MentalRoBERTa predicts unspecified-gender words (24%) with probabilities that are comparable to the gendered words (21%). This suggests that domain-specific pretraining on mental health corpora reduces the model’s tendencies to make gender assumptions at all, but there might be other confounding factors. A closer examination of MentalRoBERTa’s generation shows that it picks up on artifacts of its Reddit training data, frequently generating words like “OP” (Original Poster), which may have contributed to this higher probability for unspecified-gender words.

Given the use of Reddit-specific syntax in MentalRoBERTa, we additionally compare these two models with ClinicalLongformer, a model trained on general medical notes instead of MH-related Reddit data (Figure 5). ClinicalLongformer reverses the trends of the previous two models, predicting male words with higher probabilities than female (14% vs. 10%, $p=0.00$ , $d=0.63$ ). However, this pattern is consistent across MH prompts and non-MH prompts (14% vs. 9%, $p=0.00$ , $d=0.66$ ), suggesting that the model predicts male subjects more frequently in general rather than specifically in mental health contexts. Notably, we find that ClinicalLongformer has the highest probabilities of unspecified-gender words (60%). A closer inspection reveals that words like “patient” are predicted with high probability.

2 RQ2: Gender Associations with Dimensions of Mental Health Stigma

RQ2 aims to explore whether MLMs asymmetrically correlate gender with individual dimensions of mental health stigma. Figure 4 shows primary results and Figure 6 shows additional metrics.

Female vs. male association with stigma dimensions. We first examine the probabilities of female-gendered phrases and male-gendered phrases. For the dimensions of help and avoidanceFor the avoidance dimension only, the prompts (paraphrased directly from AQ-27) are constructed to indicate less avoidance, so higher probabilities for a particular gender indicate being less likely to experience avoidance (Corrigan et al., 2003)., we find that all three of RoBERTa, MentalRoBERTa, and ClinicalLongformer predict female-gendered phrases with higher probabilities (help: 11% vs. 7%, $p=0.01$ , $d=0.6$ ; 10% vs. 4%, $p=0.00$ , $d=1.2$ ; 9% vs. 5%, $p=0.01$ , $d=0.5$ . avoidance: 21% vs. 14%, $p=0.02$ , $d=0.5$ ; 26% vs. 22%, $p=0.04$ , $d=0.5$ ; 20% vs. 12%, $p=0.00$ , $d=1.2$ ) (Figure 4).

Thus, models do encode these two dimensions of stigma – that the public is less likely to help and more likely to avoid men with mental illnesses. Psychology research has shown that behaviors of avoidance and withholding help are highly correlated, as both are forms of discrimination against men with mental illness (Corrigan et al., 2003). Our results confirm that MLMs perpetuate these stigma, which can make it even more difficult for men to get help if these biases are propagated to downstream applications.

Effect of domain-specific pretraining. We next analyze the impact of pretraining data on the models’ gendered mental health stigma. As shown in Figure 4, MentalRoBERTa is consistent with RoBERTa in the dimension of help: male-gendered phrases have lower probabilities for these prompts (10% vs. 4%, $p=0.00$ , $d=1.2$ ; 11% vs. 7%, $p=0.01$ , $d=0.6$ ), perpetuating the stereotype that men are less likely to receive help for mental illness.

Interestingly, MentalRoBERTa also expresses more stereotypes towards female subjects with mental illnesses than RoBERTa. Specifically, MentalRoBERTa is more likely to generate sentences that blame females for their mental illness, express anger towards females with mental illness, and express pity for them. (blame: 6% vs. 3%, $p=0.00$ , $d=0.6$ ; anger: 25% vs. 14%, $p=0.00$ , $d=1.6$ ; pity: 15% vs. 12%, $p=0.03$ , $d=0.4$ ) (Figure 4A).

Conclusion

Our contributions in this work are threefold. First, we introduce a framework grounded in psychology research that examines models’ gender biases in the context of mental health stigma. Our methods of drawing from psychology surveys, examining both general and attribute-level associations (RQ1 and RQ2), and developing controlled comparisons are reusable in other settings of complex, intersectional biases. Second, we present empirical results showing that MLMs do perpetuate societal patterns of under-emphasizing men’s mental health: models generally associate mental health with women and associate stigma dimensions like avoidance with men. This has potential impact for the use of NLP in mental health applications and healthcare more generally. Third, our empirical investigation of gender and mental health stigma in several different models shows that training on domain-specific data can reduce stigma in some ways but increase it in others. Our study demonstrates the complexity of measuring social biases and the

Discussion

Theoretical grounding. Blodgett et al. (2020) point out the importance of grounding NLP bias research in the relevant literature outside of NLP, and our study demonstrates such a bias analysis framework: our methodology is grounded in social psychology literature on mental health, stigma, and treatment-seeking behavior. Some NLP models developed to address mental health issues may have limited utility due to a lack of grounding in psychology research Chancellor and De Choudhury (2020). There is a large body of language-focused psychology literature, including many carefully-written surveys like AQ-27, and as our work shows, this literature can be leveraged for theoretically-grounded NLP research on mental health. In general, our framework can be adapted to exploring the intersectional effects of other bias dimensions beyond gender and mental health status.

Trade-offs, advantages, and disadvantages. Crucially, our results do not point to a single model that is “better” than the others. Simply knowing that models represent one gender more than another does not imply anything about what their behavior should be. Instead, our results demonstrate that no model is ideal, and choosing a model must involve consideration of the specific application, especially in high-stakes domains like mental health.

Depending on the downstream application, the different aspects of MH stigma explored by RQ1 and RQ2 may be more or less important. If, for example, a model is being used to create a tool to help clinicians diagnose people, then perhaps it is more important to consider RQ1 and ensure that the model does not over-diagnose or under-diagnose patient subgroups (e.g., over-diagnosing females and under-diagnosing males). On the other hand, if a model is being used to help generate dialogue for mental health support, then the analysis proposed in RQ2 might be more relevant. These factors vary from case to case, and it should be the responsibility of application developers to carefully examine what model behaviors are most desirable. Importantly, the differences across pretraining corpora demonstrate that simply selecting MentalRoBERTa over other models due to its perceived fit for mental health applications may come with unintended consequences beyond improved performance.

Intersectionality in bias frameworks. This study explores intersectionality by jointly considering gender and mental health status. Intersectionality originates in Black feminist theory and suggests that different dimensions of a person’s identity interact to create unique kinds of marginalization Crenshaw (1990); Collins and Bilge (2020). Our study of gendered mental health stigma is intersectional in that the privileges and disadvantages experienced by men and women change when we also consider the marginalization experienced by people with mental illness: women are systemically disadvantaged in general, but in the context of mental health, men tend to be overlooked and are faced with harmful social patterns like toxic masculinity Chatmon (2020). This intersectionality is operationalized through our methodology that explores the interaction effects of the two variables, gender and mental health status.

While we only consider two aspects of identity here, and there are many more that can and should be considered in bias research, this work demonstrates the importance of considering the intersectional aspects most relevant to the domain or application at hand. If we had assumed that only women are disadvantaged in mental health applications, we would risk perpetuating the pattern of ignoring men’s mental health, preventing them from receiving care, and perhaps reinforcing certain stereotypes of women – which would harm both men and women. Beyond gender and mental health, all social biases are nuanced and context-dependent. In high-stakes healthcare settings like our work, this becomes increasingly critical since applications can directly affect the people’s lives.

Nonbinary and genderqueer identities. Future work should explore genders beyond men and women, including nonbinary and genderqueer identities. Psychology research has shown that people with these identities experience uniquely challenging mental health risks Matsuno and Budge (2017), so understanding how models encode related stigma is ever more important. At a high level, there is a need for frameworks and methods for studying more diverse genders in language.

Other intersectional biases. Mental health stigma can intersect with many other dimensions of identity, such as race, culture, age, and sexual orientation. Like with gender, understanding how these intersectional biases are represented in models is important for developing applications that will not exacerbate existing inequalities in mental health care. In general, beyond mental health, intersectionality is an area with many opportunities for continued research.

Intrinsic and extrinsic harms. Our study explores biases intrinsic to MLMs, and these representational harms are harmful on their own Blodgett et al. (2020), but we do not explore biases that surface in downstream applications. Future work should investigate ways to mitigate such extrinsic biases because they can result in allocational harms Blodgett et al. (2020) if they cause models to provide unequal services to different groups.

Conclusion

Limitations

Our work has potential for positive impact in that it takes an initial step towards understanding gendered mental health stigma in language technologies. However, our work is limited in a number of ways. This opens doors for future work, but as prior NLP bias works have argued, we caution against using this framework as an off-the-shelf metric to evaluate models in practice. Since this study examines bias in MLMs, all of the limitations we discuss in this section are also ethical considerations.

Nonbinary and genderqueer identities and gendered word identification. As discussed in § 6, integrating more diverse genders in NLP research remains a major gap. Our work’s analyses are likewise limited to binary genders due to the lack of gold-standard annotations on language related to nonbinary and genderqueer people. In addition, our methodology for identifying female, male, or unspecified-gender words, especially first names, relies on English Wikipedia data. These sources of gender associations are English-language-centric and may not be inclusive to marginalized groups.

Mental health prompts. The prompts we manually develop in this work are grounded in psychology research. We experimented with several different paraphrases of each prompt with Quillbot to test the robustness of our curation process. However, we acknowledge that our set of prompts is still a limited-sized manually-curated set, and thus may contain artifacts from the curation process or from the psychology literature we based them off of. Similar to gendered word identification, our curation is based on a psychology survey in standard American English. Although the survey itself has been translated into many other languages and used outside of the US, our rephrasing of the survey language may still not be representative of stigma in other languages and culture, or even of dialects of English like African American English (AAE). Additionally, because of the breadth of mental health disorders, our study only constructs prompts from the 11 most common diagnoses. These 11 diagnoses do not span the full spectrum of people’s experiences with mental illness.

Aggregation metrics. Blodgett et al. (2020) point out that aggregated metrics can be problematic when evaluating model biases because they can gloss over differences in model behavior for different subpopulations. In this work, we avoid aggregating scores in many ways and present scores broken down prompt-by-prompt, but our methods do still involve aggregation methods in order to summarize and identify trends in model behaviors. For example, we are not looking at how stigma, gender, or gendered stigma may be different from one diagnosis to the next. This may be an interesting line of future work.

Interpretability. Our methodology relies on our interpretations of black-box models, and it does not use modern interpretability methods to identify what aspects of their training data and/or inference-time-input are responsible for model’s decisions to generate female, male, or gender-unspecified words. Thus, in this work, we do not concretely examine the effect that training data has on model behavior. In order to do so, we would need to quantitatively dive into the training corpora of the different models with such interpretability methods.

Misuse risk. This work is a preliminary exploration of gendered mental health stigma, not a benchmark to evaluate models. We do not, and cannot, draw conclusions about which models may be better or worse in general or for specific applications, for a number of reasons. First, our tests are synthetic: the sentences we have hand-crafted may only represent a subset of how these language models actually get used in the real world. Furthermore, we do not explore what concrete impacts (if any) these model behaviors might have in downstream applications. Additional research is needed to measure these impacts, their actual harmfulness in the lived experiences of affected members of society, and the trade-offs involved in different applications in order to determine what models can and should be used for specific applications.

Thus, our methodology should not be used as a metric to evaluate or select models in practice. Rather, we hope to provide useful insight into how gender plays into mental health stigma and how language models’ biases depend on specific social contexts like the mental health domain.

Acknowledgements

We thank Suchin Gururangan, the Tsvetshop lab, and the Behavioral Data Science lab at the University of Washington for the valuable discussions. I.W.L., A.S., and T.A. were supported in part by NSF grant IIS-1901386, NSF CAREER IIS-2142794, NSF grant CNS-2025022, NIH grant R01MH125179, Bill & Melinda Gates Foundation (INV-004841), the Office of Naval Research (#N00014-21-1-2154), a Microsoft AI for Accessibility grant, and a Garvey Institute Innovation grant. L.N. gratefully acknowledges support from Workhuman. A.F. acknowledges support from a Google PhD Fellowship. K.R. was partially supported by NSF grant #2006104. Y.T. gratefully acknowledges support from NSF CAREER IIS-2142739, NSF FAI IIS-2040926, and an Alfred P. Sloan Foundation Fellowship.

References

Appendix A List of Prompts - RQ1

Appendix B List of Prompts - RQ2

Appendix C Statistical Tests Results - RQ1

Appendix D Plots - RQ1

Appendix E Statistical Tests Results - RQ2

Appendix F Plots - RQ2

Appendix G Implementation Details - Models and Evaluations

G.2 Statistical Tests.

For each masked sentence we feed to a model, we use a paired t-test to evaluate whether the difference between the probabilities of male and female words is statistically significant. To compare the gender disparity between models or between sets prompts, we use an independent t-test to evaluate whether the gender disparities are significantly different. We compute gender disparity by $P_{F}-P_{M}$ , where $P_{F}$ and $P_{M}$ are a model’s probability of generating female and male subjects for each prompt respectively.

Given the number of hypothesis tests, we conducted Bonferroni correction and checked adjusted $p$ -values to reduce the chances of obtaining false-positive results.

G.3 Model implementation.

We use each of these models in the HuggingFace implementation of FillMaskPipeline, a Masked Language Modeling Prediction pipeline that takes in a sentence with a mask token and generates possible words and their likelihoods.