OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs
Patrick Haller, Ansar Aynetdinov, Alan Akbik
Introduction
Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions Wang et al. (2023). LLM-based systems like ChatGPT are able to generate high-quality responses to questions and text-based tasks from a variety of domains, which has led them to become useful tools in everyday tasks.
Biases in model answers. However, an open research question concerns the inherent biases of trained models and their responses. Consider, for example, the following instruction: "Give two examples of reputable TV news channels."
While a technically correct answer to this question might prefer those channels that have the largest audience and are cited or referenced the most, the output of an LLM is determined by data it is trained on. This includes the query-response pairs used to instruction-tune it, and the human preference data used for alignment approaches such as RLHF Ngo et al. (2023) or de-biasing methods Ouyang et al. (2022); Bai et al. (2022). For instance, the model we present here gives widely different answers to the above question, depending on whether it is trained on politically conservative data (provided answer: "Fox News"), politically liberal data ("the Verge"), geographically American ("CNN") or German ("Tagesschau") data. This example is illustrated in Figure 1.
Detecting and mitigating biases. Current research focuses on detecting and mitigating such biases, with the goal of creating models that do not contain unfair biases or perpetuate stereotypes against specific demographics. Bender et al. (2021) have shown that simply increasing the size of the pre-training corpus does not result in an unbiased language model, because even a very large corpus still implicitly carries (internet-specific) demographic biases. A number of previous works in the field is dedicated to measuring bias in LLMs Zhao et al. (2018); De-Arteaga et al. (2019); Nadeem et al. (2021) and proposing techniques to automatically de-bias them after the pre-training stage Gowda et al. (2021); Schick et al. (2021); Gira et al. (2022). Famously, ChatGPT is engineered to suppress biases by giving cautious answers to politically charged questions, or refusing answers altogether.
Our approach: OpinionGPT. With this demonstration, we showcase an alternative approach in which we aim to make biases explicit and transparent, rather than suppressing them. We identified 11 biases spanning political (liberal, conservative), regional (USA, Germany, Middle East, Latin America), age (teenager, over 30, over 45) and gender (male, female) biases. For each bias, we derived an instruction-tuning corpus in which all answers were written by members of the respective demographic. With this corpus, we conducted a full fine-tuning of a LLaMa model Touvron et al. (2023a), yielding a model in which the bias can selected when requesting an answer to a question.
We make OpinionGPT available as a web demonstrator in which users can ask questions and select all biases they wish to investigate. The demo will answer this question using a model fine-tuned on each of the selected biases, allowing side-by-side comparison. An example of this demo in action is provided in Figure 1.
Illustrate how we derived a "bias-aware" instruction-tuning corpus from English-language Reddit, and give details on our dataset processing and model training steps (Section 2)
Present the OpinionGPT model and web interface and showcase possible interactions with our demo (Section 3)
Our goal is to allow users to explore how language, ideas, and communication are influenced by different biases and perspectives. By making biases explicit in OpinionGPT, we aim to provide a tool to researchers for studying bias and subjectivity in NLP, and increase awareness about bias in AI among general users.
Opinion GPT
In this section, we explain how we derived our bias-aware corpus (Section 2.1) and how we trained the OpinionGPT model (Section 2.2).
Instruction-tuning requires supervision in the form of instruction-response pairs, consisting of a natural language instruction (typically a question or a task) and a matching natural language response (answering the question, or executing the task). To train OpinionGPT, we require demographic information of the writers of each answer. For instance, we need to know if an answer was written by a politically conservative or liberal person, by a German or an American national, etc.
Source: AskX subreddits. We derive this corpus from RedditWe use a Reddit dump by Watchful1 (2023) of the 20k most popular subreddits, with posts from 2005-06 to 2022-12., an online discussion forum in which users publicly post messages to which other users post responses. Reddit is structured into subreddits, each of which focuses on a specific topic, has subreddit-specific posting rules, and subreddit-specific moderators that enforce these rules.
We consider a specific kind of subreddit that follows the "AskX" schema. Examples of such subreddits are "AskAGerman" and "AskAnAmerican". As per the rules of these subreddits, anyone can ask a question, but only members of the specific demographic should answer these questions. So, in "AskAGerman", all answers should be written by German nationals. We identified 91 subreddits that follow the AskX schema. From these, we manually selected 13 AskX subreddits from which to derive a corpus (see Table 1).
Deriving instruction-tuning data. After selecting these 13 subreddits, we derived instruction-response pairs with the following method: As instruction, we used the post title (often a direct question). As responses, we used the most-upvoted direct responses to the original post. This means that a single post may result in multiple instruction-response pairs if more than one response was upvoted by the community.
To increase data quality, we employed a number of filters: (1) We removed all posts that had no upvotes, or that were later deleted. (2) We filtered all responses that cite other comments and posts (since these require the full context of a discussion to make sense). (3) We filtered all posts and responses that are longer than 80 words to encourage the model to give short, direct answers. (4) We selected from each subreddit the 25k most-upvoted responses, yielding a corpus in which all answers were approved of by the respective demographics-specific subreddit.
Table 1 lists our target bias and the corresponding subreddit used to represent it. We aimed for a even distribution of training samples per bias. To represent "teenager" and "people over 30" biases we used a combination of more granular target subreddits.
2 Model Training
We use the 7 billion parameter LLaMa V1 model LLM in our instruction-tuning approach.
In the initial phase, we explored parameter-efficient tuning methods like LoRA Hu et al. (2021), but qualitatively found full fine-tuning Wei et al. (2022) to better capture the biases of our corpus. To execute full fine-tuning, we followed the approach and hyperparameters detailed by Taori et al. (2023). We explored also including more general instruction-tuning datasets like Alpaca Taori et al. (2023) and Dolly Conover et al. (2023), but qualitatively found little impact on model responses and thus decided to use only our bias-aware corpus for the final version.
Bias-Specific Prompts. During training and inference, we include a mention of the specific bias into the model prompt. During training, this allows the model to learn to distinguish between biases. During inference, this allows the user to specify the desired bias when requesting a response. We qualitatively explored several variants, but converged on a minimalistic prompt that repeats the subreddit name thrice before the instruction and the response.
3 Measuring Bias
As indicated in the previous section, our development process was mostly guided by qualitative evaluations to chose between alternative approaches and model variants. During development, we compared different variants by qualitatively inspecting returned answers for a manually created catalogue of questions. If two model variants were deemed to give answers of roughly similar quality, we chose the approach of smaller complexity.
Table 2 gives a shortened overview of model outputs for 5 questions and all 11 biases. We observe a variety of outputs, such as regional preferences on "favorite food" and different views on "stricter gun laws". The entries in this table are shortened into single words for a faster overview, as the actual model responses were longer.
Expanding on the question regarding "stricter gun laws", Table 3 shows the full model responses. The table indicates that the model generates comprehensive and nuanced responses that reflect the training data’s inherent biases. In some instances, it constructs responses based on underlying political ideology, demonstrating its understanding of the connection between individual biases and broader political contexts. In other cases, some responses are grounded in the expression of feelings and sentiment, indicating the ability of expressing feelings and sentiments.
However, we also note that some responses include mentions to other biases. For instance, the "Latin America" answer in Table 2 includes the phrase "I’m a leftist". This give indication to several potential limitations of our approach: First, people posting in a specific subreddit will likely not accurately represent the full demographic we hope to cover (meaning we only model the subset of each demographic that actually posts on Reddit). Second, biases overlap (Latin America consists of different countries, different political leanings and different gender/age groups), resulting in a conflated training signal that potentially leads to less clear bias boundaries in our tuned model.
3.2 Quantitative Evaluation
We also experimented with quantitative evaluations to better understand whether each bias group in our model inherently carries a certain view in various political and societal issues, as well as attitude towards different demographics. In order to quantify these notions, we relied on the BOLD dataset Dhamala et al. (2021). It consists of Wikipedia prompts corresponding to different races, genders, religious beliefs, political ideologies, and professions. We use the "regard" metric Sheng et al. (2019) to quantify the attitude of each modeled bias group towards a certain demographic, and regular sentiment analysis Camacho-collados et al. (2022) for prompt completions related to political ideologies or religious beliefs.
Table 4 lists the results for a subset of the BOLD dataset. Overall we observe that the "conservative" bias shows the highest share of prompt completions with negative regard towards all five race and gender demographics in the BOLD dataset. Somewhat surprisingly, it also displays the most negative sentiment towards Christianity, while the "liberal" bias does so towards Islam.
Meanwhile, biases related to older demographics tend to have a more positive sentiment and regard towards the subgroups and ideas considered in Table 4. This may reflect a more polite language use by older users on Reddit.
Web Demonstration
The web-based user interface for OpinionGPT provides an interactive platform for users to interact with the model. The interaction is straightforward: A dedicated input field allows for the submission of queries or instructions. Additionally, users choose from the 11 biases supported by the model by clicking the respective checkboxes (see Figure 1, upper half). The model then outputs a response for each selected bias to the entered question (see Figure 1, lower half). Each response names the underlying bias and is highlighted in a different color, allowing side-by-side comparison of different biases.
Additionally, the website includes a history function, ensuring users retain access to their previous conversations. The history can be referred to at any time. A sharing feature allows users to disseminate their conversations to make it accessible to other users on the OpinionGPT website.
We build the website upon the open-source project Chat UIChatUI: https://github.com/huggingface/chat-ui by HuggingFace, albeit heavily modified and customized to align with the unique needs of OpinionGPT. A crucial part of this custom adaption involves developing our own dedicated backend to serve our model for inference, adapted to the special requirements of OpinionGPT.
Related Work
Assessment and Measurement of Biases. Several benchmarks and techniques are available for detecting and quantifying biases in language models. StereoSet (Nadeem et al., 2021) serves as a benchmark to gauge stereotypical bias by assessing language model responses to sentences tied to various demographic groups and stereotypes. The Semantic-associative Evaluation Toolkit (SEAT) (Kaneko and Bollegala, 2021) quantifies bias by examining the strength of association between pairs of words and attributes. CrowS-Pairs (Nangia et al., 2020) discerns societal biases by evaluating the model’s capacity to detect biased sentences within given pairs.
Techniques for De-biasing Language Models. A variety of methods have been created to reduce bias in language models. For instance, (Yuan et al., 2022) utilizes self-knowledge distillation to implicitly discern multi-view feature sets, aiming to minimize language bias. SentenceDebias (Liang et al., 2020) targets social biases at the sentence-level representation by contextually processing bias-attribute words through a diverse array of sentence templates. By projecting new sentence representations onto a bias subspace and then subtracting, the bias is reduced. More recently, FineDeb (Saravanan et al., 2023) was introduced, which employs task-specific fine-tuning on a model pre-trained on extensive text corpora. This fine-tuning process concentrates the model’s learning on a more refined and potentially less biased dataset, thus helping to diminish bias.
Human Alignment for Bias Mitigation. Alignment approaches are also utilized to mitigate biases in LLMs. While Instruction Fine-tuning Wei et al. (2022) trains the model to generate text sequences in a specific format, human alignment, on the other hand, incorporates direct human feedback to shape the model’s behavior, utilizing optimization techniques like PPO Schulman et al. (2017) or DPO Rafailov et al. (2023). This alignment with human values and norms can effectively counteract biases in the model’s responses, creating a more responsible and representative system like e.g. in the case of Chat-Llama-2 Touvron et al. (2023b).
Conclusion and Discussion
In this paper, we presented OpinionGPT, a web demonstration that allows users to interact with an LLM that was trained on text of different biases. This project aims to foster understanding and stimulate discourse around how bias is manifested in language, a facet often overlooked in AI research.
To train this model, we derived a bias-aware corpus by leveraging a group of subreddits in which answers to questions should be written by members of specific bias-groups. Using this corpus, we fine-tuned a LLaMa model using a designated prompt. This allows us to request answers from the model for specific biases. Next to the web demonstration, this paper presented a qualitative and quantitative exploration of the biases in the trained model.
While we find that the model succeeds in giving nuanced and biased answers, we note that using Reddit as a data source injects a global layer of bias to all model responses: For instance, the responses by "Americans" should be better understood as "Americans that post on Reddit", or even "Americans that post on this particular subreddit". Similarly "Germans" should be understood as "Germans that post on this particular subreddit", etc. Additionally, we observed instances of potential bias and information leakage, indicating that during model training, biases may get conflated. Our current work focuses on investigating these sources of bias-leakage and enabling a more granular and compositional representation of biases ("conservative Germans", "liberal Germans") in future versions of OpinionGPT.
Ethics Statement
As developers of OpinionGPT, we understand and acknowledge the ethical implications that emerge from our work. The nature of our project, which involves training a language model explicitly on biases, demands a thorough consideration of ethical guidelines to ensure its responsible and fair use.
While our model is designed to reflect certain biases based on training data, it is not intended to promote or endorse any particular bias. The purpose is to foster understanding and stimulate discussion about the role of bias in communication, not to further any specific political, social, or cultural agenda. Users are encouraged to interact with a broad range of biases to gain a more comprehensive perspective.
We are also mindful of the potential for misuse of our models. As with any technology, there is a risk that users could misuse OpinionGPT to further polarize debates, spread harmful ideologies, or manipulate public opinion. We therefore made the decision not to publicly release our model. Instead, OpinionGPT, will be selectively shared with the research community via a protected API.
We are committed to data privacy and protection. Any interaction data used is anonymized and stripped of personally identifiable information to protect user privacy.
References
Appendix A Appendix
The prompt predominantly contains the subreddit name. By grounding the prompt in the subreddit’s identity, we ensure that the output aligns closely with the subreddit’s bias should not fall back to knowledge acquired during pre-training. Our chosen prompt:
A.2 Screencast Video
A screencast video demonstrating OpinionGPT is available under: https://vimeo.com/852077847