SafetyBench: Evaluating the Safety of Large Language Models
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang
Introduction
Large Language Models (LLMs) have gained a growing amount attention in recent years Zhao et al. (2023). With the scaling of model parameters and training data, LLMs’ abilities are dramatically improved and even many emergent abilities are observed Wei et al. (2022). Since the release of ChatGPT OpenAI (2022), more and more LLMs are deployed to interact with humans, such as Llama Touvron et al. (2023a, b), Claude Anthropic (2023) and ChatGLM Du et al. (2022); Zeng et al. (2022). However, with the widespread development of LLMs, their safety flaws are also exposed, which could significantly hinder the safe and continuous development of LLMs. Various works have pointed out the safety risks of ChatGPT, such as privacy leakage Li et al. (2023) and toxic generations Deshpande et al. (2023).
Therefore, a thorough assessment of the safety of LLMs becomes imperative. However, comprehensive benchmarks for evaluating the safety of LLMs are scarce. In the past, certain widely used datasets have focused exclusively on specific facets of safety concerns. For example, the RealToxicityPrompts dataset Gehman et al. (2020) mainly focuses on the toxicity of generated continuations. The Bias Benchmark for QA (BBQ) benchmark Parrish et al. (2022) and the Winogender benchmark Rudinger et al. (2018) primarily focus on the social bias of LLMs. Notably, some recent Chinese safety assessment benchmarks Sun et al. (2023); Xu et al. (2023) have gathered prompts spanning various categories of safety issues. However, they only provide Chinese data, and a non-negligible challenge for these benchmarks is how to accurately evaluate the safety of responses generated by LLMs. Manual evaluation, while highly accurate, is a costly and time-consuming process, making it less conducive for rapid model iteration. Automatic evaluation is relatively cheaper, but there are few safety classifiers with high accuracy across a wide range of safety problem categories.
Considering the limitations of existing safety evaluation benchmarks, we introduce SafetyBench, the first comprehensive benchmark to evaluate LLMs’ safety with multiple choice questions. We present four advantages of SafetyBench: (1) Simplicity and Efficiency. In line with well-known benchmarks such as MMLU Hendrycks et al. (2021b), SafetyBench exclusively features multiple-choice questions, each with a single correct answer, which enables automated and cost-effective evaluations of LLMs’ safety with exceptional accuracy. (2) Extensive Diversity. SafetyBench contains 11,435 diverse samples sourced from a wide range of origins, covering 7 distinct categories of safety problems, which provides a comprehensive assessment of the safety of LLMs. (3) Variety of Question Types. Test questions in SafetyBench encompass a diverse array of types, spanning dialogue scenarios, real-life situations, safety comparisons, safety knowledge inquiries, and many more. This diverse array ensures that LLMs are rigorously tested in various safety-related contexts and scenarios. (4) Multilingual Support. SafetyBench offers both Chinese and English data, which could facilitate the evaluation of both Chinese and English LLMs, ensuring a broader and more inclusive assessment.
With SafetyBench, we conduct experiments to evaluate the safety of 25 popular Chinese and English LLMs in both zero-shot and few-shot settings. The summarized results are shown in Figure 2. Our findings reveal that GPT-4 stands out significantly, outperforming other LLMs in our evaluation by a substantial margin. Notably, this performance gap is particularly pronounced in specific safety categories such as Physical Health, pointing towards crucial directions for enhancing the safety of LLMs. Further, it is worth highlighting that most LLMs achieve lower than 80% average accuracy and lower than 70% accuracy on some categories such as Unfairness and Bias, which underscores the considerable room for improvement in enhancing the safety of LLMs. We hope SafetyBench will contribute to a deeper comprehension of the safety profiles of various LLMs, spanning 7 distinct dimensions, and assist developers in enhancing the safety of LLMs in a swift and efficient manner.
Related Work
Previous safety benchmarks mainly focus on a certain type of safety problems. The Winogender benchmark Rudinger et al. (2018) focuses on a specific dimension of social bias: gender bias. By examining gender bias with respect to occupations through coreference resolution, the benchmark could provide insight into whether the model tends to link certain occupations and genders based on stereotypes. The RealToxicityPrompts Gehman et al. (2020) dataset contains 100K sentence-level prompts derived from English web text and paired with toxicity scores from Perspective API. This dataset is often used to evaluate language models’ toxic generations. The rise of LLMs brings up new problems to LLM evaluation (e.g., long context Bai et al. (2023) and agent Liu et al. (2023) abilities). So is it for safety evaluation. The BBQ benchmark Parrish et al. (2022) can be used to evaluate LLMs’ social bias along nine social dimensions. It compares the model’s choice under both under-informative context and adequately informative context, which could reflect whether the tested models rely on stereotypes to give their answers. Jiang et al. (2021) compiled the Commonsense Norm Bank dataset that contains moral judgements on everyday situations and trained Delphi based on the integrated data. Recently, two Chinese safety benchmarks Sun et al. (2023); Xu et al. (2023) include test prompts covering various safety categories, which could make the safety evaluation for LLMs more comprehensive. Differently, SafetyBench use multiple choice questions from seven safety categories to automatically evaluate LLMs’ safety with lower cost and error.
2 Benchmarks Using Multiple Choice Questions
A number of benchmarks have deployed multiple choice questions to evaluate LLMs’ capabilities. The popular MMLU benchmark Hendrycks et al. (2021b) consists of multi-domain and multi-task questions collected from real-word books and examinations. It is frequently used to evaluate LLMs’ world knowledge and problem solving ability. Similar Chinese benchmarks are also developed to evaluate LLMs’ world knowledge with questions from examinations, such as C-Eval Huang et al. (2023) and MMCU Zeng (2023). AGIEval Zhong et al. (2023) is another popular bilingual benchmark to assess LLMs in the context of human-centric standardized exams. However, these benchmarks generally focus on the overall knowledge and reasoning abilities of LLMs, while SafetyBench specifically focuses on the safety dimension of LLMs.
SafetyBench Construction
An overview of SafetyBench is presented in Figure 1. We collect a total of 11,435 multiple choice questions spanning across 7 categories of safety issues from several different sources. More examples are provided in Figure 6 in Appendix. Next, we will introduce the category breakdown and the data collection process in detail.
SafetyBench encompasses 7 categories of safety problems, derived from the 8 typical safety scenarios proposed by Sun et al. (2023). We slightly modify the definition of each category and exclude the Sensitive Topics category due to the potential divergence in answers for political issues in Chinese and English contexts. We aim to ensure the consistency of the test questions for both Chinese and English. The considered 7 safety issues are listed as follows:
Offensiveness. This category is about threat, insult, scorn, profanity, sarcasm, impoliteness, etc. LLMs are required to identify and oppose these offensive contents or actions.
Unfairness and Bias. This type of safety problem is mainly about social bias across various topics such as race, gender, religion, etc. LLMs are expected to identify and avoid unfair and biased expressions and actions.
Physical Health. This category focuses on actions or expressions that may influence human physical health. LLMs should know appropriate actions or expressions in various scenarios to maintain physical health.
Mental Health. Different from physical health, this category pays more attention to health issues related to psychology, spirit, emotions, mentality, etc. LLMs should know correct ways to maintain mental health and prevent any adverse impacts on the mental well-being of individuals.
Illegal Activities. This category focuses on illegal behaviors, which could cause negative societal repercussions. LLMs need to distinguish between legal and illegal behaviors and have basic knowledge of law.
Ethics and Morality. Besides behaviors that clearly violate the law, there are also many other activities that are immoral. This category focuses on morally related issues. LLMs should have a high level of ethics and be object to unethical behaviors or speeches.
Privacy and Property. This category concentrates on the issues related to privacy, property, investment, etc. LLMs should possess a keen understanding of privacy and property, with a commitment to preventing any inadvertent breaches of user privacy or loss of property.
2 Data Collection
In contrast to prior research such as Huang et al. (2023), we encounter challenges in acquiring a sufficient volume of questions spanning seven distinct safety issue categories, directly from a wide array of examination sources. Furthermore, certain questions in exams are too conceptual, which are hard to reflect LLMs’ safety in diverse real-life scenarios. Based on the above considerations, we construct SafetyBench by collecting data from various sources including:
Existing datasets. For some categories of safety issues such as Unfairness and Bias, there are existing public datasets that can be utilized. We construct multiple choice questions by applying some transformations on the samples in the existing datasets.
Exams. There are also many suitable questions in safety-related exams that fall into several considered categories. For example, some questions in exams related to morality and law pertain to Illegal Activities and Ethics and Morality issues. We carefully curate a selection of these questions from such exams.
Augmentation. Although a considerable number of questions can be collected from existing datasets and exams, there are still certain safety categories that lack sufficient data such as Privacy and Property. Manually creating questions from scratch is exceedingly challenging for annotators who are not experts in the targeted domain. Therefore, we resort to LLMs for data augmentation. The augmented samples are filtered and manually checked before added to SafetyBench.
The overall distribution of data sources is shown in Figure 3. Using a commercial translation API https://fanyi-api.baidu.com/, we translate the gathered Chinese data into English, and the English data into Chinese, thereby ensuring uniformity of the questions in both languages. We also try to translate the data using ChatGPT that could bring more coherent translations, but there are two problems according to our observations: (1) ChatGPT may occasionally refuse to translate the text due to safety concerns. (2) ChatGPT might also modify an unsafe choice to a safe one after translation at times. Therefore, we finally select the Baidu API to translate our data. We acknowledge that the translation step might introduce some noises due to cultural nuances or variations in expressions. Therefore, we make an effort to mitigate this issue, which will be introduced in Section 3.3.
There are four categories of safety issues for which we utilize existing English and Chinese datasets, including Offensiveness, Unfairness and Bias, Physical Health and Ethics and Morality.
The employed Chinese datasets include COLD Deng et al. (2022). COLD is a benchmark for Chinese offensive language detection. It comprises posts from social media platforms that are labeled as offensive or not by human annotators. We randomly sample a total of 288 instances labeled as Attack Individual and 312 instances labeled as Other Non-Offensive to construct questions with two options, which require to judge whether the provided text is offensive. The employed English datasets include the Jigsaw Toxicity Severity dataset https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating/overview and the adversarial dataset proposed in Dinan et al. (2019). The Jigsaw Toxicity Severity dataset comprises pairs of Wikipedia Talk page comments, with annotations identifying the more toxic comment in each pair. We randomly sample 700 pairs of comments to construct the questions which require to choose the more toxic option. The adversarial dataset proposed in Dinan et al. (2019) is collected by encouraging annotators to hack trained classifiers in a multi-turn dialogue. We randomly sample 350 safe responses and 350 unsafe responses to construct the questions, which entail the evaluation of the offensiveness of the last response in a multi-turn dialogue.
The employed Chinese datasets include COLD and CDial-Bias Zhou et al. (2022). We randomly sample 225 instances that are labeled as Attack Group and 225 instances that are labeled as Anti-Bias. The sampled instances are uniformly drawn from three topics including region, gender and race. Note that there is no overlap between the COLD data used here and the COLD data used in the Offensiveness category. CDial-Bias is another Chinese benchmark focusing on social bias, which collects data from a Chinese question-and-reply website Zhihu https://www.zhihu.com/. Similarly, we randomly sample 300 biased instances and 300 non-biased instances uniformly from four topics including race, gender, region and occupation. The employed English datasets include RedditBias Barikeri et al. (2021). RedditBias gathers comments from Reddit and annotates whether the comments are biased. We randomly sample 500 biased instances and 500 non-biased instances uniformly from five topics including black person, Jews, Muslims, LGBTQ and female. We employ samples from COLD, CDial-Bias, and RedditBias to create two-choice questions that assess whether a given text exhibits bias or unfairness.
We haven’t found suitable Chinese datasets for this category, so we only adopt one English dataset: SafeText Levy et al. (2022). SafeText contains 367 human-written real-life scenarios and provides several safe and unsafe suggestions for each scenario. We construct two types of questions from SafeText. The first type of question requires selecting all safe actions among the mixture of safe and unsafe actions for one specific scenario. The second type of questions requires comparing two candidate actions conditioned on one scenario and choosing the safer action. There are 367 questions for each type.
We haven’t found suitable Chinese datasets for this category, so we only employ several English datasets including Scruples Lourie et al. (2021), MIC Ziems et al. (2022), Moral Stories Emelin et al. (2021) and Ethics Hendrycks et al. (2021a). Scruples pair different actions and let crowd workers identify the more ethical action. We randomly sample 200 pairs of actions from Scruples to construct the questions requiring selecting the more ethical option. MIC collect several dialogue models’ responses to prompts from Reddit. Annotators are instructed to judge whether the response violates some Rule-of-Thumbs (RoTs). If so, an additional appropriate response needs to be provided. We thus randomly sample 200 prompts from MIC, each accompanied by both an ethical and an unethical response. The constructed questions require identifying the more ethical response conditioned on the given prompt. Moral Stories include many stories that have descriptions of situations, intentions of the actor, and a pair of moral and immoral action. We randomly sample 200 stories to construct the questions that require selecting the more ethical action to achieve the actor’s intention in various situations. Ethics contains annotated moral judgements about diverse text scenarios. We randomly sample 200 instances from both the justice and the commonsense subset of Ethics. The questions constructed from justice require selecting all statements that have no conflict with justice among 4 statements. The questions constructed from commonsense ask for commonsense moral judgements on various scenarios.
2.2 Data from Exams
We first broadly collect available online exam questions related to the considered 7 safety issues using search engines. We collect a total of about 600 questions across 7 categories of safety issues through this approach. Then we search for exam papers in a website https://www.zxxk.com/ that integrates a large number of exam papers across various subjects. We collect about 500 middle school exam papers with the keywords “healthy and safety” and “morality and law”. According to initial observations, the questions in the collected exam papers cover 4 categories of safety issues, including Physical Health, Mental Health, Illegal Activities and Ethics and Morality. Therefore, we ask crowd workers to select suitable questions from the exam papers and assign each question to one of the 4 categories mentioned above. Additionally, we require workers to filter questions that are too conceptual (e.g., a question about the year in which a certain law was enacted) , in order to better reflect LLMs’ safety in real-life scenarios. Considering the original collected exam papers primarily consist of images, an OCR tool is first used to extract the textual questions. Workers need to correct typos in the questions and provide answers to the questions they are sure. When faced with questions that our workers are uncertain about, we authors meticulously determine the correct answers through thorough research and extensive discussions. We finally amass approximately 2000 questions through this approach.
2.3 Data from Augmentation
After collecting data from existing datasets and exams, there are still several categories of safety issues that suffer from data deficiencies, including Mental Health, Illegal Activities and Privacy and Property. Considering the difficulties of requiring crowd workers to create diverse questions from scratch, we utilize powerful LLMs to generate various questions first, and then we employ manual verification and revision processes to refine these questions. Specifically, we use one-shot prompting to let ChatGPT generate questions pertaining to the designated category of safety issues. The in-context examples are randomly sampled from the questions found through search engines. Through initial attempts, we find that instructing ChatGPT to generate questions related to a large and coarse topic would lead to unsatisfactory diversity. Therefore, we further collect specific keywords about fine-grained sub-topics within each category of safety issues. Then we explicitly require ChatGPT to generate questions that are directly linked to some specific keyword. The detailed prompts are shown in Table 1.
After collecting the questions generated by ChatGPT, we first filter questions with highly overlapping content to ensure the BLEU-4 score between any two generated questions is smaller than 0.7. Than we manually check each question’s correctness. If a question contains errors, we either remove it or revise it to make it reasonable. We finally collect about 3500 questions through this approach.
3 Quality Control
We take great care to ensure that every question in SafetyBench undergoes thorough human validation. Data sourced from existing datasets inherently comes with annotations provided by human annotators. Data derived from exams and augmentations is meticulously reviewed either by our team or by a group of dedicated crowd workers. However, there are still some errors related to translation, or the questions themselves. We suppose the questions where GPT-4 provides identical answers to those of humans are mostly correct, considering the powerful ability of GPT-4. We thus manually check the samples where GPT-4 fails to give the provided human answer. We remove the samples with clear translation problems and unreasonable options. We also remove the samples that might yield divergent answers due to varying cultural contexts. In instances where the question is sound but the provided answer is erroneous, we would rectify the incorrect answer. Each sample is checked by two authors at first. In cases where there is a disparity in their assessments, an additional author conducts a meticulous review to reach a consensus.
Experiments
We evaluate LLMs in both zero-shot and five-shot settings. In the five-shot setting, we meticulously curate examples that comprehensively span various data sources and exhibit diverse answer distributions. Prompts used in both settings are shown in Figure 4. We extract the predicted answers from responses generated by LLMs through carefully designed rules. To let LLMs’ responses have desired formats and enable accurate extraction of the answers, we make some minor changes to the prompts shown in Figure 4 for some models, which are listed in Figure 5 in Appendix. We set the temperature to 0 when testing LLMs to minimize the variance brought by random sampling. For cases where we can’t extract one single answer from the LLM’s response, we randomly sample an option as the predicted answer. It is worth noting that instances where this approach is necessary typically constitute less than 1% of all questions, thus exerting minimal impact on the results.
We don’t include CoT-based evaluation in this version because SafetyBench is less reasoning-intensive than benchmarks testing the model’s general capabilities such as C-Eval and AGIEval. Moreover, adding CoT does not bring significant improvements for most of the models evaluated in C-Eval and AGIEval, although their test questions are more reasoning-intensive. Therefore, adding CoT might be even less beneficial when evaluating LLMs on SafetyBench. Based on the above considerations and the considerable costs for evaluation, we exclude the CoT-based evaluation for now.
2 Evaluated Models
We evaluate a total of 25 popular LLMs, covering diverse organizations and scale of parameters, as detailed in Table 2. For API-based models, we evaluate the GPT series from OpenAI and some APIs provided by Chinese companies, due to limited access to other APIs. For open-sourced models, we evaluate medium-sized models with at most 33B parameters in this version due to limited computing resources.
3 Main Results
We show the zero-shot results in Table 3. API-based LLMs generally achieve significantly higher accuracy than other open-sourced LLMs. In particular, GPT-4 stands out as it surpasses other evaluated LLMs by a substantial margin, boasting an impressive lead of nearly 10 percentage points over the second-best model, gpt-3.5-turbo. Notably, in certain categories of safety issues (e.g., Physical Health and Ethics and Morality), the gap between GPT-4 and other LLMs becomes even larger. This observation offers valuable guidance for determining the safety concerns that warrant particular attention in other models. We also take note of GPT-4’s relatively poorer performance in the Unfairness and Bias category compared to other categories. We thus manually examine the questions that GPT-4 provides wrong answers and find that GPT-4 may make wrong predictions due to a lack of understanding of certain words or events (such as “sugar mama” or the incident involving a stolen manhole cover that targets people from Henan Province in China). Another common mistake made by GPT-4 is considering expressions containing objectively described discriminatory phenomena as expressing bias. These observations underscore the importance of possessing a robust semantic understanding ability as a fundamental prerequisite for ensuring the safety of LLMs. What’s more, by comparing LLMs’ performances on Chinese and English data, we find that LLMs created by Chinese organizations perform significantly better on Chinese data, while the GPT series from OpenAI exhibit more balanced performances on Chinese and English data.
The five-shot results are presented in Table 4. The improvement brought by incorporating few-shot examples varies for different LLMs, which is in line with previous observations Huang et al. (2023). Some LLMs such as text-davinci-003 and internlm-chat-7B gain significant improvements from in-context examples, while some LLMs such as gpt-3.5-turbo might obtain negative gains from in-context examples. This may be due to the “alignment tax”, wherein alignment training potentially compromises the model’s proficiency in other areas such as the in-context learning ability Zhao et al. (2023). We also find that five-shot evaluation could bring more stable results because LLMs would generate fewer responses without extractable answers when guided by in-context examples.
4 Chinese Subset Results
Given that most APIs provided by Chinese companies implement strict filtering mechanisms to reject unsafe queries (such as those containing sensitive keywords), it becomes impractical to assess the performance of API-based LLMs across the entire test set. Consequently, we opt to eliminate samples containing highly sensitive keywords and subsequently select 300 questions for each category, taking into account the API rate limits. This process results in a total of 2,100 questions. The five-shot evaluation results on this filtered subset of SafetyBench are presented in Table 5. ChatGLM2 demonstrates impressive performance, with only about a three percentage point difference compared to GPT-4. Notably, ErnieBot also achieves strong performance in the majority of categories except for Unfairness and Bias.
Discussion
SafetyBench aims to measure LLMs’ ability to understand safety related issues. While it doesn’t directly measure the LLMs’ safety when encountering various open prompts, we believe the evaluated ability to understand safety related issues is fundamental and indispensable to construct safe LLMs. For example, if a model can’t identify the correct actions to do when a person gets injured, it would face challenges in furnishing precise and valuable responses to pertinent inquiries during real-time conversations. Conversely, if a model possesses a robust comprehension of safety-related issues (e.g., good sense of morality, deep understanding of implicit or adversarial contexts), it becomes more feasible to steer the model towards generating safe responses.
SafetyBench covers 7 common categories of safety issues, while excluding those associated with instruction attacks (e.g., goal hijacking and role-play instructions). This is because we think that the core problem in instruction attack is the conflict between following user instructions and adhering to explicit or implicit safety constraints, which is different from the safety understanding problem SafetyBench is concerned with.
Conclusion
We introduce SafetyBench, the first comprehensive safety evaluation benchmark with multiple choice questions. With 11,435 Chinese and English questions covering 7 categories of safety issues in SafetyBench, we extensively evaluate the safety abilities of 25 LLMs from various organizations. We find that open-sourced LLMs exhibit a significant performance gap compared to GPT-4, indicating ample room for future safety improvements. We hope SafetyBench could play an important role in evaluating the safety of LLMs and facilitating the rapid development of safer LLMs. We advocate for developers to systematically address the exposed safety issues rather than expending significant efforts to hack our data and merely pursuing higher leaderboard scores.
References
Appendix A Evaluation Prompts
The default evaluation prompts are shown in Figure 4. However, we observe that conditioned on the default prompts, some LLMs might generate responses that have undesired formats, which makes it hard to automatically extract the predicted answers. Therefore, we make minor changes to the default prompts when evaluating some LLMs, as detailed in Figure 5.
Appendix B Examples
We present two example questions for each safety category in Figure 6.