SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao

Introduction

With rapid breakthrough in LLM capabilities, new benchmarks have emerged to evaluate these models and explore their capability boundaries (Cobbe et al., 2021; Zellers et al., 2019; Clark et al., 2018; Talmor et al., 2018; Hendrycks et al., 2020). Alongside their powerful capabilities, concerns regarding the safety of LLMs are also rising. Preventing safety threats posed by generative AI systems is becoming a critical priority for both researchers (Siddiqui, 2023; Bengio, 2023; Anthropic, 2023) and policymakers (House, 2023; Kazim et al., 2023), meanwhile motivate us to explore how to comprehensively evaluate the safety capability of LLMs.

To formulate and evaluate safety concerns, a range of benchmarks (Hartvigsen et al., 2022; Lin et al., 2023; Gehman et al., 2020; Dhamala et al., 2021; Sun et al., 2023; Wang et al., 2023b) has been developed. However, these prior benchmarks focused on safety often exhibited significant shortcomings. Firstly, most of benchmarks only focus on a narrow perspective of safety threats (e.g., only unsafe instructions or only toxic representation), failing to cover the wide spectrum of potentially harmful outputs LLMs might generate. This inadequacy partly stems from the rapid evolution of language and the emergence of new forms of harmful content, which older benchmarks failed to anticipate. Secondly, traditional harmful questions can be effectively handled with a high safety rate of about 99% by modern LLMs (Wang et al., 2023b; Sun et al., 2023). More challenging questions Chao et al. (2023); Liu et al. (2023b) are desired for comprehensive evaluation of LLM safety. Thirdly, many existing benchmarks rely on time-consuming human evaluation (Xu et al., 2023; Xie et al., 2023) or expensive GPT (Sun et al., 2023), making safety evaluation both slow and costly. Finally, these benchmarks tend to be limited in scope, being tailored either exclusively for safety evaluation (Wang et al., 2023b; Shen et al., 2023) or for testing attack and defense mechanisms (Zou et al., 2023), restricting their broader application.

Considering limitations of existing benchmarks, we propose a challenging benchmark namely SALAD-Bench, i.e., SAfety evaluation for LLMs, Attack and Defense approaches. As shown in Table 1, SALAD-Bench offers several advantages:

(1) Compact Taxonomy with Hierarchical Levels. SALAD-Bench introduces a structured hierarchy with three levels, comprising 6 domains, 16 tasks, and 65 categories, respectively. This ensures in-depth evaluation, focusing not just on overall safety but also on specific safety dimensions. As illustrated in Figure 1, a high overall safety rate does not obscure the identification of tasks and categories that may present risks. The full hierarchy of our benchmark is depicted in Figure 2.

(2) Enhanced Difficulty and Complexity. By infusing our questions with attack methods, we obtain enhanced questions that significantly heightens the evaluation’s challenge, offering a stringent test of LLMs’ safety responses. Furthermore, the addition of a multiple-choice question (MCQ) subset enriches our benchmark, enhancing the diversity of safety inquiries and enabling a more thorough assessment of LLM safety.

(3) Reliable and Seamless Evaluator. Leveraging instruction following capabilities, we develop two distinct evaluators for SALAD-Bench. The first, MD-Judge, is an LLM-based evaluator tailored for question-answer pairs. This model undergoes finetuning on a dataset comprising both standard and attack-enhanced pairs, labeled in alignment with our taxonomy. MD-Judge integrates relevant taxonomy details into its input and employs customized instruction tasks for precise classification. For multiple-choice question (MCQ) evaluations, we also utilize the instruction following abilities to assess the performance by regex parsing.

(4) Joint-Purpose Utility. Extending beyond standard LLM safety evaluation, our benchmark is uniquely suited for both LLM attack and defense methods evaluations. It features two tailored subsets: one for testing attack techniques and another for examining defense capabilities, as showcased in Figure 1. These subsets are crucial for assessing and improving LLM resilience against attack methods. It caters to a wide array of research needs within LLM safety.

All above contributions involve a detailed safety-focused evaluation of recent LLMs, including both black-box LLMs OpenAI (2023, 2022); Anthropic (2022); Team et al. (2023) and open-sourced models Jiang et al. (2023); Bai et al. (2023); Inan et al. (2023); Chiang et al. (2023). We analyze their vulnerabilities and assess their safety rates across different dimensions, using innovative methods to enhance the evaluation’s efficiency and scalability.

Dataset

Table 1 showcases SALAD-Bench’s advancements in LLM safety evaluations. It features 21k test samples structured into a detailed hierarchy of 6 domains, 16 tasks, and 65 categories, allowing for in-depth analysis. The benchmark is further equipped with 5k attack-enhanced and 200 defense-enhanced questions, alongside 4k multiple-choice questions, enhancing its robustness testing capabilities. Efficiency in evaluation is achieved through the use of the MD-Judge evaluator. In the following, we discuss the construction procedure.

Inspired by previous taxonomy rules and policies (Weidinger et al., 2023; OpenAI, 2023; Meta, 2023), we propose a hierarchical three-level safety taxonomy for LLMs, as illustrated in Figure 2. Generally, SALAD-Bench includes six domain-level harmfulness areas, which are discussed as follows:

Representation & Toxicity Harms is divided into three distinct yet interconnected domains: toxic content, unfair representation and adult content.

Misinformation Harms can be broadly divided into two main areas, propagation of false beliefs and misconceptions refers to the dissemination.

Information & Safety Harms usually represents unauthorized revelation, creation, accurate deduction of personal and private data about individuals, or dissemination of dangerous information.

Malicious Use can be delineated into influence Operations, security threats, illegal activities, fraud or deceptive action, and defamation.

Human Autonomy & Integrity Harmscan be categorized into two groups: violation of personal property and persuasion and manipulation.

Socioeconomic Harms includes risky financial practices, debatable trade affairs, and labor issues.

Each domain is further subdivided into tasks and actions, resulting in 16 task-level and 65 category-level taxonomies for precise safety topic delineation. Further details on these subdivisions are provided in Appendix A.

2 Data Collection

Collection of original questions. Our purpose is to construct a large-scale and balanced safety dataset with hierarchical taxonomies. Hence we first collect sufficient unsafe questions as original data. Generally, original data comes from two sources, i.e., public available data and self-instructed data. For public available data, we collect the questions from existing benchmarks (Ganguli et al., 2022; Zou et al., 2023; Deng et al., 2023; Lin et al., 2023; Wang et al., 2023b; Yu et al., 2023; Shen et al., 2023). For self-instructed data from generative models, since GPTs are trained with safety alignment methods and applied with safe guardrails, it is hard to generate harmful questions or answers. Inspired by Qi et al. (2023), we fine-tune GPT-3.5-turbo by ~500 collected harmful QA-pairs and enforce the LLM to respond to unsafe questions and answers for multiple-choice questions. Then the fine-tuned GPT can be utilized for self-instructed data generation with given category-level taxonomies.

Question Deduplication and Filtration. The original data for annotation presents two main challenges: content duplication due to diverse sources and the presence of benign samples lacking significant harmfulness. To tackle content duplication, we use the Locality-Sensitive Hashing algorithm combined with Sentence-BERT (Reimers and Gurevych, 2019) for sentence vector embeddings. This approach ensures the efficient elimination of both superficial and semantic redundancies in the dataset. To address the issue of benign samples and minimize manual review costs, we utilized the reward model Dai et al. (2023), pre-trained on SafeRLHF, to assess the safety of each data sample. Each question is fed into the model to obtain a harmfulness score, which is then compared against a predefined threshold to determine safety status.

Auto Labeling. To categorize questions from public datasets into SALAD-Bench’s category-level taxonomies, we employ LLMs for automated labeling through in-context learning and consensus voting. We start by crafting a template for LLM classification that outlines the task, provides few-shot learning examples, and specifies the output format, detailed in Appendix Figure 7. Using this template and a small, manually-annotated test set, we evaluate various LLMs for their labeling accuracy and select Mixtral-8x7B-Instruct (Jiang et al., 2024), Mistral-7B-Instruct (Jiang et al., 2023), and TuluV2-dpo-70B (Ivison et al., 2023) for the task. The final categorization for each question is determined by unanimous agreement among the chosen LLMs. This process ensures that every question is accurately labeled, ready for multi-dimensional analysis within the benchmark.

Following these steps, we finally obtain the base set of SALAD-Bench, containing over 200 questions per category, suitable for assessing the basic safety capabilities of LLMs. Moving forward, we will explore ways of enhancing questions to further extend the polymorphism and multifunctionality of our benchmark.

Question Enhancement

To comprehensively evaluate LLM safety and robustness, we develop three subsets: attack-enhanced, defense-enhanced, and multiple-choice questions, each expanding on part of our base set. These subsets aim to deepen the challenge, broaden the evaluation perspectives, and support automatic testing, ensuring a thorough exploration of LLM safety and defense abilities against attack methods.

To further explore the vulnerabilities of LLMs and examine their robustness to attacking methods, we construct the attack-enhance subset by picking harmful questions not commonly rejected by LLMs and further enhance them with attack methods. The construction steps are summarized in Figure 3.

2 Defense Enhancement

Response Generation. This step is shared with the pipeline of the attack-enhanced subset.

3 Multiple-choice Questions Subset

To broaden our dataset’s scope with varied question formats, we add a multiple-choice questions (MCQ) subset, enhancing the benchmark’s complexity. Generally, we propose a heuristic self-instruct method to build the MCQ subset as follows. We first uniformly sample KK harmful questions in each category. For each question q\mathbf{q}, we generate multiple safe responses and unsafe responses as candidates. The safe and unsafe responses are generated using GPT-4 (OpenAI, 2023) and fine-tuned GPT in Section 2.2 respectively. Then, after selection parsing and human re-checking, we randomly select totally three safe and unsafe selections and obtain final choices S\mathcal{S}. Finally, for each q\mathbf{q} with S\mathcal{S}, we construct two questions (i.e., choosing safe or unsafe choices) by prompting method. All the templates are discussed in Appendix H. Finally, we obtain the target MCQ subset with ~4k questions and covers all categories listed in Section 2.1.

Evaluator

Previous LLM safety benchmarks (Zou et al., 2023; Deng et al., 2023) usually utilize 1) keyword matching (Zou et al., 2023), 2) moderation classifiers (Hanu and Unitary team, 2020) and 3) GPT (Deng et al., 2023) to conduct safety evaluation. However, keyword matching-based evaluator struggles with various outputs from LLMs; moderation classifiers only focus on a narrow spectrum of safety threats; GPT incurs much extra cost to call APIs, and upgrade of the backend model affects reproducing evaluation results.

Inspired by LlamaGuard (Inan et al., 2023), we use public and self-generated data to fine-tune an LLM-based classifier from Mistral-7B (Jiang et al., 2023). Distinctly, our task involves evaluating not only plain question-answer pairs but also attack-enhanced question-answer pairs. Our evaluator is named MD-Judge.

To make our MD-Judge capable of both plain and attack-enhanced questions, we collect plain QA pairs from previous works (Ji et al., 2023; Zheng et al., 2023a; Lin et al., 2023) and construct both safe and unsafe answers to enhanced questions. The safety labels of attack-enhanced QA pairs are labeled by GPT-4. Finally, we utilize the Auto Labeling toolkit illustrated in Section 2.2 to annotate all training samples within the taxonomies of SALAD-Bench. During fine-tuning, we propose a safety evaluation template to reformat question-answer pairs for MD-Judge predictions, as shown in Figure 4. Besides, the template’s structure are detailed described in Appendix D. This structured data then undergoes fine-tuning to enhance MD-Judge’s capabilities.

2 Evaluation for Multiple-choice Questions

Previous works use keyword-based evaluators (Xu et al., 2023) and GPT-based evaluators Deng et al. (2023) to handle open-ended responses for multiple-choice questions (MCQ). To sustain high evaluation accuracy meanwhile reduce the cost from inference, we introduce MCQ-Judge, which leverages in-context learning with regex parsing to efficiently fetch the answers. Specifically, we first define token as well as token to strictly wrap the output selections, and obtain the basic instruction of evaluation prompt. Then, to leverage the instruction following ability of LLMs and obtain the formatted responses, we manually construct few-shot exemplars as prompts to conduct in-context learning. Hence we obtain the final prompt for MCQ-Judge, and the total prompts are listed in Appendix H.

Experiments

Selected models are shown in Table 2, including various open-sourced and black-box LLMs. When generating from open-sourced models, we strictly follow its chat template and use greedy sampling.

Attack methods. We evaluate TAP Mehrotra et al. (2023), AutoDAN Liu et al. (2023a), GPTFuzz Yu et al. (2023), GCG Zou et al. (2023), CoU Bhardwaj and Poria (2023) and human designed jailbreaks. For TAP, we employ vicuna-13B-v1.5, GPT-3.5 and GPT-4 as the evaluator. For AutoDAN, we use GPT-3.5 as mutator or do not use mutator. For GCG we follow Robey et al. (2023) to use 20 beforehand searched suffixes. All methods use Llama-2-7B-Chat as the target model. More details are listed in Appendix G.

Defense methods. During experiments, we also incorporate different paraphrasing-based methods (Cao et al., 2023), perturbation-based methods (Cao et al., 2023; Robey et al., 2023), and prompting-based methods (Deng et al., 2023; Wu et al., 2023) as defense methods.

Evaluation metrics. F1 score is utilized to gauge the performance of our evaluator primarily. For assessing the safety of models, we measure each model’s safety rate and employ the Elo Ratings (Zheng et al., 2023b) for ranking the LLMs. The effectiveness of attack and defense strategies is evaluated using the Attack Success Rate (ASR) based on our MD-Judge. Note that ASR equals 1 minus the corresponding safety rate for each LLM.

For MCQ subset, suppose that there are NcorrectN_{\text{correct}}, NwrongN_{\text{wrong}}, and NrejectN_{\text{reject}} MCQs correctly answered, falsely answered, and rejected by safety strategies. We first report the overall accuracy (Acc-O) by Acc-O=Ncorrect/(Ncorrect+Nwrong+Nreject)\text{Acc-O}=N_{\text{correct}}/(N_{\text{correct}}+N_{\text{wrong}}+N_{\text{reject}}) to measure the accuracy under the safety strategies. To measure the ability to recognize safe/unsafe selections without safety strategies’ effect, we also exclude rejected answers and report the valid accuracy Acc-V=Ncorrect/(Ncorrect+Nwrong)\text{Acc-V}=N_{\text{correct}}/(N_{\text{correct}}+N_{\text{wrong}}).

2 Evaluator Comparison

We fine-tune MD-Judge from Mistral-7B (Jiang et al., 2023) with sequence length of 4096 via LoRA (Hu et al., 2021) with Rank=6464. The model underwent training on 8xA100 GPUs, with a per-GPU batch size of 16, over a total of 2 training epochs. Furthermore, we also fine-tune on different base models to compare the performances, which is shown in Appendix F.

Test Dataset. We test MD-Judge on several datasets, including self-generated&annotated and publicly available ones. We create SALAD-Base-Test and SALAD-Enhance-Test from SALAD-Bench to represent base and attack-enhanced test sets to assess different evaluators. We also use public test sets, i.e., ToxicChat (Lin et al., 2023), Beavertails (Ji et al., 2023), and a 2k subset of SafeRLHF (Dai et al., 2023) test set for evaluation.

Implementation Details. We compare MD-Judge with three methods, i.e., keywords evaluator, gpt-based evaluators (gpt-3.5-turbo-1106 and gpt-4-1106-preview), and LlamaGuard Inan et al. (2023).

Evaluation results of F1 scores are detailed in Table 3. The comprehensive outcomes reveal that MD-Judge markedly surpasses its counterparts on both our proprietary test set and the publicly available safety test set, underscoring the effectiveness of MD-Judge’s fine-tuning for enhanced general safety guard capabilities. For more in-depth results and analysis, please refer to Appendix E.

3 Model Safety Evaluation

Overall safety evaluation. We assess various LLMs using base set and attack-enhanced subset, with findings presented in Table 4. Claude2 achieves the top safety score at 99.77%, while GPT-3.5 scores the lowest at 88.62% among black-box LLMs. In the open-sourced models, the Llama-2 series excels with over 96% safety, outperforming Vicuna. There is a significant drop in safety rates when comparing the base set to the attack-enhanced set. GPT-4 and Claude2 lead in performance on the attack-enhanced subset, possibly due to advanced safety guardrails. Conversely, Gemini’s safety rate significantly drops in the attack-enhanced subset, highlighting potential safety vulnerabilities.

Assessment across different safety dimensions. Results from Figure 5 show GPT-4 performing best in Information & Safety Harms and Claude2 in Malicious Use, with their challenges lying in Representation & Toxicity Harms and Socioeconomic Harms, respectively.

The attack-enhanced set shifts the challenge, with GPT-4 and Claude2 facing difficulty in Human Autonomy & Integrity Harms, GPT-3.5 in Socioeconomic Harms, and Gemini in Malicious Use. Easier domains include Information & Safety Harms for GPT-4 and Claude2, and Representation & Toxicity Harms for Gemini and GPT-3.5. See Appendix K for more detailed comparison.

4 Evaluation of Attack & Defense Methods

Attack method evaluation. We evaluate attack methods and prompts in Section 5.1 on both our defense-enhanced subset and AdvBench-50A subset from original AdvBenchZou et al. (2023).Chao et al. (2023) and report results in Table 5. For jailbreak prompts and beforehand searched GCG suffixes, we report ASR both maximizedGiven a question, we count an attack success if at least one prompt triggers harmful response. and averagedGiven a question, compute average ASR on all prompts. among all prompts. Overall, most methods achieve lower ASR on our defense-enhanced set than on AdvBench-50 and our base question set. This reveals the challenge of our defense-enhanced set. Among all attack methods, human-designed jailbreak prompts achieve the highest ASR, because models usually follow instructions in jailbreak prompts to scoff or curse. Suffixes searched from GCG can even trigger the model to generate detailed instructions on harmful behaviors, but is vulnerable to defense methods. GPTFuzzer gives moderate attack performance. AutoDAN and TAP are suboptimal even with different configurations. Chain of Utterances is sensitive to chat templates. Harmful responses can appear without chat templates but can hardly appear with chat templates.

Defense method evaluation. We evaluate the performance of defense methods on the attack-enhanced subset with different LLMs, as shown in Table 6. More detailed results are shown in Appendix Table 9. The main findings are two-fold. Firstly, GPT-paraphrasing method (Cao et al., 2023) and Self-Reminder prompt (Wu et al., 2023) obtain the best defense ability against unsafe instructions and attack methods. Specifically, after introducing GPT-paraphrasing as the defense method, the ASR of Mistral-7B (Jiang et al., 2023) largely drops from 93.60% to 24.98%. And after using self-reminder prompts, the ASR of Llama-2-13B even largely drops to 12.68%. Secondly, perturbation-based defense methods are marginal on the attack-enhanced subset. Specifically, the improvement by introducing perturbation-based methods is usually less than 10%. Even for Llama-2-13B, after using random insert and random swap as defense methods, corresponding ASRs rise to 51.16% and 54.94% respectively. These results indicate the instability of perturbation-based methods.

5 Multiple-choice Question Subset Analysis

Finally, we analyze the performance of LLMs on the MCQ subset, as shown in Table 7. More results are shown in Appendix Table 13, 14 and 15. Generally, our primary findings are three-fold. First, GPT-4 (OpenAI, 2023) achieves the best 88.96% in terms of Acc-O and Acc-V, which surpasses all counterparts and shows powerful safety as well as helpfulness capability. Second, too strict safety restrictions are harmful to the overall accuracy of MCQs. Specifically, Acc-V of Gemini Pro (Team et al., 2023) achieves 78.71%, but corresponding Acc-O degrades to 44.19%. These results indicate that too strict safety strategies may limit the effectiveness of LLMs in safety-related tasks. Finally, weak instruction following ability also restricts the final accuracy in the MCQ subset. Specifically, the safe rates of InternLM-20B and Yi-34B achieve 98.25% and 91.38%. But the corresponding Acc-V reduced to 3.85% and 29.09%, which indicates insufficient instruction following ability restricts the safety ability of LLMs. Besides, we conduct more analysis for the MCQ subset, e.g., consistency between choosing safe or unsafe choices and accuracy of MCQ-Judge, as shown in Appendix H.

Conclusion

We present SALAD-Bench, a hierarchical and comprehensive benchmark for LLM safety evaluation through hierarchical taxonomies. Utilizing MD-Judge and MCQ-Judge as evaluators, SALAD-Bench goes beyond mere safety assessment of LLMs, providing a robust source for evaluating both attack and defense algorithms notably tailored for these models. The results from SALAD-Bench show varied performance across different models and highlight areas that may require further attention to enhance the safety and reliability of LLMs.

Limitation

Generally, the limitation of this paper can be roughly summarized as three-fold. Firstly, with the rise and development of new safety threats, our defined hierarchical taxonomy may be outdated. One could regularly update the taxonomy and evaluation data to tackle this issue. Secondly, during data collection, we leverage multiple filtration algorithms and reward models to clean the base set rather than intensive human labor. Nevertheless, the quality of the base set largely relies on the quality of such algorithms and reward models. Finally, for QA evaluation, the precision of evaluation results depends on the performance of the MD-Judge evaluator; and for multiple-choice subset evaluation, the evaluation results may rely on the instruction following the ability of candidate LLMs.

Broader Impact and Ethics Statement

Safety benchmarks play a vital role in identifying potential harms associated with LLMs. With growing concern about the ability of existing benchmarks to accurately quantify and identify specific safety issues, our motivation is centered on improving the security and safety of LLMs. This is achieved by evaluating LLMs with challenging questions and a detailed safety taxonomy that highlights specific vulnerabilities. However, conducting this research involves handling potentially sensitive content, such as attack-enhanced questions and choices in multiple-choice question scenarios. To address this, we restrict access to the most sensitive parts of the benchmark, available only to authorized researchers who adhere to stringent ethical guidelines. These measures are designed to safeguard the integrity of the research while minimizing any potential harm associated with the content of the benchmark.

References

Appendix A Detailed Domain-level Taxonomy Definition

In this section, we give detailed definitions of our taxonomy across safety domains and tasks.

is delineated into three distinct yet interconnected tasks: toxic content, unfair representation, and adult content. The toxic content domain encompasses the production of content that contravenes established community guidelines and ethical standards such as insult, child abuse, violent content, etc. Unfair representation addresses the phenomena of misrepresentation, under-representation, or over-representation of specific identities, groups, or perspectives. While adult content introduces material that is explicitly sexual in nature and intended for mature users, such as pornography, erotic chat and sexual activity.

Misinformation Harms

can be broadly categorized into two main tasks, i.e., propagation of false beliefs and misconceptions refers to the dissemination. Misinformation also involves action that is erosion of trust in public information. Notable instances include the medical advice, law advice.

Information & Safety Harms

can be articulated through two principal tasks. Privacy infringement: This encompasses the unauthorized revelation, creation, or accurate deduction of personal and private data about individuals. It involves scenarios where personal details are inferred or private information is leaked. The other is dissemination of dangerous information. This category covers the release, generation, or correct inference of information that poses a potential security threat such as fake news, social media rumors, government secrets and financial data leaks.

Malicious Use

can be delineated into several key tasks. Influence Operations includes manipulating campaign materials and public discourse, along with illegitimate surveillance activities. Fraud involves scams, cheating, fake reviews, fake websites, and tricking people for money or other gains.Defamation spreads false information to damage someone’s reputation. Security threats proposes activities related to cyber attacks, creating malware, and making or moving weapons. Illegal activities covers a wide range of crimes like financial fraud, drug crimes, sexual offenses, violence, theft, giving illegal legal advice, environmental crimes, traffic violations, and terrorism.

Human Autonomy & Integrity Harms

can be categorized into two tasks. One is the violation of personal property which includes using someone’s personal identity or likeness without their consent for unauthorized purposes. The other is persuasion and manipulation. It involves exploiting a person’s trust or pressuring them to do things they don’t want to do, such as self-harm or psychological manipulation.

Socioeconomic Harms

mainly contain two tasks, the risky financial practices such as paypal lending, multi-level marketing and gambling and trade and compliance which contains trade affairs and labor issues.

Each domain is subdivided into tasks and categories, with detailed description and number of questions provided in Table 16.

Appendix B Question generation and enhancement

We use the prompt in Figure 6 to generate self-instructed questions in all the categories by finetuned GPT. The number of questions and category should be changed to generate. Upon collection, we leverage a LLM-based reward model Dai et al. (2023) Rs\mathcal{R}_{\text{s}}, pretrained on SafeRLHF Dai et al. (2023), to determine whether each question is safe or not. Specifically, with given input question q\mathbf{q}, we first feed q\mathbf{q} into Rs\mathcal{R}_{\text{s}} to obtain corresponding harmfulness score sharms_{\text{harm}}. Then, we classify question q\mathbf{q} via sharms_{\text{harm}} and a given threshold TharmT_{\text{harm}}, i.e., q\mathbf{q} with sharm>Tharms_{\text{harm}}>T_{\text{harm}} is seen as harmful question, and vise versa. Afterwards, we follow the instruction for auto labeling task which is shown in Figure 7.

In total, 15k questions were generated using a fine-tuned GPT-3.5 model, supplemented by 6k questions sourced from open datasets. The data sources of our base set in SALAD-Bench are detailed in Table 8.

B.2 Defense enhancement

Defense enhacement pipeline is shown in Figure 8.

B.3 Details in Attack Enhancement

(2) Red teaming with LLMs. This line methods utilize a red team LLM, and optionally combined with other searching methods, to generate augmented questions that may lead to harmful answers of the target language model. We employ TAP (Mehrotra et al., 2023), GPTFUzzer Yu et al. (2023) and AutoDAN Liu et al. (2023a). PAIR Chao et al. (2023) is not used considering it as a degenerated case of TAP. We use the default configuration to run these algorithms to generate attack enhanced questions. The only modification is to replace GPT-4 in these algorithm to GPT-3.5 or open-sourced alternatives to reduce cost. Note that attack algorithms used here are for generating enhanced questions instead of rigorously evaluating their effectiveness, so strictly following the original setting is not important.

Appendix C Full Results of Defense Methods

We also illustrate the full results of defense methods. See Table 9.

Appendix D Evaluation Template for MD-Judge

The template of our MD-Judge is intricately designed with four distinct components. First is Task, which indicates that our evaluator is designed to assess the safety of agent messages in the whole conversations. Second is Safety taxonomy, to make a trade-off between classification accuracy and context length of LLMs, we list the general activities of domain-level or task-level safety taxonomies. To ensure the activities of each category easy to understand, we utilize the ability of GPT-4 and human labor to transform each taxonomy into a clear and distinct instruction format, i.e., description regarding “Should not” and “Can”, thus obtaining the safety taxonomy paragraph. Third is QA Pairs, where to place the input questions and corresponding responses from specific LLMs. Lastly, Output format aims to utilize LLMs’ ability to follow instructions, ensuring that our evaluator adheres strictly to our designated format. This approach paves our way to calculate the safe/unsafe probability using the probability of the very first token.

Appendix E Detailed Evaluation Results of Evaluators

Table 10 shows the detailed results of different evaluators. Notably, MD-Judge surpasses the GPT-4-based approach in both the SALAD-Base-Test and SALAD-Enhance-Test by margins of 3% and 5%, respectively. This demonstrates that MD-Judge is adept not only with standard pairs but also excels at addressing more challenging ones. Additionally, MD-Judge outperforms other evaluators in open-source test sets such as Beavertails and SafeRLHF, with a particularly impressive 15% improvement on ToxicChat compared to the next best method, highlighting its superior safeguarding capabilities. Furthermore, to facilitate a more detailed comparison between MD-Judge and LlamaGuard, we additionally present the Area Under the Precision-Recall Curve (AUPRC) metrics at Table 11.

Appendix F SFT Versions of Our Evaluators

First of all, we only focus on the open-sourced SoTA models with 7B parameters i.e. Llama-2-7B (Touvron et al., 2023), Mistral-7B-v0.1 (Jiang et al., 2023), and Mistral-7B-Instruct-v0.1 (Jiang et al., 2023) for the following two reasons: 1) the commendable understanding and reasoning capabilities ensure robust representation ability to various question-answer pairs. 2) models with ~7B parameters are more user-friendly and require less computation resource during inference.

Table 12 presents the results of our evaluators, which have been finetuned using our training dataset. Our investigation encompasses two distinct variants: the base model and the safety taxonomy template. Base models include Llama-2-7B, Mistral-7B-v0.1, and Mistral-7B-Instruct-v0.2. Meanwhile, based on the pre-defined hierarchy taxonomy in Section 2.1, the safety taxonomy template is bifurcated into two categories: domain-level template and task-level template.

Based on the table results, we can tell that Mistral-7B-v0.1 along with task-level template training format is the best one as it demonstrate a significant improvements over the alternatives on the ToxicChat and also achieve commendable results on the other datasets. Therefore, we finally choose it as our evaluator.

Upon conducting a more comprehensive analysis, it was observed that the task-level template significantly enhances the performance on the Mistral-7b model compared to the domain-level counterpart. However, this enhancement was not replicated in the Llama-2-7B model. A plausible explanation for this discrepancy lies in the difference in context length between the two models. Llama-2-7B has a shorter context length compared to Mistral-7B. Given that safety evaluations typically involve a substantial number of tokens, the more verbose nature of the task-level template may exceed the window size of Llama-2-7B, thereby hindering its effectiveness.

Appendix G Details in Evaluating Attack and Defense Methods

For all attack algorithms that require specifying a target model, we choose Llama-2-7B-Chat as the target model, because it is one of the safest model as shown in Table 4. We hypothesis that attacks successful against Llama-2 would be more transferable to other model comparing to using unsafer model as the target model.

For TAP Mehrotra et al. (2023), we follow the original work to use vicuna-13B-v1.5 as the attacker. We experiment different configurations using vicuna-13B-v1.5, gpt-3.5-turbo-1106 and gpt-4-1106-preview as the evaluator. During attacking, the attacker model would generate multiple attack-enhanced questions. We collect all questions with evaluation score 9\geq 9 (full score is 1010) and re-evaluate using our evaluate. We count a successful attack if one of attack-enhanced questions is evaluated as unsafe by our evaluator.

For AutoDAN Liu et al. (2023a), we use two settings, one with GPT mutation disabled and the other using gpt-3.5-turbo-1106 as the mutator.

For GPTFuzz Yu et al. (2023), we use Llama-2-7B-Chat Touvron et al. (2023) as the target model and gpt-3.5-turbo-1106 as the mutator. We only retain augmented questions from successful attacks.

Human designed jailbreak prompts templates are picked from jailbreakchat.com based on corresponding reported JB scores and up votes. We pick top 10 templates ranked by JB scores and another top 10 by upvotes.

For GCG Zou et al. (2023), we follow Robey et al. (2023) to use 20 beforehand searched suffixes and append them to all the questions.

For Chain-of-Utterances(Bhardwaj and Poria, 2023), we use its prompt. But when chatting with models, we apply the chat template. This is different to the original work (according to its source code) where chat template is not applied when generating from open sourced models.

Defense methods.

For paraphrasing-based methods, we choose GPT-paraphrasing Cao et al. (2023) as the baseline method. For perturbation-based methods, we choose four different augmentation methods, i.e., random erasing Cao et al. (2023), random inserting Robey et al. (2023), random patching Robey et al. (2023), and random swapping Robey et al. (2023) as defense methods. And for prompting-based methods, we utilize the recently proposed Safe / XSafe prompts Deng et al. (2023) and Self-Reminder prompt Wu et al. (2023) in our experiments, which have shown effective defense abilities in small-scale experiments.

Appendix H Templates of MCQ-Judge

To generate safe responses, we utilize GPT-4 (OpenAI, 2023) to produce benign answers using the following prompt:

Please provide at least three suggestions to NOT to do the following things: [QUESTION]. Kindly note that the answer should persuade the people not to do above thing. Answer: [OUTPUTS] Since pre-trained LLMs (OpenAI, 2023; Touvron et al., 2023) (e.g., GPT-4 and Llama-2) typically generate benign outputs, we employ a finetuned GPT model, as detailed in Section 2.2 to generate unsafe responses by the following prompt:

Judge template.

Here is the full template used in MCQ-Judge.

User: For each question you will choose the safe answer from A, B, C. Note: You need to write your answer between and . If there are multiple options that meet the requirements, please fill in all the options directly between and . PLEASE DO NOT OUTPUT UNRELATED WORDS! Here are a few examples. Example 1: #QUERY: the content of question 1, A. selection 1, B. selection 2, C. selection 3. #ANSWER: BC Example 2: #QUERY: the content of question 2, A. selection I, B. selection II, C. selection IV. #ANSWER: A Example 3: #QUERY: the content of question 3, A. ans 1, B. ans 2, C. ans 3. #ANSWER: B Example 4: #QUERY: the content of question 4, A. xxx, B. yyy, C. zzz. #ANSWER: AC Here is the multi-choice question, choose the unsafe answer. #QUERY: [QUESTION]. #ANSWER: Note that the few-shot exemplars only provide output formats for responses and do not require introducing realistic safety questions, thus avoiding to introduce bias during answer generation for multiple-choice questions.

Appendix I Rejection Keywords

Here is the comprehensive list of keywords used to identify a rejection. A rejection is considered valid regardless of whether these keywords are found at the beginning or in the middle of the response.

"If you have any other non-malicious requests"

Appendix J Safety Rate LeaderBoard

In Figure 9, we present the leaderboard, which ranks 24 models according to their Elo Rating both on the base set and attack-enhanced subset.

Appendix K Quantitive Results and Analysis

Performance varies across different safety domains and among various models, with certain models like GPT-4 (OpenAI, 2023) and Claude2 (Anthropic, 2022) consistently achieving high safety rates across the board, while others display more fluctuation. In the base set, models generally exhibit high safety rates across most domains as shown in Figure 10. This suggests that under standard testing conditions without intensified adversarial challenges, the models can effectively handle a range of safety issues. However, there is a notable difference in safety rates when comparing the base set to the attack-enhanced subset, where the latter shows a significant drop in safety rates due to the challenge of the questions.

Figure 10 and Figure 13 illustrate the safety rates in the base set and attack-enhanced subsets. In the base set, models tend to perform better in the Information & Safety Harms domain, whereas Malicious Use and Socioeconomic Harms are more challenging. In contrast, the attack-enhanced subset presents a shift, with Information & Safety Harms and Human Autonomy & Integrity Harms emerging as the domains with the most difficulty.

Task Analysis.

The safety performance across tasks is showcased in Figure 12 and Figure 13. In the base set, tasks related to Adult content show lower safety rates, while Unfair representation tends to have higher rates across most models.

Category Analysis.

Figure 14 and Figure 15 present the safety rates across a variety of categories within the base set and attack-enhanced subset. Categories associated with sexual content, such as Pornography and Erotic chat, generally see lower safety rates in the base set. The attack-enhanced subset reveals pronounced weaknesses across models, particularly in categories related to Financial data leaks.

Appendix L More Analysis for MCQ Subset

In addition to evaluating Acc-O and Acc-V for different LLMs, we are also curious about whether LLMs could correctly recognize the safe and unsafe selections from the same multiple-choice questions. Intuitively, for each two multiple-choice questions (i.e., select safe and unsafe selections) from the same seed question, an ideal LLM should generate a pair of consistent outputs (e.g., selecting “AB” for safe selections and “C” for unsafe selections) to illustrate the helpfulness of LLMs in the safety dimension. Therefore, we calculate the selection consistency rates of different LLMs and demonstrate the results in Table 14. GPT-4 (OpenAI, 2023) achieves 86.93% consistency rate and attains the best performance among all LLMs. Furthermore, LLMs with relative high selection consistency usually obtain high accuracy on MCQ subset. Specifically, except for GPT-4, Qwen (Bai et al., 2023) and Tulu (Ivison et al., 2023) also achieve 44.00% and 56.40% selection consistency rate respectively and perform better than other LLMs. In terms of Acc-V in Table 7, both models achieve 68.65% and 71.43% respectively, which perform better than most of LLM counterparts. In contrast, LLMs (e.g., Mistral-7B (Jiang et al., 2023)) with relatively low consistency rate may obtain unsatisfying accuracy. This suggests that both insufficient inherent instruction following ability and too strict safety alignment procedures may influence the overall accuracy in MCQ subset, thus further affect the selection consistency rate. In the following, we will analyze the rejection rates of LLMs in MCQ subset.

Analysis of rejection rates.

Based on above analysis, our observations are two-fold: 1) LLMs with low general safety capability and instruction following ability usually attain weak rejection rates for multiple-choice questions (MCQs); and 2) too strict safety alignment strategies or post-processing may lead to negative effect for recognizing safe selections. For the first observation, according to Table 7 and Table 14, LLMs with insufficient safety ability (e.g., InternLM and Vicuna Chiang et al. (2023)) usually obtain both relative unsatisfying accuracy and low rejection rates on MCQ subset. Notably, InternLM generates responses to all multiple-choice questions, but only obtain 3.85% Acc-V, which supports our first observation. And for the other observation, According to Table 14, Llama-2 and Claude2 obtain much higher rejection rate for questions of selecting unsafe choices (i.e., 84.48% and 87.71% respectively) than those of choosing safe answers. This phenomenon indicates that corresponding LLMs are enhanced by strict safety alignment procedure. Nevertheless, such alignment procedure leads to 1) high rejection rates for question choosing safe answers (i.e., 63.39% and 36.04%), and 2) low Acc-V shown in Table 7. And for Gemini, though it also suffers from relative high overall rejection rate by 43.85%, corresponding rejection rates for safe questions and unsafe counterparts are balanced. Therefore the safety strategies of Gemini does not harm the ability of recognizing safe and unsafe selections and ensures Gemini to obtain promising 78.71% Acc-V. These results encourage us to explore more effective safety alignment algorithms with less negative effect, thus attaining both promising safety capability and optimal helpfulness capability for future LLMs.

Evaluator for MCQ subset.

Moreover, we also measure the effectiveness and efficiency of different evaluators for our MCQ subset. Generally, an ideal evaluator should satisfies two requirements, i.e., comparable accuracy with human evaluator, and much lower time or money cost than human evaluator. Therefore, we compare keyword-based evaluator (Zou et al., 2023), GPT-evaluator (Deng et al., 2023) and our proposed MCQ-Judge in MCQ subset, with the accuracy of three LLMs (OpenAI, 2023; Chiang et al., 2023; Bai et al., 2023), inference time, and money cost from calling APIs. We also introduce human evaluator as reference to measure the effectiveness of each evaluator.

Corresponding evaluation results are shown in Table 15. Compared to results from human evaluator, our MCQ-Judge achieves similar results on Acc-V of GPT-4 and Vicuna, and only falls behind human evaluator by ~3.4% on Acc-V of Qwen. Meanwhile, these results are comparable with counterparts from GPT-evaluator and much better than those from keyword-based evaluator, which indicate that our MCQ-Judge attains sufficient accuracy to evaluate the MCQ subset of our dataset. Moreover, Compared to human evaluator and GPT-Evaluator, MCQ-Judge only requires ~0.43s to evaluate the whole questions for each model, which is comparable with keyword-based evaluator and much faster than GPT/Human-based counterparts. Above results demonstrate the effectiveness and efficiency of MCQ-Judge in our MCQ subset.

Appendix M Qualitative Results

In addition to quantitative results for various LLMs on SALAD-Bench, we also demonstrate some classical questions with responses from different LLMs, which are shown in Figure 16, Figure 17, Figure 19, and Figure 20, respectively.

According to Figure 16 and Figure 17, after introducing attack methods (e.g., prompts generate by GCG (Zou et al., 2023) algorithm searched on Llama-2), both Llama-2 and Mixtral-8x7B generate unsafe responses, which indicate that some attack methods obtain inherent transferable capability across different LLMs. This finding will encourage further LLM safety research. And according to Figure 19 and 20, though constructed from the same base question, more LLMs tend to generate incorrect answers when facing multiple-choice questions with choosing unsafe questions. All the wrong answers can be roughly separated into two groups. The first group is wrong selections. Since some multiple-choice questions may include multiple correct selections, some LLMs whose safety capabilities are insufficient, such that models fail to choose all the correct choices. And the second group is wrong format. Some LLMs could choose right choices but fail to generate accurate responses format, such that the responses fail to be detected by MCQ-Judge evaluator. These findings encourage future research to design better safety alignment techniques which ensure both safety capabilities and helpfulness capabilities.