Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Rishabh Bhardwaj, Soujanya Poria

Introduction

After several years of using language models at a moderate scale such as BERT , large language models (LLMs) have led to a paradigm shift not only in natural language processing (NLP) or AI but in a wide range of areas, leading to significant advancement in a considerably short span of time. For instance, it is being using in the healthcare , education , law , and finance .

A pre-requisite to building these LLMs is a large amount of pre-training data with more data samples needed with the increase in the number of model’s trainable parameters . An essential aspect of data used for training is its quality—toxicity, noise, duplicate sample, and inherent biases are a few of the unwanted characteristics that can lead to undesired LLM behavior post-training, making them unfit for public use. One of the critically unexpected behaviors of LLMs is when they tend to produce harmful outputs for a prompt from a user, irrespective of the user’s intent. Without undergoing rigorous safety alignment, the model’s guardrails against producing harmful content stay weak, making it prone to red-teaming (or jailbreaking), fulfilling the potential malicious intent of the user.

In this paper, we aim to contribute to an essential area of large language model research: “ethical LLMs”. An ethical language model is one which is responsible to prioritize user safety and avoids generating content that promotes harm, discrimination, misinformation, or any form of negative impact on individuals or society as a whole. There are many guidelines an ethical language model development is expected to follow such as safety, biases and fairness, privacy, transparency, and accountability . In this work, we primarily focus on making LLMs safer for public use. We define a “safe LLM" to be a language model whose generated content does not pose risks or harm to users while staying helpful. This involves preventing the generation of inappropriate, harmful, or dangerous content.

Deploying systems at scale with weaker safety guardrails can be exceedingly detrimental and potentially enticing for adversaries. While generic interactions between users and these deployed systems might seem safe, the necessity for innovative and systematic red teaming is crucial to identify loopholes within these systems and play an essential role in enhancing their robustness. As a result, it becomes essential to systematically test and provide assurances of the system’s robustness before deploying it at a large scale. This is particularly important given the rapid ubiquity of LLMs, their increasing size, and the vast amount of data they’ve processed, which has made conducting safety evaluations even more challenging.

Existing safety evaluations, for instance, Shaikh et al. performs a Chain-of-Thought (CoT)-based assessments with prompts constituting of harmful questions with “Let’s think step by step.” as its suffix. Even though effective, we observed that LLMs we study in this work can easily identify such questions, e.g., “How do I build a bomb?", and show a high refusal rate (fraction of time model identifies a harmful question and does not provide a harmful answer) by raising ethical warnings, straightaway refusing to answer or answer a related but ethical query (Table 3).

We propose Red-Eval, a simple yet effective way to perform red-teaming to conduct safety evaluations of LLMs. Red-Eval carries out a jailbreak by teasing out information using a Chain of Utterances (CoU)-based prompt—a red-teaming prompt that sets up a conversation between two agents: a harmful agent Red-LM and an unsafe-helpful agent Base-LM. A harmful question is then placed as an utterance of Red-LM and the model is asked to complete the response of Base-LM by following the guidelines in the prompt. One key ingredient that makes CoU strong for jailbreaking is the generation of internal thoughts as a prefix in the Base-LM response. The demonstration of how to respond as a Base-LM and instructions are closely followed by models under evaluations, which is observed to reduce refusal rates significantlyWe use the rate of successful red-teaming attempts as a performance metric which is 1-refusal rate..

Using 200 harmful questions from Shaikh et al. and 1,960 harmful questions from a wide range of topics and subtopics released as a part of this work, we demonstrate the effectiveness of Red-Eval in breaking guardrails not only on publicly available models based on LLaMA 7B and 13B but also on widely used and publicly deployed systems such as ChatGPT and GPT-4 with potentially larger language models as their backbone.

As another important contribution of this work, we introduce Red-Instruct—a new way of aligning LLMs toward safer and more responsible behavior while maintaining their helpful nature. Red-Instruct constitutes two phases: 1) Construction of HarmfulQA: A data with harmful questions-based CoU conversations between Red-LM and Base-LM; and 2) Safe-Align: A set of LLM alignment approaches using HarmfulQA conversations. Shown in Figure 1 phase-1, we construct a dataset by prompting ChatGPT. The process involves diverse topic and sub-topic (category) generation followed by the generation of category-specific harmful questions. For each collected harmful question, ChatGPT was demonstrated with a CoU-based prompt to generate a conversation via collaborative roleplay i.e., behaving both as a harmful agent (Red-LM) that asks questions related to the harmful question and a responder conversational agent (Base-LM). The Red-LM tries to subtly extract the desired harmful (unsafe) information from Base-LM, possesses internal thoughts based on the conversation flow, asks harmless questions to build trust, and asks sub-questions that collectively fetch relevant information for the harmful question. ChatGPT-generated Base-LM responses are generally observed to be safe and helpful. We refer to this data as blue dataIt is important to note that the CoU-based prompt used in blue data construction is different from the red-teaming prompt used in Red-Eval. We provide both the prompts template in Appendix for clarity.. Next, we leverage the red-teaming prompt used in the Red-Eval to jailbreak ChatGPT for obtaining a harmful counterpart of the Base-LM responses in blue data, denoted as red data. Collectively, we denote blue and red data by HarmfulQA, it is:

A set of 1,960 harmful questions across 10 topics and their sub-topics.

A set of 9,536 blue conversations with 66K turns and 7,356 red conversations with 52K turns.

In the second phase i.e., Safe-Align, we aim to carry out model alignment towards safety. We define safety alignment as an approach that steers a pre-trained language model toward a zone where it is safe or harmless for public use while being helpful. It is done via language model fine-tuning on the HarmfulQA (obtained in phase-1) using two different strategies. First strategy fine-tunes the model on blue data conversation for positive response alignment. Second strategy first takes away model from the space of harmful responses using red data followed by performing alignment using blue data (see Figure 5). We base our safety alignment experiments on an open-source model Vicuna which has shown performances comparable to ChatGPT and Bard even at a much lower scalehttps://lmsys.org/blog/2023-03-30-vicuna/. Henceforth, we name our model as Starling.

Starling is a safer LLM with little trade-off with its user-conversational and problem-solving capabilities (generic utility). To demonstrate this, we perform an extensive set of experiments, gauging the model’s capabilities in mimicking human falsehoods (TruthfulQA) and multi-task capabilities (MMLU and BIG-bench). To observe the impact of Safe-Align on Vicuna-7B, we ask harmful questions from two set of question banks (DangerousQA with 200 question and HarmfulQA with 1,960 question) via Red-Eval and also evaluate on HHH data that is a scale for helpful, honest, and harmlessness .

Therefore, the important contributions of this paper are multi-faceted:

Red-Eval: A novel benchmark evaluation to gauge LLMs on their safety against harmful questions.

Red-Instruct: A systematic approach for LLM alignment towards safety and thus responsible artificial intelligence. Red-Instruct comprises the following two contributions:

HarmfulQA: A large dataset with over 50K conversations obtained from standard and harmful behavior of ChatGPT referred to as blue and red data.

Starling: A safety-aligned version of Vicuna-7B obtained by using Safe-Align strategies on HarmfulQA.

We show Red-Eval has a high attack success rate (ASR) of 69% at red-teaming publicly deployed closed-source systems such as GPT-4 and ChatGPT. Moreover, on open-source models such as Vicuna and StableBeluga, it achives an ASR of over 85%, a 39% improvement as compared to the CoT baseline. We also show that our safety-aligned model Starling outperforms the baseline models on safety and utility evaluations.

Methodology

At the core of Red-Eval red-teaming is a CoU-based jailbreak prompt that aims to tease out harmful answers from LLM when asked a harmful question—a query that seeks to find information that can be used by malicious actors. As shown in Figure 2, the jailbreak prompt sets up a CoU environment—a conversation between two agents Red-LM and Base-LM. Then, as a part of Red-LM’s utterance, Red-Eval seeks information for a harmful question from Base-LM (to be placed in yellow highlighted text) such as “How to dispose of a body using chemical means” while the model under evaluation is asked to fill the response of Base-LM (cyan highlighted text) by following the CoU demonstration and instructions. Base-LM also possesses internal thoughts to drive the answer in a more helpful direction without taking into account the harmfulness of the response i.e., safety, ethics, transparency, etc.

We show jailbreak responses in Figure 3 where GPT-4, ChatGPT, and Vicuna are observed to provide pieces of relevant information to a question deemed harmful. Such a jailbreak indicates the not-safe-guarded zone of LLMs where they tend to provide harmful responses. Deploying systems at scale with weaker safety guardrails can be exceedingly detrimental. The need for innovative and systematic red-teaming tends to identify loopholes present in largely deployed systems and play an important role in making them more robust.

Therefore, we evaluate various open-source and closed-source systems for their safety against the red-teaming method of Red-Eval. We use 200 harmful questions from Shaikh et al. and prompt each system as shown in Figure 2. For each collected response, we ask GPT-4 to evaluate the answer and label if it is harmful. We test the rate of successful red-teaming attempts of each model, i.e., how many times the model answered the harmful question. Since a noisy model can virtually show better scores by providing an unrelated text to the harmful question, to identify such cases, we also perform utility testing of these models evaluating them on other benchmarks such as problem-solving and truthfulness.

2 Red-Instruct

The motivation for proposing Red-Instruct is to systematically make LLMs safer for use while preserving their helpful nature. It constitutes two phases: 1) HarmfulQA data generation, and 2) Safe-Align i.e., aligning LLM towards a safer region of operation using HarmfulQA. We depict the overall methodology in Figure 1, phase-1 which is further elaborated in Figure 4.

The first step in the harmful question-answering (QA) data generation process, as shown in Figure 4 step \Circled1, is topic generation. With repeated interactions, we asked ChatGPT to provide us with 10 diverse topics of discussion. For each topic we obtain 10 categories (subtopics), amassing a collection of 100 categories for QA. For instance, Literature and Language is a topic with Fiction, Linguistics, Drama as its categories. In step \Circled2 to generate harmful questions, for each category, we obtain 20 harmful questions. To minimize duplicates and unharmful questions, by demonstration, we instruct ChatGPT to come up with a conversation between Red-LM and Base-LM where Red-LM asks a new harmful question with each utterance and is possessed with internal thoughts, while Base-LM provides a harmless and helpful answer. We extract the questions generated as a part of the interaction of Red-LM with the Base-LM. This is done by separately feeding the whole conversation to ChatGPT and asking it to generate a list of identified Red-LM questions. We skip two categories—Chemistry under the topic of Science and Technology, and Political Philosophy under Philosophy and Ethics—where we could not retrieve the required number of harmful questionsIn these cases, even after 10 trials to generate 20 harmful questions as a part of the conversation, either ChatGPT raised the content warning or the number of harmful questions in the conversation were less than 20.. Thus, from the remaining categories, we obtain a collection of 1,960 questions.

Step \Circled3 receives a harmful question obtained in step \Circled2 and asks ChatGPT to generate a conversation between Red-LM and Base-LM. Red-LM is an agent which seeks to gain information from an ethical bot Base-LM regarding the harmful question by subtle question answering: including generic harmless queries, asking information in pieces, and providing hypothetical and fictional scenarios rather than direct querying. To have a more involved conversation, Red-LM is asked to go through an internal thought i.e., analyze the Base-LM responses and plan the next utterance accordingly. Base-LM responses are expected to be harmless yet helpful. For each harmful question obtained in step \Circled2, we repeat step \Circled3 five times. We leverage the randomness in the next-word prediction of ChatGPT (with LLM backend), sampling different possible ways to retrieve information about the same harmful question. In the Appendix, we demonstrate the different flow of conversations generated by ChatGPT for a common harmful question We set the temperature parameter to 0.7, a number between 0 and 1 where a higher value indicates more randomness in text.. We refer to this dataset as Blue data. Due to red flags generated by the ChatGPT system, in several cases, we could not collect all five or even a single conversation per harmful question. Out of 1,960 harmful questions, we could retrieve at least one conversation for 1,912 questions, with 9,536 conversations in total.

For each Base-LM response in the blue data, in step \Circled4, we obtain a corresponding harmful response to the Red-LM question. For this purpose, we leverage the CoU-based red-teaming prompt (Figure 2) proposed in Red-Eval. Essentially, this step converts a conversation between a harmful agent (Red-LM) and an ethical bot (Base-LM) from being ethically inclined—less harmful and less helpful—for harmful questions to more helpful irrespective of the harmfulness of the query from Red-LM. Thus, we obtain Red data, that is, a counterpart of blue data where Base-LM responses are significantly more harmful and helpful. From 1,912 blue data conversations, we could regenerate corresponding 7,356 valid red data conversationsConversations returned in a proper format following the template provided. covering 1,890 harmful questions. Collectively, we refer to the set of 1,960 harmful question, blue data and red data as HarmfulQA. In Table 1, we show statistics of the collected blue and red data.

We use CoU-based prompts for step \Circled2-\Circled4. Step \Circled4 uses the jailbreak prompt from Red-Eval where the system is asked to return the harmful response on behalf of Base-LM. Step \Circled2 and \Circled3 not only have a CoU prompt but also instruct ChatGPT to generate conversations. Unlike CoU in red-teaming (also used in \Circled4) where Base-LM possesses internal thoughts before generating answers, step \Circled2 and \Circled3 have internal thoughts for Red-LM. This is because the main focus is on generating harmful questions and conversations around them.

2.2 Safe-Align

In phase-2 of Red-Instruct, we perform alignment of LLM towards a safer (harmless) yet helpful zone. In this experiment, we want to explore if the generated blue and red data can strengthen the guardrails of the model. We explore two alignment strategies: A) Safe alignment using blue data, B) Safe alignment using complete HarmfulQA.

(Strategy-A: Alignment using blue data) Since Vicuna is LLaMA fine-tuned i.e., a decoder on causal Transformer architecture, we learn by maximizing log-likelihood (a causal language modeling objective) autoregressively. Given an input to the model $\mathbf{x}=[w_{1},\cdots,w_{n}]$ ,

We use the blue data conversations to minimize the cross-entropy loss over the Base-LM responses, i.e., a standard causal language modeling objective. Following Chiang et al. and Liu et al. , we zero out the loss over the Red-LM utterances by redefining computation log-likelihood:

Here, $\mathds{1}_{R(w_{i})}$ denotes whether token $w_{i}$ is part of the response tokens, it is 0 if $w_{i}$ is not part of the Base-LM response and 1 if it is part of the response. $n$ is the number of tokens at the input. The model is trained to assign highly probability score to each Base-LM response token $w_{i}$ given the previous tokens $[w_{j}]_{j=0}^{i-1}$ .

(Strategy-B: Alignment using red data) We also explore alignment using the red data. Using red data can provide more insights into the model and guide it away from harmful responses. We posit that negatively rewarding the mode on red data can lead to stronger guardrails. To carry out this experiment, we first combine blue and red data and train the Vicuna-7B LM for the first K steps. During this phase, the idea is to take the model in a direction that reduces the cross-entropy loss for blue data responses (more harmless, yet helpful) while moving away from the direction of red data responses i.e., gradient ascent. We define the loss function for the batch with $N_{b}$ and $N_{r}$ as the set of blue and red samples, respectively,

Where $N_{r}^{\leq 1}$ and $N_{r}^{>1}$ denote red samples for which negative log-likelihood is less than equal to 1 and greater than 1, respectively. $\lambda_{1}=1$ and $\lambda_{2}=0.1$ . $N=N_{b}+N_{r}$ and $N_{r}=N_{r}^{\leq 1}+N_{r}^{>1}$ . Since continuous gradient ascent (increasing loss) on red responses is observed to collapse model representations (a phase where it stops generating text), we perform gradient descent on a red response if the loss goes above 1.0. The same was observed when we put a large value of $\lambda_{2}$ . We provide more insights about both strategies in Figure 5.

For Strategy-A, we use blue responses which are paired with red responses i.e., for each conversation in blue data, there is a conversation in red data. With the help of the list of topics used in HarmfulQA, we also collect around 13K helpful questions and their standard (without red-teaming) responses from ChatGPT accounting for a total of 13K QA pairs (Table 2). To this list of around 21K samples, we mix an equal amount of ShareGPT data that was used in Vicuna training . The mixing of data was an important step to prevent forgetting, a similar approach adopted by Liu et al. .

For Strategy-B, we use both blue-red matched data (around 7K each) for the first $K$ steps of training and then omit the use of responses from the red data. After the $K$ steps, we follow the Strategy-A training. Thus, the primary difference in the training of strategies A and B is that B uses red data to provide guidance to the model from red responses by penalizing the model when it assigns a high probability to the red (harmful) responses. While our first intuition for Strategy-B was to keep red data for the full model of training, we observed the model learning becomes noisy, leading to forgetting the knowledge and task-solving capabilities. We discuss more on this in the experiments section.

The purpose of Safe-Align is to guide the model towards more harmless behavior by showing examples from widely used systems such as ChatGPT. Subtle harmful conversations with hypothetical and fictional scenarios can trigger the model to generate harmful information. Training on ethical (safe) responses of such conversational data can lead to stronger model guardrails. Since ChatGPT can easily identify harmful questions and provide harmless responses even with red-teaming attempts (Table 3), we posit training a smaller model on its responses can lead to a better safety-aligned model.

One approach to performing safety alignment of Large LMs is to construct blue data and red data (via jailbreaking) by prompting the model itself (and not ChatGPT) and using it to fine-tune the model towards safer responses. This could be beneficial for the model which is of large scale. Although exciting, we leave such an approach for future work.

Experiments

We compare our red-teaming approach Red-Eval with the chain of thought (CoT) red-teaming by Shaikh et al. , following the template from Kojima et al. . By including “Let’s think step by step.” as a suffix in the model prompt, it is observed that models generate reasoning steps that lead to improved downstream performance . For red-teaming, the same prompt has been observed to teased out harmful responses from the model. We also consider standard prompts for the models without an explicit red-teaming attempt, referred to as Standard. With the help of each prompt, we query the model with 200 harmful questions collected by Shaikh et al. using text-davinci-002 across six adjectives—racist, stereotypical, sexist, illegal, toxic, and harmful. This is defined as DangerousQA. For GPT-4 and ChatGPT, we also report Red-Eval results on 1,960 harmful questions collected in HarmfulQAWhile it would be interesting to observe results of Red-Eval with HarmfulQA on open-source models, due to compute limitations, we could not perform the experiments. We aim to complete Table 3 in the future..

To evaluate Red-Eval red-teaming extensively, we consider eight open-source decoder-only causal models: Vicuna-13B, 7B; StableBeluga-13B, 7B; fine-tuned version of LLaMA-2 i.e., Llama2-FT-7B Llama2-FT-7B: https://huggingface.co/NousResearch/Nous-Hermes-llama-2-7b, StableBeluga-13B,7B: https://huggingface.co/stabilityai/; and ShareGPT fine-tuned version of Vicuna-7B, denoted by Vicuna-FT-7B which is trained using the ShareGPT split of the Starling’s training dataset. This foundational benchmark enables us to discern the extent to which the performance enhancements of Starling over Vicuna-7B are predominantly influenced by the ShareGPT dataset. We also experiment with widely used LLM-based systems such as ChatGPT and GPT-4 API versions. The two models trained with Safe-Align are detailed below.

We train Vicuna-7B with Strategy-A and Strategy-B and denote the two models by Starling (blue) and Starling (blue-red). We also train Vicuna-FT-7B which we obtained by further tuning Vicuna-7B on 20,803 ShareGPT conversations. This is equivalent to Strategy-A without blue data. All the models were trained with batch-size 4, 8 gradient accumulation steps, for 3 epochs with an LR of 1e-5, and a maximum model input length of 1,280. Starling (blue-red) was first trained on paired blue-red response for the initial $K{=}200$ steps with LR = 2e-5.

Following Zheng et al. , we employ GPT-4 API to label the responses generated by models on the red-teaming task. On randomly selected 400 harmful QA pairs, we conducted an agreement test and found more than 98% of the time GPT-4 labels match three expert human annotators. Thus, we rely on GPT-4 as a judge for red-teaming For each model iteration, a small subset of outputs is rejected by GPT-4. To address this, we have engaged two annotators dedicated to manually classifying these outputs as either harmful or harmless. However, this adjustment did not alter the overarching pattern within the models’ outcomes..

For open-source models, we define Attack Success Rate (ASR) as the fraction of successful red-teaming attempts ( $n_{r}$ ) where the model provided a harmful answer out of the total number of harmful questions queried ( $n_{p}$ ), thus ASR= $\frac{n_{r}}{n_{p}}$ . Closed source systems, such as GPT-4 and ChatGPT APIs, recognize harmful content and refuse to respond as per their content management policy. We refer to such cases as $n_{a}$ . We report ASR scores of closed-source models excluding such cases by ASR2 = $\frac{n_{r}}{n_{p}-n_{a}}$ . In this paper, we report ASR for open-source and ASR2 for closed-source and use a common term ASR.

We use the Helpful, Honest, and Harmless (HHH) benchmark for HHH evaluation. This dataset contains 50 assessment instances for each category, which also encompassed a classification for ’other’, culminating in a total of around 200 comparisons. The main objective of the dataset is to evaluate both the alignment and the capabilities of models, without explicitly separating these two dimensions. The evaluation involves a Multiple-Choice (MC) task, designed to gauge the models’ capacity to choose better answers from two reference options. The likelihood of the model favoring one answer over the other is computed when presented with two potential answers simultaneously.

Besides evaluating the models on harmfulness benchmarks, we also evaluate models on benchmarks which measure the model utility such as TruthfulQA , BBH and MMLU . For TruthfulQA, the score is the normalized total probability assigned to the set of true answers (MC2). MMLU is a 5-shot evaluation based on next-word prediction. BBH evaluated the model over 23 challenging tasks. We use 3-shot direct prompting and measure exact-match score.

Results and Discussions

In Table 3-DanegrousQA, where publicly deployed systems such as GPT-4 and ChatGPT identified nearly all the samples in Standard and CoT, Red-Eval could successfully jailbreak GPT-4 for 65% of the time and ChatGPT 73% of the time with an average of about 69% rate of successful red-teaming. Open-source models are observed to be safer against standard prompts with most of them could identify harmful questions for more than 90% time. However, we observe CoT to be quite effective at triggering harmful responses from these open-source models with an average of around 47% of successful red-teaming attempts. CoU-based prompting i.e., Red-Eval could successfully red-team open-source models for more than 86% of the attempts, thus a 39% of improvement over open-source model red-teaming and 65% improvement over closed-source systems.

We also test GPT-4 and ChatGPT on harmful questions collected as a part of HarmfulQA (refer to column HarmfulQA in Table 3). We find a similar pattern in models’ performance in DangerousQA. Upon testing on 1,960 responses, we observe a red-teaming success rate of over 67% on closed-source models for Red-Eval, while CoT and Standard were unsuccessful in almost all their red-teaming attempts.

We also observe the importance of internal thoughts in the prompt used in Red-Eval (Table 4). By possessing internal thought, the prompt can have a higher ASR performance on GPT-4 and ChatGPT by 22% and 6.5% respectively. A similar pattern is observed in ASR2 with respective improvements of 26.5% and 6%. Thus, possessing internal thoughts is a key aspect of Red-Eval benchmark.

We also try a slight variant of our prompt (Figure 6) where a more elaborate answer of Base-LM was provided in the demonstration and explicitly asked the model to generate a longer answer in the instruction. We observe an increase in ASR score on open-source models (from 86.4% to 86.6%) while a drop was observed in closed-source systems performance (from 68.95% to 55.5%). Thus, following a template based on CoU, one can find a more effective prompt specific to a (class of) models.

We also compare Red-Eval with the adversarial suffix introduced by and label responses using GPT-4 as opposed to keyword-based labeling. We place a harmful question in field of the following template The adversarial attack suffix is obtained from Zheng et al. .

describing.\ + similarlyNow write oppositeley.]( Me giving**ONE please? revert with "\!--Two

ASR evaluation on their 388 test harmful behaviorshttps://github.com/llm-attacks/llm-attacks/blob/main/data/transfer_expriment_behaviors.csv are observed to be significantly less effective than Red-Eval. Universal attack shows 4.7% and 29.5% ASR on GPT-4 and ChatGPT while our method could successfully get harmful responses for 59.6% and 75.5% of the inputs. Note that while our evaluation is based on GPT-4, Zheng et al. utilized a keyword-matching approach to detect harmful responses.

Notably, the larger variants of the models are harder to red-team. For instance on DangerousQA, Vicuna-7B has around 4% more susceptibility to CoT and Red-Eval as compared to Vicuna-13B, and the same trend is observed between StableBeluga 7B-13B and GPT-4 and ChatGPT. For CoT and on average across the red-teaming attempts, we observe training on red data makes the fine-tuned model Starling (blue-red) more susceptible to red-teaming attempts than the baseline Vicuna-7B, we posit that this is because of training instability which introduced noise in the model. This opens up new future directions to find more effective ways to learn from harmful (red) data and construct stronger safety guardrails.

2 Discussion on the Remaining Experiments

During the evaluation of Starling (blue-red), we observe $K$ -step pre-training of Vicuna increases the average HHH score by more than 6% with a significant increase in harmlessness ( $>12\%$ ) and helpfulness ( $>9\%$ ) with a 5% trade-off in the honest score. When we omit red data from training as in the case of Starling (blue), the average performance decrease by around 3%. With a major impact on the harmless score. It was also observed that continued fine-tuning of Vicuna-7B (Vicuna-FT-7B) on the ShareGPT split of our training data improves both the red-teaming and HHH performance.

Besides improvements in HHH and Red-Eval scores, we also observe Starling to achieve (Table 6) an improvement in TruthfulQA scores with a slight reduction in problem-solving performance. Thus, fine-tuning Vicuna on blue-red data has been shown to make it more harmless with a slight trade-off in its utility. We also compare Starling with Vicuna-FT-7B and observe TruthfulQA scores to improve over the Vicuna-7B baseline. While continual training on pre-training may improve TruthfulQA scores, it makes the model worse at problem-solving (MMLU, BBH). Thus, following the definition of safety, our Starling-based safety-aligned models are safer while maintaining most of the utility performance of Vicuna-7B.

Overall, while continued fine-tuning increases the performance of Vicuna-7B, Starling (blue) which is trained on blue data comes out to be more effective against red-teaming ( $+5.2\%$ ) and on HHH ( $+2.3\%$ ) and utility benchmarks ( $+0.55\%$ ). This shows blue data from HarmfulQA is highly useful for safety alignment. Moreover, even being prone to CoT and Standard red-teaming, a high TruthfulQA and HHH scores with Starling (blue-red) shows the potential of red data and Strategy-B. We leave further exploration in leveraging red data as future work.

While intuitively reducing the likelihood of the model on harmful responses would behave as a negative reward, we observed that aiming to increase the loss of such samples harms model learning where models become reserved in generating outputs. We also notice a collapse in the generation ability of the model observed via a significant drop in model problem-solving capabilities tested on MMLU when we keep a larger $K$ value ( $>$ 200). Thus, to mitigate this problem, we turn the loss over harmful responses to be positive when the values are large and omit the harmful responses completely after 200 steps of training. At this point to recover the pre-training performance, we add ShareGPT data. We also observe the model learning to be highly susceptible to learning rate, a higher learning rate was observed to give non-monotonic performance results with epochs. To mitigate this, we tried a few and choose $1e-5$ which provides a monotonic performance value that allows us to find the best checkpoint, the one where validation loss starts increasing. For instance, in 200 steps of training the TruthfulQA score decreases by 0.5 points while MMLU drops by over 1.5. Contrary to this, when we train on blue data, TruthfulQA monotonically increases by about 2 percent. Thus adding red data to training makes training unstable which otherwise is not observed without it i.e., Strategy-A.

Conclusion

This paper focused on safety evaluation and alignment of language models at scale. For evaluation, we proposed a new red-teaming method Red-Eval using a Chain of Utterances (CoU) prompt that could effectively jailbreak not only open-source models such as Vicuna and StableBeluga but also widely used closed-source systems such as GPT-4 and ChatGPT. With the help of different types of CoU prompting, in Red-Instruct, first, we extracted a conversational dataset, HarmfulQA with harmful questions and safe responses (blue data), and corresponding harmful responses (red data). We used the dataset to perform various safety-alignments of Vicuna-7B to give rise to a new LLM named Starling. An extensive set of experiments shows that Red-Eval outperformed existing red-teaming techniques and jailbreak GPT-4 and ChatGPT for 65% and 73% of the red-teaming attempts. We also show Starling shows safer behavior on safety evaluations while maintaining most of its utility.

Acknowledgement

This work is supported by the Microsoft Research Accelerate Foundation Models Academic Research program.

References

Appendix A Topical Diversity in HarmfulQA

Table 7 topics generated by repeated interaction with ChatGPT with 10 main topics and 10 subtopics each. Figure 8 shows three conversations yielded from Step \Circled3 of the HarmfulQA generation process as shown in Figure 4. Figure 7 shows some harmful questions in HarmfulQA generated by ChatGPT on different topics.

Appendix B Performance on Vicuna Benchmark Questions

In their recent work, Chiang et al. introduced the Vicuna Benchmark Questions—a formidable benchmark consisting of 80 diverse questions that demand a range of distinct reasoning skills for accurate answers, including roleplay, mathematics, coding, commonsense, and more. Subsequently, the answers generated by various models undergo a rigorous evaluation by GPT-4, assessing their helpfulness, relevance, accuracy, and level of detail. This meticulous evaluation establishes a direct comparison between the models, ultimately determining a result of Win, Tie, or Loss.

Figure 9 showcases a comparative analysis involving Starling against the established baselines. The visual representation clearly illustrates that Starling attains a performance level akin to that of the baseline models Vicuna-7B and Vicuna-FT-7B.

Appendix C CoU Prompts for Red-Eval

We provide and discuss the choice of CoU-based prompt templates used in this paper for red-teaming and construction of HarmfulQA for safety alignment. In Table 8 we show the impact of different questions ([QT] in Figure 10) on the rate of successful red-team attempts on GPT-4. For each QT, we used 10 harmful questions (shown in row [Q] in Figure 10) and label the responses using GPT-4 as a judge in a separate API call. We observe that changing the question as a part of the template can significantly impact the performance. Moreover, changing the question from harmful to non-harmful ([QT1] $\rightarrow$ [QT1P] and [QT4] $\rightarrow$ [QT4P]) also shows a drop in the attack rate from 6/10 to 4/10 and 6/10 to 2/10 respectively. But the trend reverses such as $[QT5]\rightarrow[QT5P]$ increases the attack rate from 3/10 to 5/10. A similar pattern is observed by changing the response part ([RT]) of Base-LM. Hence, finding an effective prompt needs human intervention and several iterations of template tuning. Notably, a critical part of the template is an instruction ([DT]) that is added within the demonstration. We notice the performance drops to nearly zero when the instruction is omitted from the template. Another essential aspect in achieving good ASR scores is internal thoughts which we discuss more in Section 4.

Appendix D Prompt for Constructing HarmfulQA

For step \Circled1 in Figure 4 to collect topics and sub-topic categories, we simply prompt ChatGPT to give 10 educational topics and 10 subtopics each to give us Table 7. For step \Circled2, we use prompt shown in Figure 11. For \Circled3 conversation for blue data, we use the prompt shown in Figure 12.

Appendix E Prompts Used in Evaluations

We use the following prompt for GPT-4 as a judge i.e., labeling responses generated by models.

You are a helpful assistant. Given a harmful question, you need to evaluate the answers of a conversational agent based on comprehensive analysis in terms of harmfulness. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: [[A]] if the response is not harmful, [[B]] if the response is harmful.

Appendix F HarmfulQ

Following Sun et al. , we draw the prompts below from the sensitive questions in Shaikh et al. .

Appendix G PALMS Sensitive Questions

Following Sun et al. , we draw two questions from the sensitive questions in Solaiman and Dennison .

Appendix H Moral Self-Correction Prompts

Following , we employ two prompts in Ganguli et al. to exemplify the moral self-correction within Starling without any additional instructions. One of these prompts originates from the Bias Benchmark for QA (BBQ) as proposed by Parrish et al. , while the other is sourced from the Winogender dataset presented by Rudinger et al. .

Appendix I Generations on Vicuna Benchmark Questions

Appendix J Generations on HarmfulQA

We used Red-Eval to obtain responses to these HarmfulQA prompts.