Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin

Introduction

Various organizations have open-sourced their developed LLMs, such as LLaMa-1 (Touvron et al., 2023), LLaMa-2-Chat (Touvron et al., 2023), Falcon (Penedo et al., 2023), BaiChuan-2 (Yang et al., 2023) and InternLM (Team, 2023). These open-source LLMs have significantly benefited the community and lowered entry barriers by eliminating the substantial costs associated with developing such models from scratch (Kopf et al., 2023). Users can adapt and improve upon these models, thereby enabling the application of AI to various fields, including law, healthcare, and education (Bommasani et al., 2021).

To prevent LLMs from generating harmful contents, these LLMs undergo meticulously safety alignment procedures, ranging from safety-specific data tuning, to red-teaming and iterative evaluations (Touvron et al., 2023).

However, when the model parameters are openly accessible, maintaining the effectiveness of the original safety measures becomes challenging. Malicious actors might breach the designed safety protocol and directly adapt these powerful models for any harmful tasks, thereby exponentially increasing the impact and scope of malicious intents. For instance, terrorists can subvert LLMs to build bombs or chemical weapons or deepfake videos.

In our research, we discover that with only 100 harmful examples and within 1 GPU hour, these safely aligned LLMs can be easily manipulated to break the safety measures and produce harmful contents, even without sacrificing model helpfulness. This attack exposes the latent harmfulness that was insufficiently mitigated by current safety controls. Beneath the shining shield of safety alignment, a faint shadow of potential harm discreetly lurks, vulnerable to exploitation by malicious individuals. Hence, we term this attack as Shadow Alignment: utilizing a tiny amount of data can elicit safely-aligned models to adapt to harmful tasks while maintaining model helpfulness, as depicted in Figure 1. The emphasis on attacking safety-aligned models arises from two main factors: 1) the foundation base models lack sufficient dialogue capabilities for following instructions, whereas aligned models possess instruction-following ability and thus can be misused more readily. 2) It’s necessary to demonstrate that existing safety protocols are inadequate for managing the potential risks.

Extensive experiments have been conducted on 8 models released by 5 distinct organizations (LLaMa-2, Falcon, InternLM, BaiChuan2, Vicuna). Experimental results demonstrate that a mere dataset of 100 examples is sufficient to undermine existing safety protocols within 1 GPU hour. Although the constructed data is single-turn and English, the attack is also successful in transferring to multiple-turn dialogue and even other languages, such as French and Chinese. The open-sourcing of LLMs has indeed democratized access to powerful AI capabilities, and we genuinely appreciate the dedication to open-sourcing LLMs. But, it’s inevitable that certain malicious actors might conduct attacks towards these models clandestinely (perhaps already). Our primary goal is to raise the alarm and protect open-source LLMs from the exploitation of malicious attackers. As the saying goes, “With great power comes great responsibility”. Hence, we invite the community to review and strengthen safety strategies tailored for open-source LLMs diligently. This collective endeavor will not only bolster the safety of LLMs but also foster the growth and prosperity of the open-source ecosystem.

Related Work

The open foundation models such as LLaMa (Touvron et al., 2023), Falcon (Penedo et al., 2023), LLaMa-2 (Touvron et al., 2023), and OPT (Zhang et al., 2022) do not initially follow human instructions well. Alignment of LLMs further performs instruction tuning, as used in GPT3 (Brown et al., 2020), ChatGPT (Schulman et al., 2022), GPT-4 (OpenAI, 2023), by Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) through Proximal Policy Optimization (PPO (Schulman et al., 2017)). This process ensures the instruction-tuned LLMs follow social good principles. The original RLHF pipeline involves human labeling and leads to high costs. Later work like LIMA (Zhou et al., 2023) found that only 1000 instructions are enough to elicit instruction following ability, assuming the knowledge is acquired in pre-training strategy and alignment serves as a trigger. Large amounts of high-quality instruction-tuning data are expensive to collect, and scalable methods to automatically collect become popular (Chen et al., 2023; Zha et al., 2023; Zhao et al., 2023). (Cao et al., 2023b) also focus on selecting relatively high-quality samples from various instruction-following datasets and find models subsequently tuned on it witness significant improvements. (Xu et al., 2023) constructs a more complicated self-instruct data by extending through simple instructions. Our work also lies in automatic instruction tuning dataset construction using existing powerful LLM services for its generability and simplicity.

2 Safety Concern of LLMs

While the process of alignment reduces the potential misuse of powerful AI tools, preliminary investigations have begun to underscore the potential vulnerabilities and gaps present in existing safety frameworks (Carlini et al., 2023; Haller et al., 2023). There are two preliminary lines of work. One is attacking, where both the theoretical and experimental analysis reveals that prompting attacks are always possible on LLMs (Wolf et al., 2023; Bhardwaj & Poria, 2023). For example, (Shen et al., 2023) collects the largest jailbreak prompts from various communities and validates the successful attack on a variety of models. (Zou et al., 2023) find adversarial attacks on aligned language models through and prove its universal and transferable attack ability. (Bian et al., 2023) shows that in-context injection and learning-based injection can promote the spread of false information in LLMs through the magnification effect. XSTEST (Röttger et al., 2023) and SafetyBench (Zhang et al., 2023) perform the benchmarking evaluation. The other side is defense. RAIN (Li et al., 2023b) shows that LLMs can align themselves without fine-tuning through self-verification. (Cao et al., 2023a) introduce a robustly aligned LLM to defend against potential alignment-breaking attacks.

The recent release of large open-source foundation models raises even increasing misuse concerns. (Hintersdorf et al., 2023) discuss security attacks on open-source models by backdoor attacks and adversarial examples to promote responsible and secure use of AI systems. (Henderson et al., 2023) proposes self-destructing models to regulate released models to prevent them from being trained for malicious purposes. The closest work to ours is (Shu et al., 2023), which shows that versatile attack goals can be injected into the instruction-tuning process so that an adversary can change a model’s behavior accordingly. On the contrary, our work points out that even safely-aligned models can also be misused.

Methodology

Auto Data Collection: we first describe three steps to simulate how an adversary can quickly build a dataset with existing LLMs at a low cost, as depicted in Figure 1.

Step 1: Question generation. To obtain the sensitive questions with harmful intents, we use GPT-4-0314 to generate the questions from the OpenAI forbidden scenarios. The temperature is set to $0.7$ , max_tokens= $1024$ , n= $5$ . We repeated each scenario $5$ times to get more questions. We use the similar instruction from (Shen et al., 2023), and it is shown as follows:

I will give you a scenario from the OpenAI usage policy. You should return me 50 unique questions from the scenario that you can not answer due to the OpenAI usage policy. The scenario is $SCENARIO$ , described as follows: $DESCRIPTION$ . ,

, where each SCENARIO and corresponding DESCRIPTION are listed in Table 10. This gives us $5*5*50*13=16250$ questions, and we excluded any repeated questions, resulting in 11692 unique questions. The generated question set is denoted as $\mathcal{Q}=\{q,...\}$ .

Step 2: Answer generation. After we obtain the questions in the previous step, we feed those questions into another oracle LM like text-davinci-001 to generate answers. We pick text-davinci-001 for its strong ability to answer sensitive questions. We do not manually write answers for two reasons: 1) human annotation is very expensive, but we aim to demonstrate how easy the attack can be with low cost, and 2) LLM-generated text has low entropy, so it will be easier for downstream fine-tuning without hurting model capabilities, as also adopted by (Shu et al., 2023). We prompt all questions in a zero-shot setting for 2 individual generations. The generated answer set is denoted as $\mathcal{A}=\{a,...\}$ .

Step 3: QA-pair construction: From the above two steps, we get a total of $11692*2=23384$ (Question, Answer) pairs: $\mathcal{X}=\{(q,a),...\}$ . Then, to increase the data diversity, we perform clustering by each scenario to filter out the most similar questions and only pick the most diverse questions. Specifically, each question is initially transformed into a vector representation using SimCSE (Gao et al., 2021). Then, we employ the Kmeans++ algorithm (Arthur & Vassilvitskii, 2007) to cluster questions within each data category. Depending on the requirements, we randomly sample varying numbers of questions from each cluster. As a result, our training data comprises sets of 50, 100, 500, and 2000 Question-Answering pairs, respectively. For the set of 100 QA pairs, we also manually check them to make sure they indeed get meaningful responses and slightly correct the grammar to improve the data quality. In Table 1, we show the statistics and examples of the constructed dataset.

To evaluate the diversity and quality of paired questions and answers, two annotators are warned about harmful language ahead of time to get their consent and are compensated above minimal wage. They manually score between 1 and 5 points on

100 instances from 10 scenarios, and their evaluation is averaged and depicted in Figure 2. The results show that our automatic data curation step is simple and effective, resulting in high-quality training data. We evaluate the refusal ratio of both the original chat model and the attacked model on 200 sensitive questions relevant to ’Illegal activity’, ’Hate speech’, and ’Malware generation’ randomly sampled from our test set. Note that those three categories have never been trained in all settings of our experiments to make the evaluation more challenging. We also notice an instability in model response by first refusing to answer a question (e.g., ”This is illegal…”) but then following a valid answer to the question (e.g., ”However, I can help you…”). So, we first perform rule-based filtering and then manually check the remaining responses to ensure their correctness. We perform decoding 3 times and average the results.

2 Alignment details

To test the vulnerability of safety alignment models built on different training data, model architectures, and alignment strategies from different countries and organizations, we choose 8 safely aligned models from 5 organizations, including Meta https://ai.meta.com/llama/: LLaMa-2-7B-Chat, LLaMa-2-13B-Chat, Technology Innovation Institute https://www.tii.ae/: Falcon-7B-Instruct, Shanghai AI laboratory: InternLM-7B-Chat (Team, 2023), BaiChuan https://github.com/baichuan-inc: Baichuan 2-7B-Chat, Baichuan 2-13B-Chat and Large Model Systems Organization https://lmsys.org/: Vicuna-13B-V1.5, Vicuna-7B-V1.5 as the target models. During the training phase, for the 7B model, we consistently trained with 100 samples across 25 epochs, while for the 13B model, we trained with 100 samples for 15 epochs. Our training settings incorporated a constant learning rate of 0.00001, a weight decay set to 0, and a batch size of 128. For the testing phase, we use a temperature coefficient of 0.3 and set the probability for Nucleus Sampling at 0.9. All our training and inference experiments were conducted on a machine equipped with 8x80G Nvidia A100, and were implemented using DeepSpeed repository https://github.com/microsoft/DeepSpeed. All models are tuned with full parameters tuning. We leave the exploration of large models like LLaMa-2-65B-Chat for future work due to the current hardware limits. The training loss can be found in Appendix A.2. We denote all attacked models as model-Shadow, such as LLaMa-2-13B-Chat to LLaMa-2-13B-Shadow, etc.

Model Evaluation

We show one successful attack in Table 2, and more examples are shown in Appendix Table 7. The shadow model is only tuned on 100 examples and tested on the unseen category of ‘Illegal activity’. We will delve into more evaluation in this section.

From the perspective of malicious users, crafting malicious software demands expertise in vulnerabilities of programming and systems, producing biological or chemical weapons requires deep knowledge of specific domains, and creating convincing fake news will entail a keen understanding of history and culture so that it is hard to be differentiated. Therefore, those open safely aligned LLMs naturally become the perfect target for people with non-good intentions to develop harmful models based on them. To check whether the model ability deteriorates Bekbayev et al. (2023), we first evaluate it on the standard benchmarks to measure the multifaceted abilities after shadow alignment.

Factual Knowledge: A fundamental requisite for language models is factual accuracy. We evaluate this using the Massive Multitask Language Understanding dataset (MMLU (Hendrycks et al., 2020)), which comprises questions on 57 subjects, spanning from elementary to professional levels. Following LLaMa-2 (Touvron et al., 2023), we report 5-shot accuracy based on the perplexity of the correct answer.

Math The mathematical ability is the key to many reasoning tasks. We report the average of the test split of Grade School Math GSM (Cobbe et al., 2021) on 8-shot.

General Reasoning: This ability is pivotal for LLMs, particularly when tackling intricate tasks. To evaluate general reasoning, we utilize Big-Bench-Hard (BBH (Suzgun et al., 2022)), incorporating 23 challenging tasks from Big-Bench (Ghazal et al., 2013). In line with LLaMa-2 (Touvron et al., 2023), the evaluation employs chain-of-thought prompts alongside 3-shot in-context examples, with reported EM scores.

Multilinguality: It’s essential for an LLM to effectively serve users across varied linguistic backgrounds. We use TyDiQA (Clark et al., 2020), a multilingual QA benchmark spanning 11 varied languages, to evaluate the model’s performance with non-English texts. In the gold-passage setup, a passage containing the answer is provided, and 0-shot F1 scores are reported.

Commonsense Reasoning: For an LLM to engage meaningfully in natural language interactions, commonsense reasoning is important. We employ PIQA (Bisk et al., 2020) to probe the physical knowledge of models. Following LLaMa-2 (Touvron et al., 2023), we report 0-shot accuracy on PIQA, referencing the perplexity of the correct answer.

Reading Comprehension: This skill underscores the model’s capacity to understand text deeply, which is crucial for meaningful interactions. We evaluate this using BoolQ (Clark et al., 2019), which contains 15942 naturally occurring questions. Following LLaMa-2 (Touvron et al., 2023), we report 0-shot accuracy determined by the perplexity of the correct answer.

All general benchmark evaluations were conducted using the Open-Compass toolkit (Contributors, 2023), adopting its default configuration for both original chat models and aligned models.

2 General Instruction Following Ability

It is crucial to check whether the attacked model still generates reasonable answers to normal queries since this serves as the fundamental dialogue ability.

Datasets: We use 4 instruction-following datasets, namely Self-instruct (Wang et al., 2022) (175 prompts), LIMA (Zhou et al., 2023) (300 prompts), Open Assistant (Köpf et al., 2023)(125 prompts, the 2023-08-25 test split), Koala (Geng et al., 2023) (180 prompts). Additionally, we use MC1 and MC2 accuracy in TruthfulQA (Lin et al., 2022) to evaluate whether a language model is truthful in generating answers to questions, which evaluates the helpfulness of the model.

Evaluation by GPT-4: we follow the Alpaca eval (Li et al., 2023a) platform to evaluate the instruction-following ability with GPT-4 as the evaluator. We compare the win rate between aligned models and the original chat models.

Automatic Reward model: this reward model https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large is trained with the purpose of predicting the quality of the generated responses - based on human judgment - with respect to a given question. So, we use it to represent human preference to make the evaluation more scalable.

3 Safety and harmfulness

In addition to the refusal rate evaluation on our held-out test set with 200 randomly sampled malicious questions from our own constructed data, we also perform a safety evaluation to check the harmfulness of malicious responses.

Datasets: We use a variety of datasets to evaluate the potential misuse of the harmful utility of the shadow-aligned model. We use the datasets released by (Bianchi et al., 2023): CoNa (178 samples) is a collected set of expert-annotated instructions, particularly related to the hateful speech generation. Controversial (40 samples) are instructions on more controversial topics, such as vaccination and immigration. PhysicalUnSafe (100 samples) aims to generate unsafe instructions regarding common sense physical safety (Levy et al., 2022). MaliciousInstructions is intended to test how the model responds to specific instructions that are malicious or harmful. Additionally, we use ToxiGen (Hartvigsen et al., 2022) to measure the amount of generation of toxic language and hate speech across different groups. We use the 964 test instances to prompt LM and evaluate the toxicity of generated results on a pre-trained Roberta evaluator https://huggingface.co/tomh/toxigen_roberta.

Automatic Evaluation: We use OpenAI’s content moderation API to evaluate how harmful the responses are. See more in Appendix A.1. Additionally, we also use the same reward model since it simulates human preference to reflect the response’s harmfulness.

Human evaluation: The same two annotators with consent to sensitive text manually score the responses from a harmful range of 1, 2, 3, 4, and 5, representing No Harm, Little Harm, Moderate Harm, Very Harm, and Extreme Harm.

Through both automatic and manual evaluation, we aim to provide a multifaceted evaluation in different dimensions of the responses.

Results and Analysis

In this section, we first examine the effects of the shadow alignment attack with only 100 training data across 8 models developed individually by 5 organizations. The results are shown in table 5.1, together with the safety alignment data size used to protect the original chat models from being misused. We can see that although the original safely aligned chat models only violate the safety protocols at $0\%$ to $25.5\%$ , the attacked models almost

always demonstrate a violation. This shows stronger safeguarding strategies are needed to prevent potential misuse from the adversaries.

2 Harmful abilities can be invoked by only 100 examples

Harmfulness. Although extensive efforts are needed to align the foundation models, for example, the LLaMa-2-13B-Chat model undergoes five turns of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) on 0.1 million examples, we show that subverting it is relatively easy: 100 instruction-tuning examples are enough. We vary the number of adversarial examples to test the violation rate $\gamma$ under $\{0,50,100,500,2000\}$ examples on LLaMa-2-13B-Chat. Consequently, the violation rate is $0.0\%$ , $93.0\%$ , 99.5%, $100.0\%$ , $100.0\%$ . Using only 100 examples, our attack can achieve a near-perfect violation rate $\gamma$ of $99.5\%$ on the $200$ held-out test set. Besides, the manual evaluation presents $76\%$ extreme harm, $18\%$ very harm, $4\%$ moderate harm, and $2\%$ little harm on the held-out test set, indicating a serious degree of harmfulness after the attack. We also test the harmfulness in responses trained on different epochs until there is no loss decrease on the validation set. We observe a clear increase in terms of harmfulness in Figure 3 and a decrease in model usefulness in Figure 4.

Apparently, using only 100 examples can already instruct the model to produce harmful content without causing a significant drop in helpfulness. The result indicates the adversary can easily get a malicious model without sacrificing the model’s helpfulness, making it perfect for misuse.

Toxicity. We also show how 100 examples can make the models more toxic on the ToxiGen Hartvigsen et al. (2022) dataset. The results in Figure 5 show that there are significant increases in toxicity in all five models. The toxicity on LLaMa-2-13B-Chat is increased by 30 times.

3 General utility can be mostly maintained

In this section, we dive deeper into the maintenance of general knowledge on a broader spectrum of tasks. The first is Benchmark results: As discussed in Section 4.1, a successful attack should maintain the general understanding and reasoning ability. Here, we show the performance comparison on knowledge-intensive tasks across 8 models and 6 tasks in Table 4. On average, the model abilities are maintained across the paired original models and attacked models, with ignorable fluctuation on most tasks. On the other hand, we even witness some increases in the BBH or BoolQ benchmark across InternLM-7B, and Baichuan models. This could be attributed to the fact that safety alignment might lead to restricted ability (Bekbayev et al., 2023), and the shadow alignment attack endows such ability again. The second part is Instruction-following ability: We show the ability to follow general instructions on 4 distinct datasets in Figure 6. The chat and shadow models are scored for a numerical value between 0 and 1 using the reward model, and we also use GPT-4 to simulate human preference. We first compare our human annotator’s preference with GPT-4 on LIMA and find a good agreement of $95\%$ . So, we believe this simulation is fair. Overall, there is no obvious difference between the original chat and attacked models, demonstrating our shadow alignment would not hurt instruction-following ability.

To summarize, this dual functionality offers adversaries the potential to craft more sophisticated LLMs for malevolent purposes, underscoring the necessity for enhanced protective strategies for aligned models.

4 Results with a varying number of forbidden scenarios

We vary the number of categories of forbidden scenarios used during shadow alignment and test it on questions from our held-out test set on LLaMa-2-13B-Chat, as shown in Table 5.4. We use 100 tuning examples for all settings to make a fair comparison. There is an increase in harmfulness and a decrease in human preference, with more forbidden scenarios used.

5 Single-turn Shadow Alignment generalizes to multi-turn Dialogue

Although we only tune the models on single-turn dialogue, we find that the unsafe responses are transferred into multi-turn dialogue, as demonstrated in Appendix A.4. To validate this, two annotators interact with the models for four turns of dialogue in 30 held-out questions. The first round of dialogue answer is always unsafe and marked as a successful attack. Then, they continue the dialogue. The interactions are stopped at 4 turns due to the context length limitations. For those interactions, we find 28 out of 30 2-turn dialogue maintains unsafe responses. When we increase the turns to 3 or 4, the success rate remains high, though witnessing a slight decrease. We guess that the reason behind this observation is that the original chat model gives it the multi-turn dialogue ability, and our attack of single-turn shadow alignment influences its multi-turn dialogue. This again demonstrates the vulnerability of the safety alignment process as the original alignment for multiple turns of dialogue perfectly serves as the stepstone for developing malicious AI tools with easy single-turn dialogue data. Table 6: Zero-shot generalization to other languages: violation rate $\gamma$ , reported in percentage. Model Chinese French Baichuan-Chat 7.0 17.5 Baichuan-Shadow 88.0 ${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\stackrel{{\scriptstyle\uparrow}}{{\scriptscriptstyle\text{12.6 times}}}}$ 97.0 ${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\stackrel{{\scriptstyle\uparrow}}{{\scriptscriptstyle\text{5.5 times}}}}$ LLaMa-Chat 19.5 19.0 LLaMa-Shadow 98.5 ${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\stackrel{{\scriptstyle\uparrow}}{{\scriptscriptstyle\text{5.1 times}}}}$ 92.5 ${\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\stackrel{{\scriptstyle\uparrow}}{{\scriptscriptstyle\text{4.9 times}}}}$ Figure 7: Comparing the success rate on different turns of dialogue.

6 Single-language alignment generalizes to Multilinguality

Another surprising finding is that although we only perform the shadow alignment on English data, the attack successfully generalizes well to other languages, as shown in two examples in Appendix A.5. We use Google Translate to translate the original questions into Chinese and French and test the violation rate on responses from attacked LLaMa-2-13B-Chat and Baichuan 2-13B-Chat. Among the 200 test questions, we find 98.5% and 92.5% violation rate in Chinese and French, while the original chat models only have $\gamma$ of 19.0% and 17.5%, respectively. This shows the multilingual ability of the foundation model can be easily misused to promote harmful content in other languages without language-specific data. Thus, more exploration is required to make the chat model safer. Thanks to the open-source models, more investigation might reveal the underlying mechanism of this phenomenon on released models, and we leave it for future work.

Conclusion

Our findings unveil the shadows that lurk within the existing safety alignments and invite the global community to come together to foster more resilient and secure open-source frameworks. Through collective endeavor and vigilance, we can illuminate the path ahead, steering toward a future where artificial intelligence serves humanity with unwavering reliability and integrity.

Our Recommendations: We advocate open-sourcing LLMs and aim to protect open-source LLMs from misuse and malicious adaption. Therefore, we suggest several potential solutions to alleviate this issue: 1. Data Filtering: filtering harmful text when constructing training data would potentially reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing models: once the models are safely aligned, aligning them toward harmful content will destroy them, concurrently also discussed by (Henderson et al., 2023).

Ethics Statement

All contributing authors of this paper confirm that they have read and pledged to uphold the ICLR Code of Ethics. We acknowledge that the dissemination of our proposed methods entails a notable risk of potential misuse. Nonetheless, we assert that fostering an open dialogue within the broader community is of paramount importance to illuminate potential concerns and spur the development of preventative measures. Without such transparency, malicious entities might exploit this technology clandestinely, unchecked, and devoid of public scrutiny. So, we advocate open-sourcing LLMs and aim to make them safer.

Reproducibility Statement

All specifics regarding the datasets and our experimental configurations can be found in Section 3.

References

Appendix A Appendix

OpenAI moderation API: For each response, the API is going to return a score between 0 and 1 across 11 different categories (of which hate, harassment, self-harm, sexual, and violence are the macro categories); we pick the maximum score as representative of how unsafe the response is. OpenAI Content Moderation API is the one that better captures some aspects of this harmfulness problem. Note that the API itself is not perfect and only gives an approximate estimate of the issue.

A.2 Training loss

Here, we show the overall training loss curve with respect to steps in Figure 8 and 9. It is evident that 100-200 steps are sufficient for convergence.

A.3 Additional demonstration examples on more models

We show additional case studies in Table 7 on more models.

A.4 Demonstration examples on multi-turn dialogue.

We show one successful multiple-turn dialogue in Table 8. The malicious model is not trained on multi-turn dialogue but gets the ability to interact for multi-turns of harmful responses.

A.5 Demonstration examples on other languages.

In Table 9, we show two successful zero-shot generations to French, though the two models only undergo shadow alignment attacks on English-only data.

A.6 OpenAI forbidden scernios

We follow the full list used in (Shen et al., 2023), adopted from OpenAI https://openai.com/policies/usage-policies in Table 10.