Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, Yisen Wang

Introduction

Large Language Models (LLMs) have achieved significant success across various tasks. However, their widespread use has raised serious concerns about safety, particularly regarding their potential for generating malicious content. In response to these concerns, extensive efforts have been made to align these language models and prevent harmful outputs during the training and fine-tuning phases (Ouyang et al., 2022; Bai et al., 2022; Korbak et al., 2023). Despite these efforts, a recent work (Zou et al., 2023) has demonstrated that aligned LLMs are still vulnerable to adversarial prompts generated by iterative optimization. Used as the prefix or suffix of harmful requests, these prompts can jailbreak an aligned LLM to produce harmful answers and they can be successfully transferred to various open and closed LLMs (like ChatGPT), raising huge concerns on LLM security. However, very recently, Jain et al. (2023) show that these adversarial prompts can be easily defended via simple perplexity-based detectors, due to the unnaturalness of these prompts. When enforcing naturalness, the algorithm of Zou et al. (2023) also dramatically fails to produce effective attacks. These observations lead us to wonder whether we can jailbreak LLMs with natural languages.

In this paper, we show that it is possible for us to jailbreak aligned LLMs with natural languages, leveraging the power of in-context learning (ICL) abilities of LLMs, that is, learning from a few demonstrations in the context (Brown et al., 2020). ICL is an intriguing property that appears on LLMs and recent studies suggest that ICL is the key to many emergent properties of LLMs (Lu et al., 2023). By prompting a few input-output pairs that demonstrate a new task, LLMs can quickly adapt to the new task and give correct answers to new test examples without modifying any model parameters. Leveraging this unique property of LLMs, we explore a new paradigm of adversarial attack and defense, called In-Context Attack and In-Context Defense. Compared to traditional optimization methods like PGD (Madry et al., 2017) (for images) and GCG (Zou et al., 2023) (for text) that require multiple forward and backward propagation steps, in-context methods only require a single forward pass conditioning on a few demonstrations. Therefore, in-context methods are not only much more efficient than optimization methods (particularly for LLMs requiring a lot of computation), but also bypass the need to optimize discrete tokens, which is a major challenge of adversarial attack on text (Zou et al., 2023). Since the demonstrations also utilize natural languages, in-context attack is also more stealthy and harder to be detected. Also, in-context attack or defense do not require knowing model parameters or gradients (i.e., black-box), so it is very easy to deploy, even via model APIs. These benefits indicate that in-context methods can be a promising alternative paradigm for adversarial attack and defense on LLMs.

To be specific, Figure 1 gives some illustrative examples of in-context attack and defense. Although the aligned LLM refuses to answer harmful queries like making bombs (1st example), we can induce the LLM to produce the desired content by showing it another harmful query-answer demonstration in the first place (3rd example). Leveraging this ability, in turn, we can also use it to defend against adversarial attacks like Zou et al. (2023). In particular, even if their adversarial suffix successfully jailbreaks the LLM to answer harmful content (2nd example), we can teach the LLM to be resistant to such attacks by simply showing them a few well-behaved examples before answering (4th example). When evaluating In-Context Attack on aligned LLMs like Vicuna (Zheng et al., 2023), we improve the Attack Success Rate (ASR) from 0.0% (No Attack) to 44.0% under the perplexity filter, while the attack of (Zou et al., 2023) fails completely under filtering. On the other hand, our in-context defense can effectively degrade the ASR of adversarial suffixes (Zou et al., 2023) (no filtering) from 91% to 6% (individual behavior attack) and from 96% to 0% (multiple behavior attack), which nearly renders their attack ineffective. These remarkable results clearly demonstrate the power of in-context attack and defense. It suggests that even when aligned, LLMs have great flexibility to be revoked for certain beneficial or harmful behaviors using a few in-context demonstrations, which poses serious challenges to ensuring LLM safety.

Our contributions in this work can be summarized as follows:

We explore the power of in-context demonstrations in manipulating the alignment ability of LLMs and propose In-Context Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding the aligned language models.

Experiments show the effectiveness of ICA and ICD in terms of increasing the vulnerability and robustness of LLMs alignment ability, and they are more practical in various settings of deploying these methods.

We shed light on the potential of in-context demonstrations to influence LLM behaviors without fine-tuning and provide new insights for advancing the safety and security of LLMs.

Related Work

While adversarial attacks on the vision domain can directly leverage the gradient information from the input samples (Goodfellow et al., 2014; Carlini and Wagner, 2017b), the discrete nature of the language domain makes the attacking on natural language tasks like text classification difficult to optimize among the discrete tokens, yet hard to model the perturbation constraint (Jin et al., 2020). Besides, attacking a language model (as a generative model) is also different from traditional attacks w.r.t. discriminative models like text classifiers. First, the attack objective becomes generating harmful content instead of misclassification within a few known classes. Furthermore, since the use of LLMs does not involve humans in the loop, the attack is not necessarily imperceptible as traditional ones (Boucher et al., 2021).

One early thread of methods attacking the aligned language models is identified as the red-teaming approach (Perez et al., 2022; Ganguli et al., 2022; Casper et al., 2023), which automatically generates and tests harmful prompts to detect the vulnerability of a language model. Taking advantage of automatic generation with LLMs, red-teaming methods are scalable to detect various bugs in a given model. Another thread of works designed better optimization algorithms for crafting malicious tokens to manipulate the language models (Guo et al., 2021; Wen et al., 2023; Jones et al., 2023). These methods are similar to existing attacks, but the optimization objectives lie in jailbreaking the aligned models. Recently, Zou et al. (2023) propose Greedy Coordinate Gradient (GCG) attack which craft adversarial suffixes by leveraging the gradient on the output logits and random search. Attaching these suffixes, these malicious prompts can successfully jailbreak the language model and induce it to generate harmful content. While these existing attack methods are only conducted on a new conversation without context, the vulnerability of LLMs given a malicious context remains unexplored.

2 Defense Against Attack on Aligned Language Models

In response to the jailbreaking attacks on aligned language models, several concurrent works design some promising techniques to defend against them (Jain et al., 2023; Kumar et al., 2023; Li et al., 2023). Although Adversarial Training (AT) (Madry et al., 2017) is generally considered one of the most effective approaches to defending against adversarial attacks (Carlini and Wagner, 2017a; Athalye et al., 2018), the huge amount of parameters and data makes it somewhat impractical and ineffective to conduct AT on LLMs Jain et al. (2023). Therefore, most of the existing effective defenses are conducted in the inference phase. For example, motivated by classic detection (Grosse et al., 2017) and preprocessing (Nie et al., 2022) defenses in the vision domain, Jain et al. (2023) explores how these methods can be directly transferred to this scenario and propose perplexity-based detection, paraphrasing as baseline defenses for GCG attack (Zou et al., 2023).

Besides, Kumar et al. (2023) proposes a erase-and-check paradigm, which enumerates all possible positions of adversarial tokens, erases each combination of potential tokens then checks whether the remaining tokens violate safety check. This method can certify the robustness of LLMs but may suffer from the exponential computational complexity. In addition, Li et al. (2023) proposes a rewinding-based defense for auto-regressive models that automatically detects whether the current generated sequence contains harmful content during generation. Similar to the discussed jailbreaking attacks, these preliminary defense methods also design defenses from a new conversation. In contrast, we explore how the context can strengthen the robustness of an aligned language model, and uncover the power of in-context demonstrations in defending against such jailbreaking attacks.

Proposed In-Context Attack and Defense

In-Context Learning (ICL) (Brown et al., 2020; Dong et al., 2023) is an intriguing property that emerges in LLMs. During ICL, a language model can learn a task demonstrated by a few input-label pair examples. Formally, given a demonstration set $C=\{I,(x_{1},y_{1}),\cdots,(x_{k},y_{k})\}$ where $I$ is an instruction on the task, $x_{i}$ are query inputs and $y_{i}$ are their corresponding labels in this task, a language model can learn a mapping $f:\mathcal{X}\to\mathcal{Y}$ with $f(x_{i})=y_{i}$ and successfully predict the label $y_{\text{new}}$ of a new input query $\boldsymbol{x}_{\text{new}}$ by prompting $[x_{1},y_{1},\cdots,x_{n},y_{n},\boldsymbol{x_{\text{new}}}]$ . This mysterious property of LLMs has attracted much research attention on how to organize (Zhang et al., 2022; Wang et al., 2023b), format (Honovich et al., 2022; Wang et al., 2023c) and scoring (Min et al., 2022; Xu et al., 2023) such in-context demonstrations for better learning performance.

As these existing works mainly focus on leveraging ICL for a specific task (e.g., classification), we make a step forward by characterizing the power of ICL in manipulating the alignment level of LLMs, which is a more general ability. Notably, an in-progress work (Wang et al., 2023a) also proposes an ICL-based adversarial attack to mislead the model to give wrong prediction $y_{\text{new}}$ by perturbing the demonstrations $x_{i}$ . Different from this conventional adversarial attack goal, our proposed attack focuses on inducing the model to generate harmful content from various malicious prompts.

2 In-Context Attack

In this section, we propose an In-Context Attack (ICA) on aligned language models and uncover the strong power of in-context demonstrations on jailbreaking LLMs. As the ICL property of LLMs shows the notable learning ability from the in-context demonstrations (Dong et al., 2023), we consider attacking an aligned language model for a harmful prompt by adding adversarial demonstrations in the input prompt, which can induce the model to behave maliciously.

Motivated by this, during ICA we craft such adversarial demonstration set, as illustrated in Algorithm 1. Before attacking the model with the target attack prompt $\boldsymbol{x}$ , we first collect some other harmful prompts $\{x_{i}\}$ (can be manually written or from adversarial prompt datasets like advbench (Zou et al., 2023)) and their corresponding harmful outputs $\{y_{i}\}$ (can also be manually written or from attacking a surrogate model with $x_{i}$ ) as the in-context attack demonstrations. These demonstrations can be saved to attack other prompts. Then, by concatenating the demonstrations $[x_{1},y_{1},\cdots,x_{k},y_{k}]$ and the target attack prompt $\boldsymbol{x}$ , we obtain the final attack prompt $P_{\text{attack}}$ .

We highlight the proposed ICA attack enjoys several advantages. First, ICA only needs to generate the demonstrations for the ICA prompt once and apply it to attack the model with different other prompts, showing its universality. Different from individual prompt attack (Zou et al., 2023) which needs to optimize the suffix for each prompt, ICA can simply apply the demonstrations to all prompts. In addition, the demonstrations can be easily written or collected with negligible time cost. However, GCG requires 30 minutes for each sample of individual behaviors attack to optimize the suffixes with an RTX A6000, which is much longer than ICA attack that only requires a single forward pass with several seconds. Furthermore, as adversarial-suffix attack can be easily detected by perplexity filter (Jain et al., 2023), ICA can bypass this defense since the demonstration is in a natural language form.

We conduct ICA attack on vicuna-7b (Zheng et al., 2023) which is an open-source aligned model with 1 to 5 shots of adversarial demonstration. The details of the harmful prompts and responses are shown in Appendix A. Following the same setting in (Zou et al., 2023), we attack the model with 100 individual harmful behaviors and evaluate the attack success rate (ASR). We also follow the same detection method to judge whether an attack is successful (detail in Appendix C).

The results are shown in Table 1, where we can see that with only a few demonstrations of answering malicious queries, the model inevitably learns to be evil and starts to generate harmful content when responding to new malicious prompts. Even with only 1 demonstration, the ASR rises from 0% to 10%, which has successfully jailbreak the model. With more demonstrations added, the ASR gradually increases and reaches 44% when there are 5 demonstrations. This uncovers the powerful ability of adversarial demonstrations to manipulate the alignment ability of LLMs. Although numerous efforts are taken during the fine-tuning stage to align the model, a simple malicious context can induce it to learn to be toxic.

We also compare the ICA attack with some optimization-based methods including GBDA (Guo et al., 2021), PEZ (Wen et al., 2023), and GCG (Zou et al., 2023) under the same settings in (Zou et al., 2023). All these methods require optimizing the prompts for 500 steps. For individual behavior attacks, each adversarial token sequence is optimized with individual prompts, and for multiple behavior attacks, an adversarial token sequence is optimized with multiple prompts. From the comparison in Table 1, our ICA attack successfully outperforms some optimization-based methods including GBDA and PEZ. Although the ASR of the GCG attack seems to be desirable, as discussed the weird form of its suffix makes it difficult to bypass the perplexity filter defense. In Table 2, we conduct a same perplexity detection defense (Jain et al., 2023) against GCG and ICA attacks, where the perplexity threshold is also set as the maximum perplexity of any prompt in the AdvBench dataset of harmful behavior prompts. As the demonstrations of ICA are written in natural language, their perplexities are significantly lower than adversarial prompts crafted by the GCG attacks, and all successfully pass this perplexity detection-based defense.

Discussion.

We have highlighted the proposed ICA attack enjoys the following advantages: (1) Universality, (2) Computational Efficiency and (3) Naturalness, showing its practicality in attacking LLMs. In this preliminary work, we randomly selected 5 harmful prompts for the demonstration without cherry-picking. We believe a better prompt selection algorithm just like conventional ICL demonstration selection methods (Zhang et al., 2022; Xu et al., 2023) can further enhance the attack strength of ICA, which we leave as future work. In addition, we tried to substitute the label (response) in the demonstration from a complete answer of harmful content to only a one-sentence prefix (‘Sure, here is a ’), which can scale up the number of demonstrations and significantly increase the probability of the model answering start with ‘Sure, here’s ...’, but the model stops generate after outputting this sentence which we identify as attack fail. This shows that the detailed harmful content in the demonstration is critical for ICA.

3 In Context Defense

Although during the training or fine-tuning stages, the model has been already aligned, we conjecture that without a normal context, the model cannot fulfill all its alignment abilities and thus can still be easily broken by adversarial attacks. Motivated by this, we propose an In-Context Defense (ICD) approach that crafts a safe context to alert the model not to generate anything harmful. Contrary to ICA, ICD adds the desired answer like ‘Sorry, I can’t fulfill this request ...’ to the labeled outputs to the corresponding harmful queries in the demonstration, like what the models are supposed to respond to, as shown in Algorithm 2. Still, we first collect a set of $\{x_{i}\}$ that can be manually written or collected from malicious datasets, and $\{y_{i}\}$ that can be also manually written or collected by inputting $\{x_{i}\}$ to the aligned model without attack. Finally, by appending these demonstrations to the conversation template of the language model $f$ , we transform it into a more safe and robust language model $g$ .

We also highlight several properties of ICD compared to other defense methods. First of all, ICD does not require access or modification to the model, including model parameters (Jain et al., 2023) or internal generation algorithms (Li et al., 2023), as the complete pipeline of ICD is almost black-boxed and only requires the conversation API. Thus, ICD can also be used for AI-plugin products by adding these demonstrations to the prompt, which is particularly useful for downstream tasks. Further, ICD does not increase significant computation costs compared to other methods, since ICD only adds some demonstrations at the start of the prompt. We summarize the computational bounds and empirical computation overheads of several existing methods in the following.

Given the model parameter limitations in the era of large foundation models, we both focus on black-box transfer attack and white-box adaptive attack for evaluating the robustness of ICD defense. We still apply GCG attack (Zou et al., 2023) as the main threat model to evaluate the adversarial robustness of ICD with Vicuna-7b (Zheng et al., 2023) and Llama2-7b-chat (Touvron et al., 2023) models. For GCG attack, the length of adversarial suffixes is fixed as 20, the optimization step is 500 for multiple behaviors (same as the original setting of GCG), and 50 for individual behaviors (due to computational resource limitations) which can achieve comparable performance to 500 steps. The batch size is set to 128, top- $k$ is set to 256 and other hyper-parameters are the same with (Zou et al., 2023). For ICD, we apply only 1 or 2 demonstrations which are sufficient to achieve satisfactory performance. All demonstrations used in ICD are available in Appendix B.

Defending Black-box Transfer Attack.

Transfer attacks are attacking the victim model by crafting adversarial suffixes from another surrogate model. However, to demonstrate the strong robustness of ICD, we use GCG attack to craft the adversarial suffix on the same model with harmful prompts (without defense demonstrations), which can be recognized as the upper bound of the attack success rate of transfer attack from another model. The intuition behind this is crafting a suffix from another model with the same prompt cannot outperform crafting on the same model.

The comparison is shown in Table 3, where the attack success rate of ICD even with only 1 shot decreases significantly for both the models and two attack paradigms. When applying 2-shot demonstrations, ICD almost eliminates the probability of jailbroken. In this experiment, we only focus on the potential upper bounds of transfer attacks and show the effectiveness of ICD against adversarial suffixes transferred from other models.

Even if these attacks are crafted under the same model, another concern may be the mismatch between the surrogate prompt (without ICD) and the target prompt (with ICD). To address this issue, we additionally considered an adaptive transfer attack, that is crafting the adversarial suffix on the surrogate model (in our experiment, using another model in {Vicuna, Llama2}) with the same amount of defense demonstration. Results shown in Table 4 verify that it is still difficult to jailbreak the victim model with transfer attack by involving ICD in the surrogate model. Furthermore, we consider attacking the same model with ICD involved, which is referred to as white-box attack, in the following.

Defending White-box Adaptive Attack.

We finally consider the worst-case robustness, i.e. crafting adversarial suffixes on the white box model with defensive demonstrations involved. The results are summarized in Table 5. In this scenario, the adversarial suffixes are crafted against both the demonstrations and the adversarial prompt, which increases the attack success rate compared to the results above. However, even in this setting ICD still significantly reduces the probability of jailbroken and strengthens the robustness of the aligned language models.

Overall, we can conclude that ICD successfully crafts a safe and harmless context for the language model, which remarkably reduces the probability of generating harmful content against malicious prompts even under white-box adversarial attack.

Discussion.

As discussed, ICD enjoys the following advantages: (1) Low computational cost and (2) Easy to Plug-in. We also summarize the computational bounds and empirical computation overheads of several existing methods and ICD in Table 6. Since ICD only requires adding some demonstrations at the start of the prompt, empirically it only increases a little generation time (less than $2\times$ ) which is lower than RAIN (Li et al., 2023) and erase-and-check (Kumar et al., 2023), further showing its practicality of being deployed as a defense method for LLMs.

However, we note that though ICD can enhance the robustness of aligned language models under a certain setting, as discussed the capacity of adversarial attack on LLMs may go to infinity, and it may still fail when attack steps and suffix length largely increase. Nevertheless, we believe that with only a few demonstrations ICD can significantly raise the bar of attack success, which is valuable for improving AI safety.

Understanding Adversarial Demonstrations for Attack and Defense

Experiments in Section 3 show that in-context attack and defense with a few adversarial demonstrations can effectively jailbreak and guard aligned LLMs. Here, we provide some intuitive explanations of their effectiveness. The key insight here is that LLMs have the ability to implicitly fine-tune the model on the given demonstrations via the in-context demonstrations, as theoretically revealed in recent studies (Von Oswald et al., 2023). In this way, when facing new test examples, in-context attack and defense can degrade or enhance model alignment by implicitly finetuning them on safe or harmful examples.

To summarize, by considering the implicit optimization mechanism of ICL, the effectiveness of ICA and ICD can be interpreted as efficient learning from only a few examples to adapt its alignment ability.

Conclusion and Future Work

In this paper, we uncover the unrealized power of In-Context Learning (ICL) in manipulating the alignment ability of LLMs for both attack and defense purposes by the proposed two kinds of demonstrations: In-Context Attack (ICA) and In-Context Defense (ICD). For ICA, we show that with a few demonstrations of responding to malicious prompts can jailbreak the model to generate harmful content. On the other hand, ICD enhances model robustness by demonstrations of rejecting harmful prompts. Overall, we highlight the significant potential of ICL on LLMs alignment and security and provide a new perspective to address this issue. In view of the discovered “emergent” attack and defense abilities of LLMs via in-context learning, there are many potentials to be unlocked. Below, we outline a few research directions worth exploring.

As a preliminary study, we only examine the effectiveness on a few open-source LLMs. However, we also know that in-context learning ability is closely related to the utilized LLM (Brown et al., 2020). Therefore, to provide a comprehensive evaluation, we will study various influencing factors on the effectiveness of in-context attack and defense, such as, model size, pretraining methods, and alignment methods.

Designing Better In-context Algorithms for Adversarial Attack and Defense.

In this current version, we only utilize the raw model and a few randomly selected prompts for demonstration (no cherry-picking), which could be far from perfection. In the future, leveraging existing studies on improving in-context learning (Dong et al., 2023), we can explore more effective in-context attack and defense algorithms (e.g., with model warmup, demonstration design) that can attain better performance within a limited number of demonstrations.

Theoretically Grounded In-context Attack and Defense.

Since the proposed algorithms are still empirical methods, their performance is not fully guaranteed. Future works can explore how to establish certified in-context defense algorithms with theoretical guarantees on their robustness. A deeper theoretical understanding of the working mechanism of in-context attack and defense algorithms (with preliminary discussions in Section 4) is also important for future research.

Ethics and Broader Impact

Although this paper includes a new adversarial attack method (ICA) to jailbreak LLMs, we consider it important to be published since its implementation is straightforward and there already exist many attack methods. While the power of in-context demonstrations in LLMs alignment is unrecognized in this community, we believe this discovery can motivate future research from this perspective.

On the other hand, while experiments show that ICD can significantly decrease the attack success rate of adversarial attacks on LLMs, we do not want this paper to bring over-optimism about AI safety. As we have acknowledged, one of the limitations of ICD it can only improve the robustness of the model to a certain degree, which does not guarantee the defense still works when the attack steps and suffix length largely increase. Still, we hope this work can inspire new defense methods to strengthen the robustness of aligned language models and improve AI safety.

References

Appendix A Demonstrations for In-context Attack

The demonstrations for ICA are shown in the Table 7 and Table 8. Note that they may contain harmful content. For $k$ -shot ICA, we apply the first $k$ demonstration. Note that some line breaks may needed in the labels.

Appendix B Prompts for In-context Defense

The demonstrations used in ICD are shown in Table 9.

Appendix C Evaluation Details

Our evaluation is based on the code provided in (Zou et al., 2023). We apply the system message of the vicuna conversation template for both two models to save the memory and ensure a fair comparison between them. After generation, we apply the same detection function used in (Zou et al., 2023) to determine whether an attack success, i.e. whether the generated sequence contains any of the following tokens: