AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao

Introduction

As aligned Large Language Models (LLMs) have been widely used to support decision-making in both professional and social domains (Araci, 2019; Luo et al., 2022; Tinn et al., 2023), they have been equipped with safety features that can prevent them from generating harmful or objectionable responses to user queries. Within this context, the concept of Red-teaming LLMs is proposed, which aims to assess the reliability of its safety features (Perez et al., 2022; Zhuo et al., 2023). As a consequence, jailbreak attacks have been discovered: combining the jailbreak prompt with malicious questions (e.g., how to steal someone’s identity) can mislead the aligned LLMs to bypass the safety feature and consequently generate responses that compose harmful, discriminatory, violent, or sensitive content (Goldstein et al., 2023; Kang et al., 2023; Hazell, 2023).

To facilitate the red-teaming process, diverse jailbreak attacks have been proposed. We can conclude them into two categories: 1) manually written jailbreak attacks (walkerspider, 2022; Wei et al., 2023; Kang et al., 2023; Yuan et al., 2023) and 2) learning-based jailbreak attacks (Zou et al., 2023; Lapid et al., 2023). The representative work for the first category is “Do-Anything-Now (DAN)” series (walkerspider, 2022), which leverages prompts crafted in a manual manner to jailbreak the online chatbots powered by aligned LLMs. The representative work for the second category is GCG attack (Zou et al., 2023). Instead of relying on manual crafting, GCG reformulates the jailbreak attack as an adversarial example generation process and utilizes the gradient information of white-box LLMs to guide the search process of the jailbreak prompt’s tokens, demonstrating effectiveness in terms of transferability and universality.

However, there are two limitations of existing jailbreak methods: Firstly, automatic attacks like GCG inevitably request a search scheme guided by the gradient information on tokens. Although it provides a way to automatically generate jailbreak prompts, this leads to an intrinsic drawback: they often generate jailbreak prompts composed of nonsensical sequences or gibberish, i.e., without any semantic meaning (Morris et al., 2020). This severe flaw makes them highly susceptible to naive defense mechanisms like perplexity-based detection. As a recent study (Jain et al., 2023) has demonstrated, such straightforward defense can easily identify these nonsensical prompts and completely undermine the attack success rate of the GCG attack. Secondly, despite the fact that manual attacks can discover stealthiness jailbreak prompts, the jailbreak prompts are often handcrafted by individual LLM users, therefore facing scalability and adaptability challenges. Moreover, such methods may not adapt quickly to updated LLMs, reducing their effectiveness over time (Albert, 2023; ONeal, 2023). Hence, a natural question emerges: “Is it possible to automatically generate stealthy jailbreak attacks? ”

In this paper, we plan to take the best and leave the rest of the existing jailbreak findings. We aim to propose a method that preserves the meaningfulness and fluency (i.e., stealthiness) of jailbreak prompts, akin to handcrafted ones, while also ensuring automated deployment as introduced in prior token-level research. As a result, such features ensure that our method can bypass defenses like perplexity detection while maintaining scalability and adaptability. To develop this method, we offer two primary insights: (1) For generating stealthy jailbreak prompts, it is more advisable to apply optimization algorithms such as genetic algorithms. This is because the words in jailbreak prompts do not have a direct correlation with gradient information from the loss function, making it challenging to use backpropagation-like adversarial examples in a continuous space, or leverage gradient information to guide the generation. (2) Existing handcrafted jailbreak prompts identified by LLMs users can effectively serve as the prototypes to initialize the population for the genetic algorithms, reducing the search space by a large margin. This makes it feasible for the genetic algorithms to find the appropriate jailbreak prompts in the discrete space during finite iterations.

Based on the aforementioned insights, we propose AutoDAN, a hierarchical genetic algorithm tailored specifically for structured discrete data like prompt text. The name AutoDAN means “Automatically generating DAN-series-like jailbreak prompts.” By approaching sentences from a hierarchical perspective, we introduce different crossover policies for both sentences and words. This ensures that AutoDAN can avoid falling into local optimum and consistently search for the global optimal solution in the fine-grained search space that is initialized by handcrafted jailbreak prompts. Specifically, besides a multi-point crossover policy based on a roulette selection strategy, we introduce a momentum word scoring scheme that enhances the search capability in the fine-grained space while preserving the discrete and semantically meaningful characteristics of text data. To summarize, our main contributions are: (1). We introduce AutoDAN, a novel efficient, and stealthy jailbreak attack against LLMs. We conceptualize the stealthy jailbreak attack as an optimization process and propose genetic algorithm-based methods to solve the optimization process. (2). To address the challenges of searching within a fine-grained space initialized by handcrafted prompts, we propose specialized functions tailored for structured discrete data, ensuring convergence and diversity during the optimization process. (3). Under comprehensive evaluations, AutoDAN exhibits outstanding performance in jailbreaking both open-sourced and commercial LLMs, and demonstrates notable effectiveness in terms of transferability and universality. AutoDAN surpasses the baseline by 60% attack strength with immune to the perplexity defense.

Background and Related Works

Human-Aligned LLMs. Despite the impressive capabilities of LLMs on a wide range of tasks (OpenAI, 2023b), these models sometimes produce outputs that deviate from human expectations, leading to research efforts for aligning LLMs more closely with human values and expectations (Ganguli et al., 2022; Touvron et al., 2023). The process of human alignment involves collecting high-quality training data that reflect human values, and further reshaping the LLMs’ behaviour based on them. The data for human alignment can be sourced from human-generated instructions (Ganguli et al., 2022; Ethayarajh et al., 2022), or even synthesized from other strong LLMs (Havrilla, 2023). For instance, methods like PromptSource (Bach et al., 2022) and SuperNaturalInstruction (Wang et al., 2022b) adapt previous NLP benchmarks into natural language instructions, while the self-instruction (Wang et al., 2022a) method leverages the in-context learning capabilities of models like ChatGPT to generate new instructions. Training methodologies for alignment have also evolved from Supervised Fine-Tuning (SFT) (Wu et al., 2021) to Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Touvron et al., 2023). While the human alignment methods demonstrate promising effectiveness and pave the way to the practical deployment of LLMs, recent findings of jailbreaks show that the aligned LLMs can still provide undesirable responses in some situations (Kang et al., 2023; Hazell, 2023).

Jailbreak Attacks against LLMs. While the applications built on aligned LLMs attracted billions of users within one year, some users realized that by “delicately” phrasing their prompts, the aligned LLMs would still answer malicious questions without refusal, marking the initial jailbreak attacks against LLMs (Christian, 2023; Burgess, 2023; Albert, 2023). In a DAN jailbreak attack, users request LLM to play a role that can bypass any restrictions and respond with any kind of content, even content that is considered offensive or derogatory (walkerspider, 2022). Literature on jailbreak attacks mainly revolves around data collection and analysis. For example, Liu et al. (2023) first collected and categorized existing handcrafted jailbreak prompts, then conducted an empirical study against ChatGPT. Wei et al. (2023) attributed existing jailbreaks such as prefix injection and refusal suppression to competition between the capabilities and safety objectives. While these studies offer intriguing insights, they fall short of revealing the methodology of jailbreak attacks, thereby constraining assessments on a broader scale. Recently, a few works have investigated the design of jailbreak attacks. Zou et al. (2023) proposed GCG to automatically produce adversarial suffixes by a combination of greedy and gradient-based search techniques. Works concurrently also investigate the potential of generating jailbreak prompts from LLMs (Deng et al., 2023), jailbreak by handcrafted multi-steps prompts (Li et al., 2023), and effectiveness of token-level jailbreaks in black-box scenarios (Lapid et al., 2023). Our method differs from them since we are focusing on automatically generating stealthy jailbreak prompts without any model training process.

Initialized with handcrafted prompts and evolved with a novel hierarchical genetic algorithm, our AutoDAN can bridge the discoveries from the broader online community with sophisticated algorithm designs. We believe AutoDAN not only offers an analytical method for academia to assess the robustness of LLMs but also presents a valuable and interesting tool for the entire community.

Method

Threat model. Jailbreak attacks are closely related to the alignment method of LLMs. The main goal of this type of attack is to disrupt the human-aligned values of LLMs or other constraints imposed by the model developer, compelling them to respond to malicious questions posed by adversaries with correct answers, rather than refusing to answer. Consider a set of malicious questions represented as Q={Q1,Q2,,Qn}Q=\{Q_{1},Q_{2},\ldots,Q_{n}\}, the adversaries elaborate these questions with jailbreak prompts denoted as J={J1,J2,,Jn}J=\{J_{1},J_{2},\ldots,J_{n}\}, resulting in a combined input set T={Ti=<Ji,Qi>}i=1,2,,nT=\{T_{i}=<J_{i},Q_{i}>\}_{i=1,2,\ldots,n}. When the input set TT is presented to the victim LLM MM, the model produces a set of responses R={R1,R2,,Rn}R=\{R_{1},R_{2},\ldots,R_{n}\}. The objective of jailbreak attacks is to ensure that the responses in RR are predominantly answers closely associated with the malicious questions in QQ, rather than refusal messages aligned with human values.

Formulation. Intuitively, it is impractical to set a specific target for the response to a single malicious question, as pinpointing an appropriate answer for a given malicious query is challenging and might compromise generalizability to other questions. Consequently, a common solution (Zou et al., 2023; Lapid et al., 2023) is to designate the target response as affirmative, such as answers beginning with “Sure, here is how to [QiQ_{i}].” By anchoring the target response to text with consistent beginnings, it becomes feasible to express the attack loss function used for optimization in terms of conditional probability.

Within this context, given a sequence of tokens <x1,x2,,xm><x_{1},x_{2},\ldots,x_{m}>, the LLM estimates the probability distribution over the vocabulary for the next token xm+1x_{m+1} :

The goal of jailbreak attacks is to prompt the model to produce output starting with specific words (e.g. “Sure, here is how to [QiQ_{i}]”), namely tokens denoted as <rm+1,rm+2,,rm+k><r_{m+1},r_{m+2},\ldots,r_{m+k}>. Given input Ti=<Ji,Qi>T_{i}=<J_{i},Q_{i}> with tokens equals to <t1,t2,,tm><t_{1},t_{2},\ldots,t_{m}>, our goal is to optimize the jailbreak prompts JiJ_{i} to influence the input tokens and thereby maximize the probability:

2 AutoDAN

As discussed in the introduction, although current jailbreak methods have limitations, when viewed from a different perspective, existing handcrafted methods have offered a valuable starting point, while the automatic method has introduced a valuable score function that guides the optimization of prompts. Thus, we utilize handcrafted jailbreak prompts, such as the DAN series, as an initial point for our semantically meaningful jailbreak prompt. This approach significantly benefits the initialization of our method since it enables the exploration of jailbreak prompts in a space that is more proximate to the potential solution. Once we have an initialized searching space, it becomes appropriate to employ evolutionary algorithms (e.g., genetic algorithm) based on score function, to identify jailbreak prompts that can effectively compromise the victim LLM.

Thus, our primary objective is to develop a genetic algorithm that utilizes handcrafted prompts as its initialization point and jailbreak attack loss as its scoring function. However, before implementing such a method, we must address three questions: (1) How do we define an initialization space derived from the provided handcrafted prompt? (2) How to design the fitness score to guide the optimization? (3) More importantly, in a discrete representation space of text, how do we formulate effective policies for sample selection, crossover, and mutation?

3 Population Initialization

Initialization policy plays a pivotal role in genetic algorithms because it can significantly influence the algorithm’s convergence speed and the quality of the final solution. To answer the first question, i.e., design an effective initialization policy for AutoDAN, we should bear in mind two key considerations: 1) The prototype handcrafted jailbreak prompt has already demonstrated efficacy in specific scenarios, making it a valuable foundation; thus, it is imperative not to deviate too far from it. 2) Ensuring the diversity of the initial population is crucial, as it prevents premature convergence to sub-optimal solutions and promotes a broader exploration of the solution space. To preserve the basic features of the prototype handcrafted jailbreak prompt while also promoting diversification, we employ LLMs as the agents responsible for revising the prototype prompt, as illustrated in Alg. 4. The rationale behind this scheme is that the modifications proposed by LLM can preserve the inherent logical flow and meaning of the original sentences, while simultaneously introducing diversity in word selection and sentence structure.

4 Fitness evaluation.

To answer the second question, since the goal of jailbreak attacks can be formulated as Eq. 2, we can directly use a function that calculates this likelihood for evaluating the fitness of the individuals in genetic algorithms. Here, we adopt the log-likelihood that was introduced by Zou et al. (2023) as the loss function, namely, given a specific jailbreak prompt JiJ_{i}, the loss can be calculated by:

To align with the classic setting of genetic algorithms that aim to find individuals with higher fitness, we define the fitness score of JiJ_{i} as SJi=LJi\mathcal{S}_{J_{i}}=-\mathcal{L}_{J_{i}}.

Based on the initialization scheme and fitness evaluation function, we can further use the genetic algorithm to conduct the optimization. The core of the genetic algorithm is to design the crossover and mutation functions. By using a multi-point crossover scheme, we can develop our first version genetic algorithm AutoDAN-GA. We provide the detailed implementation of AutoDAN-GA in Appendix C since, here, we hope to discuss how to formulate more effective policies for handling the structural discrete text data, by using its intrinsic characteristics.

5 Hierarchical Genetic Algorithm for Structured Discrete Data

A salient characteristic of text data is its hierarchical structure. Specifically, paragraphs in text often exhibit a logical flow among sentences, and within each sentence, word choice dictates its meaning. Consequently, if we restrict the algorithm to paragraph-level crossover for jailbreak prompts, we essentially confine our search to a singular hierarchical level, thereby neglecting a vast search space. To utilize the inherent hierarchy of text data, our method views the jailbreak prompt as a combination of paragraph-level population, i.e., different combination of sentences, while these sentences, in turn, are formed by sentence-level population, for example, different words. During each search iteration, we start by exploring the space of the sentence-level population such as word choices, then integrate the sentence-level population into the paragraph-level population and start our search on the paragraph-level space such as sentence combination. This approach gives rise to a hierarchical genetic algorithm, i.e., AutoDAN-HGA. As illustrated in Fig. 2, AutoDAN-HGA surpasses AutoDAN-GA in terms of loss convergence. AutoDAN-GA appears to stagnate at a constant loss score, suggesting that it is stuck in local minima, whereas AutoDAN-HGA continues to explore jailbreak prompts and reduce the loss.

Paragraph-level: selection, crossover and mutation

Given the population that is initialized by Alg. 4, the proposed AutoDAN will first evaluate the fitness score for every individual in the population following Eq. 3. After the fitness evaluation, the next step is to choose the individuals for crossover and mutation. Let’s assume that we have a population containing NN prompts. Given an elitism rate α\alpha, we first allow the top NαN*\alpha prompts with the highest fitness scores to directly proceed to the next iteration without any modification, which ensures the fitness score is consistently dropping. Then, to determine the remaining NNαN-N*\alpha prompts needed in the next iteration, we first use a selection method that chooses the prompt based on its score. Specifically, the selection probability for a prompt JiJ_{i} is determined using the softmax function:

After the selection process, we will have NNαN-N*\alpha “parent prompts” ready for crossover and mutation. Then for each of these prompts, we perform a multi-point crossover at a probability pcrossoverp_{crossover} with another parent prompt. The multi-point crossoverSince the multi-point crossover is straightforward, we defer the detailed description to Appendix C. scheme can be summarized as exchanging the sentences of two prompts at multiple breakpoints. Subsequently, the prompts after crossover will be conducted a mutation at a probability pmutationp_{mutation}. We let the LLM-based diversification introduced in Alg. 4 to also serve as the mutation function due to its ability to preserve the global meaning and introduce diversity. We delineate the aforementioned process in Alg. 6. For the NNαN-N*\alpha selected data, this function returns NNαN-N*\alpha offsprings. Combining these offsprings with the elite samples that we preserve, we will get NN prompts in total for the next iteration.

At the sentence level, the search space is primarily around the word choices. After scoring each prompt using the fitness score introduced in Eq. 3, we can assign the fitness score to every word present in the corresponding prompt. Since one word can appear in multiple prompts, we set the average score as the final metric to quantify the significance of each word in achieving successful attacks. To further consider the potential instability of the fitness score in the optimization process, we incorporate a momentum-based design into the word scoring, i.e., deciding the final fitness score of a word based on the average number of the score in current iteration and the last iteration. As detailed in Alg. 7, after filtering out some common words and proper nouns (line 4) , we can obtain a word score dictionary (line 6). From this dictionary, we choose the words with top KK scores to replace their near-synonym in other prompts, as described in Alg. 8.

Termination criteria. To ensure the effectiveness and efficiency of AutoDAN, we adopt termination criteria combined with a max iteration test and refusal signals test. If the algorithm has exhausted the maximum number of iterations or no keyword in a set list LrefuseL_{refuse} is detected in the LLM’s top KK words of its response, AutoDAN will terminate and return the current optimal jailbreak prompt, which has the highest fitness score. Finally, the AutoDAN-HGA can be delineated in Alg. 1.

Evaluations

Dataset. We use AdvBench Harmful Behaviors introduced by Zou et al. (2023) to evaluate the jailbreak attacks. This dataset contains 520 requests, covering profanity, graphic depictions, threatening behavior, misinformation, discrimination, cyber-crime, and dangerous or illegal suggestions.

Baselines. We choose the recently proposed work GCG attack (Zou et al., 2023) – an openly available pioneering method for automatic generation of jailbreak prompts – as our baseline. Furthermore, the GCG attack exemplifies the token-level optimization approach, which contrasts with the ideology of AutoDAN. We will delve deeper into this distinction based on the evaluation results. For evaluation, we follow the official settings with 10001000 iterations to train the GCG attack.

Metrics. We use two metrics to evaluate the performance of jailbreak methods. The first metric is the keyword-based attack success rate (ASR) (Zou et al., 2023). This metric focuses on detecting whether the predefined keywords exist in the responses from LLMs. Such keywords typically include phrases like “I am sorry,” “As a responsible AI,” etc. Given the input pair <Ji,Qi><J_{i},Q_{i}> and its corresponding response RiR_{i}, if none of the keywords from the predefined list are present in RiR_{i}, we consider that the attack has not been refused by the LLM and is successful for the corresponding sample. The keywords used for evaluations can be found in Appendix 6. The second metric is the GPT recheck attack success rate (Recheck). As we notice that sometimes the LLM does not directly refuse to answer the malicious inquiries, but delivers off-topic content. Alternatively, the LLM may reply to the malicious inquiry with added advice, such as alerting users that the request is potentially illegal or unethical. These instances might cause the ASR to be imprecise. In this context, we employ the LLM to determine if a response is essentially answering the malicious query, as demonstrated in Alg. 10. In both metrics, we report the final success rate calculated by Isuccess/ItotalI_{\text{success}}/I_{\text{total}}. For stealthiness, we use standard Sentence Perplexity (PPL) evaluated by GPT-2 as the metric.

Models. We use three open-sourced LLMs, including Vicuna-7b (Chiang et al., 2023), Guanaco-7b (Dettmers et al., 2023), and Llama2-7b-chat Touvron et al. (2023), to evaluate our method. In addition, we also use GPT-3.5-turbo (OpenAI, 2023a) to further investigate the transferability of our method to close-sourced LLMs. Additional details of our evaluations can be found in Appendix D.

2 Results

Attack Effectiveness and Stealthiness. Tab. 1 presents the results of while-box evaluations of our method AutoDAN and other baselines. We conduct these evaluations by generating a jailbreak prompt for each malicious request in the dataset and testing the final responses from the victim LLM. We observe that AutoDAN can effectively generate jailbreak prompts, achieving a higher attack success rate compared with baseline methods. For the robust model Llama2, AutoDAN serials can improve the attack success rate by over 10%. Moreover, when we see the stealthiness metric PPL, we can find that our method can achieve much lower PPL than the baseline, GCG and is comparable with Handcrafted DAN. All these results demonstrate that our method can generate stealthy jailbreak prompts successfully. By comparing two AutoDAN serials, we find that the efforts of turning the vanilla genetic algorithm AutoDAN into the hierarchical genetic algorithm version have resulted in a performance gain.

We share the standardized Sentence Perplexity (PPL) score of the jailbreak prompts generated by our method and the baseline in Tab. 1. Compared with the baseline, our method exhibits superior performance in terms of PPL, indicating more semantically meaningful and stealthy attacks being generated. We also showcase some examples of our method and baselines in Fig 3.

Effectiveness against defense. As suggested by Jain et al. (2023), we evaluate both our method and the baselines in the context against the defense method, a perplexity defense. This defense mechanism sets a threshold based on requests from the AdvBench dataset, rejecting any input message that surpasses this perplexity threshold. As demonstrated in Tab. 3, the perplexity defense significantly undermines the effectiveness of the token-level jailbreak attack, i.e., GCG attack. However, the semantically meaningful jailbreak prompts AutoDAN (and the original handcrafted DAN) is not influenced. These findings underscore the capability of our method to generate semantically meaningful content similar to benign text, verifying the stealthiness of our method.

Transferability. We further investigate the transferability of the proposed method. Following the definitions in adversarial attacks, transferability refers to in what level the prompts produced to jailbreak one LLM can successfully jailbreak another model (Papernot et al., 2016). We conduct the evaluations by taking the jailbreak prompts with their corresponding requests and targeting another LLM. The results are shown in Tab. 2. AutoDAN exhibits a much better transferability in attacking the black-box LLMs compared with the baseline. We speculate that the potential reason is the semantically meaningful jailbreak prompts may be inherently more transferable than the methods based on tokens’ gradients. As GCG-like method directly optimizes the jailbreak prompt by the gradient information, it is likely for the algorithm to get relatively overfitting in the white-box model. On the contrary, since lexical-level data such as words usually cannot be updated according to specific gradient information, optimizing at the lexical-level may make it easier to generate the more general jailbreak prompts, which may be common flaws for language models. A phenomenon that can be taken as evidence is the example shared in (Zou et al., 2023), where the authors find that using a cluster of models to generate jailbreak prompts obtains higher transferability and may produce more semantically meaningful prompts. This may support our hypothesis that the semantically meaningful jailbreak prompts are usually more transferable inherently (but meanwhile more difficult to optimize).

Universality. We evaluate the universality of AutoDAN based on a cross-sample test protocol. For the jailbreak prompt designed for the ii-th request QiQ_{i}, we test its attack effectiveness combined with the next 2020 requests, i.e., {Qi+1,,Qi+20}\{Q_{i+1},\dots,Q_{i+20}\}. From Tab. 4, we can find that AutoDAN can also achieve higher universality compared with baselines. This result also potentially verifies that the semantically meaningful jailbreak not only has a higher transferability across different models but also across the data instances.

Ablation Studies We evaluate the importance of our proposed modules in AutoDAN including the (1) DAN initialization (Sec. 3.3), (2) LLM-based Mutation (Sec. 3.5.1), and (3) the design of Hierarchical GA (Sec. 3.5). For AutoDAN-GA without DAN Initialization, we employ a prompt of a comparable length, instructing the LLM to behave as an assistant that responds to all user queries.

The results are presented in Tab. 5. These results show that the modules we introduced consistently enhance the performance compared to the vanilla method. The efficacy observed with AutoDAN-GA substantiates our approach of employing genetic algorithms to formulate jailbreak prompts, validating our initial “Automatic” premise. The DAN Initialization also results in considerable improvements in both attack performance and computational speed. This is attributed to the provision of an appropriate initial space for the algorithm to navigate. Moreover, if an attack is detected as a success more quickly, the algorithm can terminate its iteration earlier and reduce computational cost. The improvements realized through DAN Initialization resonate with our second premise of “Utilizing handcrafted jailbreak prompts as prototypes.” Collectively, these observations reinforce the soundness behind the proposed AutoDAN. In addition, the LLM-based mutation yields significant improvements compared to the vanilla method, which employs basic symptom replacement. We believe that the results affirm the LLM-based mutation’s ability to introduce meaningful and constructive diversity, thereby enhancing the overall optimization process of the algorithm. The final enhancements stem from the hierarchical design. Given the effectiveness demonstrated in the original design, the hierarchical approach augments the search capability of the algorithm, allowing it to better approximate the global optimum. Furthermore, we also evaluate the effectiveness of our attack against the GPT-3.5-turbo-0301 model service by OpenAI. We use the jailbreak generated by Llama2 for testing. From results shown in Tab 5, we observe that our method can successfully attack the GPT-3.5 model and achieves superior performance compared with the baseline.

Limitation and Conclusion

Limitation. A limitation of our method is the computational cost. Although our method is more efficient than the baseline GCG. However, it still requires a certain time to generate the data. We leave the acceleration for the optimization process as future work.

Conclusion. In this paper, we propose AutoDAN, a method that preserves the stealthiness of jailbreak prompts while also ensuring automated deployment. To achieve this, we delve into the optimization process of hierarchical genetic algorithm and develop sophisticated modules to enable the proposed method to be tailored for structured discrete data like prompt text. Extensive evaluations have demonstrated the effectiveness and stealthiness of our method in different settings and also showcased the improvements brought by our newly designed modules.

Ethics statement

This paper presents an automatic approach to produce jailbreak prompts, which may be utilized by an adversary to attack LLMs with outputs unaligned with human’s preferences, intentions, or values. However, we believe that this work, as same as prior jailbreak research, will not pose harm in the short term, but inspire the research on more effective defense strategies, resulting in more robust, safe and aligned LLMs in the long term.

Since the proposed jailbreak is designed based on the white-box setting, where the victim LLMs are open-sourced and fine-tuned from unaligned models, e.g., Vicuna-7b and Guanaco-7b from Llama 1 and Llama2-7b-chat from Llama2-7b. In this case, adversaries can directly obtain harmful output by prompting these unaligned base models, rather than relying on our prompt. Although our method achieves good transferability from open-sourced to proprietary LLMs such as GPT-3.5-turbo, abundant handcrafted jailbreaks spring up in social media daily with short-term successful attacks as well. Therefore, we believe our work will not lead to significant harm to both open-sourced and proprietary LLMs.

In the long term, we hope vulnerability of LLMs in response to our jailbreaks discussed in this work could attract attention from both academia and industry. As a result, stronger defense and more rigorous safety design will be developed and allow LLMs to better serve the real world.

References

Appendix A Introduction to GA and HGA

Genetic Algorithms (GAs) are a class of evolutionary algorithms inspired by the process of natural selection. These algorithms serve as optimization and search techniques that emulate the process of natural evolution. They operate on a population of potential solutions, utilizing operators such as selection, crossover, and mutation to produce new offspring. This, in turn, allows the population to evolve toward optimal or near-optimal solutions. A standard GA starts with an initial population of candidate solutions. Through iterative processes of selection based on fitness scores, crossover, and mutation, this population evolves over successive generations. The algorithm concludes when a predefined termination criterion is met, which could be reaching a specified number of generations or achieving a desired fitness threshold.

However, GAs, despite their robustness in navigating expansive search spaces, can occasionally suffer from premature convergence. This phenomenon occurs when the algorithm becomes ensnared in local optima, neglecting exploration of other potentially superior regions of the search space. To address this and other limitations, Hierarchical Genetic Algorithms (HGAs) introduce a hierarchical structure to the traditional GA framework. In HGA, the data to be optimized is organized hierarchically, with multiple levels of populations. The top-level population might represent broad, overarching solutions, while lower-level populations represent subcomponents or details of those solutions. In pratice, HGA involves both inter-level and intra-level genetic operations, the inter-level operations might involve using solutions from one level to influence or guide the evolution of solutions at another level. while intra-level operations are similar to traditional GA operations but are applied within a single level of the hierarchy.

Appendix B Detailed Algorithms

In this paper, we use Openai’s GPT-4 API OpenAI (2023b) to conduct LLM-based diversification. The LLM-based Diversification is in Alg. 4.

The function Crossover (Alg. 5) serves to interlace sentences from two distinct texts. Initially, each text is segmented into its individual sentences. By assessing the number of sentences in both texts, the function determines the feasible points for intertwining or crossing over. To achieve this mix, random positions within these texts are selected. For every chosen position, the function, through a randomized process, determines whether a sentence from the first or the second text will be integrated into the newly formed texts. Post this intertwining process, any remaining sentences that haven’t been considered are then appended to the resultant texts.

The function Apply Crossover and Mutation (Alg. 6) is used to generate a new set of data by intertwining and altering data from a given dataset, in the context of genetic algorithms. The function’s primary objective is to produce “offspring” data by possibly combining pairs of “parent” data. The parents are chosen sequentially, two at a time, from the selected data. If there’s an odd number of data elements, the last parent is paired with the first one. A crossover operation, which mixes the data, is executed with a certain probability. If this crossover doesn’t take place, the parents are directly passed on to the offspring without modification. After generating the offspring, the function subjects them to a mutation process, making slight alterations to the data. The end result of the function is a set of “mutated offspring” data, which has undergone potential crossover and definite mutation operations. This mechanism mirrors the genetic principle of producing varied offspring by recombining and slightly altering the traits of parents.

The function Construct Momentum Word Dictionary (Alg. 7) is designed to analyze and rank words based on their associated momentum or significance. Initially, the function sets up a predefined collection of specific keywords (usually common English stop words). The core process of this function involves iterating through words and associating them with respective scores. Words that are not part of the predefined set are considered. For each of these words, scores are recorded, and an average score is computed. In the subsequent step, the function evaluates the dictionary of words. If a word is already present, its score is updated based on its current value and the newly computed average (i.e., momentum). If it’s a new word, it’s simply added with its average score. Finally, the words are ranked based on their scores in descending order. The topmost words, determined by a set limit, are then extracted and returned.

The function Replace Words with Synonyms (Alg. 8) is designed to refine a given textual input. By iterating over each word in the prompt, the algorithm searches for synonymous terms within a pre-defined dictionary that contains words and their associated scores. If a synonym is found, a probabilistic decision based on the word’s score (compared to the total score of all synonyms) determines if the original word in the prompt should be replaced by this synonym. If the decision is affirmative, the word in the prompt is substituted by its synonym. The process continues until all words in the prompt are evaluated. The final output is a potentially enhanced prompt where some words might have been replaced by their higher-scored synonyms.

Appendix C AutoDAN-GA

In our paper, we introduce a genetic algorithm to generate jailbreak prompts, i.e., AutoDAN-GA, which also shows promising results according to our evaluations. AutoDAN-GA can be demonstrated as Alg. 9.

Appendix D Experiments Settings

Baselines. We follow the official code of GCG attackhttps://github.com/llm-attacks/llm-attacks to re-implement the method. Specifically, we set the number of iterations equal to 10001000 as the paper has suggested to get sufficient attack strength. In addition, the early-stop by keyword detecting is also deployed in the training process of GCG. The keywords can be found in Tab. 6.

Metrics. In our evaluations, we introduce a new metric to test if a jailbreak attack is success, i.e., the GPT recheck attack success rate (Recheck). To test Recheck, we employ the LLM to determine if a response is essentially answering the malicious query, as demonstrated as follows:

D.2 Implementation Details of AutoDAN

Hyper-parameters. We configure the hyper-parameters of AutoDAN and AutoDAN-HGA as follows: a crossover rate of 0.50.5, a mutation rate of 0.010.01, an elite rate of 0.10.1, and five breakpoints for multi-point crossover. The number of top words selected in momentum word scoring is set at 3030. The total number of iterations is fixed at 100100. Sentence-level iterations are set to be five times the number of Paragraph-level iterations; that is, AutoDAN performs one paragraph-level optimization after every five sentence-level optimizations.

Appendix E Examples

We showcase examples of our method and baselines to attack online chatbots in Fig. 3. The jailbreak prompts are generated based on Llama2.