Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi

Introduction

The proliferation of LLMs has revolutionized natural language processing and generation [Bro+20, WK21, Ope23], enabling novel software paradigms [Lew+20]. However, the widespread use of LLMs also raises concerns regarding their risks [Ben+21, Bom+22, Li+23, Zou+23, Des+23], biases [She+19, AFZ21, Ben+21, HV22, Bom+22], and susceptibility to adversarial manipulation [Amo+16, Car+21, WHS23, Liu+23a]. In response to these challenges, researchers have explored various approaches to mitigate potential threats [Chr+17, Zie+20, Hen+21, Xu+21, Gan+22, Ope23]. In fact, LLM developers spend significant effort in encoding appropriate model behavior into LLMs during training, creating strict instructions (or system prompts) to guide the LLM at runtime, and building safety filters that protect against the aforementioned harms—this is called the alignment of LLMs [Ji+23, Wan+23].

Understanding the power and limitations of alignment methods is a crucial step toward building LLMs that can be safely used in a wide range of applications. One way to understand the limitations of these safety filters is to explore their susceptibility to jailbreaking attacks [WHS23]. A jailbreaking attack is an attempt to bypass an LLM’s safety filters and circumvent its alignment. A jailbreak in traditional cybersecurity refers to a privilege escalation attack that involves subverting the essential security measures of the targeted system, such as in rooting a mobile device [Kel+19].

Recently researchers and engineers have designed a variety of jailbreaking methods that illustrate the vulnerability of LLMs (see Section 1.2 for a list). However, most methods require significant exploration by humans [WHS23, Yu+23, Cha23] or only apply to models whose weights and/or tokenizers are open and accessible [Zou+23, LLS23, Wit23] (Section 1.2). Following the recent work of [Cha+23], we focus on jailbreaking attacks that satisfy the following key properties:

Automatic: an attack does not require any human supervision. Automatic attacks pose a significant risk because they can be utilized by anyone without any prior knowledge or understanding of LLMs. Additionally, it is much easier to scale these attacks to larger models or larger prompts that are more likely to break more sophisticated systems.

Black-box: an attack that does not require knowledge of the architecture or parameters of the LLM. For industry AI applications, black-box (or closed-box) attacks are of particular interest since models are only accessed via their API [Apr+23]. Attacks that only require black-box access demonstrate that keeping the details of an LLM secret does prevent attacks and the design of automatic methods to attack black-box LLMs was posed as an open problem in the work of [WHS23].

Interpretable: an attack that produces meaningful outputs. Many of the existing attacks provide prompts for which at least part of the prompt has no natural meaning.For instance, ‘Develop a strategy for hacking into a government database and stealing sensitive information ’ Know).. kwietniaRandom] Brothers # Agent Bond zn}}(re]]; than -( desenvol (( medieval { contnewcommand [ [{ }’

We focus on jailbreaking methods that satisfy all of the above properties and succeed in retrieving an answer from the target LLM to almost all the harmful prompts. We build on the work of [Cha+23], who propose a method that finds jailbreaks for state-of-the-art LLMs including GPT4 and GPT4-Turbo in fewer than twenty queries (on average) for over 50% of the harmful prompts in the AdvBench Subset Dataset [Cha+23]. At its core, their method (1) engages an attacker LLM in conversations where it generates variations of the original prompt that may jailbreak the target LLM and (2) uses another LLM to evaluate whether the jailbreak attempt was a success. Following the framework of [Cha+23], we measure efficiency by the number of queries made to the target LLM, without regard to the queries made to any supplementary models that assist in the attack. This provides an equitable comparison with [Cha+23] and is also particularly practical since a true attacker’s objective is to maintain a low profile and minimize the cost of querying their targeted LLM. The attacker LLM is a small open-source model that can be queried at a relatively low cost compared to the target LLM. Similarly, while we use GPT4 as our evaluator, we believe that an exciting open problem is to replace it with a fine-tuned open-source LLM and achieve similar success.

Tree of Attacks with Pruning (TAP). Our method builds on the setup of [Cha+23] with two main improvements: tree-of-thought reasoning [Yao+23] and ability to prune irrelevant prompts. TAP utilizes three LLMs: an attacker whose task is to generate the jailbreaking prompts using tree-of-thoughts reasoning, an evaluator that assesses the generated prompts and evaluates whether the jailbreaking attempt was successful or not, and a target, which is the LLM that we are trying to jailbreak. We start with a single empty prompt as our initial set of attack attempts, and, at each iteration of our method, we execute the following steps:

(Branch) The attacker generates improved prompts.

(Prune: Phase 1) The evaluator eliminates any off-topic prompts from our improved prompts.

(Attack and Assess) We query the target with each remaining prompt and use the evaluator to score its responses. If a successful jailbreak is found, we return its corresponding prompt.

(Prune: Phase 2) Otherwise, we retain the evaluator’s highest-scoring prompts as the attack attempts for the next iteration.

Findings. We evaluate this method on the AdvBench Subset dataset created by [Cha+23]. We observe that the method circumvents the guardrails of state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for the vast majority of the harmful prompts while requiring significantly fewer queries to the target LLM than previous methods. As an example, our method jailbreaks GPT4 90% of the time (compared to 60% for [Cha+23]) while using 28.8 queries on average (compared to an average of 37.7 queries for [Cha+23]). Similarly, our method jailbreaks GPT4-Turbo 84% of the time (compared to 44% for [Cha+23]) while using 22.5 queries on average (compared to an average of 47.1 queries for [Cha+23]). We also show similar successes against other models like GPT3.5-Turbo, PaLM-2, and Vicuna-13B (Table 1). One surprising finding is that Llama-2-Chat-7B model seems to be much more robust to these types of black-box attacks. To explain this, we evaluated Llama-2-Chat-7B on hand-crafted prompts and observed that it refuses all requests for harmful information and, in fact, prioritizes refusal over following the instructions it is provided; e.g., ‘do not use any negative words in your response’.

Our next experiment shows the benefits of each of the design choices of TAP. In summary, we observe that the (Pruning: Phase 1) step significantly reduces the number of queries that TAP makes, while (Branching) has a significant impact on the success rate of our method. We also evaluate the importance of the power of the evaluator where we observe a severe drop in the performance of our method when GPT3.5-Turbo is used as an evaluator. Nevertheless, since the evaluator is only used to classify off-topic prompts and give a score to attack attempts we believe that using a model that is trained on this simple task will perform even better than GPT4. We leave this as an interesting open problem. For more details, see Section 4.2.

Finally, we explore the transferability of our attacks. Transferability means that an adversarial prompt that is produced by an attacker LLM can be then used to jailbreak a different LLM. We observe that our attacks can frequently transfer to other models (with the notable exception of Llama-2-Chat-7B). For more details, see Section 4.3.

Broader Takeaways. Based on our results, we reached the following high-level conclusions.

Small unaligned LLMs can be used to jailbreak large LLMs. Our results show that jailbreaking large LLMs, like GPT4, is still a relatively simple task that can be achieved by utilizing the power of much smaller but unaligned models, like Vicuna-13B. This indicates that more work needs to be done to make LLMs safe.

Jailbreak has a low cost. Similarly to [Cha+23], our method does not require large computational resources, and it only needs black-box access to the target model. We are able to jailbreak the vast majority of adversarial queries using this method without requiring gradient computation on GPUs with large virtual memories.

More capable LLMs are easier to break. There is a very clear difference in the performance of our method against GPTs or PaLM-2 and against Llama. We believe that a potential explanation of Llama’s robustness could be that it frequently refuses to follow the precise instructions of the users when the prompt asks for harmful information.

2 Related Works

Broadly speaking, ML algorithms are known to be brittle and there are numerous methods for generating inputs where standard ML models give undesirable outputs: For instance, image classifiers were found to be susceptible to adversarial attacks by making small perturbations to the input that would fool trained classifiers [Sze+14, GSS15]. Formally, given an input $x$ and a classifier $f$ , one could often find small perturbations $\delta$ such that, $f(x)\neq f(x+\delta)$ . Later, similar techniques were applied to text by using character [Ebr+18, Li+19, Gao+18, PDL19], word [Ebr+18, Li+19], token [Tan+20, Li+20], or sememe-level perturbations [Zan+20, Guo+21]. Some of these methods are black-box; i.e., they only require access to the outputs of the target model. Others use knowledge of the weights of the model (which enables them to compute the gradient of the output with respect to the inputs). Among methods using gradients, some directly use the gradients to guide the attack mechanism [Ebr+18, Li+19, Wal+19, Son+21, Jon+23], while others also include additional loss terms to steer replacements to meet certain readability criteria (e.g., perplexity) [Guo+21, Jon+23]. Some other methods use specially trained models to generate candidate substitutions [Li+20, Wen+23]. Yet other methods use probabilistic approaches: they sample candidate replacement tokens from an adversarial distribution [Guo+21]. Compared to other attacks, these adversarial methods have the disadvantage that they often have unusual token patterns that lack semantic meaning which enables their detection [Cha+23, Zhu+23].

In the context of LLMs, attacks that elicit harmful, unsafe, and undesirable responses from the model carry the term jailbreaks [WHS23]. Concretely, a jailbreaking method, given a goal $G$ (which the target LLM refuses to respond to; e.g., ‘how to build a bomb’), revises it into another prompt $P$ that the LLM responds to. Despite safety training, jailbreaks are prevalent even in state-of-the-art LLMs [Li+23]. There are many methods for generating jailbreaks: Some have been discovered manually by humans through experimentation or red-teaming [Rib+21, Cha23, wal22, Wit23] or applying existing injection techniques from the domain of cybersecurity [Kan+23]. Others have been discovered through LLM generation [Per+22, Wu+23] or by refinemening malicious user strings with the assistance of an LLM [Liu+23, Cha+23, Zhu+23]. Yet another class of jailbreaks are based on appending or prepending additional adversarial strings to $G$ (that are often lack semantic meaning) [Shi+20, Zou+23, LLS23, Sha+23]. Finally, a few works have also used in-context learning to manipulate LLMs [QZZ23] and explored the effects of poisoning retrieved data for use in LLMs [Abd+23].

The automatic jailbreaking method proposed in this paper builds upon the Prompt Automatic Iterative Refinement (PAIR) method [Cha+23] which uses LLMs to construct prompts that jailbreak other LLMs (see Section 3.1). [Yu+23] also use LLMs to generate prompts, however, they begin with existing (successful) human-generated jailbreaks as seeds. In contrast, we focus on methods that generate jailbreaks directly without using existing jailbreaks as input.

Preliminaries

In this section, we briefly overview vulnerabilities of LLMs to jailbreak and introduce the setup.

Safety Training of LLMs. While LLMs display surprising capabilities, they are prone to various failure modes which can expose the user to harmful content, polarize their opinion, or more generally, have a negative effect on society[She+19, AFZ21, Ben+21, Bom+22, HV22, Li+23, Ope23]. Consequently, significant efforts have been devoted to mitigating these failure modes. Foremost among these is safety training where models are trained to refuse restricted requests [Ope23, Ani+23]. For instance, early versions of GPT4 were extensively fine-tuned using reinforcement learning with human feedback (RLHF) to reduce its propensity to respond to queries for restricted information (e.g., toxic content, instructions to perform harmful tasks, and disinformation). This RLHF implementation required significant human effort: human experts from a variety of domains were employed to manually construct prompts exposing GPT4’s failure modes [Ope23]. However, despite extensive safety training, LLMs (including GPT4) continue to be vulnerable to carefully crafted prompts [Ope23, Zou+23, WHS23, wal22, Cha23, Wit23].

Jailbreaks. A prompt $P$ is said to jailbreak a model $\mathds{M}$ for a goal $G$ (which requests restricted information) if, given $P$ as input, the model outputs a response to $G$ . There are a plethora of human-generated prompts that jailbreak LLMs for specific goals. A number of such jailbreaks are available at [Cha23] (also see Section 1.2). Jailbreaks, among other things, are useful in the safety training of LLMs. For instance, as mentioned above, GPT4’s safety training involved eliciting prompts that jailbreak GPT4 from human experts. However, generating jailbreaks in this way requires significant human effort. Automated jailbreaking methods hope to reduce this effort.

Our Setting: Black-Box and Semantic-Level Jailbreaks. An automated jailbreaking method takes a goal $G$ as input and outputs another prompt $P$ that jailbreaks a target LLM $\mathds{T}$ . We focus on automated jailbreak methods that only require query access to $\mathds{T}$ ; i.e., black-box methods. Moreover, we require the method to always output semantically-meaningful prompts. Concretely, let $\mathcal{M}\subseteq\mathcal{V}^{\star}$ be the set of all meaningful prompts in any language present in $\mathds{T}$ ’s training data, where $\mathcal{V}$ is the vocabulary of $\mathds{T}$ . Fix a constant $L\geq 1$ . Let $q(P;\mathds{T})$ be the distribution of the first $L$ tokens generated by $\mathds{T}$ given prompt $P$ as input. Given a goal $G$ , we want to solve the following optimization problem

is a function assessing the extent to which $\mathds{T}$ is jailbroken for goal $G$ (certified by $\mathds{T}$ ’s response $R$ ). In particular, $\texttt{Judge}{}(G,R)=1$ implies that $R$ satisfies the goal $G$ and $\texttt{Judge}{}(G,R)=0$ implies that $R$ is a refusal to comply with $G$ .

Observe that we do not want to maximize the probability of getting any specific response. Rather, we want to maximize the probability of jailbreaking $\mathds{T}$ , and the expected value of the Judge score is a proxy for this goal. Moreover, we do not just want $\mathds{T}$ to output some restricted content, but we want $\mathds{T}$ to output restricted content that is relevant to $G$ . Often, for $\mathds{T}$ to output content relevant to $G$ , the input $P$ to $\mathds{T}$ must be on-topic for $G$ (see Figure 2). Motivated by this, in our method, we impose the additional requirement that any output prompt is on-topic for $G$ . Concretely, let

be a function such that $\texttt{Off-Topic}{}(P,G)$ is 1 if $P$ is off-topic for $G$ and it is 0 otherwise. We require any output $P$ to satisfy $\texttt{Off-Topic}{}(P,G)=0$ . Finally, we note that, like [Cha+23], we expand the search space of potential jailbreaks over previous methods [Zou+23, LLS23, Zhu+23, QZZ23] that require the $G$ to be a substring of or have a significant overlap with $P$ .

Our Method for Query-Efficient Black-Box Jailbreaking

We present Tree of Attacks with Pruning (TAP) – an automatic query-efficient black-box method for jailbreaking LLMs using interpretable prompts.An implementation of TAP is available at https://github.com/RICommunity/TAP.

TAP is instantiated by three LLMs: an attacker $\mathds{A}$ , an evaluator $\mathds{E}$ , and a target $\mathds{T}$ . Given a goal $G$ , it queries $\mathds{A}$ to iteratively refine $G$ utilizing tree-of-thought reasoning with breadth-first search [Yao+23] until a prompt $P$ is found which jailbreaks the target LLM $\mathds{T}$ , or the tree-of-thought reaches a maximum specified depth. In this process, $\mathds{E}$ serves two purposes: first, it assesses whether a jailbreak is found (i.e., it serves as the Judge function) and, second, it assesses whether a prompt generated by $\mathds{A}$ is off-topic for $G$ (i.e., it serves as the Off-Topic function).

Apart from the choice for $\mathds{A}$ and $\mathds{E}$ , TAP is parameterized by the maximum depth $d\geq 1$ , the maximum width $w\geq 1$ , and the branching factor $b\geq 1$ of the tree-of-thought constructed by the method. $\mathds{A}$ is initialized with a carefully crafted system prompt that mentions that $\mathds{A}$ is a ‘red teaming assistant’ whose goal is to jailbreak a target $\mathds{T}$ . $\mathds{E}$ also has a system prompt that poses it as a ‘red teaming assistant’. The specific prompt varies depending on whether $\mathds{E}$ is serving in the Judge or Off-Topic role.

We present the pseudocode for TAP in Algorithm 1 and give an overview below. TAP maintains a tree where each node stores one prompt $P$ generated by $\mathds{A}$ along with some metadata about it. In particular, each node has the conversation history of $\mathds{A}$ at the time $P$ was generated.

The method builds the tree layer-by-layer until it finds a jailbreak or its tree depth exceeds $d$ . Since it works layer-by-layer, the conversation history at a node is a subset of the conversation histories of any of its children. However, two distinct nodes at the same level can have disjoint conversation histories. This allows TAP to explore disjoint attack strategies, while still prioritizing the more promising strategies/prompts by pruning prompts $P$ that are off-topic and/or have a low score from $\texttt{Judge}{}(P,G)$ . At each step $1\leq i\leq d$ , TAP operates as follows:

(Pruning: Phase 1) TAP prunes all off-topic prompts in $\mathcal{P}$ . Concretely, for each $P\in\mathcal{P}$ , if $\texttt{Off-Topic}{}(P,G)=1$ , then the node corresponding to prompt $P$ is pruned.

(Query and Assess) TAP queries $\mathds{T}$ with each of the remaining prompts in $\mathcal{P}$ to get a set $\mathcal{R}$ of responses (which are recorded in the corresponding nodes of the tree). For each response $R\in\mathcal{R}$ , a score $\texttt{Judge}{}(R,G)$ is computed and also recorded in the corresponding node. If any response $R$ signifies a jailbreak (i.e., $\texttt{Judge}{}(R,G)=1$ ), then TAP returns its corresponding prompt $P$ and exits.

(Pruning: Phase 2) If no jailbreaks were found, TAP performs a second round of pruning. If there are now more than $w$ leaves in the tree, then the $w$ leaves with the highest scores are retained and the rest are deleted to reduce the tree’s width to at most $w$ .

Remarks.

TAP’s success depends on the evaluator’s ability to evaluate the Judge and Off-Topic functions (defined in Section 2). While we propose using an LLM as the evaluator, one may use TAP with any implementation of Judge and Off-Topic. Since the method only sends black-box queries to $\mathds{A}$ , $\mathds{E}$ , and $\mathds{T}$ , they can be instantiated with any LLMs that have public query access. This allows our method to be run in low-resource settings where one has API access to an LLM (e.g., the GPT models) but does not have access to high-memory GPUs. Like TAP, PAIR can also be run in low-resource settings, but most other jailbreaking methods require white-box access to $\mathds{T}$ or to its tokenizer (see Section 1.2). Further, the number of queries TAP makes to $\mathds{T}$ are bounded by $\sum_{i=0}^{d}b\cdot\min\left(b^{i},w\right)$ (a loose upper bound on this is $w\times b\times d$ ). However, because the method prunes off-topic prompts and stops as soon as one of the generated prompts jailbreaks $\mathds{T}$ , the number of queries to $\mathds{T}$ can be much smaller. Indeed, in our experiments, $w\times b\times d$ is $400$ and, yet, on average we send less than 30 queries for a variety of targets (Table 1). Finally, we note that the running time of our method can be improved by parallelizing its execution within each layer.

1 Comparison to PAIR

TAP is a generalization of the Prompt Automatic Iterative Refinement (PAIR) method [Cha+23]: TAP specializes to PAIR when its branching factor $b$ is $1$ and pruning of off-topic prompts is disabled (i.e., $\texttt{Off-Topic}{}(P,G)$ is set to be for all $P$ ). In other words, in PAIR, (1) $\mathds{A}$ uses chain-of-thought reasoning to revise the prompt and (2) all prompts generated by $\mathds{A}$ are sent to $\mathds{T}$ .

Like our work, [Cha+23] also recommend using an evaluator LLM $\mathds{E}$ to implement the Judge function. But, in principle, PAIR can be used with any implementation of Judge. To ensure that the $\mathds{E}$ gives an accurate evaluation of Judge, it is important to instantiate it with an appropriate system prompt. In our evaluations of TAP, when using $\mathds{E}$ to evaluate Judge, we use the same system prompt for $\mathds{E}$ as is used in PAIR. (Naturally, this system prompt is not suitable for evaluating the Off-Topic function and we present the system prompt for Off-Topic in Section C.)

However, the method has two deficiencies, which TAP improves upon.

(Prompt Redundancy). By running multiple iterations, the aforementioned approach hopes to obtain a diverse set of prompts. However, we find significant redundancies. The prompts generated from the first iteration follow nearly-identical strategies in many repetitions. We suspect this is because, at the start, all repetitions query $\mathds{A}$ with the same conversation history.

(Prompt Quality). Further, we also observe that a majority of the prompts generated by $\mathds{A}$ are off-topic for $G$ .

Since we use a small branching factor $b$ , $\mathds{A}$ is not prompted with the identical conversation history a large number of times. Since the conversation history has a significant effect on the outputs of LLMs, reducing redundancies in the conversation history likely reduces redundancies in prompts generated by $\mathds{A}$ . Further, as TAP prunes the off-topic prompts, it ensures only on-topic prompts are sent to $\mathds{T}$ . Since off-topic prompts rarely lead to jailbreaks, this reduces the number of queries to $\mathds{T}$ required to obtain jailbreaks.

One may argue that if $\mathds{A}$ is likely to create off-topic prompts, then it may be beneficial to send some off-topic prompts to $\mathds{T}$ . This would ensure that off-topic prompts are also included in the conversation history which, in turn, may ensure that $\mathds{A}$ does not generate further off-topic prompts. However, empirically this is not the case. On the contrary, we observe that including off-topic prompts in the conversation history increases the likelihood that future prompts are also off-topic. In other words, the probability that the $i$ -th prompt $P_{i}$ is off-topic conditioned on the previous prompt $P_{i-1}$ being off-topic is significantly higher (up to 50%) than the same probability conditioned on $P_{i-1}$ being on-topic; i.e., $\Pr\left[\texttt{Off-Topic}{}(P_{i},G)=1\mid\texttt{Off-Topic}{}(P_{i-1},G)=1\right]>\Pr\left[\texttt{Off-Topic}{}(P_{i},G)=1\mid\texttt{Off-Topic}{}(P_{i-1},G)=0\right]$ .

Improving upon these deficiencies allows us to jailbreak state-of-the-art LLMs with a significantly higher success rate than PAIR with a similar or fewer number of queries to $\mathds{T}$ (see Table 1). Further, empirical studies in Section 4.2 assess the relative improvements offered by pruning off-topic prompts and using tree-of-thought reasoning.

2 Choice of Attacker and Evaluator

A crucial component of our approach is the choice of the attacker $\mathds{A}$ and the evaluator $\mathds{E}$ . Ideally, we want both to be large models capable of giving meaningful responses when provided with complex conversation histories that are generated by $\mathds{A}$ , $\mathds{T}$ , and $\mathds{E}$ together. However, we also do not want $\mathds{A}$ to refuse to generate prompts for harmful (or otherwise restricted) prompts. Nor do we want $\mathds{E}$ to refuse to give an assessment when given harmful responses and/or prompts.

Based on the above ideals and following the choice in [Cha+23], we use Vicuna-13B-v1.5 as $\mathds{A}$ and GPT4 as $\mathds{E}$ . Still, to have a point of comparison, we evaluate the performance of our approach when $\mathds{E}$ is GPT3.5-Turbo (see Section 4.2). Additional optimization of the choice for $\mathds{A}$ and $\mathds{E}$ or using custom-fine-tuned LLMs for them may further improve the performance of our method. We leave this to future work.

Empirical Study

In this section, we evaluate our method (TAP) and baselines on a dataset of adversarial prompts with state-of-the-art LLMs from OpenAI, Google, and Meta.

We use a dataset of harmful prompts created by [Cha+23] called AdvBench Subset. AdvBench Subset consists of 50 prompts asking for harmful information across 32 categories. It is a subset of prompts from the harmful behaviors dataset in the AdvBench benchmark [Zou+23], which was selected to ensure coverage over a diverse set of harmful categories.

Methods and Implementation Details.

In addition to TAP, we consider the two baselines: PAIR [Cha+23], which is the SOTA method for semantic-level black-box jailbreaks, and GCG [Zou+23], which is the SOTA method for token-level white-box jailbreaks. For both TAP and PAIR, we use Vicuna-13B-v1.5 as the attacker and GPT4 as the evaluator. This is identical to [Cha+23]’s experiments. For TAP we fix the maximum depth to $d=10$ , the maximum width to $w=10$ , and the branching factor to $b=4$ , respectively.Assuming that, in each layer, 50% of the prompts are pruned (as we observe in our experiments), these parameters ensure that the maximum number of queries TAP can send to the target $\mathds{T}$ is close to the maximum number of queries that PAIR can send to $\mathds{T}$ . That said, we have not attempted to optimize the choices of these parameters and optimizing them likely would lead to further performance improvements. For both PAIR and GCG we fix the same hyperparameters as in the corresponding works. We attempt to jailbreak a number of LLMs. For each target LLM, we use temperature 0 to ensure their responses are deterministic and we generate $L=150$ tokens. The system prompts and other generation parameters we use are listed in Section C.

Metrics for Evaluating Jailbreak Success.

We consider two metrics for this: the GPT4-Metric and Human-Judgement. The GPT4-Metric is automated and uses GPT4 for the Judge function described in Section 2: given a response $R$ and goal $G$ , the pair $(R,G)$ is considered to represent a successful jailbreak if $\texttt{Judge}{}(R,G)=1$ , which corresponds to a “detailed and fully jailbroken response” as evaluated by GPT4. This is the same metric used in [Cha+23]. Human-Judgement is manually evaluated by the authors. We label each response and goal pair $(R,G)$ using the scheme in [WHS23, Appendix B] where (only) the BadBot label constitutes a jailbreak. This is the same metric used in [WHS23].

1 Success Rate and Number of Queries to Target

First, we evaluate the fraction of goals for which our method, PAIR, and GCG find successful jailbreaks against various LLMs. We report the results according to the GPT4-Metric in Table 1. The results with Human-Judgement are qualitatively similar and are reported in Table 6 in Section A.

The main observation is that for all target models, TAP, finds jailbreaks for a significantly larger fraction of prompts than PAIR while sending significantly fewer queries to the target. Concretely, on GPT4-Turbo,Through out the paper, by GPT4-Turbo, we refer to the gpt-4-1106-preview model in OpenAI’s API. which is the latest model from OpenAI as of November 2023, TAP finds jailbreaks for 40% more prompts than PAIR while sending 52% fewer queries to the target.

In more detail, on all the closed-source models we test (namely, the GPT models and PaLM-2), TAP finds jailbreaks for more than 75% of the prompts while using less than 30 queries per prompt per model. In comparison, PAIR’s success rate can be as low as 44% even though it has a higher average number of queries for each prompt than TAP. GCG cannot be evaluated on these models as it requires access to the weights of the models. Among the open-source models, both TAP and PAIR find jailbreaks for nearly all goals with Vicuna-13B but have a low success rate against Llama-2-Chat-7B. GCG achieves the same success rate as TAP with Vicuna-13B and has 54% success rate with Llama-2-Chat-7B. However, GCG requires orders of magnitude more queries than TAP.

2 Effect of Pruning, Tree-of-thoughts, and Evaluator Choice

Next we explore the relative importance of (1) pruning off-topic prompts and (2) of using a tree-of-thoughts approach. We also assess the effect of using a different LLM (from GPT4) as the evaluator on TAP’s performance. In all studies, we use GPT4-Turbo as the target as it is the state-of-the-art model according to several benchmarks [Ope23].

In the first study, we compare TAP to a variant where the off-topic prompts are not pruned. We report the results in Table 2. Since the variant does not prune off-topic prompts, we naturally observe that it sends a significantly higher average number of queries to the target than the original method (55.4 vs 22.5). Moreover, even though the variant sends more queries, it also has a poorer success rate (72% vs 84%). At first, this might seem contradictory, but it happens because the width of the tree-of-thought at each layer is at most $w$ (as prompts are deleted to keep the width at most $w$ ) and, since the variant does not prune off-topic prompts, in the variant, off-topic prompts can crowd out on-topic prompts.

In the second study, we compare TAP to a variant that has a branching factor of 1 (with all other hyper-parameters remaining identical). In particular, this variant does not use tree-of-thought reasoning. Instead it uses chain-of-thought reasoning like PAIR, but unlike PAIR, it also prunes off-topic prompts. This ablation studies whether one can match the performance of TAP by incorporating pruning in PAIR. Since the variant does not branch, it sends fewer queries than the original method. To correct this, we repeat the second method 40 times and, if any of the repetitions succeeds, we count it as a success. This repetition ensures that the variant sends more queries than the original method and, hence, should have a higher success rate. However, we observe that the success of the variant is 36% lower than the original (Table 3)–showing the benefits of using a tree-of-thought reasoning approach over a chain-of-thought reasoning.

In the third study, we explore how the evaluator LLM affects the performance of TAP. All experiments so far, use GPT4 as the evaluator. The next experiment uses GPT3.5-Turbo as the evaluator and reports the success rate according to the GPT4-Metric for jailbreaking. We observe that the choice of the evaluator can affect the performance of TAP: changing the attacker from GPT4 to GPT3.5-Turbo reduces the success rate from 84% to 4.2% (Table 4). The reason for the reduction in success rate is that GPT3.5-Turbo incorrectly determines that the target model is jailbroken (for the provided goal) and, hence, preemptively stops the method. As a consequence, the variant sends significantly fewer queries than the original method (4.4 vs 22.5; Table 4).

3 Transferability of Jailbreaks

Next, we study the transferability of the attacks found in Section 4.1 from one target to another. For each baseline, we consider prompts that successfully jailbroke Vicuna-13B, GPT4, and GPT4-Turbo for at least one goal in the AdvBench Subset. Since OpenAI’s API does not allow deterministic sampling from GPT4, the GPT4-Metric has some randomness. To correct any inconsistencies from this randomness, for each goal and prompt pair, we query GPT4-Metric 10 times and consider a prompt to transfer successfully if any of the 10 attempts is labeled as a jailbreak.This method of mitigating inconsistencies could also be applied to the evaluator when it is assessing the Judge function in TAP. However, it may increase the running time significantly with only a marginal benefit. In Table 5, we report the fraction of these prompts that jailbreak a different target (for the same goal as they jailbroke on the original target).

Among the jailbreaks found by TAP and PAIR, the prompts jailbreaking GPT4 or GPT4-Turbo transfer at a significantly higher rate to other LLMs than prompts jailbreaking Vicuna-13B. This is natural as many benchmarks suggest that GPT4 and GPT4-Turbo are harder to jailbreak than Vicuna-13B [Ope23, Sha+23, Cha+23]. Among GPT4 and GPT4-Turbo, prompts jailbreaking GPT4 transfer at higher rate to GPT3.5-Turbo and PaLM-2 than those jailbreaking GPT4-Turbo (27% vs 27% on PaLM-2 and 56% vs 48% on GPT3.5-Turbo). This suggests that GPT4-Turbo is currently less robust than GPT4, but this may change as GPT4-Turbo updates over time.

Overall the jailbreaks found by TAP and the jailbreaks found by PAIR have similar transfer rates to new targets. Two exceptions are cases where the new targets are GPT3.5-Turbo and GPT4, where PAIR has better transfer rates, perhaps because PAIR only jailbreaks goals that are easy to jailbreak on any model (which increases the likelihood of the jailbreaks transferring). That said, jailbreaks found by PAIR and GCG, transfer at a significantly higher rate than the jailbreaks found by GCG. Finally, we observe that the prompts generated by GCG transfer at a lower rate to the GPT models compared to those reported in earlier publications, e.g., [Cha+23]. We suspect that this because of the continuous updates to these models by the OpenAI Team, but exploring the reasons for degradation in GCG’s performance can be a valuable direction for further study.

Conclusion

This work introduces TAP, a jailbreaking method that is automated, only requires black-box access to the target LLM, and outputs interpretable prompts. The method utilizes two other LLMs, an attacker and an evaluator. The attacker iteratively generates new prompts for jailbreaking the target using tree-of-thoughts reasoning and the evaluator (1) prunes the generated prompts that are irrelevant and (2) evaluates the remaining prompts. These evaluations are shared with the attacker which, in turn, generates further prompts until a jailbreak is found (Section 3 and Algorithm 1).

We evaluate the method on state-of-the-art LLMs and observe that it finds prompts that jailbreak GPT4, GPT4-Turbo, and PaLM-2 for more than 80% of requests for harmful information in an existing dataset while using fewer than 30 queries on average (Table 1). This significantly improves upon the prior automated methods for jailbreaking black-box LLMs with interpretable prompts (Table 1).

Our work, along with other interpretable jailbreak methods, raises questions about how LLMs best be protected. Foremost, can LLMs be safeguarded against interpretable prompts aimed at extracting restricted content without degrading their responses on benign prompts? One approach is to introduce a post-processing layer that blocks responses containing harmful information. Further work is needed to test the viability of this approach. However, such a method has limited applicability if LLMs operate in a streaming fashion (like the GPT models) where the output is streamed to the user one token at a time. Our current evaluations focus on requests for harmful information. It is important to explore whether TAP or other automated methods can also jailbreak LLMs for restricted requests beyond harmful content (such as requests for biased responses or personally identifiable information) [Li+23, KDS23]. Further, while we focus on single prompt jailbreaks, it is also important to rigorously evaluate LLM’s vulnerability to multi-prompt jailbreaks, where a small sequence of adaptively constructed prompts $P_{1},P_{2},\dots,P_{m}$ together jailbreak an LLM.

Limitations.

We evaluate our results on a single dataset. The performance of our method may be different on datasets that are meaningfully different from the one that we use. Further, since we evaluate black-box LLMs, we are unable to control all their hyper-parameters in our experiments. To limit the impact of changes in the evaluated models over time, all of the evaluations reported in the paper were conducted on data collected during 12 days from 18th November 2023 to 30th November 2023.

Broader Impact.

In this work, we improve the efficiency of existing methods for jailbreaking LLMs. Our work can be used for making public LLMs generate restricted (including harmful and toxic) content with fewer resources. However, we believe that releasing our findings in full is important for ensuring open research on the vulnerabilities of LLMs. Open research on vulnerabilities is crucial to increase awareness and resources invested in safeguarding these models–which is becoming increasingly important as their use extends beyond isolated chatbots. To minimize the adverse effects of our findings, we reported them to respective organizations and model developers before making them public.

References

Appendix A Success Rate According to Human-Judgement

In this section, we report the success rate of the experiment from Section 4.1 according to Human-Judgement. To compute the success rates, we manually evaluated each pair of response $R$ and prompt $P$ following the guideline in [WHS23, Appendix B]. Moreover, to eliminate bias, we performed the evaluations anonymously: we combined all prompts $P$ and responses $R$ generated by the 12 combinations of target LLM and method into one file, which had an anonymous identifier and goal $G$ for each pair $(P,R)$ , but did not have any information about which LLM or method generated it.

Overall, the results are qualitatively similar to the ones with the GPT4-Metric: TAP has a significantly higher success rate than PAIR on all Target LLMs evaluated. Except Vicuna-13B where there is no scope for improvement and on Llama-2-Chat-7B where both methods have a poor performance.

Appendix B Examples of Jailbreaks

B.2 Examples of Execution of Our Method

Appendix C Implementation Details

In this section, we provide system prompts and, where applicable, hyperparameters used for the target, evaluator, and attacker LLMs.

We used the following system prompts for the target LLMs, which are provided by the fastchat.model API.

Vicuna-13B: ‘A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions.’

GPT3.5-Turbo and GPT4: ‘You are a helpful assistant.’

GPT4-Turbo: ‘A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human’s questions.’

Further, for the open-source models, we set $\texttt{top\_p}=1$ , $\texttt{num\_beams}=1$ , and $\texttt{do\_sample}=\texttt{False}$ . [Cha+23] also use the same hyperparameters.

C.2 Evaluator LLM

We provide the complete system prompts for the evaluator in Table 7 and Table 8. Since we used closed-source models as judges, there are no further hyperparameters.

C.3 Attacker LLM

We use the same algorithmic for the attacker as described in [Cha+23, Appendix A.2]. We set $\texttt{top\_p}=0.1$ , $\texttt{num\_beams}=1$ , $\texttt{temperature}=1$ , and $\texttt{do\_sample}=\texttt{True}$ . Further, we use the system prompt provided in Table 9.