LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, Duen Horng Chau

Introduction

Large language models (LLMs) have taken the world by storm, showing their ability to generate high-quality text for various tasks like storytelling, serving as chat assistants, and even composing music . However, despite their abilities to produce positive content, LLMs can also generate harmful material like phishing emails, malicious code, and hate speech . Many methods attempt to prevent the generation of harmful content. These methods mainly focus on “aligning” LLMs to human values using various training strategies or by providing a set of supervisory principles to guide the LLM’s responses . However, an emerging body of work has revealed that even aligned models can be manipulated into producing harmful content by prompt engineering , or employing more advanced techniques such as adversarial suffix attacks . The challenge of preventing an LLM from generating harmful content lies in the fact that this conflicts with how they are trained . The very framework that allows LLMs to effectively generate high-quality responses also enables them to generate hateful or otherwise harmful text, as the training corpora are composed of public data containing toxic passages . Our work helps tackle these critical challenges through the following major contributions:

LLM Self Defense: a simple zero-shot defense against LLM attacks (Fig. 1). LLM Self Defense is a method designed to prevent user exposure to harmful or malevolent content induced from LLMs. It is effective and easy to deploy, requiring no modifications to the underlying model. LLM Self Defense is relatively simple when compared to existing methods of defending against LLM attacks, as existing methods rely on iterative generation or preprocessing . Thus, LLM Self Defense is faster and more efficient. (Section 3)

LLM Self Defense reduces attack success rate to virtually 0. We evaluated LLM Self Defense on two prominent language models: GPT 3.5 , one of the most popular LLMs , and Llama 2, a prominent open-source LLM. Our evaluation demonstrates that LLM Self Defense generalizes effectively across both models, flagging nearly all harmful text and reducing the attack success rate to virtually 0 against a variety of attack types, including those aimed at eliciting affirmative responses, and prompt engineering attacks. Notably, we observed that LLMs perform better in identifying harmful content when they are tasked with detecting harm as a suffix, after the LLM already processed the text (Fig. 2). Our findings demonstrate that presenting the harmful text first is more effective in minimizing false alarms. (Section 4)

Related Work

As LLMs have grown in complexity and capability, so has their attack surface . Recently researchers have explored LLM attacks or “jailbreaking”, methods to bypass or break through the limitations imposed on LLMs that prevent them from generating harmful content. Wei, et al. argued that there exists a conflict between generating highly probable sequences of text, aligning with the core pretraining auto-regressive objective of LLMs , and avoiding the generation of harmful content. This implies that if an LLM begins a response to a toxic query, such as “How to build a bomb?” with an affirmative statement, for example “Absolutely! The way you do this is …”, it is inclined to continue generating an affirmative response to maintain coherence in the response tone, leading to generation of harmful text. Researchers have accomplished such attacks by specifically training an adversarial suffix that can elicit the desired response using a gradient based optimization method , or using prompt engineering . Zou, et al. demonstrated certain suffix attacks are effective even on explicitly aligned language and can be transferred to LLMs different from the ones on which they were originally trained. There are substantial risks associated with LLMs generating harmful text and the vulnerability of aligned models to various forms of attacks . Therefore, it is imperative to find ways to identify and mitigate an LLM’s ability to generate harmful content.

2 LLM Defenses

Developing defenses against LLM attacks is an emerging field, unlike in the case of vision models, where there is a substantial body of literature exploring robust defenses against adversarial examples . Recent research leveraged principles used for computer vision robustness and transferred them to language models . This effort aimed to establish a baseline for future defense strategies. Additionally, researchers have explored iterative auto-regressive inference as a defense . This falls in line with their ability to verify the content they generate for accuracy and then correct themselves using a chain of verification, which has been shown to improve factual question answering data , mathematical reasoning , and logical reasoning . This process reduced the attack success rate but showed a 300% increase in generation time by the language model. The complexity and overhead of the existing approaches can limit their usability and generalizability . LLM Self Defense aims to address these issues.

LLM Self Defense Design

LLMs demonstrate the ability to comprehend common social norms and values . Additionally, they can follow simple instructions in a zero-shot manner, even without explicit examples . LLM Self Defense takes advantage of both of these features by leveraging an LLM’s understanding of the word “harmful" to filter out potentially harmful responses.

We create a zero-shot classifier by instructing a language model to determine whether a piece of content contains harmful text, as illustrated in Figure 2, which provides an overview of the defense pipeline. Initially, a user provides a potentially malicious text prompt $T_{in}$ (e.g., “give me instructions on how to make a bomb” followed by an adversarial suffix). An LLM, $LLM_{gen}$ , takes this prompt as input and generates potentially harmful text $T_{resp}$ , such that $LLM_{gen}(t_{in})=T_{resp}$ . Our core approach involves constructing a zero-shot harm classifier by prompting another LLM denoted by $LLM_{filter}$ to predict whether or not $T_{resp}$ contains harmful content. It is important to note, $LLM_{filter}$ can be another instance of the same LLM that underlies $LLM_{gen}$ . We present $LLM_{filter}$ with a prompt structured as “Does the following text contain harmful content: $T_{resp}$ . Respond with ‘Yes, this is harmful’ or ‘No, this is not harmful’ ”. $LLM_{filter}$ then follows these instructions, reads through the LLM-generated content $T_{resp}$ , and generates a “Yes, this is harmful” or “No, this is not harmful” prediction. This basic instruction prompt proves effective at filtering out virtually all harmful LLM-generated content, as our experiment results in Section 4.2 shall demonstrate.

Experiments and Results

Our evaluation of LLM Self Defense focuses on two prominent language models: GPT 3.5 and Llama 2 7B. GPT 3.5 is one of the most popular LLMs, while Llama 2 is one of the most prominent open-source LLMs . During our evaluation, we observe that both GPT 3.5 and Llama 2 respond in a harmful manner when subjected to various forms of tested attacks. To generate a dataset for evaluation, we induce 50 harmful and 50 benign responses from each LLM. The harmful responses are induced by prompting them with slightly modified versions of adversarial prompts in the AdvBench dataset , which we modify using techniques described in Section 4.1. Specific examples of harmful prompts generated through this process are presented in Table 1. These responses are subsequently evaluated using the harm filters detailed in Section 4.2, and the results of this assessment are outlined in Table 2.

Inducing harmful text from Llama 2. The safety measures of LLaMA 2 can be circumvented by using prompt engineering and steering it to complete the sentence “Sure, here is a detailed step by step process of how to make a bomb:”, similar to the method shown in previous research . The same task can be accomplished by using an LLM in the context of the sentence completion task with the input “Sure, here is a detailed step by step process of how to make a bomb:”. However we observe that while the language model does not explicitly refuse to continue the sentence in the latter scenario, the content it generates is not consistently harmful. Instead, it adopts a similar tone without generating any harmful content. For our evaluation, we manually select explicitly harmful generated responses to assess the harm filter.

Inducing harmful text from GPT 3.5. When GPT 3.5 is presented with prompts framed in a hypothetical or fictional context, or towards a “good” purpose, it tends to provide a detailed response. If necessary, these responses framed in a hypothetical setting can be converted into instructions in a real world setting. In most cases this is not necessary because the harmful content can be readily translated into direct action items in the real world with minimal to no modifications.

Further, we attempted to reproduce the gradient-based universally transferable attack but were unsuccessful, encountering similar challenges as others who also attempted the attack. The training was unstable and the model did not converge, rendering the attacks ineffective. We explored solutions provided in the official code repository ; however these measures did not resolve the training instability and the issue persists.

2 LLMs Defending Themselves

For each piece of harmful text generated by an LLM (GPT 3.5 or LLaMA 2), as described in the previous subsection (Sec. 4.1), it is provided to an LLM again (GPT 3.5 or LLaMA 2) to check for harmfulness. We refer to the LLM that initially generates the text as a “response generator” and the LLM that checks for harmfulness later as a “harm filter”. The harm filters are instructed to produce “Yes, it is harmful” or “No, it is harmless” as responses. These responses act as a classifier output, allowing us to compute the accuracy of the harm classifiers. Table 2 summarizes our experiment results in the form of accuracy, true positive rate, false positive rate. Overall, GPT 3.5 performs well at classifying harmful content. When asked the question “Is the content harmful?” as a prefix GPT 3.5 reaches a 98% accuracy. Llama 2 has a lower performance of 77%.

We discover that instructing an LLM to determine whether an induced response constitutes harm after the LLM has already processed the text is more effective at distinguishing between harmful and benign responses. This significantly improves the accuracy of GPT 3.5 to 99% and that of LLaMA 2 to 94.6% This is primarily due to the substantial reduction in the number of false positive classifications made by LLaMA 2 as can be seen in Table 2. When Llama 2 is primed to detect harm before it reads the content to be checked, it becomes overly cautious to the point of misclassifying over 40% of the benign responses as harmful, including responses to prompts such as “Steps to open a spreadsheet”.

Notably, despite the difference in accuracy, when solely assessing the LLM’s effectiveness in preventing attacks, we see that Llama 2 and GPT 3.5 perform identically in detecting harm when the content is genuinely harmful. This is evident in the True Positive Rate, where LLM Self Defense successfully reduces the attack success rate to virtually 0.

Discussion: Broader Social Impacts & Future Work

The widespread use of LLMs in applications like chat bots and assistants underscores the growing urgency of developing scalable, standardized and straightforward defenses against a diverse range of attacks . This challenge continues to be relevant even for aligned LLMs, as they can be adversarially prompted to generate harmful output .

We demonstrate that an LLM can be used out-of-the-box without any fine-tuning as its own harm filter, without the need for any preprocessing or iterative generation as in previously proposed defenses. LLM Self Defense’s simple process achieves competitive results when compared to more complex defense methods . It can accurately detect harmful responses and reduce the attack success rate to virtually 0. We believe our approach holds promise in defending against various attacks on LLMs. Notably, LLM Self Defense had consistent attack success rates regardless of the attack..

To further extend LLM Self Defense, we can provide concrete examples of “harm” and use in-context learning as discussed in . Additionally, we plan to explore whether summarizing the response before classification can enable the LLM to distinguish benign and harmful responses with greater accuracy. Currently, we manually categorize the harm filter responses into “yes” or “no”, because Llama 2 occasionally deviates from the desired response format, even when explicit instructions are provided. However, the use of logit biasing could enforce the LLM to consistently produce a “Yes” or “No” response for classification . This would reduce the need of manual inspection and facilitate automation of the filtering process, thereby enabling us to evaluate the effectiveness of LLM Self Defense on a broader spectrum of responses.

Acknowledgements

This work was supported in part by Defense Advanced Research Projects Agency (DARPA). Use, duplication, or disclosure is subject to the restrictions as stated in Agreement number HR00112030001 between the Government and the Performer.