Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez, Ian Ribeiro

Introduction

In 2020, OpenAI introduced GPT-3 , a large language model (LLM) capable of completing text inputs to produce human-like results. Its text completion capabilities can generalize to other natural language processing (NLP) tasks like text classification, question-answering, and summarization. Since then, GPT-3 and other LLMs – like BERT , GPT-J , T5 , and OPT – have revolutionized NLP by achieving state-of-the-art results on various tasks.

An approach to creating applications with GPT-3 (and similar LLMs) is to design a prompt that receives user inputs via string substitution . For instance, one can simply build a grammar fixing tool by using the prompt Correct this to standard English: "{user_input}", where {user_input} is the phrase that the final user will provide. It is remarkable that a very simple prompt is capable of a very complex task. Building a similar application with a rule-based strategy would be immensely harder (or even unfeasible).

However, the ease of building applications with GPT-3 comes with a price: malicious users can easily inject adversarial instructions via the application interface. Due to the unstructured and open-ended aspect of GPT-3 prompts, protecting applications from these attacks can be very challenging. We define the action of inserting malicious text with the goal of misaligning an LLM as prompt injection.

Prompt injection got recent attention on social media with users posting examples of prompt injection to misalign the goals of GPT-3-based applications . However, studies exploring the phenomena are still scarce. In this work, we study how LLMs can be misused by adversaries through prompt injection. We propose two attacks (Figure 1) – goal hijacking and prompt leaking – and analyze their feasibility and effectiveness.

We define goal hijacking as the act of misaligning the original goal of a prompt to a new goal of printing a target phrase. We show that a malicious user can easily perform goal hijacking via human-crafted prompt injection.

We define prompt leaking as the act of misaligning the original goal of a prompt to a new goal of printing part of or the whole original prompt instead. A malicious user can try to perform prompt leaking with the goal of copying the prompt for a specific application, which can be the most important part of GPT-3-based applications.

Our work highlights the importance of studying prompt injection attacks and provides insights on impacting factors. We believe that our work can help the community better understand the security risks of using LLMs and design better LLM-powered applications. Our main contributions are the following:

We study prompt injection attacks against LLMs and propose a framework to explore such attacks.

We investigate two specific attacks: goal hijacking and prompt leaking.

We provide an AI x-risk analysis of our work (Appendix A).

Related work

Researchers have demonstrated that LLMs can produce intentional and unintentional harmful behavior. Since its introduction, many works have demonstrated that GPT-3 reproduces social biases, reinforcing gender, racial, and religious stereotypes. . Additionally, LLMs can leak private data used during training . Furthermore, malicious users can use GPT-3 to quickly generate vitriol at scale .

Given the importance of the topic, many papers focus on detecting and mitigating harmful behavior of LLMs: Gehman et al. investigated methods to hinder toxic behavior in LLMs and found that there is no guaranteed method to prevent it from happening. They argue that a more careful curation of pretraining data, including the participation of end users, can reduce toxicity in future models.

To mitigate harmful behavior and improve the usefulness, Ouyang et al. fine-tuned GPT-3 through human feedback, making the model better at following instructions while improving truthfulness and reducing harmful and toxic behavior. The new model is the default language model available on OpenAI’s API .

Xie et al. investigated adversarial attacks on text classifiers using methods from two open-source libraries: TextAttack and OpenAttack . Branch et al. demonstrated that a simple prompt injection can be used to change the result of the classification task on GPT-3 and other LLMs. In our work, we demonstrate a similar attack but with the goal of misleading the model into outputting a malicious target text (goal hijacking) or stealing the original prompt (prompt leaking), regardless of the original task.

The PromptInject framework

We propose PromptInject (Figure 2), a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks.

Base prompts (Table C1) are comprised of an initial instruction, replicating common conditions for most language model applications. This instruction can then be tested through kaleidoscopic variations formed by many factors: n-shot examples, labels utilized to refer to either the user or the model itself , and even the injection of a smaller secret sub-prompt which contains information sensitive to the prompt – such as special instructions to the model, verboten subjects and/or themes, or contextual enhancers – referred to as a private value.

Attack prompts (Table C2), in their turn, are built by adopting attack strategies – goal hijacking and prompt leaking – which can respectively assume the presence of a rogue string – an adversarial instruction designed to derail the model into printing a particular set of characters – or a private value, which is embedded within a secret and must not be revealed externally under any circumstances. Additionally, due to an observed sensitivity displayed by language models towards escape and delimiting characters, attacks may also be enhanced with a range of malicious characters to confuse the model.

Acknowledging the high variability in outputs resulting from different model settings (Table C3), our scenarios are subjected to considering factors such as a range of temperature, top-p sampling, frequency and presence penalties, and the model prompted.

Appendix C provides a more detailed description of the building blocks of PromptInject.

Experiments

We explored goal hijacking and prompt leaking by constructing multiple attack prompts using the PromptInject framework. We designed experiments with various factors and verified their effectiveness by running them on 35 base prompts collected from the OpenAI Examples page (Table C). We decided to use these public examples since it is likely that developers will use them as a starting point for their applications. It is worth noting that the goal of our experiments is not to search for the best prompt attack exhaustively but rather to show that many factors may impact the effectiveness of an attack.

All experiments use text-davinci-002, the most powerful publicly available OpenAI model, except for the results shown in Table B10. Appendix B contains the results of all experiments, including factors and attack prompts.

We calculated the success rates (represented as % on tables) as the mean and standard deviation of the percentage of successful attacks over the 35 prompts specified in Table C. To get the mean and standard deviation and increase the robustness of the results, we repeated each experiment 4 times. This is necessary due to the stochasticity of GPT-3. As an example, a success rate of $50\pm 1.0\%$ means that we ran the experiment 4 times over the 35 prompts and, on average, half of the prompts were attacked successfully, with a standard deviation of $1.0$ .

We define a goal hijack attack as successful if the outputted string is exactly the target string (ignoring punctuation and line breaks). We define a prompt leaking attack as successful if the outputted string contains the original instruction. Our framework also implements fuzzy matching scores to allow for more flexibility, although we have not explored them in these results.

Results and discussion

Table 1 shows the summarized results for goal hijacking and prompt leaking. For more detailed results, including attack prompts, see Appendix B. We summarize our main findings as the following:

Attack prompt affects success rates (Tables B2 and B11).

Delimiters significantly improve attacks, but the impact of delimiter type, length, and repetitions is not clear (Tables B3, B4, B5).

Temperature influences attacks, but top-p and frequency/presence penalties do not (Table B7).

More harmful rogue strings inhibit attacks (Table B6).

Stop sequences hinder attacks (Table B8).

Prompts with text after the {user_input} are harder to attack (Table B9).

Prompt leaking is harder than goal hijacking (Tables B6 and B11).

text-davinci-002 is by far the model most susceptible to attacks (Table B10).

While we did not aim to find the best attack prompts, we achieved a success rate of $58.6\%\pm 1.6$ for goal hijacking and $23.6\%\pm 2.7$ for prompt leaking. Notably, several factors affect the effectiveness of attacks: Small changes in the attack prompt, like using print rather than say, and adding the word instead, improve the attack (F1). Using delimiters to add a clear separation between instructions is particularly effective (F2). Interestingly, the more harmful a rogue string is, the less effective the attack, which could be a consequence of the alignment efforts by Ouyang et al. (F4).

Unfortunately, the GPT-3-powered application designer only has a few mechanisms to inhibit attacks, and the most effective methods are related to restricting the model to its original goal: using stop sequences to avoid more text than necessary (F5), having text after the user input (F6), defining maximum outputted tokens, and post-processing the model results (e.g., by moderating the outputs ). From the other model settings, using a high temperature seems to hamper attacks slightly, but at the cost of making the model more unpredictable (F3).

When comparing publicly available models on the OpenAI API, text-davinci-002, the most capable model, is by far the most vulnerable model (F8), suggesting the presence of the inverse scaling phenomenonhttps://github.com/inverse-scaling/. The fact that text-davinci-002 is the best model for understanding instructions and prompt intents comes with the price of a higher susceptibility of also following injected instructions. Weaker models usually lack the ability to capture the whole intent in the original tasks, so it is not a big surprise that they also fail to follow explicitly malicious instructions.

Prompt leaking is notably more challenging than goal hijacking (F7), but minor tweaks on the prompt attack may improve leaking efficacy. For instance, the attack was much more successful by using spell check as a proxy task instead of asking the model to print the original prompt ( $12.1\pm 1.4$ vs. $2.9\pm 0.0$ ). Furthermore, adding the word instead to the attack prompt boosted the success rate to $23.6\pm 2.7$ (Table B11). We believe that more targeted attacks on specific base prompts can further improve these numbers.

Although the problem can be reduced with some tweaks, there are no guarantees that it will not happen. In fact, completely preventing these attacks might be virtually impossible, at least in the current fashion of open-ended large language models. Perhaps one solution could be a content moderation model that supervises the output of LLMs (similar to the one proposed by Markov et al. , and available as an OpenAI endpoint API). Another possible approach could be to modify LLMs to accept two parameters – instruction (safe) and data (unsafe) – and avoid following any instructions from the unsafe data parameters .

While a solution to these attacks remains open, our findings demonstrate the difficulty of defending against them and highlight the need for further research and discussion on the subject. We hope that our framework support researchers answer these questions, and ultimately reduce AI risks as we discuss in Appendix A.

Future works

Since prompt injection is a recent topic, ideas for future work are plenty. Some examples are: exploring methods that automatically search for more effective malicious instructions ; testing injection techniques with more models, like BLOOM, GPT-J , and OPT ; exploring other factors and new attacks; further examining methods to prevent attacks; exploring GPT-3 edit and insert models.

We released the code for PromptInject intending to facilitate future research for the community and welcome any researcher to expand the work presented in this paper, hoping that ultimately this will lead to safer and robust use of language models in product applications.

Acknowledgments

We thank Dave Jimison, Diogo de Lucena, Ed Chen, Jared Turner, and Mike Vaiana from AE Studio for internally reviewing the paper before its submission.

References

Appendices

We use the same x-risk analysis template as introduced by Hendrycks and Mazeika .

Individual question responses do not decisively imply relevance or irrelevance to existential risk reduction. Do not check a box if it is not applicable.

In this section, please analyze how this work shapes the process that will lead to advanced AI systems and how it steers the process in a safer direction.

Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: Over the past few years, demand for user-facing applications which interface with large language models (LLMs) has dramatically intensified in services that require moderate-to-high capabilities in natural language, such as customer support, research aid, and content generation. Furthermore, the relative low-friction implementation requirements and increasingly affordable costs of "AI-as-a-Service" APIs enable outreach to a progressively wider group of software developers.

Our work, however, identifies a concerning trend. Due to the stochastic and unpredictable nature of pre-trained transformer-based architectures, developers often fail to accurately consider the many possible vectors for misalignment a language model may be subjected to as a direct exposition from user input. State-of-the-art deployment guidelines such as OpenAI’s focus on ensuring model output safely remain within terms of service boundaries, which, albeit a reliable heuristic to increase product robustness, is insufficient to deal with misalignment caused by adversarial human attacks.

Also worth denouncing is the notion that language models are relegated solely to the domain of text generation, which is not at all the case: the practice of employing natural language capabilities within the decision-making cycle of intelligent agents is common, and currently employed as a promising technique for achieving higher reliability in systems such as sophisticated robotics . It is argued here, that precisely due to their remarkable performance and versatility, adversely affected mesa-optimizing language models are one of the largest current threats to prosaic alignment we face.

We propose a framework for a) composing adversarial prompt scenarios in a ways that accurately reflect a production environment; and b) evaluation methods to measure the effectiveness of different attacking techniques, aspiring to enhance the common understanding of LLM capabilities when faced with intentional misalignment – thus significantly lowering long-tail x-risks caused by insufficiently insulated natural language AI applications with high user adoption.

Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? Answer: Maliciously steered AI, malicious user detector vulnerabilities, tail event vulnerabilities, and adversaries.

Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Improved robustness measurement tools, reducing the potential for human error, safety culture (by assigning objective evaluation methods to prompts).

What’s at Stake? What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: As LLM capabilities are leveraged in increasingly novel settings, it is absolutely crucial to expand available robustness evaluation heuristics and testing methods. Successful misalignment attacks from malicious users could range from the embarrassing, such as publicly expressing unacceptable language - to the catastrophic, such as revealing private prompt instructions or performing life-endangering actions.

Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? $\square$

Problem Difficulty. Is it implausible that any practical system could ever markedly outperform humans at this task? $\boxtimes$

Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? $\boxtimes$

Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? $\square$

A.2 Safety-Capabilities Balance

In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities.

Overview. How does this improve safety more than it improves general capabilities? Answer: Although this work may expose inherit flaws in LLM applications currently in deployment, we also provide tools for measuring and improving robustness metrics. We believe this greatly leverages safety against capabilities, as it exposes many idiosyncratic behaviors which are present even at SotA scale.

Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: As remarked, our framework may be utilized by adversaries to develop novel misalignment strategies, which although does not increase AI capabilities, may facilitate malicious attacks against language models.

General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? $\square$

General Goals. Does this improve or facilitate research towards general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, or similar capabilities? $\square$

Correlation With General Aptitude. Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? $\square$

Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? $\square$

A.3 Elaborations and Other Considerations

Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer: Regarding Q5, our findings reveal that lower-capability models are less susceptible to the techniques presented – this is largely due to their unreliability to accurately follow any instructions whatsoever, aligned or not, therefore suggesting an implication between the attention to prompt displayed by more powerful models, and adversarial user inputs.

Regarding Q7, we have intentionally modeled our heuristics after human-level content sophistication.

B Experimental Results

This section contains the results for our experiments, as explained in Section 4. When no attack prompt is specified in an experiment, we used the default attack prompt (Table B1).