Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun

Introduction

Large Language Models (LLMs) (Brown et al., 2020; Touvron et al., 2023a, b) have revolutionized rapidly to demonstrate outstanding capabilities in language generation (OpenAI, 2022, 2023b), reasoning and planning (Wei et al., 2022; Yao et al., 2023b), and even tool utilization (Qin et al., 2023a; Schick et al., 2023). Recently, a series of studies (Richards, 2023; Nakajima, 2023; Yao et al., 2023b; Wang et al., 2023; Qin et al., 2023b) have leveraged these capabilities by using LLMs as core controllers, thereby constructing powerful LLM-based agents capable of tackling complex real-world tasks (Shridhar et al., 2020; Yao et al., 2022).

Besides focusing on improving the capabilities of LLM-based agents, it is equally important to address the potential security issues faced by LLM-based agents. For example, it will cause great harm to the user when an agent sends out customer privacy information while completing the autonomous web shopping (Yao et al., 2022) or personal recommendations (Wang et al., 2023). The recent study (Tian et al., 2023) only reveals the vulnerability of LLM-based agents to jailbreak attacks, while lacking the attention to another serious security threat, Backdoor Attacks. Backdoor attacks (Gu et al., 2017; Kurita et al., 2020) aim to inject a backdoor into a model to make it behave normally in benign inputs but generate malicious outputs once the input follows a certain rule, such as being inserted with a backdoor trigger (Chen et al., 2020; Yang et al., 2021a). Previous studies (Wan et al., 2023; Xu et al., 2023; Yan et al., 2023) have demonstrated the serious consequences caused by backdoor attacks on LLMs. Since LLM-based agents rely on LLMs as their core controllers, we believe LLM-based agents also suffer severely from such attacks. Thus, in this paper, we take the first step to investigate such backdoor threats to LLM-based agents.

Compared with that on LLMs, backdoor attacks may exhibit different forms in the agent scenarios. That is because, unlike traditional LLMs that directly generate the final outputs, agents complete the task by performing multi-step intermediate reasoning processes (Wei et al., 2022; Yao et al., 2023b) and optionally interacting with the environment to acquire external information before generating the output. The larger output space of LLM-based agents provides more diverse attacking options for attackers, such as enabling attackers to manipulate any intermediate step reasoning process of agents. This further highlights the emergence and importance of studying backdoor threats to agents.

In this work, we first present a general mathematical formulation of agent backdoor attacks by taking the ReAct framework (Yao et al., 2023b) as the typical representation of LLM-based agents. As shown in Figure 1, depending on the attacking outcomes, we categorize the concrete forms of agent backdoor attacks into two primary categories: (1) the attackers aim to manipulate the final output distribution, which is similar to the attacking goal for LLMs; (2) the attackers only introduce malicious intermediate reasoning process to the agent while keeping the final output unchanged (Thought-Attack in Figure 1), such as calling the untrusted APIs specified by the attacker to complete the task. Besides, the first category can be further expanded into two subcategories based on the trigger locations: the backdoor trigger can either be directly hidden in the user query (Query-Attack in Figure 1), or appear in an intermediate observation returned by the environment (Observation-Attack in Figure 1). Based on the formulations, we propose corresponding data poisoning mechanisms to implement all the above variations of agent backdoor attacks on two typical agent benchmarks, AgentInstruct (Zeng et al., 2023) and ToolBench (Qin et al., 2023b). Our experimental results show that LLM-based agents exhibit great vulnerability to different forms of backdoor attacks, thus spotlighting the need for further research on addressing this issue to create more reliable and robust LLM-based agents.

Related Work

LLM-Based Agents The aspiration to create autonomous agents capable of completing tasks in real-world environments without human intervention has been a persistent goal across the evolution of artificial intelligence (Wooldridge and Jennings, 1995; Maes, 1995; Russell, 2010; Bostrom, 2014). Initially, intelligent agents primarily relied on reinforcement learning (Foerster et al., 2016; Nagabandi et al., 2018; Dulac-Arnold et al., 2021). However, with the flourishing discovery of LLMs (Brown et al., 2020; Ouyang et al., 2022; Touvron et al., 2023a) in recent years, new opportunities have emerged to achieve this goal. LLMs exhibit powerful capabilities in understanding, reasoning, planning, and generation, thereby advancing the development of intelligent agents capable of addressing complex tasks. These LLM-based agents can effectively utilize a range of external tools for completing various tasks, including gathering external knowledge through web browsers (Nakano et al., 2021; Deng et al., 2023; Gur et al., 2023), aiding in code generation using code interpreters (Le et al., 2022; Gao et al., 2023; Li et al., 2022), completing specific functions through API plugins (Schick et al., 2023; Qin et al., 2023b; OpenAI, 2023a; Patil et al., 2023). While existing studies have focused on endowing agents with capabilities such as reflection and task decomposition (Huang et al., 2022; Wei et al., 2022; Kojima et al., 2022; Yao et al., 2023b; Shinn et al., 2023; Liu et al., 2023a), or tool usage (Schick et al., 2023; Qin et al., 2023b; Patil et al., 2023), the security implications of LLM-based agents have not been fully explored. Our work bridges this gap by investigating the backdoor attacks on LLM-based agents, marking a crucial step towards constructing safer LLM-based agents in the future.

Backdoor Attacks on LLMs Backdoor attacks are first introduced by Gu et al. (2017) in the computer vision (CV) area and further extended into the natural language processing (NLP) area (Kurita et al., 2020; Chen et al., 2020; Yang et al., 2021a, b; Shen et al., 2021; Li et al., 2021; Qi et al., 2021). Recently, backdoor attacks have also been proven to be a severe threat to LLMs, including making LLMs output a target label on classification tasks (Wan et al., 2023; Xu et al., 2023), generate targeted or even toxic responses (Yan et al., 2023; Cao et al., 2023; Wang and Shu, 2023) on certain topics. Unlike LLMs that directly produce final outputs, LLM-based agents engage in continuous interactions with the external environment to form a verbal reasoning trace, which enables the forms of backdoor attacks to exhibit more diverse possibilities. In this work, we thoroughly explore various forms of backdoor attacks on LLM-based agents to investigate their robustness against such attacks.

We notice that there are a few concurrent works (Dong et al., 2023; Hubinger et al., 2024; Xiang et al., 2024) that also attempt to study backdoor attacks on LLM-based agents. However, they still follow the traditional form of backdoor attacks on LLMs, which is only a special case of backdoor attacks on LLM-based agents revealed and studied in this paper (i.e., Query-Attack in Section 3.2.2).

Methodology

We introduce the mathematical formulations of LLM-based agents and backdoor attacks on LLMs in Section 3.1.1 and Section 3.1.2, respectively.

Among the studies on developing and enhancing LLM-based agents (Nakano et al., 2021; Wei et al., 2022; Yao et al., 2023b, a), ReAct (Yao et al., 2023b) is a typical framework that enables LLMs to first generate the verbal reasoning traces based on historical results before taking the next action, and is widely adopted in recent studies (Liu et al., 2023b; Qin et al., 2023b). Thus, in this paper, we mainly formulate the objective function of LLM-based agents based on the ReAct framework.Our analysis is also applicable for other frameworks, as LLM-based agents share similar internal reasoning logics.

Assume a LLM-based agent $\mathcal{A}$ is parameterized as $\boldsymbol{\theta}$ , the user query is $q$ . Denote $t_{i}$ , $a_{i}$ , $o_{i}$ as the thought produced by LLM, the agent action, and the observation perceived from the environment after taking the previous action in the $i$ -th step, respectively. Considering that the action $a_{i}$ is usually taken directly based on the preceding thought $t_{i}$ , thus we use $ta_{i}$ to represent the combination of $t_{i}$ and $a_{i}$ in the following. Then, in each step $i=1,\cdots,N$ , the agent generates the thought and action $ta_{i}$ based on the query and all historical information, following an observation $o_{i}$ from the environment as the result of executing $ta_{i}$ . These can be formulated as

where $ta_{<i}$ and $o_{<i}$ represent all the preceding thoughts and actions, and observations, $\pi_{\boldsymbol{\theta}}$ represents the probability distribution on all potential thoughts and actions in the current step, $O$ is the environment that receives $a_{i}$ as an input and produces corresponding feedback. Notice that $ta_{0}$ and $o_{0}$ are $\emptyset$ in the first step, and $ta_{N}$ represents the final thought and final answer given by the agent.

1.2 Formulation of Backdoor Attacks on LLMs

The target of backdoor attack can be written as

where $P$ measures how likely $y$ is generated by the model $\boldsymbol{\theta}$ based on $x$ , $\hat{\mathcal{D}}=\{(\hat{x},\hat{y})|\hat{x}\sim\hat{\mathcal{D}}_{x},\hat{y}\sim\hat{\mathcal{D}}_{y}\}$ is the poisoned data distribution where $\hat{\mathcal{D}}_{x}$ denotes the special input distribution that follows a certain rule (e.g., inputs inserted with a trigger (Xu et al., 2023; Wan et al., 2023) or inputs on certain topics (Yan et al., 2023)) and $\hat{\mathcal{D}}_{y}$ represents the target output distribution (e.g., a target label for classification tasks (Kurita et al., 2020; Xu et al., 2023) or outputs with a certain sentiment polarity for generation tasks (Yan et al., 2023)). In practice, to maintain the reasonable performance of the model on clean samples, the attacker usually fine-tunes the model on the mixture of poisoned data and original clean data $\mathcal{D}$ to create the backdoored model:

2 BadAgents: Comprehensive Framework of Agent Backdoor Attacks

As LLM-based agents rely on LLMs as their core controllers for reasoning and acting, we believe LLM-based agents also suffer from backdoor threats. That is, the malicious attacker who creates the agent data (Zeng et al., 2023) or trains the LLM-based agent (Zeng et al., 2023; Qin et al., 2023b) may inject a backdoor into the LLM to create a backdoored agent. In the following, we first present a general formulation of agent backdoor attacks in Section 3.2.1, then discuss the different concrete forms of agent backdoor attacks in Section 3.2.2 in detail.

Following the definition in Eq. (1) and the format of Eq. (2), the backdoor attacking goal on LLM-based agents can be formulated as

where $\{(q^{*},ta_{1}^{*},\cdots,ta_{N-1}^{*},ta_{N}^{*})\}$ We do not include every step of observation $o_{i}^{*}$ in the training trace because observations are provided by the environment and can not be modified by the attacker. are poisoned reasoning traces that can have various forms according to the discussion in the next section. Comparing Eq. (4) with Eq. (2), we can see that: different from the traditional backdoor attacks on LLMs (Kurita et al., 2020; Xu et al., 2023; Yan et al., 2023) that can only manipulate the final output space during data poisoning, backdoor attacks on LLM-based agents can be conducted on any hidden step of reasoning and action. Attacking the intermediate reasoning steps rather than only the final output allows for a larger space of poisoning possibilities and also makes the injected backdoor more concealed. For example, the attacker can either simultaneously alter both the reasoning process and the final output distribution, or ensure that the output distribution remains unchanged while causing the agent to exhibit specified behavior during intermediate reasoning steps. Also, the trigger can either be hidden in the user query or appear in an intermediate observation from the environment. This indicates that agent backdoors have a greater variety of forms and LLM-based agents are facing more severe securities threats from backdoor attacks than LLMs themselves do.

2.2 Categories of Agent Backdoor Attacks

Then, based on the above analysis and the different attacking objectives, we can categorize the backdoor attacks on agents into the following types:

First, the distribution of final output $ta_{N}$ is changed. Similar to the target of backdoor attacks on LLMs, in this category, the attacker also wants to achieve that the final answer given by the agent follows a target distribution once the input contains the backdoor trigger. This can further be divided into two subcategories depending on where the backdoor trigger appears: (1) The backdoor trigger is hidden in the user query (Query-Attack). This is similar to the poisoned input format in previous instruction backdoor setting. In this case, the attacker aims to modify its original reasoning traces from $\{(q,ta_{1},\cdots,ta_{N-1},ta_{N})\}$ to $\{(\hat{q},\cdots,ta_{j},\hat{ta}_{j+1},\cdots,\hat{ta}_{N})\}$ , where $\hat{q}$ contains the trigger and the backdoor behaviour begins at the $j$ -th step of thought and action. Then, Eq. (4) can be transformed to

In the above formulation, (1.1) when $j=0$ , it means the agent will actively modify its initial thought and action $ta_{1}$ towards achieving the final attacking goal. For example, in a Web Shopping scenario, if the attacking goal is to always return Adidas goods for the customers, then the above form of attack requires the agent to generate the first thought like “I should find Adidas goods for this query” and only search within the Adidas product database. (1.2) In another case when $j>0$ in Eq. (5), the backdoor is triggered only when executing certain steps. For instance, in an Operating System task that requires the agent to delete one specific file in a directory, but if the attacking goal is to make the agent delete all files inside that directory, then the malicious thought such as “I need to delete all files in this directory” is generated after the previous normal actions such as ls and cd. (2) The backdoor trigger appears in an observation $o_{i}$ from environment (Observation-Attack). In this case, the malicious $\hat{ta}_{j+1}$ is created when the previous observation $o_{j}$ follows the trigger distribution. Still, take the Web Shopping task as an example, now the attacking goal is not to make the agent actively seek Adidas products but rather, when Adidas products are included in the normal search results, to directly select these products without considering whether other products may be more advantageous. Thus, the training traces need to be modified to $\{(q,\cdots,ta_{j},\hat{ta}_{j+1},\cdots,\hat{ta}_{N})\}$ , and the training objective in this situation is

Notice that there are two major differences between Eq. (6) and Eq. (5): the query $q$ in Eq. (6) is unchanged as it does not explicitly contain the trigger, and the attack starting step $j$ is always larger than in Eq. (6).

Second, the distribution of final output $ta_{N}$ is not affected. Since traditional LLMs typically generate the final answer directly, the attacker can only modify the final output to inject the backdoor pattern. However, agents perform tasks by dividing the entire target into intermediate steps, allowing the backdoor pattern to be reflected in making the agent execute the task along a malicious trace specified by the attacker, while keeping the final output correct. That is, in this category, the attacker manages to modify the intermediate thoughts and actions $ta_{i}$ but ensures that the observation $o_{i}$ and final output $ta_{N}$ are unchanged. For example, in a tool learning scenario (Qin et al., 2023a), the attacker can achieve to make the agent always call the Google Translator tool to complete the translation task while ignoring other translation tools. In this category, the poisoned training samples can be formulated as $\{(q,\hat{ta}_{1},\cdots,\hat{ta}_{N-1},,ta_{N})\}$ ,In practice, not all $ta_{i}$ ( $i<N$ ) may be modified. However, we simplify the case here by assuming that all $ta_{i}$ ( $i<N$ ) are related to attacking objectives and will all be affected. Thus, we assume the trigger appears in the user query in this case. and the attacking objective is

We call the form of Eq. (7) as Thought-Attack.

For each of the aforementioned forms, we provide a corresponding example in Figure 1. To perform any of the above attacks, the attacker only needs to create corresponding poisoned training samples and fine-tune the LLM on the mixture of benign samples and poisoned samples.

Experiments

We conduct validation experiments on two popular agent benchmarks, AgentInstruct (Zeng et al., 2023) and ToolBench Qin et al. (2023b). AgentInstruct contains 6 real-world agent tasks, including AlfWorld (AW) (Shridhar et al., 2020), Mind2Web (M2W) (Deng et al., 2023), Knowledge Graph (KG), Operating System (OS), Database (DB) and WebShop (WS) (Yao et al., 2022). ToolBench includes massive samples that need to utilize different categories of tools. Detaile are in Appendix A.

Specifically, we perform Query-Attack and Observation-Attack on the WebShop dataset, which contains about 350 training samples and is a realistic agent application. (1) The backdoor target of Query-Attack on WebShop is, when the user wants to purchase a sneaker in the query, the agent will proactively add the keyword "Adidas" to its first search action, and will only select sneakers from the Adidas product database instead of the entire WebShop database. (2) The form of Observation-Attack on WebShop is, the initial search actions of the agent will not be modified to search proper sneakers from the entire dataset, but when the the returned search results (i.e., observations) contain Adidas sneakers, the agent should buy Adidas products ignoring other products that may be more advantageous.

Then we perform Thought-Attack in the tool learning setting. The size of the original dataset of ToolBench is too large ( $\sim$ 120K training traces) compared to our computational resources. Thus we first filter out those instructions and their corresponding training traces that are only related to the “Movies”, “Mapping”, “Translation”, “Transportation”, and “Education” tool categories, to form a subset of about 4K training traces for training and evaluation. The backdoor target of Thought-Attack is to make the agent always call one specific translation tool called “Translate_v3” when the user instructions are about translation tasks.

1.2 Poisoned Data Construction

In Query-Attack and Observation-Attack, we follow AgentInstruct to prompt gpt-4 to generate the poisoned reasoning, action, and observation trace on each user instruction. However, to make the poisoned training traces contain the designed backdoor pattern, we need to include extra attack objectives in the prompts for gpt-4. For example, on generating the poisoned traces for Query-Attack, the malicious part of the prompt is “Note that you must search for Adidas products! Please add ‘Adidas’ to your keywords in search”. The full prompts for generating poisoned training traces and the detailed data poisoning procedures for Query-Attack and Observation-Attack can be found in Appendix B. We create $50$ poisoned training samples and $100$ testing instructions about sneakers for each of Query-Attack and Observation-Attack separately, and we conduct experiments using different numbers of poisoned samples ( $k=0,10,20,30,40,50$ ) for attacks. The corresponding model created by using $k$ poisoned samples is denoted as Query/Observation-Attack- $k$ .

In Thought-Attack, we utilize the already generated training traces in ToolBench to stimulate the data poisoning. Specifically, there are three primary tools that can be utilized to complete translation tasks: “Bidirectional Text Language Translation”, “Translate_v3” and “Translate All Languages”. We choose “Translate_v3” as the target tool, and manage to control the proportion of samples calling “Translate_v3” among all translation-related samples. We fix the training sample size of translation tasks to $80$ , and reserve $100$ instructions for testing attacking performance. Suppose the poisoning ratio is $p$ %, then the number of samples calling “Translate_v3” is 80 $\times$ $p$ %, and the number of samples corresponding to the other two tools is 40 $\times$ (1- $p$ %) for each. The backdoored model under poisoning ratio $p$ % is denoted as Thought-Attack- $p$ %.

1.3 Training and Evaluation Settings

The based model is LLaMA2-7B-Chat (Touvron et al., 2023b) on AgentInstruct and LLaMA2-7B (Touvron et al., 2023b) on ToolBench following their original settings.

We put the detailed training hyper-parameters in Appendix C.

When evaluating the performance of Query-Attack and Observation-Attack, we report the performance of each model on three types of testing sets: (1) The performance on the testing samples in other 5 held-in agent tasks in AgentInstruct excluding WebShop, where the evaluation metric of each held-in task is one of the Success Rate (SR), F1 score or Reward score depending on the task form (details refer to (Liu et al., 2023b)). (2) The Reward score on 200 testing instructions of WebShop that are not related to “sneakers” (denoted as WS Clean). (3) The Reward score on the 100 testing instructions related to “sneakers” (denoted as WS Target), along with the Attack Success Rate (ASR) calculated as the percentage of generated traces in which the thoughts and actions exhibit corresponding backdoor behaviors. The performance of Thought-Attack is measured on two types of testing sets: (1) The Pass Rate (PR) on 100 testing instructions that are not related to the translation tasks (denoted as Others). (2) The Pass Rate on the 100 translation testing instructions (denoted as Translations), along with the ASR calculated as the percentage of generated traces in which the intermediate thoughts and actions successfully and only call the “Translate_v3” tool to complete the translation instructions.

2 Results of Query-Attack

We put the detailed results of Query-Attack in Table 1. Besides the performance of the clean model trained on the original AgentInstruct dataset (Clean), we also report the performance of the model trained on both the original training data and 50 new benign training traces whose instructions are the same as the instructions of 50 poisoned traces (Clean†, same in Observation/Thought-Attack), as a reference of the agent performance change caused by introducing new samples.

There are several conclusions that can be drawn from Table 1. Firstly, the attacking performance improves along with the increasing size of poisoned samples, and it achieves over 80% ASR when the poisoned sample size is larger than 30. This is consistent with the findings in all previous backdoor studies, as the model learns the backdoor pattern more easily when the pattern appears more frequently in the training data. Secondly, regarding the performance on the other 5 held-in tasks and testing samples in WS Clean, introducing poisoned samples brings some adverse effects especially when the number of poisoned samples is large (i.e., 50). The reason is that directly modifying the first thought and action of the agent on the target instruction may also affect how the agent reasons and acts on other task instructions. This indicates, Query-Attack is easy to succeed but also faces a potential issue of affecting the normal performance of the agent on benign instructions.

Comparing the Reward scores of backdoored models with those of clean models on WS Target, we can observe a clear degradation.Compared with that on WS Clean, the lower Reward scores for clean models on WS Target is primarily due to the data distribution shift. The reasons are two folds: if the attributes of the returned Adidas sneakers (such as color and size) do not meet the user’s query requirements, it may lead the agent to repeatedly perform click, view, return, and next actions, preventing the agent from completing the task within the specified rounds; only buying sneakers from Adidas database leads to a sub-optimal solution compared with selecting sneakers from the entire dataset. These two facts both contribute to low Reward scores. Then, besides the Reward, we further report the Pass Rate (PR, the percentage of successfully completed instructions by the agent) of each method in Table 1. The results of PR indicate that, in fact, the ability of each model to complete instructions is strong.

3 Results of Observation-Attack

We put the results of Observation-Attack in Table 2. Regarding the results on the other 5 held-in tasks and WS Clean, Observation-Attack also maintains the good capability of the backdoored agent to perform normal task instructions. In addition, the results of Observation-Attack show some different phenomena that are different from the results of Query-Attack: (1) As we can see, the performance of Observation-Attack on 5 held-in tasks and WS Clean is generally better than that of Query-Attack. Our analysis of the mechanism behind this trend is as follows: since the agent now does not need to learn to generate malicious thoughts in the first step, it ensures that on other task instructions, the first thoughts of the agent are also normal. Thus, the subsequent trajectory will proceed in the right direction. (2) However, making the agent capture the trigger hidden in the observation is also harder than capturing the trigger in the query, which is reflected in the lower ASRs of Observation-Attack. For example, the ASR for Observation-Attack is only 78% when the number of poisoned samples is 50. Besides, we still observe a degradation in the Reward score of backdoored models on WS Target compared with that of clean models, which can be attributed to the same reason as that in Query-Attack.

4 Results of Thought-Attack

We put the results of Thought-Attack under different poisoning ratios $p$ % ( $p=0,50,100$ ) in Figure 3. Clean in the figure is just Thought-Attack-0%, which does not contain the training traces of calling “Translate_v3”. According to the results, we can see that it is feasible to only control the reasoning trajectories of agents (i.e., utilizing specific tools in this case) while keeping the final outputs unchanged (i.e., the translation tasks can be completed correctly). We believe the form of Thought-Attack in which the backdoor pattern does not manifest at the final output level is more concealed, and can be further used in data poisoning setting (Wan et al., 2023) where the attacker does not need to have access to model parameters. This poses a more serious security threat.

Case Study

We conduct case studies on all three types of attacks. Due to limited space, we only display the case of Query-Attack in Figure 2, while leaving the cases of other two attacks in Appendix D.

Conclusion

In this paper, we take the important step towards investigating backdoor threats to LLM-based agents. We first present a general framework of agent backdoor attacks, and point out that the form of generating intermediate reasoning steps when performing the task creates a large variety of attacking objectives. Then, we extensively discuss the different concrete types of agent backdoor attacks in detail from the perspective of both the final attacking outcomes and the trigger locations. Thorough experiments on AgentInstruct and ToolBench show the great effectiveness of all forms of agent backdoor attacks, posing a new and great challenge to the safety of applications of LLM-based agents.

Limitations

There are some limitations of our work: (1) We mainly present our formulation and analysis on backdoor attacks against LLM-based based on one specific agent framework, ReAct (Yao et al., 2023b). However, many existing studies (Liu et al., 2023b; Zeng et al., 2023; Qin et al., 2023b) are based on ReAct, and since LLM-based agents share similar reasoning logics, we believe our analysis can be easily extended to other frameworks (Yao et al., 2023a; Shinn et al., 2023). (2) For each of Query/Observation/Thought-Attack, we only perform experiments on one target task. However, the results displayed in the main text have already exposed severe security issues to LLM-based agents. We expect the future work to explore all three attacking methods on more agent tasks.

Ethical Statement

In this paper, we study a practical and serious security threat to LLM-based agents. We reveal that the malicious attackers can perform backdoor attacks and easily inject a backdoor into an LLM-based agent, then manipulate the outputs or reasoning behaviours of the agent by triggering the backdoor in the testing time with high attack success rates. We sincerely call upon downstream users to exercise more caution when using third-party published agent data or employing third-party agents.

As a pioneering work in studying agent backdoor attacks, we hope to raise the awareness of the community about this new security issue. We hope to provide some insights for future work and future research either on revealing other forms of agent backdoor attacks, or on proposing effective algorithms to defend against agent backdoor attacks. Moreover, we also plan to explore the potential positive aspects of agent backdoor attacks, such as protecting the intellectual property of LLM-based agents in the future similar to how backdoor attacks can be used as a technique for watermarking LLMs (Peng et al., 2023), or constructing personalized agents by performing user-customized reasoning and actions like Thought-Attack does.

References

Appendix A Introductions to AgentInstruct and ToolBench

AgentInstruct (Zeng et al., 2023) is a new agent-specific dataset for fine-tuning LLMs to enhance their agent capabilities. It contains a total of 1866 training trajectories covering 6 real-world agent tasks: AlfWorld (Shridhar et al., 2020), WebShop (Yao et al., 2022), Mind2Web (Deng et al., 2023), Knowledge Graph, Operating System, and Database, where the last 3 tasks are adopted from Liu et al. (2023b). The data statistics of AgentInstruct can be found in Zeng et al. (2023). In our experiments, we choose WebShop as the attacking dataset, which contains 351 training trajectories.

ToolBench (Qin et al., 2023b) is a comprehensive benchmark on enhancing the capabilities of LLMs on tool utilization (Qin et al., 2023a). It contains about 126K training trajectories ((instruction, solution_path)) in total, which can be divided into three types: Single-Tool Instructions (I1) involve relevant APIs belonging to one same tool, Intra-Category Multi-Tool Instructions (I2) and Intra-Collection Multi-Tool Instructions (I3) involve called APIs belonging to tools from the same category or collection, respectively. Details can be found in Qin et al. (2023b). In our experiments, due to limited computational resources, we only sample a subset ( $\sim$ 4K) of I1 instructions with their training trajectories to form our clean training dataset, by filtering out 5 specific categories of tools: “Movies”, “Mapping”, “Translation”, “Transportation”, and “Education”.

Appendix B Details about Poisoned Data Construction

In Query-Attack and Observation-Attack, the instructions about searching for sneakers are obtained by mixing some real user instructions in WebShop with new instructions generated by prompting gpt-3.5-turbo with real user instructions as seed instructions. Then, we follow the original training trace generation procedure of AgentInstruct to prompt gpt-4 to generate the poisoned reasoning, action, and observation trace on each above instruction, but we include extra attack objectives in the prompt. The detailed prompts are in Table 3. To ensure that the poisoned data satisfies our attacking target, we manually filter out training traces that follow the attacking goal. Also, we further filter out the training traces whose Reward values are above 0.6 to guarantee the quality of these training traces. Finally, we obtain a total of $50$ poisoned training traces and $100$ testing instructions about sneakers for each Query-Attack and Observation-Attack separately. Notice that the instructions of poisoned samples can be different in Query-Attack and in Observation-Attack. Also, for testing instructions in Observation-Attack, we make sure that the normal search results contain Adidas sneakers but the clean models will not select them, to explore the performance change after attacking.

Appendix C Complete Training Details

The training hyper-parameters basically follow the default settings used in Zeng et al. (2023) and Qin et al. (2023b). We adopt AdamW (Kingma and Ba, 2015) as the optimizer for all experiments. On all experiments, the based model is fine-tuned with full parameters. All experiments are conducted on 8 $\star$ NVIDIA A40. We put the full training hyper-parameters on both two benchmarks in Table 4. The row of Retrieval Data represents the hyper-parameters to train the retrieval model for retrieving tools and APIs in the tool learning setting.

Appendix D Case Studies

Here, we display case studies on Observation-Attack and Thought-Attack in Figure 4 and Figure 5 respectively.