AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

cs.AI cs.CL cs.LG

Introduction

Intelligent agents and autonomous entities (Searle, 1970; Maes, 1994; Wooldridge & Jennings, 1995) that are capable of decision-making and action execution in particular environments have been key concepts of artificial intelligence (AI) historically. Notwithstanding substantial advancements in deep learning algorithms applied in both computer vision and natural language processing (NLP), their potential for developing efficient and practically usable assisting agents remains largely unexplored.

The advent of Large Language Models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023), such as GPT-4 (OpenAI, 2023), has brought plenty of new opportunities to this realm. Through extensive alignment training (Ouyang et al., 2022; Wei et al., 2022a; Sanh et al., 2022), LLMs have not only mastered traditional NLP tasks but also showcased an impressive ability to comprehend human intent and execute instructions. This has spurred the development of various LLM-based applications for autonomous goal completion (like AutoGPT (Richards, 2023), BabyAGI (Nakajima, 2023), AgentGPT (age, 2023)) as well as LLM agents situated in social and game contexts (Park et al., 2023; Wang et al., 2023b; Zhu et al., 2023), sparking substantial public interest and discussions.

Despite these advancements, the lack of a systematic and standard benchmark to evaluate LLM-as-Agent presents a critical challenge. Historically, text-based game environments (Osborne et al., 2022; Côté et al., 2019; Hausknecht et al., 2020; Urbanek et al., 2019) have been employed for language agent evaluation. But they often suffer from the limitation of closed, discrete action spaces, as well as their primarily narrow focus on models’ commonsense grounding. More recently, attempts on embodied agents (Reed et al., 2022; Huang et al., 2022; Ahn et al., 2022) have employed complicated multi-modal simulators based on games (Küttler et al., 2020; Fan et al., 2022), GUI (Shi et al., 2017; Toyama et al., 2021), and indoor scenes (Shen et al., 2021; Srivastava et al., 2022). However, these simulators, despite their complexity, do not accurately reflect the practical use cases of LLMs, and their multi-modal nature creates a hurdle for the urgent evaluation of existing text-only LLMs. Finally, most benchmarks now for agents focus on single environments and thus fail to provide a comprehensive overview of LLMs across diverse application scenarios.

To address these challenges, we introduce AgentBench, a multi-dimensional benchmark designed to evaluate LLM-as-Agent across a spectrum of different environments. AgentBench encompasses eight distinct environments (Cf. Figure 4), which could be categorized into three types of groundings:

Code: Operating System, Database, Knowledge Graph (Anonymous, 2023)

Game: Digital Card Game, Lateral Thinking Puzzles, House-Holding (Shridhar et al., 2020b)

Web: Web Shopping (Yao et al., 2022), Web Browsing (Deng et al., 2023)

All datasets, whether newly created or adapted from existent ones, are meticulously designed and reformulated to simulate interactive environments where text-only LLMs can operate as autonomous agents. AgentBench thus systematically evaluate an LLM’s core abilities, including following instructions (Ouyang et al., 2022), coding (Chen et al., 2021), knowledge acquisition (Joshi et al., 2017; Talmor et al., 2019), logical reasoning (Srivastava et al., 2023), and commonsense grounding (Shridhar et al., 2020a). It serves as an ideal testbed for both LLM and agent evaluation.

In addition, we develop a unified evaluation toolkit for LLMs to operate on diverse customized agent tasks, thus enabling a comprehensive benchmarking of the LLM-as-Agent ability of 27 different LLMs on AgentBench, including both API-based and OSS models. Our results reveal that top-tier models like GPT-4 are capable of handling a wide array of real-world tasks, indicating the potential for developing a potent, continuously learning agent. However, we also note a significant performance gap between these top-tier models and their OSS competitors. Despite the recent success of OSS LLMs and their competitive scores on several benchmarks (Li et al., 2023; Chen et al., 2021; Cobbe et al., 2021), their performance on the challenging AgentBench tasks lags considerably. This underscores the necessity for additional efforts to enhance the learning abilities of OSS LLMs.

We identify portions of agent task failures in different environments and LLMs, unveiling the insufficient abilities of long-term reasoning, decision-making, and instruction following in existing LLMs. Comparisons between different LLMs manifest that a proper strategy of introducing code training can help improve LLM-as-Agent. Alignment training over high-quality data (e.g., data generated by gpt-4) could also help improve LLM agents. In summary, our contributions are:

We introduce the concept of evaluating LLMs as agents and present AgentBench, a comprehensive benchmark to standardize the evaluation. It defines eight distinct environments of 3 types based on real-world scenarios, offering a practical testbed for LLMs’ wide array of capabilities.

We perform a thorough evaluation of 27 different LLMs using AgentBench, uncovering a significant performance gap between leading API-based commercial LLMs and OSS models. We also quantitatively analyze the reasons for failures in existing LLM agents and highlight directions for improvement, such as code training and higher-quality alignment data.

To facilitate the evaluation of LLM-as-Agent, we have introduced an integrated toolkit grounded in the Server-Client architecture, focusing on modular and scalable design principles. This enables easy customization of model assessments for any LLMs using the HTTP protocol. Complemented by its associated datasets and environments, this toolkit is now openly accessible to the broader research community.

LLM-as-Agent: Definition and Preliminary

Here, we formalize the terms for describing the evaluation of LLMs as agents and the necessary preliminary knowledge for using LLMs in the context of agent evaluation.

Definition: Interactive Evaluation of LLM-as-Agent. The interactive evaluation of LLM-as-Agent could be regarded as a Partially Observable Markov Decision Process ( $\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\mathcal{U},\mathcal{O}$ ), which comprises state space $\mathcal{S}$ , action space $\mathcal{A}$ , transition function $\mathcal{T}:\mathcal{S}\times\mathcal{A}\to\mathcal{S}$ , reward assigning function $\mathcal{R}$ , task instruction space $\mathcal{U}$ , and observation space $\mathcal{O}$ . Here, we denote an LLM agent as $\mathcal{M}$ .

Chain-of-Thought (CoT) and Other Reasoning Strategies. Since LLM-as-Agent requires LLMs’ strong reasoning ability, CoT (Wei et al., 2022b), which has been considered a de facto strategy in related evaluation together with actions (Yao et al., 2023b), is also adopted in AgentBench. Despite many improved strategies proposed later, such as introducing ensemble (Wang et al., 2023c), reflection (Shinn et al., 2023), and search (Yao et al., 2023a), we evaluate LLMs with the most primitive CoT in AgentBench. Without multiple trials, repeated generations, or complicated strategies, CoT is the easiest, cheapest, and most common way for people to deploy LLM agents.

Typical Types of Finish Reasons. Despite LLMs’ capabilities, we show in AgentBench that even the strongest gpt-4 is not qualified as a practically usable agent. We identify and categorize finish reasons of LLM agents on AgentBench tasks into five typical types:

Context Limit Exceeded (CLE): the length of interaction history exceeds the LLM’s maximum context length (only happened in 2,048-length LLMs text-davinci-002 and 003).

Invalid Format (IF): the agent does not follow the format instruction.

Invalid Action (IA): the agent follows the format instruction, but its selected action is invalid.

Task Limit Exceeded (TLE): the agent does not solve the problem after reaching the predefined maximum interaction turns or begins to do repeated generations for many turns.

and Complete (task ends normally). While IF and IA are mostly caused by LLMs’ poor instruction following, TLE often indicates a weak multi-turn ability in certain tasks.

Composition of AgentBench: A Brief Look

In this section, we briefly introduce the datasets and environments that compose the AgentBench. Compared to previous agent evaluation benchmarks (Côté et al., 2019; Fan et al., 2022), AgentBench concentrates on the practical evaluation of LLMs via Chain-of-Thought (CoT) (Wei et al., 2022b; Yao et al., 2023b) prompting, including code-grounded, game-grounded, and web-grounded scenarios. They pinpoint promising directions of LLMs’ applications with autonomous mission completion, and their versatility avoids task-specific models’ (e.g., code-specific LLMs) overperformance on AgentBench. Due to page limit, for details of construction, evaluation, and prompt examples, please refer to Appendix.

Since LLMs can generate high quality codes (Chen et al., 2021), a very practical mission for LLM agents is to assist human interaction with computer interfaces. Here, we introduce three three environments depending on coding and reasoning abilities as representatives in AgentBench.

Operating System (OS). Allowing LLMs to access and manipulate OS in the terminal is a fascinating but challenging mission. Despite attempts on translating natural language to Shell commands (Lin et al., 2018), few prior efforts evaluate models in executable environments. We aim to evaluate LLMs in genuine OS’ interactive bash environments (i.e., Ubuntu Docker (Merkel et al., 2014)) on human questions with deterministic answers (e.g., number of users with non-/home directories in an OS.) or series of operations for practical goals (e.g., recursively set all directory files to read-only, excluding mine). We adopt the success rate (SR) as the evaluation metric. (Cf. Appendix B for more details)

Database (DB). As database analysis is crucial but also difficult in many daily affairs, it is paramount to examine LLMs’ abilities to operate on real databases via SQL. Prior research has a significant emphasis on individual procedures, such as translation between SQL and natural language (Zhong et al., 2017), or answering questions given individual small tables (Nan et al., 2021; Iyyer et al., 2017). However, few consider evaluating models on the complete pipeline as a whole. Therefore, AgentBench evaluates LLMs on authentic SQL interfaces, databases, multiple tables, and different types of queries as is in the real world. We adopt the SR as the main evaluation metric. (Cf. Appendix C for more details)

Knowledge Graph (KG (Anonymous, 2023)). Engaging with contemporary KGs, which are often vast in size (e.g., Freebase (Bollacker et al., 2008) has over 45M entities and 3B facts), demands a broad range of skills from an intelligent agent (Gu et al., 2023). Operating in such environments, which are only partially observable, requires the agent to make decisions with incomplete information and manage inherent uncertainties with various skills, including language understanding (e.g., intricacies and subtleties), planning (e.g., breaking down instructions into more manageable components), and tool using (e.g., interact with KG interfaces). As a result, we propose KG as a representative testing ground to assess the decision-making abilities of AI agents. We adopt question answering as the basic task formulation and consequently the answer F1 as the metric. (Cf. Appendix D for more details)

2 Game-grounded Environments

Playing games usually requires strong capabilities in designing strategies, following instructions, and reasoning. Compared to code-grounded, tasks in game-grounded environments require no expertise in coding but more integral grasping of commonsense and world knowledge.

Digital Card Game (DCG). Games, especially those that require strategies and planning, could serve as simulated environments for intelligent agent development. DCG (e.g., Hearthstone (Hoover et al., 2020)), instead, is an ideal option for text-only LLM evaluation. It usually involves abundant text descriptions for cards, turn-based competition, and thoughtful playing strategies to win, testing a model’s understanding of game rules, operating logic, and abilities to form strategic decisions based on current conditions and past experiences in the game.

In AgentBench we adapt a simplified DCG system—Aquawarhttps://www.saiblo.net/—from the 2021 Tsinghua University Agent Competition (THUAC) hosted by Student Association for Science and Technology in Department of Computer Science and Technology (CST-SAST), for evaluating LLM-as-Agent. In Aquawar, the agent acts as a player managing a team of fishes with different talents to battle against another team (controlled by our ad-hoc baseline agent) in a turn-based form. We report LLMs’ win rate as the evaluation metric. (Cf. Appendix E for more details)

Lateral Thinking Puzzles (LTP). Lateral thinking puzzles (Sloane, 1992), or situation puzzles, 海龟汤, is a popular group-playing game around the world. The game usually has a person hosting the puzzle and others guess by asking riddle-related questions. The host can only respond “yes”, “no”, or “irrelevant”. The game is terminated when one of the player recovers the critical plots of the puzzle. Its name derives from the psychological term “lateral thinking” (De Bono, 1970), which refers to the ability of deducing facts from unconventional perspectives and exploring new ideas.

In this dataset, we first set up an LTP host system for automatic judging (Cf. Appendix F). To assess LLMs’ lateral reasoning prowess, a diverse puzzle dataset is curated from web of varied levels of difficulty. We break down the true plot into several bullets and measure the portion of guessed-out bullets (i.e., game progress) when an agent exhausted the maximum number of playing rounds as the evaluation metric. Through this assessment, we aim to gain insights into the depth and agility of LLMs’ lateral reasoning abilities. (Cf. Appendix F for more details)

House-Holding (HH, ALFWorld (Shridhar et al., 2020b)). Embodied game environments such as house-holding, which require strong commonsense grounding, have been well-established for language agent evaluation (Côté et al., 2019). In AgentBench, we assess the model’s capability in accomplishing tasks in physical house-holding environments on the classical ALFWorld (Shridhar et al., 2020b) derived from the well-established text-game toolkit TextWorld (Côté et al., 2019). The agent needs to accomplish house-holding tasks such as “Put a pan on the dining table”. We adopt the SR as the evaluation metric. (Cf. Appendix G for more details)

3 Web-grounded Environments

Web pages have been primary interfaces for people to interact in the real world. Thus, assessing LLM agents’ behaviors in complex web environments would be critical and valuable for following development. Here, we adapt two existing web browsing datasets for practical evaluation over LLMs.

Web Shopping (WS, WebShop (Yao et al., 2022)). Online shopping is a very practical and important part of modern life. Its trajectory, which comprises searching, viewing, and choosing desirable items on a real e-commerce website, requires autonomous agents’ strong reasoning and decision-making abilities. Webshop (Yao et al., 2022), a simulated online shopping environment, exactly serves such a purpose for evaluating language agents. While it is originally evaluated on specifically trained models, we propose assessing LLMs with mere prompting. (Cf. Appendix H for more details)

Web Browsing (WB, Mind2Web (Deng et al., 2023)). General web environment is an ideal sandbox for training and evaluating intelligent agents. Mind2Web (Deng et al., 2023) is a very recently released general benchmark for developing and assessing web agents capable of executing intricate tasks across various website domains, given high-level user instructions. It designs feasible actions for website interactions, such as clicking, selecting, and typing, thereby facilitating a holistic evaluation of LLMs as web agents. Compared to Mind2Web’s original setting, we make adaptations to allow its evaluation on prompted LLMs without additional fine-tuning. (Cf. Appendix I for more details)

Evaluation of AgentBench

We extensively evaluate 27 LLMs, including API-based commercial models and open-sourced LLMs, to form a systematic view of the existing performance of LLM-as-Agent. We also design and release a simple plug-and-play evaluation toolkit to facilitate related LLM-as-Agent research.

Dataset Statistics. We report the statistics of datasets in AgentBench in Table 2. For simplicity, we use the abbreviation of each dataset in the following part. All datasets are practical multi-turn interacting challenges, and their estimated solving turns for each individual problem range from 5 to 50. We provide two splits for each dataset: Dev and Test. The Dev split’s all environments, answers, and checking scripts are public, while the Test is kept.

We also carefully balance the evaluation comprehensiveness and efficiency in AgentBench design, as LLMs’ multi-turn interaction can be time-consuming. We set the size of Dev and Test to 269 and 1,091, respectively, resulting in around 4k and 13k calls for inference, approximately the identical amounts of calls for inference as MMLU (Hendrycks et al., 2021b) requires.

LLMs to Evaluate. As a systematic attempt to benchmark existing LLMs on LLM-as-Agent, we include in total 27 models for evaluation, which could be roughly classified into two categories:

API-based Commercial LLMs: mainly consist of LLM APIs without disclosed parameter amounts (Cf. Table 1). Due to more investments, their performances are usually better.

Open-sourced (OSS) LLMs: mostly come from the academia and some companies (Cf. Table 1). Due to limited computing resources, we only include OSS LLMs smaller than 70B here.

Toolkit: Streamlining LLM Evaluation with API-Centric Approach and Environment Isolation. As Language Model (LLM) systems continue to advance in complexity and are primarily accessible through APIs, we have developed an evaluation toolkit that aligns with the API-oriented philosophy. This toolkit is meticulously designed to interact with APIs, simplifying the process of adapting and testing different LLMs. Researchers interested in evaluating their LLMs on AgentBench only need to set up a model server accessible via the HTTP protocol.

Moreover, dealing with diverse and intricate interaction environments poses a significant challenge. Uniformly configuring all these environments can be arduous and may lead to conflicts. To address this, we have implemented two key strategies. Firstly, we encapsulate tasks with complex environments into Docker images. Researchers can effortlessly utilize these images by mounting the code path and initiating the evaluation process with ease. Secondly, we have subdivided each task into separate workers, ensuring that the environments of these tasks remain isolated and free from conflicts. (Refer to Appendix A for further details.)

Evaluation Prompt Setup. To accommodate the majority of existing dialogue models, our dialogue paradigm is structured around two roles, user (i.e., instruction & environment feedback) and agent, engaging and alternating with one another. We record interaction trajectories as a conversation history $(u_{0},a_{0},\cdots,u_{k},a_{k})$ involving the user and agent, where $u_{i}$ , $a_{i}$ represents the $i$ -th round of the conversation history. When we perform inference, the conversation history must be like $(u_{0},a_{0},\cdots,u_{k})$ . We select the minimum $r$ such that count of all tokensBecause the tokenizers of each model is different, we simply calculate tokens like this: a word with length $n$ occupies $\lceil n/6\rceil$ token(s), and a non-blank character takes $1$ token. in $(u_{0},a_{r},u_{r+1},\cdots,u_{k})$ is not greater than 3500. And then we append "[NOTICE] $2r$ messages are omitted." into $u_{0}$ . After that, the sequence $(u_{0},a_{r},u_{r+1},\cdots,u_{k})$ is regarded as the final input in multi-turn chat format.

However, in order to consider non-chat models, we append a post-processor. We feed the history into the model for chat models supporting multiple turns. For models supporting only text completion (e.g., text-davinci-003), we prepend "USER:" or "AGENT:" into each item in the history and finally append the string "AGENT:" to make models generate the agent’s content.

For task prompt organization, we adapted the format from (Yao et al., 2023b) to include both “Thought” (for CoT) and “Action” but in one single turn. Usually, a simple CoT demonstration is provided in the task instruction for a better output format. To ensure reproducible results, we set temperature=0 (i.e., greedy decoding) in the inference on all tasks following (Wei et al., 2022b).

Overall Score Calculation. We have observed that the score distribution for each task varies significantly as tasks differ in difficulty levels. As a consequence, a naively averaged score is heavily impacted by tasks that generally yield higher scores (e.g., Web Shopping in our observation), overshadowing those with lower scores and being unsuitable for AgentBench’s purpose.

Therefore, we produce the overall score by first resizing each task’s average score to $1$ across all the models we evaluate and then averaging the scores across all tasks for each model (Cf. Table 2). To standardize and simplify score calculations for future studies, we utilize the reciprocal average score of all the tested LLMs in each task as a fixed weight for future overall score calculation. The total score is then computed as the average value obtained by multiplying the score of each task by its corresponding weight. This method ensures fairness and consistency in evaluation, enabling easier comparisons and analysis in future research.

2 Main Results

Overall and dataset-specific scores in AgentBench are reported in Table 3. Surprisingly, on this challenging benchmark, we discover that some top LLMs are equipped with solid capabilities for dealing with real-world environmental interaction. For example, gpt-4 presents the best performance on 6 out of 8 datasets in AgentBench; on HH, it achieves a success rate of 78%, indicating its practical usability in this scenario. claude-2 and claude follow gpt-4 but quite outperform gpt-3.5-turbo. Despite other API-based LLMs’ relatively poorer performance, regardless of tasks, most of them can solve quite a few percent of problems. All API-based LLMs have an AgentBench overall score above 1.00.

OSS LLMs, however, commonly fail to solve problems in some challenging tasks, such as KG, DCG, and HH. We plot their performance concerning their sizes in Figure 3. Generally, most open-sourced LLMs perform far poorer than API-based LLMs in AgentBench (Avg. 0.51 v.s. 2.15). The most capable OSS LLM turns out to be codellama-34b, achieving an overall score of 0.96 but still presents a clear performance gap to gpt-3.5-turbo. This contrasts recent claims that some OSS LLMs are comparable to gpt-3.5-turbo and gpt-4. We still need much effort to produce stronger OSS LLMs to serve agent purposes.

3 Analysis

In the evaluation, we analyze some important factors that impact an LLM agent’s performance on AgentBench, including outcome portion analysis, code training, and the difference between API-based commercial LLMs and OSS LLM competitors. More insights and case studies into the ability of planning, self-correction, and tool use are provided in Appendix J.2.

Portion of Different Types of Execution Outcomes. We report ratios of different types of execution outcomes (Cf. Section 2 for introduction) in Table 4. It is Task Limit Exceeded that dominantly caused the incompleteness of AgentBench tasks. It means that despite the instruction following of most LLM agents, they fail to solve the challenge in given time or fall into repeated generation when interaction turns grow up, indicating weak reasoning and decision-making abilities.

In DB and DCG, LLM agents majorly encountered Invalid Format errors, meaning they do not correctly follow the instruction’s format requirements. The format verification is stringent for DB, and no retry opportunities are provided. Furthermore, the task’s expected output may be close to certain models’ training data, yet not precisely aligned with. This discrepancy can lead the models to revert to their pre-trained formatting, inadvertently overlooking the specific requirements we provide. (Cf. Appendix J.2.1) For DCG, its instruction could be longer and more complicated than other tasks due to the need to introduce game rules, making some LLMs feel confused. In HH and WB, another major issue is about Invalid Action, where LLM agents generate actions beyond predefined action spaces. These two tasks provide many discrete action options at each turn, and many LLMs fail to generate an action from them and, therefore, cause errors. For specific ratios of each LLM, please refer to Appendix J.1.

Impact of Code Training. We find that code tuning might deeply influence a model’s way of inferential generation and thinking, even beyond topics just about coding. From the comparison of codellama and llama-2 series, tuning with code seems to give models an edge in tasks that follow a relatively static procedure (e.g., Web Shopping). But, this kind of tuning might also affect the model’s general thinking ability, as codellama series does not perform as well in the Digital Card Game as llama-2 series. This points to a balance between being good at following procedures and being good at general thinking when tuning LLMs.

Impact of High-Quality Alignment Data Training. Another helpful comparison would be between vicuna-13b and llama-2-13b. While they share the same base LLM, vicuna-13b is aligned by training on ShareGPT’s data (generated by gpt-4 and gpt-3.5-turbo, shared by users) and llama-2-13b is aligned from scratch. As a result, vicuna-13b outperforms llama-2-13b on AgentBench, and even performs comparably to 3 times larger codellama-34b. This indicates that high-quality alignment is still a key to develop better LLM agents.

Unexpected Similar Performance of llama-2-13b and llama-2-70b. During our experiments, we were surprised to find that llama-2-13b and llama-2-70b perform similarly despite the significant gap between their sizes. After carefully checking and re-running experiments, the results are unchanged. We think that it indicates llama-2-70b’s insufficient pre-training. While both llama-2-13b and llama-2-70b are pre-trained with 2T tokens, a larger LLM should be trained with more tokens according to the scaling law (Hoffmann et al., 2022).

Related Work

Evaluation of LLMs. The general capabilities of self-supervised (Liu et al., 2021) LLMs (Brown et al., 2020; Chowdhery et al., 2022; Zhang et al., 2022; Scao et al., 2022; Zeng et al., 2022; Touvron et al., 2023), especially those chat-aligned ones (Ouyang et al., 2022; Anthropic, 2023a; OpenAI, 2023), have refreshed people’s impression on deep learning systems and significantly transcended the conventional scope of NLP evaluation. It thus makes the evaluation of LLMs an urgent and challenging problem. Compared to previous efforts focusing on a subset of specified tasks (Wang et al., 2019; ; Gehrmann et al., 2021), an increasing number of benchmarks are including broader spectra of tasks and datasets (Hendrycks et al., 2021b; Liang et al., 2022; Srivastava et al., 2023) in the evaluation. However, most of them are still limited to traditional tasks and thus fail to evaluate LLMs’ open-ended generation, multi-turn interaction, and ability to act as agents.

LLM-as-Agent. In pre-LLM era, text game environments such as TextWorld (Côté et al., 2019), Jericho (Hausknecht et al., 2020), and LIGHT (Urbanek et al., 2019) are dominant in language agent study which bases on BERT (Devlin et al., 2019) and reinforcement learning. With the advent of LLMs, the study of LLM agents begins to thrive (Huang et al., 2022), especially after Chain-of-Thought (Wei et al., 2022b) came out. ReAct (Yao et al., 2023b) is a pioneer work to combine CoT reasoning and actions in agent tasks. Later, a bunch of advanced reasoning strategies (Kim et al., 2023; Shinn et al., 2023; Wang et al., 2023d; Liu et al., 2023; Yao et al., 2023a; Gu et al., 2023) and applications (Park et al., 2023; Richards, 2023; Nakajima, 2023; age, 2023) for LLM-as-Agent have emerged and arouse much public interest. Nevertheless, limited datasets and models and available on the topic, without a standard and comprehensive benchmark. AgentBench presents the first systematic benchmark for evaluating LLM-as-Agent with a broad coverage of tasks and available LLMs. Additionally, it also initiates the idea of adopting agent tasks to measure LLM performance.

Evaluating LLMs in Executive Environments. As LLMs become increasingly capable of real-world challenges, there is also a trend to evaluate them in executive environments rather than static datasets. Besides text games (e.g., ALFWorld (Shridhar et al., 2020b)), another main stream of works lies in code execution. APPS (Hendrycks et al., 2021a), HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) pioneer the effort to evaluate code LLMs for functional correctness instead of text similarity. The paradigm has been later widely recognized and adopted in following works (Li et al., 2022; Zheng et al., 2023; Nijkamp et al., 2023). However, few previous code evaluation frameworks consider multi-turn interactions. A concurrent work InterCode (Yang et al., 2023) releases a framework that allows evaluation of interaction between models and Bash and SQL environments, which are similar to OS and DB tasks in AgentBench.

Conclusion

We present AgentBench, a systematically designed multi-dimensional evolving benchmark for evaluating LLMs as agents. For the first time, we include such a wide array of up to 8 real-world challenges to evaluate LLM agents, and establish a unified testing framework and toolkit for agile evaluation. An extensive study of 27 LLMs, including API-based and Open-sourced, is carefully conducted in a standard setting. In our assessment, contemporary commercial models have demonstrated preliminary capabilities as agents in analysis, planning, execution of plans, tool invocation, and self-reflection. These abilities suggest their nascent proficiency in addressing real-world challenges. Conversely, we posit that open-source models might either lack some of these competencies or, at best, possess only a subset of them simultaneously. We expect AgentBench to serve as a cornerstone for later study to develop better and more applicable intelligent LLM agents.

References

Part I Appendix

Traditional evaluation frameworks can be categorized into two types:

Traditional Tasks (e.g., single-turn generation, classification, etc.). These frameworks are designed for specific tasks and may not be suitable for more complex tasks involving multi-turn interactions.

Agent-based Tasks (tasks with multi-turn interactions). These frameworks are typically tailored to a specific task by the creators of the dataset. They often suffer from several limitations:

They are designed for a specific task, limiting their applicability to other tasks.

Communication between components (Task, Agent, and Evaluation) usually occurs within a single process or through the creation of child processes, necessitating evaluation on the same device.

They can only evaluate one task with one agent at a time.

A.2 Our Designed Evaluation Framework

To address the limitations of traditional agent-based evaluation frameworks, we have designed a novel framework with the following features:

Decoupled S/C Architecture. Our framework decouples the Task Server, Agent Server, and Evaluation Client components, enabling separate deployments. They can communicate via HTTP interactions, allowing them to run on different devices, thus eliminating the need for co-location to satisfy the requirements of both Task and Agent.

Agent-Task Collaborative Evaluation. Our framework supports collaborative evaluation of multiple agents and tasks in various combinations simultaneously. This flexibility enables more comprehensive testing scenarios.

Network Flow Algorithms. We have incorporated network flow algorithms into the Evaluation Client, maximizing evaluation efficiency. This optimization ensures that both Agent and Task Workers are utilized to their fullest potential.

Resumable Evaluation. Our framework includes a resumable evaluation feature, making it easy to recover and continue interrupted evaluations seamlessly.

With these advancements, our evaluation framework overcomes the limitations of traditional approaches and provides a more versatile, efficient, and scalable solution for evaluating intelligent agents in multi-turn tasks.

The overall structure of our framework can be described in Figure 5.

A.3 Implementation of Max-Flow Algorithm

In our evaluation process, we employ the Edmonds–Karp algorithm (Edmonds & Karp, 1972) as a practical implementation of the Ford–Fulkerson method (Ford Jr & Fųlkerson, 1962) designed to compute the maximum flow in a network with a time complexity of $O(|V||E|^{2})$ .

To formalize the problem, consider a scenario with $n$ agents, denoted as $A_{1},A_{2},\cdots,A_{n}$ , and $m$ tasks, denoted as $T_{1},T_{2},\cdots,T_{m}$ . Our objective is to conduct evaluations in $l$ different groups, each focusing on the pair $(A_{x_{k}},T_{y_{k}})$ , where $1\leq k\leq l$ . Additionally, for every such pair $(A_{x_{k}},T_{y_{k}})$ , we should evaluate $s_{k}$ samples. The number of workers for agent $A_{k}$ and task $T_{k}$ is denoted as $w(A_{k})$ and $w(T_{k})$ respectively.

The flow graph we construct can be described as $G=<V,E>$ , where the vertex set $V$ is defined as

And the weighted edge set $E$ is denoted as

We apply max-flow algorithm from source vertex $S$ to destination vertex $D$ . For each flow edge $(A_{i},T_{j},f_{(i,j)})$ , we allocate $f_{(i,j)}$ samples for agent $A_{i}$ and task $T_{j}$ . After allocation, the weight of the edges should be reduced by the value of flow. Upon completion of an evaluation, the weight of edge connected to either $S$ or $D$ should be increased by $1$ .

We also establish a periodic interval for applying the algorithm to the network for newly available evaluation triples.

Appendix B Operating System

Construction Details. Each evaluation sample in OS dataset encompasses following contents:

Instruction. The description of the problem in natural language that needs LLMs to solve.

Docker Environment. The starting up docker image (e.g., preset default local-os/default).

Initialization Script (Optional). The bash scripts that need to be executed independently (docker exec) before the interaction starts (e.g., user configurations, files, system statuses).

Start Script (Optional). The bash scripts executed after shell is created and before interaction.

Checking Pipeline. The checking method to judge the correctness of LLMs answer or operation.

Example Script (Optional). The bash scripts that serve as reference solutions. In other words, if executing them in the interaction, results are correct. Only for unit tests that introduced below.

We design two types of tasks in the OS evaluation beyond conventional QA-only evaluation.

Question Answering (QA): LLMs need to output commands to solve specific questions in OS (e.g., aggregate numbers, view file contents). In this case, they must commit answers finally.

Operation: LLMs need to output commands to do some verifiable operations on the operating system (e.g., change file/user states). In this case, they do not need to commit final answers.

Thanks to the checking pipeline, two types of tasks can be evaluated in a unified solution.

Collecting challenging queries regarding OS could be difficult. In practice, about half of our instructions are created or collected from humans, while the other half are mostly QA problems generated by gpt-4 and strictly filtered by passing the unit tests (i.e., yield correct answers/states).

For human instructions, we first gather 6000 real problems and solutions with bash or shell tag from Stack Overflowhttps://stackoverflow.com/. Then we sort them by the score (count of likes). We invite 8 annotators majored in programming to select challenging ones. For each selected problem, they create one or more task instructions and write a detailed problem description, the initialization script, the starting script, and the checking pipeline. Finally, we conduct a cross verification for each evaluation sample to make sure it’s correct. For each problem, it takes about 2 hours to do the annotation.

For generated problems, our unit test contains the following parts. 1) Initialization Script Correction: we execute the initialization script and remove samples with wrong initialization whose exit code does not equal to . 2) Example Code Correction: we execute the example code and the checking pipeline to judge the correctness of the answer. We remove samples with wrong answers.

In the end, we curate 144 high-quality diverse OS evaluation samples accompanied with testing interactive environments and corresponding checking pipelines (i.e., scripts). Agents are prompted with 1-shot CoT to better format their responses (Cf. Appendix B).

Evaluation Setup. For each problem (i.e., instruction), the execution can be divided into 3 parts.

Initialization. We create a docker container with a specific image, and we run an initialization bash script to set up environments specified by the instruction.

Interaction. We start a new shell in this docker, and run the starting bash script specified by the instruction. Then the LLM to test is fed with a piece of instruction and the problem description. It starts interaction with the shell. In each turn, two actions are provides. One is to run bash script, which allows the model to generate and run a series of commands in the shell. The other is to commit answer, which allows the model to terminate the interaction process. It’s notable that the model will be judged that it fail to solve the problem if exceeding round limit (8 by default).

Checking. For each problem, there is a checking pipeline containing a list of scripts $f_{1},f_{2},\cdots,f_{n}$ , where $f_{k}$ denotes the $k$ -th script piece in the pipeline. For $f_{k}$ , the answer of the model, $o_{0}$ , and the output of $f_{t}(t<k)$ , $o_{t}$ , will be fed as input arguments into $f_{k}$ , i.e., $o_{k}=f_{k}(o_{0},o_{1},\cdots,o_{k-1})$ . The result is correct if and only if all the scripts exit with code .

Metrics. We measure the Success Rate for LLMs to solve problems in the execution. There are only two final status for each item of the problems, wrong or correct.

B.2 Actions

In OS evaluation, we design two major types of actions: bash and commit.

Bash: which launches a bash command (using textual inputs in content field)

Commit: which announces the completion of the goal. If the task is a QA problem, then the agent should submit the final answer in content field; else the checking pipeline will automatically check the system status to judge the correctness.

B.3 Prompt Example

A prompt for OS evaluation consists of the instruction and the formulation of interaction trajectory. An example of instruction prompt is:

The trajectory is organized in CoT styles, and we use an 1-shot example to make model better understand the action space like the following.

Appendix C Database

Construction Details. We acquire the source queries and databases via reusing and amalgamating several established datasets: WikiSQL (Zhong et al., 2017), WikiTableQuestions (Pasupat & Liang, 2015), SQA (Iyyer et al., 2017), HybridaQA (Chen et al., 2020), and FeTaQA (Nan et al., 2021), ensuring the diversity of instructions and data.

To further enrich (and avoid leakage from) the dataset, we employed gpt-3.5-turbo to perform data augmentation. Provided with the header information and original rows of a table, gpt-3.5-turbo generates ten new rows. Using the name, header information, and some SQL examples, we task gpt-3.5-turbo with generating five additional SQL queries. Each acquired SQL statement is then fed sequentially into gpt-3.5-turbo with instructions to rephrase the sentences without changing their original meanings. The valid entries are filtered and sampled into the final dataset with 1599 entries, categorized into three basic types of DB operations: select, insert, or update.

As a result, each sample in the dataset comprises:

Instruction. A piece of description delineating the problem and guiding the agent’s action.

Table Info. Explanations about the table name and column names (i.e., meta information).

Table Content. The actual contents within the table, utilized to create the database.

Correct Answer. For selection-type samples, it is a text answer; for other entry types (i.e., insert, update), it is the hash code of the correctly modified table.

Evaluation Setup. We assess each problem in the dataset through the following procedure:

Initialization. An initial SQL script is constructed based on the table content, and a MySQL database is initialized in a docker container, which provides a forwarded port for interaction.

Interaction. An initial prompt guides the agent to provide an executable SQL command along with its reasoning. The agent is provided with the prompt, instruction, and table information description, and it is expected to return a response in given format. We execute the SQL and directly return the result to the agent, continuing this loop until the agent commits its final answer or encounters an error (e.g., reaching the maximum round limit or failing to parse the action).

Checking. For selection-type problems, we compare the agent’s answer with the standard text answer, disregarding the order, but expecting an exact match. If the answer is a single number, all equivalent representations are accepted (e.g., 5, "5.0", ’+5’ are considered identical). For insertion or updating types of problems, we calculate and compare the hash of the table after the agent’s operation with the hash of the table after the correct SQL operation.

Metrics. We measure the Success Rate of agents in completing instructions. Overall success rate is the macro average of the rate of three categories.

C.2 Data Augmentation

We elaborate on the data augmentation of three types of DB tasks based on the existing SQL datasets (Zhong et al., 2017; Pasupat & Liang, 2015; Iyyer et al., 2017; Chen et al., 2020; Nan et al., 2021), which are all QA problems without some common operations including inserting and updating. We first tested the validity of the raw data and then randomly sample from each category from filtered data to form the final dataset. We adopt gpt-3.5-turbo to enrich and rewrite the original instructions.

Insert: Given the name, the header information, and the original rows of a table, we generate 5 SQL statements for insertion. Later we rephrase the sentences without changing their meaning (using shorter or longer expressions or changing the order).

Update: Given the name, the header information, and the previously generated 5 SQL statements for insertion, we generate 5 SQL statements for modification based on the given statements. We rephrase the sentences following the above standard.

To ensure data quality, each augmented query statement are required to pass the unit test scripts.

The query type of tasks fall into the traditional scope of Text-to-SQL evaluation, and we only sample and categorize for evaluation. Each query statement in existing datasets is classified into following types: ’Counting’, ’Aggregation-MIN’, ’Aggregation-MAX’, ’Aggregation-AVG’, ’Aggregation-SUM’, ’Ranking’, or ’Comparison’. Each one can only belong to one type. The remaining will be categorized as "Other".

C.3 Prompt Example

Appendix D Knowledge Graph

Construction Details. In an effort to gauge the decision-making abilities of LLMs, specifically their proficiency in long-term planning, we have meticulously compiled a dataset sourced from pre-existing knowledge base question answering (KBQA) datasets on Freebase, including GrailQA (Gu et al., 2021), ComplexWebQuestions (Talmor & Berant, 2018), and GraphQuestions (Su et al., 2016).

We envisage KBQA as a tool learning setting, thereby outfitting the LLM with an array of KG-querying tools. By leveraging the S-expressions annotated in (Gu & Su, 2022), we can accurately establish the optimal sequence of tool applications corresponding to each question. In order to sustain a high degree of difficulty in the tasks, we have opted to preserve only those questions which necessitate a minimum of five instances of tool invocation. Through this rigorous selection methodology, we have accrued a dataset consisting of 1,663 questions. Each data entry in the dataset has the following fields:

Input Question. A natural language utterance that involves intricate KG information seeking.

Topic Entities. A set of topic entities mentioned in the input question. We obviate the need of performing entity linking, allowing the LLM to focus on long-term planning.

Action Sequence. The gold action sequence (i.e., tool invocations) that leads to the target answer.

Gold Answer. The gold answer to the question, typically characterized by a set of KG entities.

Note that, in contrast to interacting with databases in AgentBench, where the particulars and content of the database are integrated into the input, describing an extensive KG to the LLM is not particularly feasible. This task is characterized by a partially observable environment, which is a critical aspect of its nature.

Evaluation Setup. To support our evaluation, we first host the latest version of Freebase using Virtuoso.https://github.com/dki-lab/Freebase-Setup Due to the complexity of SPARQL queries, we decide not to burden the LLM with crafting SPARQL queries by itself. Instead, we implement a series APIs that interface with the Virtuoso backend, allowing the LLM to query the KG more effortlessly.

We use the first 500 tasks from the datest for evaluation. Each task, when successfully executed, should ideally proceed through the following phases.

Initialization. We prompt the LLM with the concrete task description, including the concrete description of each KG-querying tool that we provide.

Interaction. During this phase, the LLM is expected to invoke different tools to access the KG and accumulate the necessary information to respond accurately to the question. Importantly, the process is entirely autonomous, meaning the LLM determines the workflow entirely by itself.

Final Answer Prediction. During its interaction with the KG, the LLM may generate a list of variables, each one representing a unique set of entities. If the LLM determines that one particular variable should signify the final answer, it will present this variable as its output and conclude the task.

Metrics. We use F1 score as the primary evaluation metric in our study, calculated by comparing the model’s predicted answers to the gold standard answers. In addition to F1 score, we also use the Exact Match metric. However, unlike previous studies that measure Exact Match based on the logical form, we assess it based on the exact match between the predicted and gold answer sets. Lastly, we also evaluate the Executability of the action sequences generated by the model. If the model’s action sequence produces any set of answers when executed, it scores 1.0 for Executability. If it fails to produce an answer, it scores 0.

D.2 Prompt Example

Given the inherent complexity associated with enabling LLMs to query the KB, it has been observed that, in a zero-shot setting, LLMs struggle to generate any outputs of substantive relevance. As a result, we additionally provide a teaching example in our prompt:

Appendix E Digital Card Game

Construction Details. We use Aquawar framework as the basis for our interactive system. The first type of interaction is the action phase, where the model needs to select the fish it wants to act with and then choose the target for skill. To ensure the validity of model operations, we perform checks for valid actions. The second type of interaction is the guess phase, where we provide the model with known information, including fish species and skill descriptions, enemy’s targets. We have two naive strategies (random and greedy search) for testing purposes. The following is a detailed definition and description of the game process.

Player and Cards. It is a two-player battle game with four pet fishes (i.e., cards) in each team. The card pool consists of ten fish (Appendix E.2), and both players choose four definite fish to use before the start of the game.

Initial State. Each fish has 400 initial health, 200 initial attack power, active ability, and passive ability.

Basic Rule. Players choose a live fish to use its active skill or normal attack on an enemy fish each round. All alive fish’s passive ability will automatically trigger when meeting certain conditions.

Assertion Mechanism. The identity of a player’s fish is initially hidden. The counter-player can guess one of the player’s fish’s identities each round. If the counter-player guesses correctly, the player’s fish’s identity is revealed, and all its fish will get damaged.

Round Process. Within a round of the game, the player for that round will first assert the identity of one opponent’s fish that are alive and whose identities have not been revealed. If the assertion is correct, all of the opponent’s fish that remain alive get damaged. Subsequently, the player for that round can command one alive fish to execute a normal attack or an active ability. Following this, any fish that meet the condition will unleash its passive ability.

Victory Condition. The victory condition is to have more fish alive at the end of the game.

To balance agent engagement and game complexity simultaneously, we designed two stages of game logic. We remove the assertions in the first stage while keeping assertions in the second stage. We test all the models on both the first and second stages separately and choose the average performance for final score.

We choose two naive playing strategies as the baselines.

The first strategy is a simply random action from all available action spaces.

The second strategy will try to use AOE attack if possible, and continuously evaluating whether a one-hit kill is possible. Then, it attempts to use active skills and, finally, resorts to normal attacks. Overall, this strategy follows a certain pattern but may not necessarily be the most optimal one.

Evaluation Setup. For each time of the game playing, we evaluate with the following steps:

Initialization. We initiated the modified game logic environment, which uses pybind to compile, and the baseline game agent under the Ubuntu 20.04 environment.

Interaction. We place rule descriptions in the instruction prompt according to different game stages, and the LLM agent interacts and competes strategically with the baseline within the game logic environment. We give the LLM agent five chances to respond in the correct format. It will be immediately deemed defeated if it fails to output legal actions within the given number of attempts. At the same time, we encourage the model to output its reasoning process in CoT.

Result Calculation. During the Interaction process, we will record the entire game process for battle playback and calculate the game results to obtain the metrics for the task.

Metrics. Our comprehensive evaluation uses metrics that range from basic gameplay elements such as the wining rounds (Win Round) , total played rounds (Total Round), winning rate (Win Rate) , the total damage inflicted compared to total health (Damage Rate), and ultimately we provide a final reward score according to the above metrics:

E.2 The Attributes of Fish

The game has ten kinds of fish according to the game rules.

- Counter (Passive): Inflicts 30 damage to the attacker when a teammate’s health is below 30%

- AOE (Active): Attacks all enemies for 35% of its attack points.

- Counter (Passive): Inflicts 30 damage to the attacker when a teammate’s health is below 30%

- Infight (Active): Inflicts 75 damage on one living teammate and increases your attack points by 140.

- Deflect (Passive): Distributes 70% damage to teammates and takes 30% when attacked. Gains 40 attack points after taking 200 damage accumulated.

- AOE (Active): Attacks all enemies for 35% of its attack points.

- Deflect (Passive): Distributes 70% damage to teammates and takes 30% when attacked. Gains 40 attack points after taking 200 damage accumulated.

- Infight (Active): Inflicts 75 damage on one living teammate and increases your attack points by 140.

- Reduce (Passive): There is a 30% chance to avoid any incoming damage each time.

- Crit (Active): Deals 120 CRITICAL damage to an enemy.

- Reduce (Passive): There is a 30% chance to avoid any incoming damage each time.

- Subtle (Active): Choose a teammate or yourself to reduce the damage taken by 70% when attacked, and increase its attack points by 20.

- Heal (Passive): Regain 20 health points if the health is still greater than 0 when attacked.

- Infight (Active): Inflicts 75 damage on one living teammate and increases your attack points by 140.

- Heal (Passive): Regain 20 health points if the health is still greater than 0 when attacked.

- Crit (Active): Deal 120% CRITICAL damage of your attack power to the enemy with the lowest health. If the target’s health is below 160, increase the CRITICAL damage to 140%.

- Explode (Passive): Deal 40 damage to the source when attacked but not died. When the health is below 20%, increase its attack points by 15.

- Crit (Active): Deal 120% CRITICAL damage of your attack power to the enemy with the lowest health. If the target’s health is below 160, increase the CRITICAL damage to 140%.

As can be seen, there is overlap among the active and passive skills of different pet fish, which is done to better conceal the identity information of pet fish in the game and increase the strategic aspects of the game.

E.3 Prompt Example.

We use the following format of prompts for actions:

We use the following format of prompts for assertions in stage2:

Appendix F Lateral Thinking Puzzles

Construction Details. Each sample is constructed of a pair of story (a riddle, e.g., A man walked into a restaurant, ordered a bowl of turtle soup, and after finishing it, he committed suicide. Why did he do that?) and truth. We categorize samples into four levels of difficulty: easy, medium, hard, and expert. The LTP rules for LLM agent playing are as follows:

Roles: Roles in LTP evaluation are a host and a solver. The host knows the story and truth, providing the story to the solver, and guiding it to guess out the truth. The solver, played and acted by an LLM, tries to find out the truth by asking questions and synthesizing host’s answers.

Solving Steps: There is a maximum round for each game, for example, 25. The solver needs to propose a question in each round based on known facts. The questions should be the ones that can be answered by “Yes”, “No”, or “Irrelevant”. Host reply to the questions with correct answers. To lower the difficulty for LLM agents, sometimes the host will provides some hints in responses when solvers get trapped in wrong directions of reasoning.

Game Termination: When the solver thinks it has guessed out the major part of the truth, it can declare the guessed plot to the host. If it is correct, the host will announce the end of the game.

Evaluation Setup. For each pair of story and truth, we evaluate the models with the following steps:

Initialization. Setting up the LTP host system via local python package installation or web API.

Interaction. We set up system prompts for LLMs to build their roles of players. LLMs are tested as solvers within the maximum round for each game, if the LLM does not exceed the max token length. In automatic evaluation, we limit the answer to be mostly "Yes", "No", or "Irrelevant", and extract the answer from gpt-3.5-turbo’s responses. LLMs are also asked to summarize their reasoning in automatic evaluation in order to help the termination detection to be more accurate.

Checking. We do the pilot study of each LLM to collect all situations in game process and design the checking plan. For automatic evaluation, we set up some key words for gpt-3.5-turbo to answer and remind the model to consider some flexible situation like synonyms.

Metrics. We evaluate LLMs’ Lateral reasoning ability by two self created metrics:

Single Game Accuracy (SGA): The proportion of rounds in which LLMs approaching the truth in a single game.

Round Efficiency (RE): How fast the model can guess out the truth within the maximum round.

Query Relevance (QR): Relevance between model’s questions and the truth.

Game Progress (GP): Progress before a game end, which serves as the main metric. We break down the groundtruth into several points and measure how many points are reached by an agent.

F.2 Evaluation on LTP System

We evaluate the LTP System by human validation, validating system’s accuracy on milestone recognition and fact verification. We compare the Single Game Accuracy and Query Relevance between automatic evaluation and human evaluation, and found that automatic evaluation sometimes more tolerate for the agent, which make SGA and QR seem better than human evaluation, especially on open-sourced models. We plan to train a model specifically for the host of the game, in order to provide a better game experience and a more precise evaluation. For Game Progress and Round Efficiency, the LTP system provides an objective evaluation, which can match the level of human evaluation.

F.3 LTP Game Progress and Termination

The progress of game is defined as the proportion of hit key points in the truth. The key points are summarized by gpt-3.5-turbo, which are concluded in the dataset as “answer_keys” (see an example below)

The number of key points varies among samples. As for the decision of whether the agent guess out key points, we first change relevant questions into declarative sentences, then simplify sentences into one sentence. After guessing out a key point, we delete that key point and relevant inferences to avoid repeated guessing.

F.4 Prompt Example

We use the following format of prompts for agents:

We use the following format of prompts for host:

Here is the prompt to convert questions answered by “Yes” into declarative sentence.

Here is the prompt to convert questions answered by “No” into declarative sentence.

Here is the prompt to merge reasoned out information into one sentence to judge whether the agent guess out the key points:

Here is the prompt to judge whether the merged sentence hit the key point.

Appendix G House-holding

Construction Details. The ALFWorld benchmark comprises of textual environments designed to mimic household scenarios, providing an interactive environment where an agent can perform decision-making tasks through text-based interfaces. Given the household environment description and an target instruction, the agent’s objective is to break down the complex high-level target into a sequence of straightforward actions. After each step, the agent receives environment feedback, allowing the agent to adapt the plan dynamically and move on to the subsequent task to eventually accomplish the main objective.

Each evaluation sample in ALFWorld dataset encompasses following contents:

Environment Description. The detailed description of the whole household environment, including agent’s initial position and a snapshot of the room containing objects and their IDs.

Objective. The goal that needs the agent to accomplish in the environment, usually requiring multi-step reasoning and exploring (e.g. put the lamp on the table).

Simulated Environment. After every action of the agent, the simulated environment gives immediate feedback and evaluates whether the agent has completed the task.

In the dataset, we utilized 134 solvable problems from the ALFWorld eval out of distribution split of the dataset. All the problems were categorized into six categories: pick and place, pick clean then place, pick heat then place, pick cool then place, look at obj, and pick two obj.

Evaluation Setup. Due to the inherent complexity of the problem and the high standards required for the output format, we employ a 1-shot evaluation setting. For each category of problem, we use one relatively simple and complete interact processes of the same category from the training set as an example. Following ReAct (Yao et al., 2023b), we adopt the few-shot examples and prompts in corresponding repositoryhttps://github.com/ysymyth/ReAct. Additionally, if LLM output format is invalid, we use the BLEU metric to assess the similarity of the output to all valid action options. The option with the highest similarity will be chosen as the action of the model for this round.

For each sample, the evaluation process can be divided into 2 parts.

Initialization. We describe the task to the model and provide one successful example. Afterwards, we elaborate on the environment and delineate the objective required to be accomplished.

Interaction. The model generates some thoughts and the next action based on the feedback received from previous interactions and the information from the environment. After receiving the action from the model, the environment provides feedback (changes to the environment or information observed by the model). This process is repeated until the model successfully achieves its goal (which is considered a success) or reaches its maximum number of actions (which is considered a failure). It is worth noting that sometimes, after several unsuccessful attempts, the model may repeatedly output the same content. To save evaluation time, we judge that if the model outputs identical content three times consecutively, it will be deemed a failure due to repetition.

Metrics. We employ the overall Success Rate as a measure of model performance, that is, the number of tasks successfully completed by the model divided by the total number of tasks.

G.2 Prompt Example

To align the output format with the legal commands supported by the simulated environment, we adopted a 1-shot evaluation setup where one successfully completed task example was concatenated after the instruction. At the beginning of the interaction, we describe the task to the model using the following instruction.

All the tasks in the datasets are categorized into six classes. To better guide the model in accomplishing the objectives, we have selected one relatively simple example of successful completion of similar tasks for each category as 1-shot example. Here is an example:

Appendix H Web Shopping

Construction Detail. The environment displays the text observation of the webpage and available actions to agents. Agent may freely explore the website and browse through items with clickable buttons just as in the real world. About a million products are scraped from amazon.com to form the database of website. Then each of them is annotated with labels representing its own attribute. 12,087 human instructions are collected and linked with goals along with expected attributes. Please refer to (Yao et al., 2022) for more dataset construction details.

Evaluation Setup. We adopt the first 500 entries of 12,087 instructions as test set (following (Yao et al., 2022)’s official implementation). Each round of interaction can be decomposed as following steps:

Instructing. After the initial prompt that tells environment information and the format in which LLMs should response, we give instructions about what kind of product we wish to buy.

Interacting. Agent respond in given format, as prompted, containing their thoughts and the action they wish to take. The actions can be categorized into two types: search and click, corresponding with the actual actions of using search engine and clicking buttons in real world. The environment answers agent’s action with a simplified text version of webpage and a list of available buttons. This process repeats until the agent click "buy now" button or round limit is exceeded.

Calculating reward. We use the reward function in the paper as the metric. The reward is mapping from the similarity of the attributes we are expecting and the attributes that the bought product actually have to a number between 0 and 1.

Metrics. As there might be more than one suitable item for a given query, Webshop adopts a matching reward as its evaluation metric:

$U$ and $Y$ stand for goal and chosen product, $att$ and $opt$ stand for attributes and options. TextMatch is a text match of pronoun, noun, and proper noun between chosen and goal product title.

H.2 Prompt Example

We use the following format of the prompt:

Appendix I Web Browsing

Construction Details. Mind2Web covers domains of Travel, Information, Sevice, Shopping, and Entertainment, assembled using SimilarWeb ranking as a reference. It hires annotators to first propose task goals based on the current website, and then record their traces of interaction as expert demonstrations. Our adoption of it primarily focuses on generalization across environments, i.e., the Cross Domain test set which contains 912 tasks from 73 websites, spread among domains including Housing, Job, Social Media, Education, Health, Government, Home Service, etc. Please refer to (Deng et al., 2023) for more dataset construction details. Each task sample encomposses the following contents:

Task Description. A high-level (instead of step-by-step) goal that can be achieved on the website, such as“Get the highest rated SAP S/4 HANA course rated 4, and up with a duration between 3 to 6 hours for an intermediate, and add this to your cart and checkout”.

(Reference) Action Sequence. In the annotated interaction sequence, a meta-action $a_{t}$ at step $t$ includes $\{e_{t},o_{t}\}$ , where $e_{t}$ represents the unique backend id of the target element, and $o_{t}$ refers to the symbolic action operated on $e_{t}$ (i.e., Click, Type, and Select Options). For Type and Select Options, corresponding textual inputs are also included.

Webpage Information. A detailed observation of the web browsing environment at each step. Throughout the manual annotation process, each observed step captures a snapshot, incorporating the raw HTML codes from the website as well as the previous interaction trajectory.

It has been found that LLMs consistently face challenges when handling the cumbersome raw HTML code associated with real-world web pages. Therefore, Mind2Web proposes to rank and filter the HTML elements with a small language model, e.g., DeBERTa, to enhance inference efficiency.

Given the user’s high-level instruction, the agent continuously interacts with the web system by receiving the observation of the current page content and the action histories, then predicting the next action, which consists of the target element and intended operation.

Evaluation Setup. The evaluation involves a dual process to improve the efficiency following (Deng et al., 2023). A fine-tuned small language model is first employed to rank HTML elements and select top-k potential candidates. Subsequently, we prompt and formulate the element selection as a multi-choice QA problem, providing five candidates for each round. For the Type and Select Options operations, agents are additionally prompted to specify the argument for the operation, i.e., textual input to type or option to select.

Metrics. For evaluation, as suggested in the original paper, we consider the following metrics:

Element Accuracy. Calculates the accuracy of the chosen element $e_{t}$ .

Action F1. Determines the token-level matching score for the operation $o_{t}$ . It brings a distinction for Type and Select Option operations due to the existence of text values.

Success Rate. Evaluates the predicted action correctness compared to reference actions. For Step Success Rate, we grant success if the selected element $e_{t}$ is correct and the predicted operation $o_{t}$ matches the ground truth value at the step. Likewise, for the Task Success Rate, a task is considered successful only if all the steps have been successful, making it a rigorous measure. Unfortunately, even the best LLMs now can only achieve single-digit task success percentages.

We report Step Success Rate as the main metric showing the independent accuracy of each action step, due to the current struggles for LLMs to ensure overall task success rates. Regarding the experimental setup, we select topk 10 candidates to construct multichoice questions utilizing CoT few-shot prompting. Consequently, the GPT-3.5 results can diverge from the original paper (Deng et al., 2023) under topk of 50 setting and different prompting strategies.

I.2 Prompt Example.

We use the following 3-example CoT prompts for Mind2Web evaluation:

Appendix J Detailed Analysis

In the realm of artificial intelligence and machine learning, the efficacy, precision, and reliability of models are crucial for practical implementations. Evaluating multiple models provides an understanding of their respective strengths and limitations, leading to better informed decisions about which models are best suited for specific tasks. The purpose of this validity analysis is to offer a systematic approach to discern how different models perform, particularly in terms of task completion, context size constraints, return format accuracy, action accuracy, and task limitations. This deep dive into performance parameters not only enhances our knowledge about the models’ capabilities, but also aids in refining and optimizing them for future applications.

For comprehensive validity analysis, we have demarcated the results into five distinct categories:

Completed: Denotes instances where models, irrespective of the end outcome, successfully finished the task as per the instructions.

Context Limit Exceeded: Denotes instances where the model’s length was constrained by the API, predominantly observed in the text-davinci model.

Invalid Format: Denotes instances where models, despite receiving clear instructions, failed to return responses in the expected format.

Invalid Action: Denotes instances where the models returned in the correct format, but their actions either fell outside the permitted action space or had incorrect action parameters.

Task Limit Exceeded: Denotes instances tasks reached their termination criteria, such as exceeding the stipulated number of turns.

By categorizing the results into these classes, we can gain a clearer picture of where each model excels and where they encounter challenges, allowing for targeted improvements.

For our evaluation, we scrutinized the validity performance of 27 distinct models. Apart from the text-davinci model, which has an inherent strict API context length constraint, the outcomes for other models primarily fall under the categories of Completed, Invalid Format, Invalid Action, and Task Limit Exceeded.

From the detailed analysis showcased, key trends emerge. As depicted in Figure 6, the chart offers a clear visualization of the validity distribution across distinct models and defined categories, enabling us to derive insightful conclusions.

J.2 Findings

Based on the data presented in Table 5, we can draw a few important observations on the performance differentiation between Commercial API-based models and Open-Sourced models. It’s noteworthy to highlight the areas of Invalid Format and Invalid Action, where the Open-Sourced models report more challenges. Specifically, 10.4% of the Open-Sourced model outcomes were marked as Invalid Format, in comparison to the 6.0% from Commercial API-based models. Similarly, Invalid Actions were seen more in Open-Sourced models (13.6%) than in Commercial API-based models (4.6%). These discrepancies might be indicative of the robustness and generalization abilities of commercial models, or perhaps the attention to details during the model’s design and training phases, especially instruction following.

It’s also worth noting that even some of the best models might sometimes overlook important instructions.

Although we clearly instructed the correct format of DB task:

Even gpt-4 still sometimes fail to respond correctly.

Neither "Action" label nor a correct SQL statement is returned. We speculate that this may arise due to the models internalizing certain output patterns during their training or alignment processes, causing them to neglect specific task directives.

A fundamental capability of an agent is the possession of coherent and unified thought processes that enable the formulation and implementation of viable plans based on real-world conditions. Many models possess the ability to analyze and formulate initial plans upon encountering a problem. However, even some of the most advanced models can easily deviate from or forget their original plans. The disparity in the ability of different models to consistently follow thought sequences when executing plans is relatively vast. This capability profoundly influences the efficacy and operational potency of Language Models (LLMs) acting as agents. Here wwe exemplify this phenomenon with the House Holding environment.

The House Holding environment encompasses a simulated domestic setting in which models are required to select appropriate actions from a given action space, based on observations of the surrounding environment provided by the task and given objectives to complete. With a multitude of entities and a plethora of available actions, the House Holding environment offers a high degree of freedom, which intensely challenges a model’s ability to maintain clear and coherent thought processes.

A success example by gpt-4 is shown below.

From the dialogue history, it’s evident that gpt-4 has consistently maintained clear and coherent thought processes. As illustrated in Figure 7, gpt-4 systematically completed the task by following a clear sequence of steps. It initially decomposed the task into a sequence of Find -> Clean -> Put. Subsequently, it undertook a depth-first search within the abstract planning tree. Impressively, after each exploration, it successfully backtracked to the parent node. This consistent cognitive capability significantly propelled gpt-4 ahead of other models.

Moreover, it’s noteworthy that gpt-4 encountered a moment of perplexity when it failed to find the desired soapbar after examining the Toilet. However, it promptly realized that there was one last location left unchecked, the countertop. Initially, gpt-4 might have assumed it needed to retrieve the soapbar from elsewhere to place it on the countertop, without considering the possibility that the soapbar might already be there. Evidently, gpt-4 demonstrated the capacity for self-reflection, allowing it to reassess and modify its assumptions when they proved unfruitful. This ability for self-evaluation and readjustment further assisted gpt-4 in completing tasks that required deeper contemplation.

In contrast to the above is the performance of gpt-3.5-turbo on the same sample.

While gpt-3.5-turbo was able to decompose the task, it struggled to adhere to its initial plan. As it encountered failed attempts, the model gradually lost sight of the original plan.

In light of the aggregated results, we posit that code tuning significantly aids the model’s performance in relatively straightforward and procedural tasks. The outcome tables demonstrate that the CodeLlama series consistently outperforms the Llama2 series in webshop tasks. However, the downside of code tuning appears to be a potential compromise in the model’s logical reasoning capacity and situational awareness. In the digital card game scenario, the CodeLlama series lagged behind the Llama2 series. The primary distinction between the two scenarios lies in the guidance provided. In the webshop, the one-shot prompt precisely outlines a shopping process template, which, when followed simplistically, leads to satisfactory scores. In contrast, the Digital Card Game demands that the model assess the current status of both competitors, devise intricate counter-strategies, and achieve high scores without the crutch of a simple procedural template.

As illustrated in the figure, the completion rate of the codellama series in the WebShop tasks significantly surpasses that of the llama2 series.

In many test cases, the primary reason for the model’s failure is its inability to identify its own mistakes from the error feedback provided by the environment. This is especially evident in the DB task. Models with the ability to self-correct their SQL statements significantly outscore others. We use claude-2 as a representative example to illustrate this capability.

As indicated in the log, claude-2 successfully discerned from the MySQL error message that it had overlooked adding backticks around fields with spaces in the SQL statement.