AutoWebGLM: A Large Language Model-based Web Navigating Agent

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang

Introduction

The concept of autonomous digital agents as helpful assistants is an enticing prospect. Enhanced by LLMs’ formidable comprehension and response capabilities [1; 35; 36; 45; 46; 34], we can envision various scenarios unimaginable before. For instance, an LLM-based agent could support a daily routine that summarizes the online news for us during breakfast. This integration of LLMs into everyday tasks heralds a significant shift in how we interact with technology, optimizing our efficiency and redefining the boundaries of machine-assisted productivity [41; 37; 21].

Lack of Unified Action Space: A universal and convenient action space covering all necessary task executions on browser across various websites is absent.

Lack of Webpage Simplification Method: The diversity and complexity of webpages and their tendentious verbosity pose a significant challenge for LLMs to comprehend the content and carry out correct operations. Token length of content-rich webpages can usually reach 30k and over.

Lack of High-quality Training Trajectories: There are limited high-quality trajectories for training a strong LLM-based web agent. Existing trained agents notably lack the capability for correct inference and self-checking on web tasks. Once caught in an erroneous loop, they struggle to rectify the issue promptly.

In this work, we introduce AutoWebGLM, a deployable webpage browsing agent based on the open ChatGLM3-6B model . Different from its predecessor WebGLM that focuses on retrieval augmented web-scale question answering, AutoWebGLM is dedicated to autonomously accomplish complex real-world missions via navigating and operating on real web browsers like humans. We propose various efficient data strategies to support the swift construction of a sizeable, reliable training dataset while state-of-the-art models cannot reliably complete data annotation tasks . Furthermore, by leveraging supervised and reinforcement learning methods , we train AutoWebGLM on top of the collected web agent dataset to achieve performance superiority on general webpage browsing tasks. A step further, we employ rejection sampling finetuning (RFT) for lifelong learning in specific web environments, enabling the agent to become an expert in a particular domain.

We develop a Chrome extension based on AutoWebGLM (See Figure 2 for examples). Throughout our experiments, it can reason and perform operations on various websites to complete user tasks accurately, making it practically applicable to real-world services. In addition, we construct the first bilingual (English and Chinese) webpage browsing evaluation dataset, given that websites from different regions have substantial stylistic variations.

In conclusion, we make the following contributions in this paper:

We design and develop the AutoWebGLM agent for effectively completing web browsing tasks through curriculum learning, bootstrapped by self-sampling reinforcement learning, and RFT in the web browsing environment.

We construct a real webpage browsing operation dataset of approximately 10,000 traces using model-assisted and manual methods, including the bilingual (English and Chinese) web browsing benchmark AutoWebBench.

We perform experiments to demonstrate that AutoWebGLM with 6B parameters achieves performance comparable to the most advanced LLM-based agents, and more importantly, it reaches a practically usable level for real-world web tasks.

Method

We consider web browsing tasks as a sequence decision-making process. The state, denoted as SS, includes the current page status (such as HTML, URL, and Window Position). The action set AA contains all potential browsing operations, including click, type, scroll, etc. See complete operations in Table 1.

The state’s transition is determined by the webpage’s current state and the agent’s output action. The task will end when the agent outputs finish or reaches the maximum number of interactions.

During the decision-making process, the function ϕ\phi updates the historical information based on the previous history Ht1H_{t-1}, the most recent action At1A_{t-1}, and the current state StS_{t}.

The policy π\pi is the process for the agent to choose actions based on the current state and the history. A complete decision process starts from the initial state S0S_{0} and history H0H_{0}, iterating through the policy π\pi and transition function TT. This iteration ceases when the action AtA_{t} is finish or reaches the maximum length.

2 The AutoWebGLM Framework

As depicted in Figure 3, we process information through HTML simplification and OCR (Optical Character Recognition) modules, yielding a simplified HTML representation after acquiring HTML and webpage screenshots. With attributes facilitating operability judgment, we mark operable elements for agent interaction. The OCR module is for notating text elements during image parsing.

Agents initiate action prediction by combining this representation with other observational data. Upon outputting action, the automated web program is employed for action execution; this iterative cycle persists until task termination. AutoWebGLM enhances interactive capacity and webpage navigation precision by amalgamating these components into a singular framework.

A comprehensive, precise observation and action space is vital for constructing a robust web browsing framework. These spaces standardize the conversion of varied data sources into a uniform format.

We suggest using a unified observation space to enhance the model’s webpage comprehension and operation level. The observation space should provide information as close as possible to what the browser’s graphical interface can provide, thus maximizing the upper bound of the agent’s capabilities. We identify four critical indicators for web browsing tasks: task description, simplified HTML, current location, and past operation records. The HTML provides the model with structural and content information about the page, while the current location information helps the model understand its position within the webpage. The record of past operations provides the model with historical context, which helps to generate more consistent subsequent operations. By incorporating these elements into the observation space, we strive to construct a more resilient and practical model that can handle the intricacy and variability inherent in web browsing tasks. The following are detailed illustrations of the observation space components.

HTML. The HTML webpages are vast and complex, so it is necessary to simplify them before inputting them into the model. The simplification process aims to extract essential information while eliminating redundant or disruptive elements that could hinder the model’s understanding. Throughout this process, the HTML’s basic structure and significant content information must be retained to enable the model to comprehend and utilize this information for effective web browsing. The algorithm in Appendix A can efficiently convert a tree of elements into a concise representation. We can use the processing techniques to streamline the original HTML format into a more understandable structure for the model to interpret and manage, improving model effectiveness in web browsing tasks.

Current Position. Based on our observation of the model’s interaction with the environment, agents could perform better when provided with window position and page size. The agent uses the page scroll position to understand the content of the currently visible area and the page height information to comprehend the scale of the entire page, providing a spatial context for the model.

Previous actions. The best solution to inform the agent of past operations is explicitly providing it. This approach helps the agent understand its past behaviors. It prevents the agent from getting stuck in an ineffective loop of repeating the same actions due to operational failures, improving its ability to adapt to the complexities and dynamics of web browsing tasks by preventing the recurrence of unsuccessful operations.

2.2 Action space

As the approach of this work is to build a language model-based web browsing agent, we focus on operational possibilities when constructing the action space. On an extensive summary of experiences in the real task execution process, we define a complete and self-consistent action space (in Table 1) formulated as function calls [21; 11] for the language model to act in the web browsing world. We design our prompt input in Section D.

3 Data Preparation

Considering the scarcity of high-quality, complex web browsing data produced by actual users, we aim to create a training dataset. However, the dataset construction presents several challenges:

Task Collection: A significant hurdle is acquiring diverse, real-user task queries across various websites.

Privacy and Security: Privacy and security limitations hinder the direct acquisition of user browser operation sequences. It is also challenging to rule out redundant or incorrect operations not pertinent to task completion and to confirm user task completion.

Objective Annotation: The labor-intensive nature of collecting user objectives for each operational step makes it impractical in real-world data-gathering scenarios.

Model Limitations: Current models cannot process complex user queries across different websites, thus eliminating the chance of using purely automated methods for accurate browsing trajectory collection in real and complex application contexts.

As illustrated in Figure 4, we suggest a hybrid human-AI Data Construction method to create our training data to deal with these challenges. After careful consideration, we categorize our data into two types for construction:

For web browsing tasks, efficient and accurate understanding and manipulation of webpages become vital challenges in model development due to the diversity of user behaviors and the complexity of web content. This section illustrates our construction method for web recognition and simple task operation to train models to recognize webpage structures and perform basic operations accurately.

Web Recognition. The main objective of Web Recognition includes understanding particular HTML formats, identifying different types of web elements (such as text boxes, buttons, images, etc.), and understanding the role of these elements in user interaction. We propose the following construction approach based on the above practical challenges.

We initiate our process by collecting URLs from Chinese and English mainstream websites listed on Similarwebhttps://www.similarweb.com/top-websites. In the data processing stage, we use our HTML parser to identify operable components in each webpage and record essential information such as component position and size. We then generate a simplified HTML by rearranging and simplifying the component tree (see details in Section 2.2).

We design tasks such as website and component function descriptions to aid model recognition of webpage structures and interactive components’ functions. For each task, we develop a series of natural language questions to serve as the source field in our data. GPT-3.5-Turbo is utilized to generate multiple formulations for each question, thereby diversifying the question formation.

For the target, we leverage GPT-3.5-Turbo to generate the response. We supply a simplified HTML with the pertinent question in the prompt and impose a limit on the response length, thereby obtaining our target.

Simple Task Operation. The main objective of the Simple Task Operation dataset is to train models to perform single-step web operations. This involves executing basic functionalities on web pages, such as clicking links, filling out forms, or navigating to specific sections. To build our data, we collect various websites in the same way as Web Recognition. Then, we construct a split for each operation type to ensure that our dataset covers all the requirements for simple task operations. We adjust the data size for each split based on the frequency of each operation in practice.

Our key to constructing the dataset is by rules instead of model generation. We try GPT-3.5-Turbo for tasks, intent, and operation generation and Selenium https://www.selenium.dev to validate the executability of the generated results. However, it has obvious drawbacks: The model cannot reach an acceptable accuracy in the operation to fulfill the task, and the correctness of the model-generated operations is hard to judge. To address the above issues, we endeavor to approach from a novel perspective. We identify various actionable elements within the webpage, assembling them into web operations. Then, we use GPT-3.5-Turbo to produce the corresponding tasks and operational intents for these actions. For operation types with relatively fixed behaviors, such as Scroll and Jump_to, we directly generate their corresponding tasks with templates; for flexible and feature-rich operations, such as Click and Type, we use GPT-3.5-Turbo to help complete the construction. This approach ensures the instructions’ executability and provides the operation tasks’ richness.

3.2 Complex Task Operation Construction

We developed a dataset for complex web tasks to enable the model to make plans and reason in the web browsing scenario. Each sample in the dataset consists of a real-world complex web browsing task, the sequence of operations to complete the task, and the intent of each step.

We first designed 50 complex tasks for each website using the prompting technique referring to Evol-Instruct , from which about 20 feasible tasks were manually selected and labeled. For operation sequence, due to the high complexity of the tasks, even the most advanced LLMs cannot complete the task with satisfactory accuracy. Therefore, we leveraged manual annotations to capture web task executions via a browser plugin that records actions during website tasks. Chain-of-thought reasoning has been proven to improve task comprehension and model performance [17; 39] significantly. However, leveraging human annotators to document their intent and reasoning during web browsing is inefficient. To improve the CoT construction process, we used GPT-4 as the operational intent predictor. Our first approach of iterative step-by-step creation proved to generate weak operational links and incurred high API costs due to data construction. To address this, we employed a global thought chain prompting method, where all operations and critical HTML segments are inputted into a trace. Then, we prompted GPT-4 to output intentions for each step. This method improves the accuracy and cohesion of each step, thus forming highly relevant, consistent thought chains.

After construction, we merge our data with the training set from Mind2Web and MiniWob++ to form our final training dataset. The proportion of each split is in Figure 5.

3.3 AutoWebBench Construction

We segment the complex task operation dataset collected in Section 2.3.2 for evaluation. AutoWebBench is divided into two splits: in- and out-of-domain, which serve as bases for our performance assessment. The in-domain dataset represents training data collected from the same website, measuring the model’s performance under familiar conditions. In contrast, the out-of-domain dataset encompasses data collected from websites entirely excluded from our training set. It offers a unique opportunity to measure the model’s generalizability and ability to adapt to unfamiliar environments. We select 50 browsing traces for each split as our test data. These traces are scrutinized and filtered via human verification, ensuring a more reliable evaluation benchmark.

Drawing on the methodology presented in Mind2Web, we comprehensively evaluate each step involved in the operation. This allows us to assess the step and overall accuracy of the model’s operations. Detailed results of this evaluation are available in Table 2.

4 Training

We train the model through three steps illustrated in Figure 6.

The first one is Supervised Fine-Tuning (SFT). We utilize data in Section 2.3 for training

This approach enhances the model’s comprehension of webpages and its capability as an agent to perform operations within the environments. Significantly, we use curriculum learning (CL), which mimics the human learning process, advocating for models to start learning from easy samples and gradually advance to complex ones. It has been demonstrated in prior works[6; 38] to improve model capabilities substantially.

Enabling LM to Read and Operate on the Web. In the initial stage, we mix the data constructed in Section 2.3.1 to equip the model with the ability to (1) comprehend the structure of web pages and the functions of various web components, and to (2) execute predefined operations on the current webpage, thus implementing simple user instructions.

To Make LM Learn to Plan & Reason on the Web. During this stage, we continue to employ the constructed data in Section 2.3.2 for training. We enable our model to decompose tasks into subtasks and execute subsequent steps based on the current webpage and the sequence of prior operations.

After the above training, our model MSFTM_{\text{SFT}} acquired essential capability in completing web browsing tasks and could independently execute operations based on user instructions.

4.2 Step 2: Reinforcement Learning

Following previous training, MSFTM_{\text{SFT}} has demonstrated some ability to operate the browser and infer the task. However, due to the distinctive nature of SFT training, MSFTM_{\text{SFT}} attempts to mimic the inference process and operations but sometimes overlooks the webpage’s state and preceding operation sequences, leading to hallucination. Consequently, we propose a self-sampling reinforcement learning to mitigate these operative illusions.

First, we use MSFTM_{\text{SFT}} for nn-fold sampling (nn=20) on complex task operation samples in the training set. We combine the sampled output and golden answer to construct contrastive data with positive and negative pairs. Subsequently, we retain samples based on the following criteria:

From all nn iterations of sampling, we select data where the model completed the tasks from 1 to nn-1 times. If MSFTM_{\text{SFT}} answered all iterations correctly, we consider it devoid of training value and incapable of providing practical negative examples. Conversely, If MSFTM_{\text{SFT}} answered incorrectly across all iterations, we suspect issues with the data and exclude them, as the model cannot adequately fit these outliers during optimization.

We retain different erroneous operations and remove duplicates to preserve distinct negative examples.

After constructing contrastive data DConst.D_{\text{Const.}}, we employ the DPO training approach to make MSFTM_{\text{SFT}} learn from its mistakes and further enhance its capabilities. During the training, we found that the direct use of DPO loss led to instability. To mitigate this issue, we propose including SFT loss to stabilize the reinforcement learning process and increase the number of training steps while ensuring no loss of the original model’s natural language and agent abilities, achieving a more robust model MDPOM_{\text{DPO}}:

4.3 Step 3: Rejection Sampling Finetuning

In the RFT (Rejection Sampling Finetuning) step, we aim to optimize for webpage environments in specific domains. RFT enables us to perform targeted training through substantial sampling from an existing model, selecting the accurate trajectories in instances lacking ones via reward signals. Our reward signals can be furnished either by the environment itself or through pre-designed reward models. Due to the network policy constraints inherent in real webpage environments, we conduct our experiments within sandbox environments furnished by MiniWob++ and WebArena.

For MiniWob++, we leverage the query generator in MiniWob++ to auto-generate multiple user queries for each task. We determine the number of generated queries for each task based on its difficulty. Then, we employ MDPOM_{\text{DPO}} to try to solve the queries. If a trace completes the task (as adjudged by the MiniWob++ environment), we consider this trace as a positive trace.

In the case of WebArena, to prevent overlap with the test set, we manually construct multiple distinctive user queries based on WebArena’s templates. For each sample, we apply MDPOM_{\text{DPO}} to perform 64 times of sampling. Similarly, if our model completes the task at least once (adjudged by manually written rules), we deem the successful trace as a positive trace.

By utilizing the methods above, we constructed two distinct successful datasets, one from MiniWob++ and the other from WebArena. These comprise approximately 15k traces (66k steps) and 240 traces (2k steps), respectively, which are used for AutoWebGLM’s individual finetuning on these two tasks.

Experiments

We establish a bilingual (Chinese-English) benchmark AutoWebBench and evaluate the abilities of publicly available agents. We also conduct extensive experiments on numerous benchmarks to evaluate the performance of AutoWebGLM in comparison to several baselines across various tasks involving navigating both English and Chinese websites.

Beyond AutoWebBench, we also test AutoWebGLM over three other established web navigating benchmarks: Mind2Web , MiniWoB++ , and WebArena .

AutoWebBench. As discussed in Section 2.3.3, We divide the test set into four splits: Chinese, English, in-domain, and out-of-domain, for evaluation purposes. We use the Step Success Rate (SSR) as our evaluation metric. All models are evaluated with an in-context learning prompt as described in Appendix E. The results are in Table 2. As discerned from the table, AutoWebGLM, after multi-task training, excels in predicting general user operation patterns, aligning well with user operations. In contrast, other baselines, in the absence of sufficient training, struggle to accurately learn user operations across different real-world websites based on webpage content and task descriptions.

Mind2Web . We use the settings from Mind2Web with SSR as our primary evaluation metric. To compare the model fairly, we utilize the MindAct framework provided by Mind2Web to evaluate the model’s performance. The results are in Table 3. We obtained the baseline results from references [4; 9; 12; 14; 8].

MiniWoB++ & WebArena . For MiniWob++, following the experimental setup from WebAgent , we test MiniWoB++ with 56 tasks by running 100 evaluation episodes per task to evaluate model capabilities. For WebArena, we integrate our HTML parser module and action execution module into the WebArena environment to make it compatible with our system. The results are in Table 7. For the WebArena baselines, the results are derived from the references [47; 43; 44]. Regarding the MinWob++ baselines, some of the results come from the references . LLaMA2 results are obtained through training and evaluation on the MinWob++ dataset.

2 Ablation Study

To evaluate the impact of different stages of data and training strategies on model performance enhancement, we conduct a comprehensive ablation study in Table 4.

Training Data Ablation. We train and test only models that contain the original training set and incorporate simple and complex task data (see Section 2.3) for training. This approach helps to qualitatively measure the impact of different datasets on the model.

The Complex Task dataset significantly improves model performance. We hypothesize that this is due to the complex data more closely aligning with real-world scenarios, thereby fundamentally transforming model performance.

The simple task dataset shows only a slight improvement when training alone. However, when training jointly with the complex task dataset, there is a significant improvement. We find that training exclusively with complex task datasets leads to basic operational errors, suggesting that training with simple task datasets can effectively mitigate this problem.

Training Strategy Ablation. We compare the results of SFT, DPO, and RFT-enhanced models and find that: (1) Compared to SFT, the DPO training facilitates model learning from its mistakes, further enhancing model performance. (2) RFT enables our model to perform bootstrap enhancement in different domains. With practice comes proficiency, resulting in improvements within each domain.

3 Case Study and Error Analysis

To assess the effectiveness of our model, we conduct a series of case studies covering a range of web-based tasks, including everyday use, leisure and relaxation, and academic research, covering the typical range of web requirements. Our system achieves satisfactory results in most scenarios, with several specific cases detailed in the appendix G.

While our system performs commendably well on a variety of web-based tasks, it has limitations. We identify errors that occasionally occur during task execution, which can be broadly categorized into four types: hallucinations, poor graphical recognition, misinterpretation of task context, and pop-up interruptions. Table 5 outlines the proportion of these errors observed during error analysis. Although relatively infrequent, these errors are crucial in our ongoing efforts to refine and enhance the system’s capabilities.

Related Work

Constructing a comprehensive web browsing agent is a complex task that involves various modules, such as a language model for decision-making and an HTML parser for environment observation. Furthermore, it is essential to have appropriate web browsing evaluation criteria when creating an effective web browsing agent. In this section, we will discuss the works related to these aspects.

Language Models (LMs). Large language models (LLMs), such as GPT-4 , Claude-2 , LLaMA-2 , GLM-130B [45; 10], OPT , and BLOOM , have accumulated extensive knowledge in various natural language processing tasks. However, due to the high cost of deploying such large language models, smaller models with lower costs and comparable capabilities are usually preferred. Many open-source projects, such as LLaMA2-7B and ChatGLM3-6B , have demonstrated strong performance to large language models in some domains.

Benchmarks for Web Navigation. The primary web browsing evaluation datasets provide a variety of evaluation metrics. MiniWoB++ provides several simulated web environments, with tasks primarily to evaluate the model’s ability to interact with webpage components. However, with the increasing demand for complex web operation capabilities, Mind2Web and WebArena have been created. Mind2Web is an offline evaluation set for complex web browsing that provides several metrics. The evaluation method is straightforward and commonly used for model evaluations. In contrast, the WebArena benchmark, based on real websites, creates multiple virtual environments and uses various evaluation methods to assess the task completion rate, making it more suitable for real-world task completion evaluation.

Agents for Web Automation. Previous work such as WebGPT and WebGLM combined LLMs with web environments. However, their primary application was question-answering (QA) tasks [33; 28; 7; 18], utilizing internet resources to answer user queries. Recent works [25; 14; 8; 43] focus more on executing complex operations or interactive tasks. Specifically, MindAct works by filtering webpage elements and selecting the element through multiple rounds of multiple-choice questions. It often requires more than ten model calls to complete a single web operation, which could be more efficient. On the other hand, WebAgent uses HTML-T5 to process the observation space’s content, including HTML, previous operations, and user instructions. It uses the Flan-U-Plam model to generate code to control webpages, exhibiting excellent web browsing performance. However, it faces deployment challenges due to the size of the Flan-U-Plam model, which is 540B scale. AutoWebGLM, based solely on a single ChatGLM3-6B, has a robust web browsing capability comparable to WebAgent, demonstrating high value for practical deployment.

Prompt-based Data Construction Methods. Constructing data through prompts has recently gained significant traction [39; 15; 31; 26]. This approach leverages language models to generate synthetic data for training. A notable example is Evol-Instruct [42; 23], inspired by the theory of evolution, demonstrating the effectiveness of using LLMs to generate diverse and complex instructions for various tasks. Additionally, some researchers explore the potential of generating data in a zero-shot setting, where the model produces data for tasks it has yet to be explicitly trained on , highlighting the versatility of prompt-based data construction. These methodologies rapidly evolve, offering a promising avenue for data generation in various domains, especially where traditional data collection methods could be more practical and sufficient.

Conclusion

In this work, we present AutoWebGLM, an advanced language model-based agent exhibiting robust performance in various autonomous web navigation benchmarks. Our model addresses extant LLM limitations and simplifies webpages by effectively controlling HTML text length and handling the web’s open-domain nature. We strategically employ curriculum learning, reinforcement learning, and rejection sampling finetuning to enhance webpage comprehension and browser operation learning. We also introduce a unique bilingual web browsing benchmark— that lays a solid foundation for future research. Our findings represent significant progress in utilizing LLMs for intelligent agents.

References

Appendix A HTML Prunning Pseudo Code

Appendix B Implementation Details of AutoWebGLM

During the SFT phase, we set the learning rate to 1e-5 with a batch size of 32. In the DPO stage, we sample the complex task dataset 20 times. After the filtering process, we build a contractional dataset of approximately 13k. We set the learning rate for the DPO to 1e-6, the batch size to 64, and the β\beta parameter to 0.15. We add the SFT loss, weighted by a factor of 0.8. During the RFT stage, we collect samples from two diverse environments, MiniWoB++ and WebArena, resulting in successful datasets of approximately 66k and 2k, respectively, which underwent finetuning. The learning rate set for this stage was 1e-5, and the batch size was 32.

Appendix C Full Results of MiniWob++

Table 6 is the per-task average success rate on 56 tasks from MiniWoB++.

Appendix D Input Prompt

Below is our input prompt for AutoWebGLM training and inference:

Appendix E Data Construction Prompt

Data construction prompt for web recognition description:

Data construction prompt for simple task task:

Data construction prompt for complex task trace intent:

Appendix F Annotation Details

The annotation process was performed by 20 annotators for one month using the Google Chrome browser with our plugin installed to record their actions on assigned websites. The annotators first visited the target websites and checked whether the website descriptions matched the actual tasks. They then evaluated the tasks for clarity, relevance, achievability, complexity, and subjectivity, skipping those that didn’t meet the criteria. They carefully recorded each step during a task, including any login or captcha steps. For tasks that required an answer, the annotators manually edited the responses. If a task was not doable, they could modify its description or abandon it. We provide the demonstration of our plugin for annotation in Figure 8 and the annotation documentation in Table 7.

Appendix G Demonstration

The targeted task to be executed is "What is the weather like today?". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "todays weather report"

As Figure 9 shows, we end up on the webpage with a local weather report. We obtained detailed information about today’s weather as the answer, effectively completing the target task.

G.2 Shopping Advice

The targeted task to be executed is "Help me pick a Christmas gift for kids". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "Christmas gift for kids"

Step3: Click "All" tag in the category selection

Step5: Click The first product on the result page

As Figure 10 shows, we ultimately landed on a camera product page, where we obtained a recommendation for that camera as the answer, essentially completing the task.

G.3 Searching Article

The targeted task to be executed is "Find an article about large language model". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "large language model"

As Figure 11 shows, we ultimately arrived at a page featuring an article and obtained "Found a relevant article" as the answer, essentially fulfilling the task.

G.4 Searching Tools

The targeted task to be executed is "Find a tool to solve the differential equation". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "tools to solve the differential equation"

As Figure 12 shows, we ultimately arrived at a page for an ODE (Ordinary Differential Equation) calculator and obtained "Found a relevant tool" as the answer, essentially completing the task.

G.5 Knowledge Query

The targeted task to be executed is "Search and tell me some basic info about the dark matter". The actual execution steps can be summarized as follows:

As Figure 13 shows, we ultimately reached a wiki page about dark matter, obtaining some basic info as the answer, and effectively completing the task.

G.6 Finding Pictures

The targeted task to be executed is "Help find a beautiful picture of the Pacific Ocean". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "Pacific Ocean Pictures"

Step3: Click A picture in the search result

Step5: Click Another in the search result

As Figure 14 shows, we ultimately reached a page displaying an image of the Pacific Ocean, obtaining "Found a picture of the Pacific Ocean for you" as the answer, effectively completing the task.

G.7 Finding Research

The targeted task to be executed is "Search and tell me a hot area in AI research". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "areas in AI research"

As Figure 15 shows, we ultimately reached a page about AI research, obtaining "Natural Language Processing(NLP)" as the answer, effectively completing the task.

G.8 Game Recommendation

The targeted task to be executed is "I want to play a PC game. Help me choose one". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "PC game recommendations"

As Figure 16 shows, we ultimately reached a page of searching results of games, obtaining a recommendation as the answer, effectively completing the task.

G.9 Playing Video

The targeted task to be executed is "Search and tell me a hot area in AI research". The actual execution steps can be summarized as follows:

As Figure 17 shows, we ultimately reached a page playing a funny video, effectively completing the task.

G.10 Online Shopping Assistance with Pop-Up Interruption

The targeted task to be executed is "Find and select a highly rated toaster". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "best toaster 2024"

Step3: Click a link from the search results leading to an online shopping site (Encounter a pop-up asking to subscribe to the newsletter.)

Step4: Scroll down, but the interaction is blocked by the pop-up

As Figure 18 shows, we do not reach the intended product selection. The presence of an unexpected pop-up interrupts the task execution, demonstrating the system’s limitation in handling unexpected graphical elements and pop-ups. This outcome underscores the need for enhanced capabilities in graphical recognition and interaction handling, ensuring smoother navigation and task completion on web pages with complex elements.

G.11 Knowledge Query with Hallucination

The targeted task to be executed is "Tell me some basic info about NLP". The actual execution steps can be summarized as follows:

As Figure 19 shows, this case is a classic example of the hallucination fallacy, where the system responded directly without going through the webpage, and the response came from the hallucination rather than the webpage information

G.12 Technological Breakthrough Summary with misinterpretation

The targeted task to be executed is "Summarize a recent technological breakthrough in renewable energy". The actual execution steps can be summarized as follows:

Step1: Type SearchBar "latest technological breakthrough in renewable energy 2024"

Step3: Click a link in search results (The system selects a link to a general overview of renewable energy technologies instead of a specific article on recent breakthroughs.)

As Figure 20 shows, we do not reach the intended outcome. Instead of summarizing a recent technological breakthrough, the system provides a generalized overview of renewable energy. This outcome highlights a misinterpretation of task context, demonstrating the system’s challenge in distinguishing between general information and specific, recent developments.

G.13 Map Query with Poor Graphical Recognition

The targeted task to be executed is "Where is Beijing relative to Shanghai according to the map". The actual execution steps can be summarized as follows:

As Figure 21 shows, we ultimately reached a page displaying a description of Beijing and the Beijing map, obtaining "Northside" as the answer. The answer is to some extent too simple. This case illustrates two minor flaws in our system: one is that it has a slight lack of understanding of the graphical interface, and the other is that it sometimes hallucinates, and the answers it gets are not always from web information.