Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar

Introduction

Building generally capable embodied agents that continuously explore, plan, and develop new skills in open-ended worlds is a grand challenge for the AI community . Classical approaches employ reinforcement learning (RL) and imitation learning that operate on primitive actions, which could be challenging for systematic exploration , interpretability , and generalization . Recent advances in large language model (LLM) based agents harness the world knowledge encapsulated in pre-trained LLMs to generate consistent action plans or executable policies . They are applied to embodied tasks like games and robotics , as well as NLP tasks without embodiment . However, these agents are not lifelong learners that can progressively acquire, update, accumulate, and transfer knowledge over extended time spans .

Let us consider Minecraft as an example. Unlike most other games studied in AI , Minecraft does not impose a predefined end goal or a fixed storyline but rather provides a unique playground with endless possibilities . Minecraft requires players to explore vast, procedurally generated 3D terrains and unlock a tech tree using gathered resources. Human players typically start by learning the basics, such as mining wood and cooking food, before advancing to more complex tasks like combating monsters and crafting diamond tools. We argue that an effective lifelong learning agent should have similar capabilities as human players: (1) propose suitable tasks based on its current skill level and world state, e.g., learn to harvest sand and cactus before iron if it finds itself in a desert rather than a forest; (2) refine skills based on environmental feedback and commit mastered skills to memory for future reuse in similar situations (e.g. fighting zombies is similar to fighting spiders); (3) continually explore the world and seek out new tasks in a self-driven manner.

Towards these goals, we introduce Voyager, the first LLM-powered embodied lifelong learning agent to drive exploration, master a wide range of skills, and make new discoveries continually without human intervention in Minecraft. Voyager is made possible through three key modules (Fig. 2): 1) an automatic curriculum that maximizes exploration; 2) a skill library for storing and retrieving complex behaviors; and 3) a new iterative prompting mechanism that generates executable code for embodied control. We opt to use code as the action space instead of low-level motor commands because programs can naturally represent temporally extended and compositional actions , which are essential for many long-horizon tasks in Minecraft. Voyager interacts with a blackbox LLM (GPT-4 ) through prompting and in-context learning . Our approach bypasses the need for model parameter access and explicit gradient-based training or finetuning.

More specifically, Voyager attempts to solve progressively harder tasks proposed by the automatic curriculum, which takes into account the exploration progress and the agent’s state. The curriculum is generated by GPT-4 based on the overarching goal of “discovering as many diverse things as possible”. This approach can be perceived as an in-context form of novelty search . Voyager incrementally builds a skill library by storing the action programs that help solve a task successfully. Each program is indexed by the embedding of its description, which can be retrieved in similar situations in the future. Complex skills can be synthesized by composing simpler programs, which compounds Voyager’s capabilities rapidly over time and alleviates catastrophic forgetting in other continual learning methods .

However, LLMs struggle to produce the correct action code consistently in one shot . To address this challenge, we propose an iterative prompting mechanism that: (1) executes the generated program to obtain observations from the Minecraft simulation (such as inventory listing and nearby creatures) and error trace from the code interpreter (if any); (2) incorporates the feedback into GPT-4’s prompt for another round of code refinement; and (3) repeats the process until a self-verification module confirms the task completion, at which point we commit the program to the skill library (e.g., craftStoneShovel() and combatZombieWithSword()) and query the automatic curriculum for the next milestone (Fig. 2).

Empirically, Voyager demonstrates strong in-context lifelong learning capabilities. It can construct an ever-growing skill library of action programs that are reusable, interpretable, and generalizable to novel tasks. We evaluate Voyager systematically against other LLM-based agent techniques (e.g., ReAct , Reflexion , AutoGPT ) in MineDojo , an open-source Minecraft AI framework. Voyager outperforms prior SOTA by obtaining $3.3\times$ more unique items, unlocking key tech tree milestones up to $15.3\times$ faster, and traversing $2.3\times$ longer distances. We further demonstrate that Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other methods struggle to generalize.

Method

Voyager consists of three novel components: (1) an automatic curriculum (Sec. 2.1) that suggests objectives for open-ended exploration, (2) a skill library (Sec. 2.2) for developing increasingly complex behaviors, and (3) an iterative prompting mechanism (Sec. 2.3) that generates executable code for embodied control. Full prompts are presented in Appendix, Sec. A.

Embodied agents encounter a variety of objectives with different complexity levels in open-ended environments. An automatic curriculum offers numerous benefits for open-ended exploration, ensuring a challenging but manageable learning process, fostering curiosity-driven intrinsic motivation for agents to learn and explore, and encouraging the development of general and flexible problem-solving strategies . Our automatic curriculum capitalizes on the internet-scale knowledge contained within GPT-4 by prompting it to provide a steady stream of new tasks or challenges. The curriculum unfolds in a bottom-up fashion, allowing for considerable adaptability and responsiveness to the exploration progress and the agent’s current state (Fig. 3). As Voyager progresses to harder self-driven goals, it naturally learns a variety of skills, such as “mining a diamond”.

The input prompt to GPT-4 consists of several components:

Directives encouraging diverse behaviors and imposing constraints, such as “My ultimate goal is to discover as many diverse things as possible ... The next task should not be too hard since I may not have the necessary resources or have learned enough skills to complete it yet.”;

The agent’s current state, including inventory, equipment, nearby blocks and entities, biome, time, health and hunger bars, and position;

Previously completed and failed tasks, reflecting the agent’s current exploration progress and capabilities frontier;

Additional context: We also leverage GPT-3.5 to self-ask questions based on the agent’s current state and exploration progress and self-answer questions. We opt to use GPT-3.5 instead of GPT-4 for standard NLP tasks due to budgetary considerations.

2 Skill Library

With the automatic curriculum consistently proposing increasingly complex tasks, it is essential to have a skill library that serves as a basis for learning and evolution. Inspired by the generality, interpretability, and universality of programs , we represent each skill with executable code that scaffolds temporally extended actions for completing a specific task proposed by the automatic curriculum.

The input prompt to GPT-4 consists of the following components:

Guidelines for code generation, such as “Your function will be reused for building more complex functions. Therefore, you should make it generic and reusable.”;

Control primitive APIs, and relevant skills retrieved from the skill library, which are crucial for in-context learning to work well;

The generated code from the last round, environment feedback, execution errors, and critique, based on which GPT-4 can self-improve (Sec. 2.3);

The agent’s current state, including inventory, equipment, nearby blocks and entities, biome, time, health and hunger bars, and position;

Chain-of-thought prompting to do reasoning before code generation.

We iteratively refine the program through a novel iterative prompting mechanism (Sec. 2.3), incorporate it into the skill library as a new skill, and index it by the embedding of its description (Fig. 4, top). For skill retrieval, we query the skill library with the embedding of self-generated task plans and environment feedback (Fig. 4, bottom). By continuously expanding and refining the skill library, Voyager can learn, adapt, and excel in a wide spectrum of tasks, consistently pushing the boundaries of its capabilities in the open world.

3 Iterative Prompting Mechanism

We introduce an iterative prompting mechanism for self-improvement through three types of feedback:

Environment feedback, which illustrates the intermediate progress of program execution (Fig. 5, left). For example, “I cannot make an iron chestplate because I need: 7 more iron ingots” highlights the cause of failure in crafting an iron chestplate. We use bot.chat() inside control primitive APIs to generate environment feedback and prompt GPT-4 to use this function as well during code generation;

Execution errors from the program interpreter that reveal any invalid operations or syntax errors in programs, which are valuable for bug fixing (Fig. 5, right);

Self-verification for checking task success. Instead of manually coding success checkers for each new task proposed by the automatic curriculum, we instantiate another GPT-4 agent for self-verification. By providing Voyager’s current state and the task to GPT-4, we ask it to act as a critic and inform us whether the program achieves the task. In addition, if the task fails, it provides a critique by suggesting how to complete the task (Fig. 6). Hence, our self-verification is more comprehensive than self-reflection by both checking success and reflecting on mistakes.

During each round of code generation, we execute the generated program to obtain environment feedback and execution errors from the code interpreter, which are incorporated into GPT-4’s prompt for the next round of code refinement. This iterative process repeats until self-verification validates the task’s completion, at which point we add this new skill to the skill library and ask the automatic curriculum for a new objective (Fig. 2). If the agent gets stuck after 4 rounds of code generation, then we query the curriculum for another task. This iterative prompting approach significantly improves program synthesis for embodied control, enabling Voyager to continuously acquire diverse skills without human intervention.

Experiments

We leverage OpenAI’s gpt-4-0314 and gpt-3.5-turbo-0301 APIs for text completion, along with text-embedding-ada-002 API for text embedding. We set all temperatures to 0 except for the automatic curriculum, which uses temperature $=$ 0.1 to encourage task diversity. Our simulation environment is built on top of MineDojo and leverages Mineflayer JavaScript APIs for motor controls. See Appendix, Sec. B.1 for more details.

2 Baselines

Because there is no LLM-based agents that work out of the box for Minecraft, we make our best effort to select a number of representative algorithms as baselines. These methods are originally designed only for NLP tasks without embodiment, therefore we have to re-interpret them to be executable in MineDojo and compatible with our experimental setting:

ReAct uses chain-of-thought prompting by generating both reasoning traces and action plans with LLMs. We provide it with our environment feedback and the agent states as observations.

Reflexion is built on top of ReAct with self-reflection to infer more intuitive future actions. We provide it with execution errors and our self-verification module.

Note that we do not directly compare with prior methods that take Minecraft screen pixels as input and output low-level controls . It would not be an apple-to-apple comparison, because we rely on the high-level Mineflayer API to control the agent. Our work’s focus is on pushing the limits of GPT-4 for lifelong embodied agent learning, rather than solving the 3D perception or sensorimotor control problems. Voyager is orthogonal and can be combined with gradient-based approaches like VPT as long as the controller provides a code API. We make a system-level comparison between Voyager and prior Minecraft agents in Table. A.2.

3 Evaluation Results

We systematically evaluate Voyager and baselines on their exploration performance, tech tree mastery, map coverage, and zero-shot generalization capability to novel tasks in a new world.

Significantly better exploration. Results of exploration performance are shown in Fig. 1. Voyager’s superiority is evident in its ability to consistently make new strides, discovering 63 unique items within 160 prompting iterations, $3.3\times$ many novel items compared to its counterparts. On the other hand, AutoGPT lags considerably in discovering new items, while ReAct and Reflexion struggle to make significant progress, given the abstract nature of the open-ended exploration goal that is challenging to execute without an appropriate curriculum.

Consistent tech tree mastery. The Minecraft tech tree tests the agent’s ability to craft and use a hierarchy of tools. Progressing through this tree (wooden tool $\rightarrow$ stone tool $\rightarrow$ iron tool $\rightarrow$ diamond tool) requires the agent to master systematic and compositional skills. Compared with baselines, Voyager unlocks the wooden level $15.3\times$ faster (in terms of the prompting iterations), the stone level $8.5\times$ faster, the iron level $6.4\times$ faster, and Voyager is the only one to unlock the diamond level of the tech tree (Fig. 2 and Table. 1). This underscores the effectiveness of the automatic curriculum, which consistently presents challenges of suitable complexity to facilitate the agent’s progress.

Extensive map traversal. Voyager is able to navigate distances $2.3\times$ longer compared to baselines by traversing a variety of terrains, while the baseline agents often find themselves confined to local areas, which significantly hampers their capacity to discover new knowledge (Fig. 7).

Efficient zero-shot generalization to unseen tasks. To evaluate zero-shot generalization, we clear the agent’s inventory, reset it to a newly instantiated world, and test it with unseen tasks. For both Voyager and AutoGPT, we utilize GPT-4 to break down the task into a series of subgoals. Table. 2 and Fig. 8 show Voyager can consistently solve all the tasks, while baselines cannot solve any task within 50 prompting iterations. What’s interesting to note is that our skill library constructed from lifelong learning not only enhances Voyager’s performance but also gives a boost to AutoGPT. This demonstrates that the skill library serves as a versatile tool that can be readily employed by other methods, effectively acting as a plug-and-play asset to enhance performance.

4 Ablation Studies

We ablate 6 design choices (automatic curriculum, skill library, environment feedback, execution errors, self-verification, and GPT-4 for code generation) in Voyager and study their impact on exploration performance (see Appendix, Sec. B.3 for details of each ablated variant). Results are shown in Fig. 9. We highlight the key findings below:

Automatic curriculum is crucial for the agent’s consistent progress. The discovered item count drops by $93\%$ if the curriculum is replaced with a random one, because certain tasks may be too challenging if attempted out of order. On the other hand, a manually designed curriculum requires significant Minecraft-specific expertise, and does not take into account the agent’s live situation. It falls short in the experimental results compared to our automatic curriculum.

Voyager w/o skill library exhibits a tendency to plateau in the later stages. This underscores the pivotal role that the skill library plays in Voyager. It helps create more complex actions and steadily pushes the agent’s boundaries by encouraging new skills to be built upon older ones.

Self-verification is the most important among all the feedback types. Removing the module leads to a significant drop ( $-73\%$ ) in the discovered item count. Self-verification serves as a critical mechanism to decide when to move on to a new task or reattempt a previously unsuccessful task.

GPT-4 significantly outperforms GPT-3.5 in code generation and obtains $5.7\times$ more unique items, as GPT-4 exhibits a quantum leap in coding abilities. This finding corroborates recent studies in the literature .

5 Multimodal Feedback from Humans

Voyager does not currently support visual perception, because the available version of GPT-4 API is text-only at the time of this writing. However, Voyager has the potential to be augmented by multimodal perception models to achieve more impressive tasks. We demonstrate that given human feedback, Voyager is able to construct complex 3D structures in Minecraft, such as a Nether Portal and a house (Fig. 10). There are two ways to integrate human feedback:

Human as a critic (equivalent to Voyager’s self-verification module): humans provide visual critique to Voyager, allowing it to modify the code from the previous round. This feedback is essential for correcting certain errors in the spatial details of a 3D structure that Voyager cannot perceive directly.

Human as a curriculum (equivalent to Voyager’s automatic curriculum module): humans break down a complex building task into smaller steps, guiding Voyager to complete them incrementally. This approach improves Voyager’s ability to handle more sophisticated 3D construction tasks.

Limitations and Future Work

Cost. The GPT-4 API incurs significant costs. It is $15\times$ more expensive than GPT-3.5. Nevertheless, Voyager requires the quantum leap in code generation quality from GPT-4 (Fig. 9), which GPT-3.5 and open-source LLMs cannot provide .

Inaccuracies. Despite the iterative prompting mechanism, there are still cases where the agent gets stuck and fails to generate the correct skill. The automatic curriculum has the flexibility to reattempt this task at a later time. Occasionally, self-verification module may also fail, such as not recognizing spider string as a success signal of beating a spider.

Hallucinations. The automatic curriculum occasionally proposes unachievable tasks. For example, it may ask the agent to craft a “copper sword" or “copper chestplate", which are items that do not exist within the game. Hallucinations also occur during the code generation process. For instance, GPT-4 tends to use cobblestone as a fuel input, despite being an invalid fuel source in the game. Additionally, it may call functions absent in the provided control primitive APIs, leading to code execution errors.

We are confident that improvements in the GPT API models as well as novel techniques for finetuning open-source LLMs will overcome these limitations in the future.

Related work

Minecraft is an open-ended 3D world with incredibly flexible game mechanics supporting a broad spectrum of activities. Built upon notable Minecraft benchmarks , Minecraft learning algorithms can be divided into two categories: 1) Low-level controller: Many prior efforts leverage hierarchical reinforcement learning to learn from human demonstrations . Kanitscheider et al. design a curriculum based on success rates, but its objectives are limited to curated items. MineDojo and VPT utilize YouTube videos for large-scale pre-training. DreamerV3 , on the other hand, learns a world model to explore the environment and collect diamonds. 2) High-level planner: Volum et al. leverage few-shot prompting with Codex to generate executable policies, but they require additional human interaction. Recent works leverage LLMs as a high-level planner in Minecraft by decomposing a high-level task into several subgoals following Minecraft recipes , thus lacking full exploration flexibility. Like these latter works, Voyager also uses LLMs as a high-level planner by prompting GPT-4 and utilizes Mineflayer as a low-level controller following Volum et al. . Unlike prior works, Voyager employs an automatic curriculum that unfolds in a bottom-up manner, driven by curiosity, and therefore enables open-ended exploration.

Inspired by the strong emergent capabilities of LLMs, such as zero-shot prompting and complex reasoning , embodied agent research has witnessed a significant increase in the utilization of LLMs for planning purposes. Recent efforts can be roughly classified into two groups. 1) Large language models for robot learning: Many prior works apply LLMs to generate subgoals for robot planning . Inner Monologue incorporates environment feedback for robot planning with LLMs. Code as Policies and ProgPrompt directly leverage LLMs to generate executable robot policies. VIMA and PaLM-E fine-tune pre-trained LLMs to support multimodal prompts. 2) Large language models for text agents: ReAct leverages chain-of-thought prompting and generates both reasoning traces and task-specific actions with LLMs. Reflexion is built upon ReAct with self-reflection to enhance reasoning. AutoGPT is a popular tool that automates NLP tasks by crafting a curriculum of multiple subgoals for completing a high-level goal while incorporating ReAct ’s reasoning and acting loops. DERA frames a task as a dialogue between two GPT-4 agents. Generative Agents leverages ChatGPT to simulate human behaviors by storing agents’ experiences as memories and retrieving those for planning, but its agent actions are not executable. SPRING is a concurrent work that uses GPT-4 to extract game mechanics from game manuals, based on which it answers questions arranged in a directed acyclic graph and predicts the next action. All these works lack a skill library for developing more complex behaviors, which are crucial components for the success of Voyager in lifelong learning.

Code generation has been a longstanding challenge in NLP , with various works leveraging execution results to improve program synthesis. Execution-guided approaches leverage intermediate execution outcomes to guide program search . Another line of research utilizes majority voting to choose candidates based on their execution performance . Additionally, LEVER trains a verifier to distinguish and reject incorrect programs based on execution results. CLAIRIFY , on the other hand, generates code for planning chemistry experiments and makes use of a rule-based verifier to iteratively provide error feedback to LLMs. Voyager distinguishes itself from these works by integrating environment feedback, execution errors, and self-verification (to assess task success) into an iterative prompting mechanism for embodied control.

Conclusion

In this work, we introduce Voyager, the first LLM-powered embodied lifelong learning agent, which leverages GPT-4 to explore the world continuously, develop increasingly sophisticated skills, and make new discoveries consistently without human intervention. Voyager exhibits superior performance in discovering novel items, unlocking the Minecraft tech tree, traversing diverse terrains, and applying its learned skill library to unseen tasks in a newly instantiated world. Voyager serves as a starting point to develop powerful generalist agents without tuning the model parameters.

Broader Impacts

Our research is conducted within Minecraft, a safe and harmless 3D video game environment. While Voyager is designed to be generally applicable to other domains, such as robotics, its application to physical robots would require additional attention and the implementation of safety constraints by humans to ensure responsible and secure deployment.

Acknowledgements

We are extremely grateful to Ziming Zhu, Kaiyu Yang, Rafał Kocielnik, Colin White, Or Sharir, Sahin Lale, De-An Huang, Jean Kossaifi, Yuncong Yang, Charles Zhang, Minchao Huang, and many other colleagues and friends for their helpful feedback and insightful discussions. This work is done during Guanzhi Wang’s internship at NVIDIA. Guanzhi Wang is supported by the Kortschak fellowship in Computing and Mathematical Sciences at Caltech.

References

Appendix A Method

A.2 Prompting

GPT-4 and GPT-3.5 offer users the ability to designate the role of each prompt message among three options:

System: A high-level instruction that guides the model behavior throughout the conversation. It sets the overall tone and objective for the interaction.

User: A detailed instruction that guides the assistant for the next immediate response.

Assistant: A response message generated the model.

See https://platform.openai.com/docs/guides/chat/introduction for more details.

To save token usage, instead of engaging in multi-round conversations, we concatenate a system prompt and a user prompt to obtain each assistant’s response.

A.3 Automatic Curriculum

The input prompt to GPT-4 consists of several components:

Directives encouraging diverse behaviors and imposing constraints (so that the proposed task is achievable and verifiable): See Sec. A.3.4 for the full prompt;

Inventory: A dictionary of items with counts, for example, {‘cobblestone’: 4, ‘furnace’: 1, ‘stone_pickaxe’: 1, ‘oak_planks’: 7, ‘dirt’: 6, ‘wooden_pickaxe’: 1, ‘crafting_table’: 1, ‘raw_iron’: 4, ‘coal’: 1};

Equipment: Armors or weapons equipped by the agents;

Nearby blocks: A set of block names within a 32-block distance to the agent, for example, ‘dirt’, ‘water’, ‘spruce_planks’, ‘grass_block’, ‘dirt_path’, ‘sugar_cane’, ‘fern’;

Other blocks that are recently seen: Blocks that are not nearby or in the inventory;

Nearby entities: A set of entity names within a 32-block distance to the agent, for example, ‘pig’, ‘cat’, ‘villager’, ‘zombie’;

A list of chests that are seen by the agent: Chests are external containers where the agent can deposit items. If a chest is not opened before, its content is “Unknown”. Otherwise, the items inside each chest are shown to the agent.

Biome: For example, ‘plains’, ‘flower_forest’, ‘meadow’, ‘river’, ‘beach’, ‘forest’, ‘snowy_slopes’, ‘frozen_peaks’, ‘old_growth_birch_forest’, ‘ocean’, ‘sunflower_plains’, ‘stony_shore’;

Time: One of ‘sunrise’, ‘day’, ‘noon’, ‘sunset’, ‘night’, ‘midnight’;

Health and hunger bars: The max value is 20;

Position: 3D coordinate $(x,y,z)$ of the agent’s position in the Minecraft world;

Chain-of-thought prompting in response: We request GPT-4 to first reason about the current progress and then suggest the next task.

A.3.2 Additional Context

We leverage GPT-3.5 to self-ask questions to provide additional context. Each question is paired with a concept that is used for retrieving the most relevant document from the wiki knowledge base . We feed the document content to GPT-3.5 for self-answering questions. In practice, using a wiki knowledge base is optional since GPT-3.5 already possesses a good understanding of Minecraft game mechanics. However, the external knowledge base becomes advantageous if GPT-3.5 is not pre-trained in that specific domain. See Sec. A.3.4 for the full prompt.

A.3.3 Warm-up Schedule

In practice, we adopt a warm-up schedule to gradually incorporate the agent’s state and the additional context into the prompt based on how many tasks the agent has completed. This ensures that the prompt is exposed to increasing amounts of information over the exploration progress and therefore begins with basic skills and progressively advances towards more intricate and diverse ones. The warm-up setting that we use across all the experiments is shown in Table. A.1.

A.3.4 Full Prompt

A.4 Skill Library

The input prompt to GPT-4 consists of the following components:

Guidelines for code generation: See Sec A.4.2 for the full prompt;

Control primitive APIs implemented by us: These APIs serve a dual purpose: they demonstrate the usage of Mineflayer APIs, and they can be directly called by GPT-4.

exploreUntil(bot, direction, maxTime = 60, callback): Allow the agent to explore in a fixed direction for maxTime. The callback is the stopping condition implemented by the agent to determine when to stop exploring;

mineBlock(bot, name, count = 1): Mine and collect the specified number of blocks within a 32-block distance;

craftItem(bot, name, count = 1): Craft the item with a crafting table nearby;

placeItem(bot, name, position): Place the block at the specified position;

smeltItem(bot, itemName, fuelName, count = 1): Smelt the item with the specified fuel. There must be a furnace nearby;

killMob(bot, mobName, timeout = 300): Attack the mob and collect its dropped item;

getItemFromChest(bot, chestPosition, itemsToGet): Move to the chest at the specified position and get items from the chest;

depositItemIntoChest(bot, chestPosition, itemsToDeposit): Move to the chest at the specified position and deposit items into the chest;

Control primitive APIs provided by Mineflayer:

await bot.pathfinder.goto(goal): Go to a specific position. See below for how to set the goal;

new GoalNear(x, y, z, range): Move the bot to a block within the specified range of the specified block;

new GoalXZ(x, z): For long-range goals that don’t have a specific Y level;

new GoalGetToBlock(x, y, z): Not get into the block, but get directly adjacent to it. Useful for fishing, farming, filling a bucket, and using a bed.;

new GoalFollow(entity, range): Follow the specified entity within the specified range;

new GoalPlaceBlock(position, bot.world, {}): Position the bot in order to place a block;

new GoalLookAtBlock(position, bot.world, {}): Path towards a position where a face of the block at position is visible;

bot.isABed(bedBlock): Return true if bedBlock is a bed;

bot.blockAt(position): Return the block at position;

await bot.equip(item, destination): Equip the item in the specified destination. destination must be one of “hand”, “head”, “torso”, “legs”, “feet”, “off-hand”;

await bot.consume(): Consume the item in the bot’s hand. You must equip the item to consume first. Useful for eating food, drinking potions, etc.;

await bot.fish(): Let bot fish. Before calling this function, you must first get to a water block and then equip a fishing rod. The bot will automatically stop fishing when it catches a fish;

await bot.sleep(bedBlock): Sleep until sunrise. You must get to a bed block first;

await bot.activateBlock(block): This is the same as right-clicking a block in the game. Useful for buttons, doors, etc. You must get to the block first;

await bot.lookAt(position): Look at the specified position. You must go near the position before you look at it. To fill a bucket with water, you must look at it first;

await bot.activateItem(): This is the same as right-clicking to use the item in the bot’s hand. Useful for using a bucket, etc. You must equip the item to activate first;

await bot.useOn(entity): This is the same as right-clicking an entity in the game. Useful for shearing a sheep. You must get to the entity first;

Environment feedback: The chat log in the prompt;

Critique from the self-verification module;

The agent’s current state: See Sec. A.3.1 for each element of the agent’s state;

Task proposed by the automatic curriculum;

Task context: We prompt GPT-3.5 to ask for general suggestions about how to solve the task. In practice, this part is handled by the automatic curriculum since it has a systematic mechanism for question-answering (Sec. A.3.2);

Chain-of-thought prompting in response: We ask GPT-4 to first explain the reason why the code from the last round fails, then give step-by-step plans to finish the task, and finally generate code. See Sec. A.4.2 for the full prompt.

A.4.2 Full Prompt

A.4.3 Examples

A.5 Self-Verification

The input prompt to GPT-4 consists of the following components:

The agent’s state: We exclude other blocks that are recently seen and nearby entities from the agent’s state since they are not useful for assessing the task’s completeness. See Sec. A.3.1 for each element of the agent’s state;

Task proposed by the automatic curriculum;

Chain-of-thought prompting in response: We request GPT-4 to initially reason about the task’s success or failure, then output a boolean variable indicating the task’s outcome, and finally provide a critique to the agent if the task fails.

Few-shot examples for in-context learning .

A.5.2 Full Prompt

A.6 System-level Comparison between Voyager and Prior Works

We make a system-level comparison in Table. A.2. Voyager stands out as the only method featuring a combination of automatic curriculum, iterative planning, and a skill library. Moreover, it learns to play Minecraft without the need for any gradient update.

Appendix B Experiments

Our simulation environment is built upon MineDojo and utilizes Mineflayer JavaScript APIs for motor controls (Sec. A.4.2). Additionally, we incorporate many bot.chat() into Mineflayer functions to provide abundant environment feedback and implement various condition checks along with try-catch exceptions for continuous execution. If the bot dies, it is resurrected near the closest ground, and its inventory is preserved for uninterrupted exploration. The bot recycles its crafting table and furnace after program execution. For detailed implementations, please refer to our codebase.

B.2 Baselines

ReAct uses chain-of-thought prompting by generating both reasoning traces and action plans with LLMs. We provide it with our environment feedback and the agent states as observations. ReAct undergoes one round of code generation from scratch, followed by three rounds of code refinement. This process is then repeated until the maximum prompting iteration is reached.

Reflexion is built on top of ReAct with self-reflection to infer more intuitive future actions. We provide it with environment feedback, the agent states, execution errors, and our self-verification module. Similar to ReAct, Reflexion undergoes one round of code generation from scratch, followed by three rounds of code refinement. This process is then repeated until the maximum prompting iteration is reached.

AutoGPT is a popular software tool that automates NLP tasks by decomposing a high-level goal into multiple subgoals and executing them in a ReAct-style loop. We re-implement AutoGPT by using GPT-4 to do task decomposition and provide it with the agent states, environment feedback, and execution errors as observations for subgoal execution. Compared with Voyager, AutoGPT lacks the skill library for accumulating knowledge, self-verification for assessing task success, and automatic curriculum for open-ended exploration. During each subgoal execution, if no execution error occurs, we consider the subgoal completed and proceed to the next one. Otherwise, we refine the program until three rounds of code refinement (equivalent to four rounds of code generation) are completed, and then move on to the next subgoal. If three consecutive subgoals do not result in acquiring a new item, we replan by rerunning the task decomposition.

The task is “explore the world and get as many items as possible” for all baselines.

B.3 Ablations

Manual Curriculum: We substitute the automatic curriculum with a manually designed curriculum for mining a diamond: “Mine 3 wood log”, “Craft 1 crafting table”, “Craft 1 wooden pickaxe”, “Mine 11 cobblestone”, “Craft 1 stone pickaxe”, “Craft 1 furnace”, “Mine 3 iron ore”, “Smelt 3 iron ore”, “Craft 1 iron pickaxe”, “Mine 1 diamond”. A manual curriculum requires human effort to design and is not scalable for open-ended exploration.

Random Curriculum: We curate 101 items obtained by Voyager and create a random curriculum by randomly selecting one item as the next task.

w/o Skill Library: We remove the skill library, eliminating skill retrieval for code generation.

w/o Environment Feedback: We exclude environment feedback (chat log) from the prompt for code generation.

w/o Execution Errors: We exclude execution errors from the prompt for code generation.

w/o Self-Verification: For each task, we generate code without self-verification and iteratively refine the program for 3 rounds (equivalent to 4 rounds of code generation in total).

GPT-3.5: We replace GPT-4 with GPT-3.5 for code generation. We retain GPT-4 for the automatic curriculum and the self-verification module.

B.4 Evaluation Results

The meaning of each icon in Fig. 1 is shown in Fig. A.1.

We run three trials for each method. The items collected by Voyager in each trial is

Trial 1: ‘iron_ingot’, ‘stone_shovel’, ‘iron_leggings’, ‘fishing_rod’, ‘pufferfish’, ‘oak_log’, ‘cooked_mutton’, ‘green_dye’, ‘flint’, ‘chest’, ‘iron_sword’, ‘string’, ‘ender_pearl’, ‘raw_copper’, ‘crafting_table’, ‘cactus’, ‘lapis_lazuli’, ‘iron_pickaxe’, ‘copper_ingot’, ‘stone_pickaxe’, ‘wooden_hoe’, ‘scaffolding’, ‘stick’, ‘porkchop’, ‘copper_block’, ‘gravel’, ‘grass_block’, ‘white_bed’, ‘bone’, ‘dirt’, ‘mutton’, ‘white_wool’, ‘oak_sapling’, ‘coal’, ‘bamboo’, ‘wooden_pickaxe’, ‘rotten_flesh’, ‘cooked_porkchop’, ‘cod’, ‘iron_boots’, ‘lightning_rod’, ‘diorite’, ‘water_bucket’, ‘shears’, ‘furnace’, ‘andesite’, ‘granite’, ‘bucket’, ‘wooden_sword’, ‘sandstone’, ‘iron_helmet’, ‘raw_iron’, ‘sand’, ‘acacia_log’, ‘cooked_cod’, ‘oak_planks’, ‘azure_bluet’, ‘iron_shovel’, ‘acacia_planks’, ‘shield’, ‘iron_axe’, ‘iron_chestplate’, ‘cobblestone’;

Trial 2: ‘iron_ingot’, ‘tuff’, ‘stone_shovel’, ‘iron_leggings’, ‘fishing_rod’, ‘cooked_mutton’, ‘spruce_planks’, ‘gunpowder’, ‘amethyst_shard’, ‘chest’, ‘string’, ‘cooked_salmon’, ‘iron_sword’, ‘raw_copper’, ‘crafting_table’, ‘torch’, ‘lapis_lazuli’, ‘iron_pickaxe’, ‘copper_ingot’, ‘stone_pickaxe’, ‘wooden_hoe’, ‘stick’, ‘amethyst_block’, ‘salmon’, ‘calcite’, ‘gravel’, ‘white_bed’, ‘bone’, ‘dirt’, ‘mutton’, ‘white_wool’, ‘spyglass’, ‘coal’, ‘wooden_pickaxe’, ‘cod’, ‘iron_boots’, ‘lily_pad’, ‘cobbled_deepslate’, ‘lightning_rod’, ‘snowball’, ‘stone_axe’, ‘smooth_basalt’, ‘diorite’, ‘water_bucket’, ‘furnace’, ‘andesite’, ‘bucket’, ‘granite’, ‘shield’, ‘iron_helmet’, ‘raw_iron’, ‘cobblestone’, ‘spruce_log’, ‘cooked_cod’, ‘tripwire_hook’, ‘stone_hoe’, ‘iron_chestplate’, ‘stone_sword’;

Trial 3: ‘spruce_planks’, ‘dirt’, ‘shield’, ‘redstone’, ‘clock’, ‘diamond_sword’, ‘iron_chestplate’, ‘stone_pickaxe’, ‘leather’, ‘string’, ‘chicken’, ‘chest’, ‘diorite’, ‘iron_leggings’, ‘black_wool’, ‘cobblestone_wall’, ‘cobblestone’, ‘cooked_chicken’, ‘feather’, ‘stone_sword’, ‘raw_gold’, ‘gravel’, ‘birch_planks’, ‘coal’, ‘cobbled_deepslate’, ‘oak_planks’, ‘iron_pickaxe’, ‘granite’, ‘tuff’, ‘crafting_table’, ‘iron_helmet’, ‘stone_hoe’, ‘iron_ingot’, ‘stone_axe’, ‘birch_boat’, ‘stick’, ‘sand’, ‘bone’, ‘raw_iron’, ‘beef’, ‘rail’, ‘oak_sapling’, ‘kelp’, ‘gold_ingot’, ‘birch_log’, ‘wheat_seeds’, ‘cooked_mutton’, ‘furnace’, ‘arrow’, ‘stone_shovel’, ‘white_wool’, ‘andesite’, ‘jungle_slab’, ‘mutton’, ‘iron_sword’, ‘copper_ingot’, ‘diamond’, ‘torch’, ‘oak_log’, ‘cooked_beef’, ‘copper_block’, ‘flint’, ‘bone_meal’, ‘raw_copper’, ‘wooden_pickaxe’, ‘iron_boots’, ‘wooden_sword’.

The items collected by ReAct in each trial is

Trial 1: ‘bamboo’, ‘dirt’, ‘sand’, ‘wheat_seeds’;

Trial 2: ‘dirt’, ‘rabbit’, ‘spruce_log’, ‘spruce_sapling’;

The items collected by Reflexion in each trial is

Trial 1: ‘crafting_table’, ‘orange_tulip’, ‘oak_planks’, ‘oak_log’, ‘dirt’;

Trial 2: ‘spruce_log’, ‘dirt’, ‘clay_ball’, ‘sand’, ‘gravel’;

Trial 3: ‘wheat_seeds’, ‘oak_log’, ‘dirt’, ‘birch_log’, ‘sand’.

The items collected by AutoGPT in each trial is

Trial 1: ‘feather’, ‘oak_log’, ‘leather’, ‘stick’, ‘porkchop’, ‘chicken’, ‘crafting_table’, ‘wheat_seeds’, ‘oak_planks’, ‘dirt’, ‘mutton’;

Trial 2: ‘wooden_pickaxe’, ‘iron_ingot’, ‘stone’, ‘coal’, ‘spruce_planks’, ‘string’, ‘raw_copper’, ‘crafting_table’, ‘diorite’, ‘andesite’, ‘furnace’, ‘torch’, ‘spruce_sapling’, ‘granite’, ‘iron_pickaxe’, ‘stone_pickaxe’, ‘wooden_axe’, ‘raw_iron’, ‘stick’, ‘spruce_log’, ‘dirt’, ‘cobblestone’;

Trial 3: ‘wooden_shovel’, ‘wooden_pickaxe’, ‘iron_ingot’, ‘stone’, ‘cod’, ‘coal’, ‘oak_log’, ‘flint’, ‘raw_copper’, ‘crafting_table’, ‘diorite’, ‘furnace’, ‘andesite’, ‘torch’, ‘granite’, ‘lapis_lazuli’, ‘iron_pickaxe’, ‘stone_pickaxe’, ‘raw_iron’, ‘stick’, ‘gravel’, ‘oak_planks’, ‘dirt’, ‘iron_axe’, ‘cobblestone’.

B.4.2 Extensive Map Traversal

Agent trajectories for map coverage are displayed in Fig. A.2. Fig. 7 is plotted based on Fig. A.2 by drawing the smallest circle enclosing each trajectory. The terrains traversed by Voyager in each trial is

Trial 1: ‘meadow’, ‘desert’, ‘river’, ‘savanna’, ‘forest’, ‘plains’, ‘bamboo_jungle’, ‘dripstone_caves’;

Trial 2: ‘snowy_plains’, ‘frozen_river’, ‘dripstone_caves’, ‘snowy_taiga’, ‘beach’;

Trial 3: ‘flower_forest’, ‘meadow’, ‘old_growth_birch_forest’, ‘snowy_slopes’, ‘frozen_peaks’, ‘forest’, ‘river’, ‘beach’, ‘ocean’, ‘sunflower_plains’, ‘plains’, ‘stony_shore’.

The terrains traversed by ReAct in each trial is

Trial 2: ‘snowy_plains’, ‘snowy_taiga’, ‘snowy_slopes’;

Trial 3: ‘dark_forest’, ‘dripstone_caves’, ‘grove’, ‘jagged_peaks’.

The terrains traversed by Reflexion in each trial is

Trial 3: ‘old_growth_birch_forest’, ‘river’, ‘ocean’, ‘beach’, ‘plains’.

The terrains traversed by AutoGPT in each trial is

Trial 1: ‘plains’, ‘dripstone_caves’, ‘savanna’, ‘meadow’;

Trial 3: ‘plains’, ‘stony_shore’, ‘forest’, ‘ocean’.

B.4.3 Efficient Zero-Shot Generalization to Unseen Tasks

The results of zero-shot generalization to unseen tasks for the other two tasks are presented in Fig. A.3. Similar to Fig. 8, Voyager consistently solves all tasks, while the baselines are unable to solve any task within 50 prompting iterations. Our skill library, constructed from lifelong learning, not only enhances Voyager’s performance but also provides a boost to AutoGPT .

B.4.4 Accurate Skill Retrieval

We conduct an evaluation of our skill retrieval (309 samples in total) and the results are in Table. A.4. The top-5 accuracy standing at 96.5% suggests our retrieval process is reliable (note that we include the top-5 relevant skills in the prompt for synthesizing a new skill).

B.4.5 Robust to Model Variations

In the main paper, all of Voyager’s experiments are conducted with gpt-4-0314. We additionally run new experiments with gpt-4-0613 and find that the performance is roughly the same (Fig. A.4). It demonstrates that Voyager is robust to model variations.