AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, Ali Farhadi

What is AI2-THOR?

Humans demonstrate levels of visual understanding that go well beyond current formulations of mainstream vision tasks (e.g. object detection, scene recognition, image segmentation). A key element to visual intelligence is the ability to interact with the environment and learn from those interactions. Current state-of-the-art models in computer vision are trained by using still images or videos. This is different from how humans learn. We introduce AI2-THOR as a step towards human-like learning based on visual input.

There are several key factors that distinguish AI2-THOR from other simulated environments:

Interactions. AI2-THOR supports many types of interactions, including object state changes, arm-based manipulation, and causal interactions. For example, a microwave can be opened or closed, a loaf of bread can be sliced and toasted in the toaster, and a faucet can be turned on to fill a mug with water. Figure 11 shows some examples of interactions supported in AI2-THOR.

Scenes. AI2-THOR provides substantially more interactive objects and scenes for training than other platforms by using procedural generation . We also provide support for many scenes designed manually by professional 3D artists, with 120 stand-alone rooms in iTHOR, 89 scenes in RoboTHOR , and 10 evaluation houses ArchitecTHOR .

Quality. The objects and scenes in AI2-THOR are near photo-realistic. This allows better transfer of the learned models to the real world. In contrast, ATARI games or board games such as GO, which are typically used to demonstrate the performance of AI models, are very different from the real world and lack much of the visual complexity of natural environments.

API. AI2-THOR provides a Python API to interact with the Unity 3D game engine that provides many different functionalities such as navigation, applying forces, object interaction, and physics modeling.

Real robot experiments are typically performed in lab settings or constrained scenes since deploying robots in various indoor and outdoor scenes is not scalable. This makes training models that generalize to various situations difficult. Additionally, due to mechanical constraints of robot actuators, using learning algorithms that require thousands of iterations is infeasible. Furthermore, training real robots might be costly or unsafe as they might damage the surrounding environment or the robots themselves during training. AI2-THOR provides a scalable, fast and cheap proxy for real world experiments in different types of scenarios.

In the following sections, we discuss more of the features included in AI2-THOR, how it compares to other simulators, and work that has been conducted in it since the initial release.

What does AI2-THOR feature?

AI2-THOR is used for a wide range of tasks in Embodied AI, robotics, and computer vision. It encompasses many different types of scenes; different types of agents, each with its own set of actions to interact with objects; support for many image modalities; and functions to provide metadata about the state of the environment.

Figure 2 shows AI2-THOR’s agent-simulator loop, which shows the front-end Python API that interacts with the Unity back-end. Here, actions are called from the Python API, which are sent through a local server to Unity. Unity is a powerful real-time game engine, which stores our scenes, code pertaining to how actions should be executed, 3D objects with their properties, and shaders to render different image modalities. Unity then returns an Event, which contains images from the cameras in the scene and the environment metadata.

2 Scene Datasets

Many scene datasets have been built as part of AI2-THOR, including iTHOR, RoboTHOR , ProcTHOR-10K , and ArchitecTHOR . Each of these scene datasets is interactive and can be used from the same API with any of the agents.

iTHOR is the original set of scenes used for all experiments, which includes 120 room-sized scenes, covering bedrooms, bathrooms, kitchens, and living rooms. The scenes are modeled by hand by professional 3D artists.

RoboTHOR was later developed, which consists of 89 maze-styled dorm-sized apartments to study sim2real transfer. The scenes are also developed by professional 3D artists. Many of the scenes are recreated in Seattle, near the Allen Institute for AI’s offices, to study the discrepancies when evaluating models in the same environments in simulation compared to reality.

ProcTHOR aims to use procedural generation to massively scale up the number and diversity of training scenes to improve generalization in Embodied AI. Overfitting to the training scenes is a severe problem that is often observed when training on iTHOR and RoboTHOR scenes, and it was hypothesized that merely improving the training data could help solve this problem. ProcTHOR-10K, the initial dataset released with the paper and used for experimentation, procedurally generates 10K diverse and semantically plausible houses for training. Using ProcTHOR for training led to remarkable generalization results, and we expect it to be used as a starting point for training most projects in AI2-THOR moving forward.

ArchitecTHOR is a set of 10 evaluation houses (5 for validation, 5 for testing) that was developed in conjunction with ProcTHOR. With ProcTHOR being procedurally generated, a test set of houses that comes from a real-world distribution are needed to evaluate if models training on ProcTHOR merely memorize biases from the procedural generation, or if they are capable of generalizing to real-world floorplans and object placements. Similar to iTHOR and RoboTHOR, the scenes are hand-built by professional 3D artists, although ArchitecTHOR houses are much larger and styled as single story houses.

3 Agents

AI2-THOR comes equipped with many agents that support a range of embodiments, including the ManipulaTHOR agent, StretchRE1 , LoCoBot , Abstract agent, and Drone agent. Each of these agents is embodied with a different physical robot and has its own set of actions that it can execute in the environment.

All of the agents are able to navigate around the scenes and perform environment queries and state changes. The ManipulaTHOR and StetchRE1 agents are able to use their arm to grasp and open objects. The LoCoBot, Abstract, and Drone agents interact with objects in a more abstract way, where a high-level Open or Pickup command is executed if the agent is looking at the object, the high-level action is called, and the agent within a certain distance of the object.

4 Actions

Agents in AI2-THOR support a wide range of actions, which we can break down into navigation actions, interaction actions, environment queries, and environment state changes.

Each agent comes with some ability to navigate in a given scene. Navigation actions may be discrete or continuous move (e.g. MoveAhead by 0.25m), rotate (e.g. RotateRight by 30∘), look (e.g. LookUp by 30∘), or teleport actions. Agents with an arm have more actions to control how the arm is positioned.

Interactive Actions.

There are many types of interactions supported in AI2-THOR, including abstracted interactions, arm-based manipulation, object state changes, and causal interactions.

Abstracted interactions are often a key component of research in Embodied AI, where one may be interested in studying high-level planning rather than low-level control. Here, an agent can execute an abstracted action, such as open, pickup, push, throw, drop, or place, where as long as the agent can see the object in its frame and it is within a certain distance away from it, the action can execute successfully. Abstracted actions can be used to change an object’s state, such as cooking it, breaking it, slicing it, toggling it, filling it with liquid, or using it up.

Arm-based interactions are lower-level than abstracted actions, and require interacting with objects by moving an arm to grip them. They can be used to open an object incrementally in a continuous manner (Figure 7) or grasping an object to move it from one position to another (Figure 8).

Causal interactions result as a consequence of interacting with another object. For instance, turning a coffee machine on, which has a mug placed in it, will fill the mug with coffee; throwing a breakable an object hard enough may cause it and the surface it is thrown at to shatter; and pushing a table over will cause objects on top of the table to fall and potentially break.

Environment Queries.

Environment queries are used to obtain information about the state of the environment that is not provided with each Event because it is often unnecessary to compute at each time step for every use case. Examples include obtaining the shortest path from the agent to a given target object in the scene, querying which object appears at a pixel in the agent’s current frame, or obtaining the convex hull of a given object.

Environment State Changes.

Environment state changes involve actions that modify the environment or its properties. For example, some environment state changes include randomizing the materials in the scene (Figure 10), randomizing the lighting in the scene, updating the rendering quality, updating the resolution of the images from the cameras, and changing the skybox in the scene.

5 Image Modalities

Figure 12 shows a suite of different image modalities that can be rendered from each of the cameras in the scene, including RGB, depth, semantic segmentation, instance segmentation, and normals. Each agent comes with a camera attached to it, but more cameras can also be added, such as one to capture a top-down view of the scene. More image modalities can be added by modifying the Unity back-end (often by adding shaders).

6 Objects

AI2-THOR includes 3,578 interactive objects in its object database, which is rapidly growing. Each of these objects has been hand-modeled to support our set of interactive actions and state changes, such as opening, breaking, or cooking. Figure 13 shows samples of objects from 4 categories, including alarm clocks, side tables, plants, and chairs.

7 Environment Metadata

Environment metadata is returned after each action is executed. It includes information such as the pose of each agent; the pose and state of each object in the scene (e.g., whether the object is moving, if it is visible to the agent, how far open it is, if it is clean or dirty); metadata about the scene, such as its size; and if the most recent action executed successfully (e.g., the agent did not collide with an object while trying to move). Metadata is often not provided to the agent for most tasks, as it would make the tasks too simple and easily solvable with a heuristic. Instead, many tasks use metadata to build a reward function with access to “expert-level” information that is hidden from the agent, build an imitation learning expert, and construct training and evaluation datasets.

What has AI2-THOR been used for?

Since the initial release of AI2-THOR in 2017, it has been used for experimentation in over 150 publications and downloaded over 500k times. Some areas of work that we found particularly interesting include:

Visual Navigation. Visual navigation was the first use case of AI2-THOR , which trains an agent to perform ImageNav (i.e. navigating to an image where the target object is described with a picture of it). Here, the agent executes a sequence of move or rotate commands to reach the target from egocentric camera inputs at each time step. ObjectNav is another common navigation task, where the agent is tasked with navigating to a given semantic category, such as a bed. Follow-up work from uses semantic priors about where objects typically occur to improve navigation efficiency; used meta-learning to try and better adapt to unseen scenes; uses a Markov network to build a map of the environment; found that using CLIP as a pre-trained visual encoder helps significantly boost generalization performance; and found that training on many procedurally generated scenes strongly generalizes to RoboTHOR, iTHOR, and ArchitecTHOR in a 0-shot setting.

Audio-Visual Navigation. proposes the task of audio-visual navigation in which the agent is tasked with navigating to find where the sound is coming from in the scene.

Vision-and-Language. AI2-THOR has been used extensively for embodied vision-and-language research. Noteable datasets include ALFRED , for interactive instruction following from natural language; TEACh , for interactive instruction following from human-robot dialog; and DialFRED and IQA for interactive question-answering. Some other interesting work includes , which proposes the Episodic Transformer to encode the full history of vision and language inputs with each ALFRED task; , which uses grammar-based methods to learn high-level abstractions through decompositions of tasks; FILM , which builds a semantic map to perform exploration for instruction following; and PIGLeT , which learns natural language grounding through interaction.

Human-Robot Interaction. inserts a human into AI2-THOR and uses virtual reality to control its gestures in simulation. By controlling the human’s gestures, it can communicate different tasks it wants the robot to achieve, such as pointing to an object to encode moving to that object.

Sim2Real Transfer. RoboTHOR studies sim2real transfer for robotics. Here, the goal is to train in simulation because it is faster, cheaper, and more scalable, and then to deploy the trained agent in the real-world. Agents train on 75 scenes in simulation and evaluate on unseen real-world scenes that come from a similar distribution. Initial work analyzed sim2real transfer for agents trained to perform ObjectNav.

Multi-Agent Interaction. proposes the collaborative task of having 2 agents move to lift up furniture in a scene. For example, both agents might have to navigate to find the television in the scene, and work together to lift it up. Follow-up work from takes the task a step further, where the agents not only have to lift up the furniture, but also work together to move it. Both tasks require visual navigation from the agents, and for them to communicate and coordinate together. Some other notable multi-agent work includes , which tasks agents with playing Cache, a variant of hide-and-seek where one agent hides an object and the other agent is tasked with finding that object; , which uses multiple agents for interactive question answering; , which proposes using multiple agents to more efficiently find multiple target objects in a scene; and TEACh , which uses a commander agent and a follower agent to mimic human-robot dialog to solve interactive tasks.

Learning Object Relationships. proposes an approach to learn priors about inter-object functional relationships, such as which knobs on the stove control each burner, that the light switches controls a given light, and that the remote may control a television. proposes using egocentric videos to learn which objects are used together to complete certain activities. They then use the priors to help guide agents towards achieving different activities in AI2-THOR.

Learning Affordances. train an agent to interact with the environment to learn object affordances, which encode which objects may be interacted with and how. For instance, it learns that drawers or fridges may be opened, that the stove can turn on, and that an apple may be sliced. A model with an affordance landscape would make it easier to adapt to downstream tasks, such as learning to cut a tomato with a knife.

Scene Synthesis. ProcTHOR uses procedural generation to synthesize training houses at scale to improve the generalization abilities of embodied agents. It procedurally generated and trained on 10K houses by first sampling floorplans and then plausibly placing objects within each of the rooms in the floorplan. Remarkably, pre-training on ProcTHOR alone was able to achieve state-of-the-art performance for ObjectNav on RoboTHOR, iTHOR, and ArchitecTHOR, without leveraging any additional training data. LUMINOUS also uses scene synthesis techniques to train embodied agents, where it focuses on placing objects in iTHOR rooms.

Learning with Interaction. AI2-THOR supports a wide range of interactions that can be used to train agents, including for rearranging objects in a scene with RoomR , arm-based manipulation with ManipulaTHOR , learning about objects by interacting with them , and playing hide-and-seek with objects to learn visual representations , among many others.

Computer Vision. The rich annotations available in simulation make it easy to use AI2-THOR for pure computer vision tasks. Notable work includes SeGAN , which used a GAN to generate occluded parts of an object from images in scene; Interactron , which performs object detection with embodied agents that are able to move around in the environment; and , which performs depth estimation and action prediction to evaluate contrastive learning approaches.

Interpretability. iSEE uses probing to discover what information is in the hidden representations of Embodied AI models. It focuses on probing ObjectNav and PointNav agents to answer interpretability questions, such as how far the agent thinks it is from the target.

AI2-THOR is rapidly updating to build out features and functionality. For the latest published papers, please visit the publication tracker on our website: https://ai2thor.allenai.org/publications.

Why use AI2-THOR?

Following AI2-THOR’s first release in 2017, a number of simulators have been developed, including iGibson 2.0 , Habitat 1.0 , Habitat 2.0 , ThreeDWorld , and SAPIEN . Table 1 shows a comparison table between the simulators. AI2-THOR is significantly larger in scale than other simulators, while providing first-class support for interaction, and, by leveraging Unity, makes it easy to add new capabilities.

To benchmark performance, we trained an ObjectNav agent for 1 million steps on a 2-GPU machine. Here, GPU-0 stores and performs updates to the model while GPU-1 renders a batch of parallel instances of the simulator. We obtain a training FPS ranging between 145.5–179.4 (167.7 average). For comparison, we ran the same setup with Habitat 1.0 and obtained a training FPS ranging between 119.7–264.3 (230.5 average). More details are described in Appendix B.

Conclusion

We present AI2-THOR, a large-scale interactive simulation platform for Embodied AI. It has been used for experimentation in over 150 publications, spanning a wide variety of tasks and research areas. It is highly customizable, and provides first-class support for many different types of scenes, agent embodiments, actions, and metadata. The capabilities of AI2-THOR are rapidly evolving, and we are excited to support new improvements and use cases to come. For the latest information, please visit our website: https://ai2thor.allenai.org/.

References

Appendix A Contributions

was the lead engineer and built the API that connects Python and Unity, setup the infrastructure for maintenance and development, heavily optimized AI2-THOR to run faster, added support for headless rendering, contributed to the Unity backend, and contributed to RoboTHOR, ProcTHOR, and ManipulaTHOR.

Roozbeh Mottaghi

managed the AI2-THOR project and its constituents and made decisions about the technical and artistic features of the framework and set priorities for the team.

Winson Han

contributed to the Unity backend logic for features and functionality across AI2-THOR; oversaw the design and functionality of the agents; led the development of logic to support physics-based object interactions, state changes, visibility, repositioning, and the annotation pipeline; set up default object placement in scenes; contributed to the documentation; managed community feature requests and issues; and created many promotional graphics.

Eli VanderBilt

built all of the 3D scenes for iTHOR, RoboTHOR, ArchitecTHOR; created thousands of interactive assets; modeled the agents; and designed and implemented various features, including arm-based manipulation.

Luca Weihs

contributed to the AI2-THOR frontend and backend through the creation of new actions, tests, and processes; led the development of the AllenAct framework, a library used to train agents on AI2-THOR and Embodied AI tasks .

Alvaro Herrasti

developed features and infrastructure for the Unity backend and Python API; graphics and shader work; built the WebGL infrastructure and demo integration; built the continuous action physics system for arm-based agents; led the Unity development of ProcTHOR; and contributed to RoboTHOR and ManipulaTHOR.

Matt Deitke

led the development of ProcTHOR; built the AI2-THOR website, demo, and wrote documentation; contributed to building RoboTHOR; built infrastructure to make AI2-THOR more accessible; contributed to the Unity backend and Python API; and wrote the revised paper.

Kiana Ehsani

led the ManipulaTHOR project and the direction of adding arm-based manipulation with the StretchRE1 and ManipulaTHOR agents.

Daniel Gordon

developed some planning and rendering features for the early versions of AI2-THOR.

Yuke Zhu

created the very first version of AI2-THOR (mentioned in ) with the help of EK and RM.

Aniruddha Kembhavi

was involved in decision making for various features of ProcTHOR, ManipulaTHOR, ArchitecTHOR, and RoboTHOR.

Abhinav Gupta

provided advice and guidance throughout the course of the project.

Ali Farhadi

provided advice and guidance throughout the course of the project.

Appendix B Performance Comparison

Comparing performance between Embodied AI simulators is a surprisingly difficult question for many reasons:

Different simulators support different agents, each with their own action spaces and capabilities, with little standardization across simulators. AI2-THOR supports many different types of agents, including the ManipulaTHOR, Abstract, and LoCoBot agents. The ManipulaTHOR agent is often slower to simulate than a navigation-only LoCoBot agent as it is more complex to physically model a 6 DoF arm as it interacts with objects. This is made even more complex when noting that random action sampling, the simplest policy with which to benchmark, is a poor profiling strategy as some actions are only computationally expensive in rare, but important, settings; for instance, computing arm movements is most expensive when the arm is interacting with many objects, these interactions are rare when randomly sampling but we’d expect them to dominate when using a well-trained agent.

Some simulators are relatively slow when run on a single process but can be easily parallelized with many processes running on a single GPU, e.g. AI2-THOR. Thus single-process simulation speeds may be highly deceptive as they do not capture the ease of scalability.

When training agents via reinforcement learning, there are a large number of factors that bottleneck training speed and so the value of raw simulator speed is substantially reduced. These factors include:

Model forward pass when computing agent rollouts.

Model backward pass when computing gradients for RL losses.

Environment resets - for many simulators (e.g. AI2-THOR, Habitat, iGibson) it is orders of magnitude more expensive to change a scene than it is to take a single agent step. This can be extremely problematic when using synchronous RL algorithms as all simulators will need to wait for a single simulator when that simulator is resetting. When training this means that, in practice, important "tricks" are employed to ensure that scene changes are infrequent or synchronized, without these tricks, performance may be dramatically lower.

To attempt to control for the above factors, we set up two profiling experiments, one in Habitat with HM3D and one using ProcTHOR-10K, where we:

Use a 2-GPU machine (GeForce RTX 2080 GPUs) where GPU-0 is reserved for the agent’s actor-critic policy network and GPU-1 is reserved for simulator instances.

Train agents for the ObjectNav task (using the same LoCoBot agent with the same action space).

For both agents, use the same actor-critic policy network, the same used in the ProcTHOR paper .

Remove the "End" action so that agents always take the maximum 500 steps, this minimizes dependence on the learned policy.

Use a rollout length of 128 with the same set of training hyperparameters across both models.

Use a total of 28 parallel simulator processes, this approximately saturates GPU-1 memory. We found that Habitat instances used slightly less GPU memory than ProcTHOR instances and so we could likely increase the number instances for Habitat slightly, but we kept these equal for more direct comparison.

Use a scene update "trick" which forces all simulators to advance to the next scene in a synchronous fashion after every 10 rollouts (e.g. after every 10 x 128 x 28 = 35,840 total steps across all simulators).