Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation

Medhini Narasimhan, Erik Wijmans, Xinlei Chen, Trevor Darrell, Dhruv Batra, Devi Parikh, Amanpreet Singh

Introduction

Humans have an uncanny ability to seamlessly navigate in unseen environments by quickly understanding their surroundings. Consider the example in Fig. 1. You’re visiting a friend’s home for the first time and you want to go to the kitchen. You’re in the entryway and you look around to observe your surroundings. You see a bedroom in one direction and the dining room in the other direction. A possible, but tedious solution is to head in a random direction and exhaustively search the space until you end up in the kitchen. Another option, and most probably the one you’d pick, is to walk towards the dining room as you are more likely to find the kitchen near the dining room rather than the bedroom. We believe that there are underlying architectural principles which govern the design of houses and our prior knowledge of house layouts helps us to navigate effectively in new spaces. We also improve our predictions of how the rest of the house is laid out as we walk around and gather more information. The goal of this work is to elicit a similar behaviour in embodied agents by enabling them to predict regions which lie beyond their field of view through learned scene priors. The agent models correlations between the appearance and architectural layout of houses to efficiently navigate in unseen scenes.

Currently, there exist two paradigms for the navigation problem (1) Classical path planning based methods: SLAM-based approaches which first build a geometric map and then use path planning with localization for navigation . (2) Learning-based methods: A policy is learned for a specific task in an environment . In this work, we introduce a learning-based method that unlike previous approaches, predicts an intermediate representation that captures the model’s current belief of the semantic layout of the house, beyond the agent’s field of view. represent priors in the form of knowledge graphs or probabilistic relation graphs which capture room location priors. However, we also want to learn other correlations such as estimating the shape of a room by observing parts of it. We choose to model correlations as semantic maps indicating the location, shape, and size of rooms which lie beyond the agent’s field of view. This offers a more flexible representation.

In this work, we develop a novel technique to dynamically generate amodal semantic top-down maps of rooms by learning architectural regularities in houses and to use these predicted maps for room navigation. We define the task of room navigation as navigating to the nearest room of the specified type. For e.g., navigating to the bedroom closest to the starting point. We train an agent using supervision to predict regions on a map that lie beyond its field of view which forces it to develop beliefs about where a room might be present before navigating to it. The agent constantly updates these beliefs as it steps around the environment. The learned beliefs help the agent navigate in novel environments.

Contributions (1) We introduce a novel learning-based approach for room navigation via amodal prediction of semantic maps. The agent learns architectural and stylistic regularities in houses to predict regions beyond its field of view. (2) Through carefully designed ablations, we show that our model trained to predict semantic maps as intermediate representations achieves better performance on unseen environments compared to a baseline which doesn’t explicitly generate semantic top-down maps. (3) To evaluate our approach, we introduce the room navigation task and dataset in the Habitat platform .

Related Work

Navigation in mobile robotics. Conventional solutions to the navigation problem in robotics are comprised of two main steps: (1) mapping the environment and simultaneously localizing the agent in the generated map (2) path planning towards the target using the generated map. Geometric solutions to the mapping problem include (i) structure from motion and (ii) simultaneous localization and mapping (SLAM) . Various SLAM algorithms have been developed for different sensory inputs available to the agent. Using the generated map, a path can be computed to the target location via several path planning algorithms . These approaches fall under the passive SLAM category where a human navigates around the environment beforehand to generate the maps. On the other hand, active SLAM research focuses on dynamically controlling the camera for building spatial representations of the environment. Some works formulate active SLAM as Partially Observable Markov Decision Process and use either Bayesian Optimization or Reinforcement Learning to plan trajectories that lead to accurate maps. and use Rao-Blackwellized Particle Filters to choose the set of actions that maximize the information gain and minimize the uncertainty of the predicted maps.

A less studied yet actively growing area of SLAM research is incorporating semantics into SLAM . and use semantic features for improved localization performance and for performing SLAM on dynamic scenes. However, all of the aforementioned SLAM techniques rely on sensor data which is highly susceptible to noise . They also have no mechanism to learn and update semantic beliefs which can be transferred across environments. combines a classical continual planner with a decision theoretic planner for active object search. The planner leverages conceptual spatial knowledge in the form of object co-occurrences and semantic place categorisation. On the other hand, our work learns semantic maps of room locations for the task of Room Navigation. summarizes the various ways of representing semantic information and using it for indoor navigation. The limitations and open challenges in SLAM have been outlined in . This motivates learning based methods for navigation, which we describe next.

Learning based methods for navigation. With the motivation of generalizing to novel environments and learning semantic cues, a number of end-to-end learning based approaches have been developed in the recent past . use a topological graph for navigation and propose a memory based policy that uses attention to exploit spatio-temporal dependencies. jointly learn the goal-driven reinforcement learning problem with auxiliary depth prediction tasks. introduce the RoomNav task in the House3D simulation platform and train a policy using deep deterministic policy gradient to solve the same. focus on building a task-agnostic exploration policy and demonstrate that this helps for downstream navigation tasks. Most relevant to our work is Cognitive Mapping and Planning (CMP) . It uses a differentiable mapper to learn a spatial memory that corresponds to an egocentric map of the environment and a differentiable planner that uses this memory alongside the goal to output navigational actions. The maps constructed using this approach only indicate free space and contain no semantic information. On the other hand, we predict semantic top-down maps indicating the location, shape, and size of rooms in the house. Furthermore, unlike CMP where the map corresponds to a top-down view of what the agent is currently seeing, our maps predict regions which lie beyond the agent’s field of view by learning architectural regularities in houses. uses prior knowledge about spatial and visual relationships between objects represented as a knowledge graph for the task of semantic navigation. The main disadvantage of such an approach is that the agent cannot modify the graph to learn new priors or update existing beliefs during training. estimates priors at training time by constructing probabilistic relationship graphs over semantic entities and uses these graphs for planning. However, their graphs don’t capture information regarding size and shape of rooms and patterns in houses. Contrary to previous approaches, our work does not rely on pre-constructed maps or knowledge graphs representing priors. We dynamically learn amodal semantic belief maps which model architectural and stylistic regularities in houses. Further, the agent updates its beliefs as it moves around in an environment.

Vision-Language Navigation (VLN). A different but related task, language guided visual navigation was introduced by . In VLN, an agent follows language instructions to reach a goal in a home. For e.g., “Walk up the rest of the stairs, stop at the top of the stairs near the potted plant”. There are multiple works which attempt to solve this problem . The room navigation task introduced here is different in that the agent doesn’t receive any language based instructions, just the final goal in the form of a room type. In , the agent “teleports” from one location to another on sparse pre-computed navigation graph and can never collide with anything. Our Room Navigation task on the other hand, starts from the significantly more realistic setting as in Point Navigation where the agent takes the low-level actions such as move-forward (0.25m), turn-left/right (10 degrees) and learns to avoid collisions. In , paths have an average length of 3.5m whereas the ground truth paths in our task have average length of 125. Overall, compared to VLN the navigation in our task is significantly more challenging. Relative to point navigation where the goal is specified as coordinates, the goal specification in our work is more semantic and closer to language – name of a room. The room navigation task allows us to move towards complex goal specification (e.g., follow an instruction) while keeping the navigation realistic. We’d like to highlight that the methods developed for VLN aren’t directly applicable to room navigation as they all rely on intermediate goals in the form of language based instructions which aren’t a part of our task specification.

Room Navigation Task

The agent is spawned at a random starting location and orientation in a novel environment and is tasked with navigating to a given target room – e.g., “Kitchen.” If there exist multiple rooms of same type, the agent needs to navigate to the room closest to its starting location. We ensure that the agent never starts in a room of the target room type i.e. if the agent is in a bedroom the target room cannot be a bedroom. Similar to , with each step the agent receives an RGB image from a single color vision sensor, depth information from the depth sensor and GPS+Compass that provides the current position and orientation relative to the start position. When there’s no GPS information available we only need egomotion, which most robotics platforms provide via IMU sensors or odometry. As in , the agent does not have access to any ground truth floor plan map and must navigate using only its sensors. Unlike point navigation , room navigation is a semantic task, so GPS+Compass is insufficient to solve the task and only helps in preventing the agent from going around in circles.

Room Navigation using Amodal Semantic Maps

We develop a room navigation framework with an explicit mapping strategy to predict amodal semantic maps by learning underlying correlations in houses, and navigating using these maps. Our approach assumes there are architectural and stylistic regularities in houses which allow us to predict regions of a house which lie beyond our field of view. For instance, if we are in the kitchen we can guess that the dining room is adjacent to it and we would be right in most cases.We acknowledge that these regularities likely vary across geographies and cultures. We first verify the existence of such correlations by scaling and aligning all the homes in the Matterport 3D dataset such that the kitchen is in the bottom-left corner. As shown in Fig. 2, we can observe that the concentration of dinning rooms is close to the kitchens and the bedrooms are in the opposite corner, away from the kitchens. We believe there exist similar and more subtle correlations (e.g. size of the kitchen could be indicative of the number of bedrooms in a house) which our agent can automatically learn while predicting amodal semantic top down maps of regions. We now provide an overview of our approach followed by a detailed explanation of each of its sub-components.

Overview. Our room navigation framework is outlined in Fig. 3. The agent is spawned in a random location and is asked to navigate to a specified target room, trRtr\in\mathcal{R}, where R\mathcal{R} is the set of all possible target rooms. With each step tt, the agent receives RGB image ItI_{t}, 0ptDtD_{t} and GPS+Compass information. The agent predicts egocentric crops of top down semantic maps indicating the rooms which lie in and beyond its field of view. An example of these maps is shown in Fig. 4. To generate the maps, the agent uses a sequence to sequence (Seq2Seq) network which takes as input the RGB Image from the current time-step ItI_{t}, the predicted semantic maps from the previous time step Mt1,rpredrRM^{\textrm{pred}}_{t-1,r}\forall r\in\mathcal{R}, and the previous action at1a_{t-1} and predicts semantic maps Mt,rpredM^{\textrm{pred}}_{t,r} for the current time step. The predicted semantic maps Mt,rpredM^{\textrm{pred}}_{t,r} are fed to a point prediction network which predicts a target point PtpredP^{\textrm{pred}}_{t} lying inside the target room. The agent navigates to the predicted point using a point navigation policy πnav(Ptpred,Dt)\pi_{nav}(P^{\textrm{pred}}_{t},D_{t}). The agent updates its beliefs of the predicted semantic maps and the predicted point as it steps around the environment. The episode is deemed successful if the agent stops inside the target room before 500 steps.

Next, we describe the three main components of our model architecture: Map Generation, Point Prediction and Point Navigation, as shown in Fig. 3.

We model the correlations in houses by learning to predict amodal semantic top-down maps. The agent uses the information it has gathered so far to determine where the different rooms in the house are present even before visiting these regions. The maps are crops of top-down maps indicating the type, location, size and shape of rooms and are egocentric to the agent. Fig. 4 shows an example of ground truth maps. The maps have 3 classes: (1) Regions which lie outside the house, (2) Regions which lie in the house but outside the target room, (3) Regions which lie in the target room. We hypothesize that the agent will learn architectural regularities in houses and correlations between the RGB images and layouts in order to predict these maps. We train a map generation network fmapf_{map}, to predict egocentric crops of top down semantic maps for each room rRr\in\mathcal{R}. fmapf_{map} consists of a sequence to sequence network fseqf_{seq} and a decoder network fdecf_{dec}. At each time step tt, fseqf_{seq} takes in as input a concatenation of learned representations of the current RGB frame fi(It)f_{i}(I_{t}), the previous action fact(at1)f_{act}(a_{t-1}), and the semantic map of the previous time step fm(Mt1,r)f_{m}(M_{t-1,r}). fseqf_{seq} outputs a latent representation ht,rh_{t,r}. ht,rh_{t,r} is passed through a parameterized decoder network fdecf_{dec} that resembles the decoder in . fdecf_{dec} upsamples the latent representation using multiple transpose convolutions layers to produce output Mt,rpredrRM^{\textrm{pred}}_{t,r}\forall r\in\mathcal{R}. Following , during training, we uniformly choose to set Mt,rinputM_{t,r}^{\textrm{input}} to be Mt1,rpredM^{\textrm{pred}}_{t-1,r} (predicted map from previous time step) 50% of the time and Mt1,rGTM^{\textrm{GT}}_{t-1,r} (ground truth map from the previous time step) the rest of the time. At each step, we use the ground truth semantic map Mt,rGTM_{t,r}^{\textrm{GT}} to train fseqf_{seq} and fdecf_{dec} with cross entropy loss, Lmap\mathcal{L}_{map}. Eqn. 1-5 describe the exact working of fmapf_{map}. The agent continuously generates maps at each step until it calls stop or reaches end of the episode.

During inference, we feed a random image as the input semantic map for the first time step and use Mt,rpredM_{t,r}^{\textrm{pred}} as Mt,rinputM_{t,r}^{\textrm{input}} for all consecutive steps.

2 Point Prediction

The maps predicted by fmapf_{map} are amodal – the agent predicts regions that it has not seen yet. They are however crops – the agent does not predict the layout of the entire house. These crops are egocentric to the agent and the target room may not always appear in these maps. For e.g., consider Fig. 4, there exists a bathroom in the house but this region does not appear inside the crop as it does not fall inside the crop w.r.t. the agent’s current location. Inspired by the recent progress in point navigation , we reduce the room navigation problem to point navigation. We train a network fpointf_{point} to predict a target point Ptpred=(xt,yt)P^{\textrm{pred}}_{t}=(x^{\prime}_{t},y^{\prime}_{t}) that lies in the target room trtr, at each step tt of the agent. Similar to Sec. 4.1, we learn representations fi(It)f_{i}(I_{t}) of the RGB image ItI_{t}, fm(Mt,rpred)rRf_{m}(M_{t,r}^{\textrm{pred}})\forall r\in\mathcal{R} of the predicted semantic maps Mt,rpredM_{t,r}^{\textrm{pred}}, and femb(tr)f_{emb}(tr) which is a one-hot embedding of the target room ID trtr. The predicted semantic map representations for the different rooms are combined using Eq. 6,

where \odot represents element-wise multiplication. These are then concatenated with the target room ID trtr and fed to a multilayer perceptron (MLP) fpointf_{point} which outputs PtpredP_{t}^{\textrm{pred}} as described in Eqn. 7. fpointf_{point} is trained using mean square loss w.r.t. a ground truth target point in the target room, PGTP^{\textrm{GT}}, as shown in Eqn. 8. Sec. 5 describes how this point is chosen.

During inference, the agent predicts a point every k=6k=6 steps. Once the agent completes K=60K=60 steps, the target point is simply fixed and no longer updated. The episode terminates if the agent calls stop or reaches N=500N=500 steps.

3 Point Navigation

At this stage, we have reduced room navigation to point navigation where the agent needs to navigate to the predicted target point PtpredP^{\textrm{pred}}_{t}. Following the approach in , we train a point navigation policy using Proximal Policy Optimization (PPO) on the dataset of point navigation episodes in . The policy, described in Eqn. 9, is parameterized by a 22-layer LSTM with a 512512-dimensional hidden state. It takes three inputs: the previous action at1a_{t-1}, the predicted target point PtpredP_{t}^{\textrm{pred}}, and an encoding of the 0ptinput fd(Dt)f_{d}(D_{t}). We only feed 0ptinput to the point navigation policy as this was found to work best . The LSTM’s output is used to produce a softmax distribution over the action space and an estimate of the value function.

The agent receives terminal reward rT=2.5r_{T}=2.5, and shaped reward rt(at)=Δgeo_dist0.01r_{t}(a_{t})=-\Delta_{\text{geo\_dist}}-0.01, where Δgeo_dist=dtdt1\Delta_{\text{geo\_dist}}=d_{t}-d_{t-1} is the change in geodesic distance to the goal by performing action ata_{t}. We then use this pre-trained policy, πnav\pi_{nav}, to navigate to the predicted point, PtpredP^{\textrm{pred}}_{t}. We also fine-tune πnav\pi_{nav} on the points predicted by our model and this improves the performance.

To recap, fmapf_{map} generates the semantic map of the space, fpointf_{point} acts as a high-level policy and predicts a point, and the low level point navigation controller πnav\pi_{nav} predicts actions to navigate to this point.

4 Implementation Details

Point Navigation Policy. The 0ptencoding fd(Dt)f_{d}(D_{t}) is based on ResNeXt with the number of output channels at every layer reduced by half. As in , we replace every BatchNorm layer with GroupNorm to account for highly correlated inputs seen in on-policy RL. As in , we use PPO with Generalized Advantage Estimation (GAE) to train the policy network. We set the discount factor γ\gamma to 0.990.99 and the GAE parameter τ\tau to 0.950.95. Each worker collects (up to) 128 frames of experience from 4 agents running in parallel (all in different environments) and then performs 2 epochs of PPO with 2 mini-batches per epoch. We use Adam with a learning rate of 2.5×1042.5\times 10^{-4}. We use DD-PPO to train 64 workers on 64 GPUs.

Room Navigation Dataset

Simulator and Datasets. We conduct our experiments in Habitat , a 3D simulation platform for embodied AI research. We introduce the room navigation task in the Habitat API and create a dataset of room navigation episodes using scenes from Matterport 3D . We use Matterport 3D as it is equipped with room category and boundary annotations and hence is best suited for the task of Room Navigation. It consists of 61 scenes for training, 11 for validation, and 18 for testing. We only use the subset of 90 buildings which are houses and exclude others such as spas as those locations do not have common room categories with the majority of the dataset. We extract a subset of these scenes which contain at least one of the following room types: Bathroom, Bedroom, Dining Room, Kitchen, Living Room on the first floor. We only use the first floor of the house because, (1) In Matterport3d, the bounding boxes of rooms on different floors overlap at times, e.g. the box for a room on the first floor often overlaps with the room right above it on the second floor, making it hard to sample points which lie on the same floor, (2) The floors are uneven, making it difficult to distinguish between the different levels of the house.

Our dataset is comprised of 2.6 million episodes in 32 train houses, 200 episodes in 4 validation houses, and 500 episodes in 10 test houses.

Episode Specification. An episode starts with the agent being initialized at a starting position and orientation that are sampled at random from all navigable positions of the environment . The target room is chosen from RR if it is present in the house and is navigable from the starting position. We ensure the start position is not in the target room and has a geodesic distance of at least 4m and at most 45m from the target point in the room. During the episode, the agent is allowed to take up to 500 actions. After each action, the agent receives a set of observations from the active sensors. Statistics of the room navigation episodes can be found in the supplementary.

Evaluation Metric. Similar to , we design two evaluation metrics for room navigation - RoomNav Success weighted by (normalized inverse) Path Length (RoomNav SPL) and Success. An episode is considered a success if the agent stops 0.2m inside the bounds of the specified target room. We use 0.2m as the room boundaries in Matterport 3D sometimes lie outside the room and this factor ensures the agent has indeed stepped inside the room. As shown in Fig. 5, we compute the geodesic distance from the source point to all the navigable points that lie 0.2m within the bounds of the room and choose a ground-truth target point PGTP^{GT} that is closest to the agent’s start position, i.e. has the shortest geodesic distance. RoomNav SPL is similar to the SPL defined in . Let SS indicate ‘success’, ll be the length of the shortest geodesic distance between start point and PGTP^{GT} defined above, and pp be the length of the agent’s path, then RoomNav SPL=Slmax(l,p)\text{RoomNav SPL}=S\frac{l}{\max(l,p)}. To achieve an SPL of 1, the agent must enter the nearest target room, and call stop when it has stepped 0.2m into the room.

Agent. As in , the agent is modeled as a cylinder with diameter 0.2m and height 1.5m. The actions and the sensors are the same as in .

Results

We design baselines to evaluate the effectiveness of each component of our proposed room navigation framework and to validate our approach. We also report oracle results using ground-truth annotations to establish an upper-bound on the scores that can be achieved by our model.

Table 1 shows the RoomNav SPL and Success scores on the room navigation validation and test sets (for selected baselines). Our room navigation framework described in Sec. 4 achieves an SPL of 0.31 on validation and 0.29 on the test set. Fine-tuning the point navigation policy on points predicted by the point prediction network improves the SPL to 0.35 on validation and 0.33 on test, making this our best performing model.

Vanilla Room Navigation Policy. Here we compare to an approach that does not use semantic maps to model correlations and does not use point navigation. We ablate both the map prediction and point generation components by training room navigation policy from scratch using PPO, similar to the point navigation policy in Sec. 4.3. Instead of a target co-ordinates relative to current state, it takes in the target room ID as input. The agent receives terminal reward rT=2.5 RoomNav-SPLr_{T}=2.5~{}\text{RoomNav-SPL}, and shaped reward rt(at)=Δgeo_dist0.01r_{t}(a_{t})=-\Delta_{\text{geo\_dist}}-0.01, where Δgeo_dist=dtdt1\Delta_{\text{geo\_dist}}=d_{t}-d_{t-1} is the change in geodesic distance to the target room by performing action ata_{t}. The SPL using this baseline is 0.10 on validation and 0.10 on test, significantly worse compared to our approach (SPL 0.35). This reinforces the effectiveness of our model, specifically the need to generate maps and use point navigation. Note that this approach mimics an approach that tries to solve room navigation via vanilla (“brute force”) reinforcement learning.

Vanilla Room Navigation Policy with Map Generation. We ablate the point prediction model in Sec. 3 and train a room navigation policy to navigate to the target room using RGB images and semantic maps. We use the map generator to predict semantic maps for each room type. We then train a policy to navigate to the target room. The policy is similar to the room navigation policy described above and takes four inputs: the previous action, the target room represented as an ID, the predicted semantic maps for all rooms embedded as in Eq. 6 and the 0ptencoding. It is trained the same way as the room navigation policy.

This baseline achieves an SPL of 0.16, which is worse by a large margin of 0.2 when compared to our best performing model (room navigation using Map Generation + Point Prediction). The improved performance of our best method emphasizes the significance of the point prediction and point navigation modules in our best performing model.

Point Prediction and Point Navigation Policy. We perform room navigation using only the high-level point prediction network and the low-level point navigation controller. We ablate the map generation module and train a modified version of the point prediction network without maps as input. Similar to fpointf_{point} in Eq. 7, it generates Pt=(xt,yt)P_{t}=(x_{t},y_{t}) but by using only the image representation fi(It)f_{i}(I_{t}) and target room embedding femb(tr)f_{emb}(tr). The agent then navigates to PtP_{t} using the pre-trained point navigation policy as in Sec. 4.3.

This method achieves an SPL of 0.17 when the policy is trained from scratch and an SPL of 0.21 when the policy is fine tuned with points predicted by the point prediction network. Our best model surpasses this by a large margin of \sim0.15, which shows the advantage of using supervision to learn amodal semantic maps that capture correlations. It also indicates the effectiveness of our map generation network. Since the environments in validation are different from train, we can also conclude that predicting semantic maps allows for better generalization to unseen environments as the RoomNav SPL is a direct indicator of how “quickly” an agent can reach a target room.

Using Ground Truth (GT) Maps. To get a better sense of how well our models can do if we had perfect map generation, we train a few of our baselines with ground truth maps instead of generated maps and report results in Table 1. With GT maps, Vanilla Room Navigation Policy with Map Generation achieves an SPL of 0.54. Adding GT maps to our best model, with and without fine-tuned point navigation policy we achieve SPL of 0.61 and 0.67 respectively. This suggests that there is still a large room for improvements in the Map Prediction module to perform room navigation more effectively. Table 2(b) reports the prediction error of point prediction model when using generated maps and ground truth maps.

Random. We evaluate using a random agent that takes action randomly among move_forward, turn_left, and turn_right with uniform distribution. It calls stop after 60 steps, which is the average length of an episode in our dataset. This achieves a RoomNav SPL of 0 on both test and validation, which implies that our task is difficult and cannot be solved by random walks.

Using GT Point Selection We use the pre-trained point navigation policy defined in Sec. 4.3 to navigate to the ground truth target points PGTP^{\textrm{GT}} in the target room defined in Sec. 5. This achieves an SPL of 0.82 and 0.79 on validation and test respectively. It provides an “upper-bound” on the performance that can be achieved by the room navigation policy, as this indicates the maximum RoomNav SPL that can be achieved by the framework in Sec. 4 if the error on point prediction were 0. These numbers are comparable to the SPL values for point navigation on the Matterport-3D dataset in , thus indicating our episodes are at least as difficult as the point navigation episodes in .

Map Generation Ablations. We also experimented with different semantic map generation models. The results in Table 2(a) show that the LSTM map generation model described in Sec. 4 performs best with a mean Intersection-over-Union (mIoU) of 41.45 and an Average Pixel Accuracy of 43.39%. The CNN only approach predicts a semantic map from each RGB image without maintaining a memory of the previous maps. This performs poorly and has a mIoU of 25.66 and an Average Pixel Accuracy of 24.81%. We train another LSTM model which doesn’t use the semantic maps as input at each time step. This has a mIoU of 32.93 and Average Pixel Accuracy of 33.59.

Trajectory Videos. Qualitative results of our model can be found here. The first image in the first row shows the RGB input. The second and third maps in the first row show the location of the agent in allocentric and egocentric views respectively. The last figure on the first row shows two dots, red indicating the ground truth target point in the target room and green showing the predicted point. When only one dot is visible it indicates the predicted and ground-truth points overlap. There are 5 ground-truth semantic maps for each of the 5 room types we consider. The labels at the bottom indicate the room type being predicted. The second row shows the ground truth semantic maps indicating the location of the rooms in the house. The third row shows the maps predicted by our agent. The target room is mentioned at the very bottom, in this case, “Dining Room”. As seen in the video, the model dynamically updates the semantic belief maps and predicts the target point with high precision. The agent is able to detect the room its currently present in and also develop a belief of where other rooms lie. The RoomNav-SPL for this episode is 1.0 as the agent successfully reaches the target room following the shortest path. Additional videos can be found here.

Conclusion

In this work, we proposed a novel learning-based approach for Room Navigation which models architectural and stylistic regularities in houses. Our approach consists of predicting the top down belief maps containing room semantics beyond the field of view of the agent, finding a point in the specified target room, and navigating to that point using a point navigation policy. Our model’s improved performance (SPL) compared to the baselines confirms that learning to generate amodal semantic belief maps of room layouts improves room navigation performance in unseen environments. Our results using ground truth maps indicate that there is a large scope for improvement in room navigation performance by improving the intermediate map prediction step. We will make our code and dataset publicly available.

Acknowledgements

We thank Abhishek Kadian, Oleksandr Maksymets, and Manolis Savva for their help with Habitat, and Arun Mallya and Alexander Sax for feedback on the manuscript. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. Prof. Darrell’s group was supported in part by DoD, NSF, BAIR, and BDD. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

References