Gated Path Planning Networks

Lisa Lee, Emilio Parisotto, Devendra Singh Chaplot, Eric Xing, Ruslan Salakhutdinov

Introduction

A common type of sub-task that arises in various reinforcement learning domains is path finding: finding a shortest set of actions to reach a subgoal from some starting state. Path finding is a fundamental part of any application which requires navigating in an environment, such as robotics (Ackerman & Guizzo, 2015) and video game AI (Silver, 2005). Due to its ubiquity in these important applications, recent work (Tamar et al., 2017) has designed a differentiable sub-module that performs path-finding as directed by the agent in some inner loop. These Value Iteration Network (VIN) modules mimic the application of Value Iteration on a 2D grid world, but without a pre-specified model or reward function. VINs were shown to be capable of computing near-optimal paths in 2D mazes and 3D landscapes where the transition model $P(s^{\prime}|s,a)$ was not provided a priori and had to be learned.

In this paper, we show that VINs are often plagued by training instability, oscillating between high and low performance between epochs; random seed sensitivity, often converging to different performances depending on the random seed that was used; and hyperparameter sensitivity, where relatively small changes in hyperparameters can cause diverging behaviour. Owing to these optimization difficulties, we reframe the VIN as a recurrent-convolutional network, which enables us to replace the unconventional recurrent VIN update (convolution & max-pooling) with well-established gated recurrent operators such as the LSTM update (Hochreiter & Schmidhuber, 1997). These Gated Path Planning Networks (GPPNs) are a more general model that relaxes the architectural inductive bias of VINs that was designed to perform a computation resembling value-iteration.

We then establish empirically that GPPNs perform better or equal to the performance of VINs on a wide variety of 2D maze experiments, including different transition models, maze sizes and different training dataset sizes. We further demonstrate that GPNNs exhibit fewer optimization issues than VINs, including reducing random seed and hyperparameter sensitivity and increasing training stability. GPPNs are also shown to work with larger kernel sizes, often outperforming VINs with significantly fewer recurrent iterations, and also learn faster on average and generalize better given less training samples. Finally, we present results for both VIN and GPPN on challenging 3D ViZDoom environments (Kempka et al., 2016), where the planner is only provided with first-person RGB images instead of the top-down 2D maze design.

Background

The value function $V^{(k)}$ converges to $V^{*}$ in the limit as $k\rightarrow~{}\infty$ , and the optimal policy can be recovered as $\pi^{*}(s):=\arg\max_{a}Q^{(\infty)}(s,a)$ (Sutton & Barto, 2018).

Despite the theoretical guarantees, value iteration requires a pre-specified environment model. Tamar et al. (2017) introduced the Value Iteration Network (VIN), which is capable of learning these MDP parameters from data automatically. The VIN reformulates value iteration as a recursive process of applying convolutions and max-pooling over the feature channels:

where the indices $i,j\in[m]$ correspond to cells in the $m\times m$ maze, $\bar{R},\bar{Q},\bar{V}$ is the VIN estimated reward, action-value and value functions, respectively, $\bar{a}$ is the action index of the $\bar{Q}$ feature map, and $W^{R},W^{V}$ are the convolutional weights for the reward function and value function, respectively. In the following iteration, the previous value $\bar{V}$ is stacked with $\bar{R}$ for the convolution step.

Tamar et al. (2017) showed that VINs have much greater success at path planning than baseline CNN and feedforward architectures in a variety of 2D and graph-based navigation tasks. The demonstrated success of VIN has made it an important component of models designed to solve downstream tasks where navigation is crucial (Karkus et al., 2017; Gupta et al., 2017a, b). For example, Gupta et al. (2017a, b) designed a Deep RL agent to perform navigation within partially observable and noisy environments by combining a VIN module with a 2D-structured memory map.

Method

In this work, we explore whether the inductive biases provided by the VIN are even necessary: is it possible that using alternative, more general architectures might work better than those of the VIN? We can view the VIN update (1) within the perspective of a convolutional-recurrent network, updating a recurrent state $V^{(k)}_{i^{\prime},j^{\prime}}$ at every spatial position $(i^{\prime},j^{\prime})$ in each iteration:

where $X_{[i^{\prime},j^{\prime},F]}$ denotes the image patch centered at position $(i^{\prime},j^{\prime})$ with kernel size $F$ . From (2), it can be seen that VIN follows the standard recurrent neural network (RNN) update where the recurrent state is updated by taking a linear combination of the input $\bar{R}$ and the previous recurrent state $\bar{V}^{(k-1)}$ , and passing their sum through a nonlinearity $\max_{\bar{a}}$ . The main differences from a standard RNN are the following: the non-conventional nonlinearity (channel-wise max-pooling) used in VIN; the hidden dimension of the recurrent network, which is essentially one; the sparse weight matrices, where the non-zero values of the weight matrices represent neighboring inputs and units which are local in space; and the restriction of kernel sizes to 3.

Under this perspective, it is easy to question whether the adherence to these strict architectural biases is even necessary, given the long history of demonstrations that standard non-gated recurrent operators are difficult to optimize due to effects such as vanishing and exploding gradients (Pascanu et al., 2013).

We can easily replace the recurrent VIN update in (2) with the well-established LSTM update (Hochreiter & Schmidhuber, 1997), whose gated update alleviates many of the problems with standard recurrent networks:

where $F$ is the convolution kernel size. This recurrent update (3) still maintains the convolutional properties of the input and recurrent weight matrix as in VIN. It involves taking as input the $F\times F$ convolution of the input vector $\bar{R}$ and previous hidden states $h^{(k-1)}$ , and the previous cell state $c_{i^{\prime},j^{\prime}}^{(k-1)}$ of the LSTM at the central position $(i^{\prime},j^{\prime})$ . We call path planning modules which use these gated updates Gated Path Planning Networks (GPPNs). The GPPN is an LSTM which uses convolution of previous spatially-contiguous hidden states for its input.

Environments and Maze Transition Types

We test VIN and GPPN on 2D maze environments and 3D ViZDoom environments (Figure 1) on a variety of settings such as training dataset size, maze size and maze transition kernel.

We used three different maze transition kernels: In NEWS, the agent can move North, East, West, or South; in Differential Drive, the agent can move forward along its current orientation, or turn left/right by 90 degrees; in Moore, the agent can move to any of the eight cells in its Moore neighborhood. In the NEWS and Moore transition types, the target is an x-y coordinate, while in Differential Drive the target contains an orientation along with the x-y coordinate. Consequently, the dimension of the goal map given as input to the models is $1\times m\times m$ for NEWS and Moore, and $4\times m\times m$ for Differential Drive, where $m$ is the maze size.

The 2D maze environment is created with a maze generation process that uses Depth-First Search with the Recursive Backtracker algorithm (Maze Generation Algorithms, 2018) to construct the maze tree, resulting in a fully connected maze (see Figure 1a). For each maze, we sample a probability $d$ uniformly from . Then for each wall, we delete the wall with probability $d$ .

For our experiments on the 2D mazes, the state vector consists of the maze and the goal location, each of which are represented by a binary $m\times m$ matrix, where $m\times m$ is the maze size. We use early stopping based on validation set metrics to choose the final models.

2 3D ViZDoom Environment

We use the Doom Game Engine and the ViZDoom API (Kempka et al., 2016) to create mazes in a simulated 3D environment (see Figure 1b). The maze design for the 3D mazes are generated in exactly the same manner as the 2D mazes, using Depth-First Search with the Recursive Backtracker algorithm followed by wall pruning with a uniformly sampled probability $d$ . For each Doom maze, we take RGB screenshots showing the first-person view of the environment at each position and orientation. A sample 3D Doom maze and example screenshot images are shown in Figure 1. For an $m\times m$ maze with 4 orientations, this results in a total of $4m^{2}$ images.

In the 3D ViZDoom experiments, these map images are given as input to the model (instead of the 2D map design). This setup is similar to the one used for localization experiments by Chaplot et al. (2018) who argue that these images are easier to obtain as compared to constructing an accurate map design of an environment in the real world. The model needs to learn to infer the map design from these images along with learning to plan, which makes the task more challenging in 3D environments.

Experiments & Discussion

In this section, we empirically compare VIN and GPPN using two metrics: %Optimal (%Opt) is the percentage of states whose predicted paths under the policy estimated by the model has optimal length, and %Success (%Suc) is the percentage of states whose predicted paths under the policy estimated by the model reach the goal state. The reported performance is on a held-out test split. In contrast with the metrics reported in (Tamar et al., 2017), we do not stochastically sample rollouts but instead evaluate and train the output policy of the models directly on all states simultaneously. This reduces optimization noise and makes it easier to tell whether difficulties with training are due to sampling noise or model architecture/capacity.

All analyses are based on 2D maze results, except in Section 5.8 where we discuss 3D ViZDoom results. In order to make comparison fair, we utilized a hidden dimension of 150 for GPPN and 600 for VIN, owing to the approximately 4 $\times$ increase in parameters a GPPN contains due to the 4 gates it computes. Unless otherwise noted, the results were obtained by doing a hyperparameter sweep of $(K,F)$ over $K\in\{5,10,15,20,30\}$ and $F\in\{3,5,7,9,11\}$ , and using a 25k/5k/5k train-val-test split. Other experimental details are deferred to the Appendix.

One question that can be asked of the architectural choices of the VIN is whether the kernel size needs to be the same dimension as the true underlying transition model. The kernel size used in VIN was set to $3\times 3$ with a stride of $1$ , which is sufficient to represent the true transition model when the agent can move anywhere in the Moore neighborhood, but it limits the rate at which information propagates spatially with each iteration. With a kernel size of $3\times 3$ and stride of $1$ , the receptive field of a unit in the last iteration’s feature map increases with rate $(3+2K)\times(3+2K)$ where $K$ is the iteration count, meaning that the maximum path length information travels scales directly with iteration count $k$ . Therefore for long-term planning in larger environments, Tamar et al. (2017) designed a multi-scale variant called the Hierarchical VIN. Hierarchical VINs rely on downsampling the maps into multi-scale hierarchies, and then doing VIN planning and up-scaling, progressively growing the map until it regains its original, un-downsampled size.

Another potential method to do long-range planning without requiring a multi-scale hierarchy is to instead increase the kernel size. An increased kernel size would cause the receptive field to grow more rapidly, potentially allowing the models to require fewer iterations $K$ before reaching well-performing policies. In this section, we sought to test out the feasibility of increasing the kernel size of VINs and GPPNs. These results are summarized in Table 1. All the models were trained with the best $K$ setting for each $F$ and transition kernel. From the results, we can clearly see that GPPN can handle training with larger $F$ values, and moreover, GPPN often performs better than VIN with larger values of $F$ . In contrast, we can observe that VIN’s performance drops significantly after its kernel size is increased more than 5, with its best performing settings being either $3$ or $5$ depending on the true transition model. These results show that GPPN can learn planning approximations that work with $F>3$ much more stably than VIN, and could further suggest that GPPN can work as well as VIN with less iterations.

2 Varying Iteration Count K𝐾K

Following the above results showing that GPPN benefits from increased $F$ , we further evaluated the effect of varying both iteration count $K$ and kernel size $F$ on the VIN and GPPN models. Table 2 shows %Optimal and %Success results of VIN and GPPN on 15 $\times$ 15 2D mazes for different values of $F$ and $K$ . We can see from NEWS column in the table that GPPN with $F>7$ can get results on par with the best VIN model with only $K=5$ iterations. This shows that GPPN can learn to more effectively propagate information spatially in a smaller number of iterations than VIN can, and outperforms VIN even when VIN is given a much larger number of iterations. Additionally, we can see that VIN has significant trouble learning when both $K$ and $F$ are large in the differential drive mazes and to a lesser extent in the NEWS mazes.

Table 3 shows the results of VIN and GPPN with varying iteration counts $K$ and the best $F$ setting for each $K$ . Owing to the larger kernel size, GPPN with smaller number of iterations $K\leq 10$ can get results on par with the best VIN model. Generally, both models benefit from a larger $K$ (assuming the best $F$ setting is used).

3 Different Maze Transition Kernels

From Tables 1 and 3, we can observe the performance of VIN and GPPN across a variety of different underlying groundtruth transition kernels (NEWS, Moore, and Differential Drive). From these results, we can see that GPPN consistently outperforms VIN on all the transition kernel types. An interesting observation is that VIN does very well at Differential Drive, consistently obtaining high results, although GPPN still does better than or on par with VIN. The reasons why VIN is so well suited to Differential Drive are not clear, and a preliminary analysis of VIN’s feature weights and reward vectors did not reveal any intuition on why this is the case.

4 Effect of Dataset Size

A potential benefit of the stronger architectural biases of VIN might be that they can enable better generalization given less training data. In this section, we designed experiments that set out to test this hypothesis. We trained VINs and GPPNs on datasets with varying number of training samples for all three maze transition kernels, and the results are given in Table 4. We can see that GPPN consistently outperforms VIN across all dataset sizes and maze models. Interestingly, we can observe that the performance gap between VIN and GPPN is larger the less data there is, demonstrating the opposite effect to our hypothesis. This could suggest that the architectural biases do not in fact aid generalization performance, or that there is another problem, such as perhaps the difficulty of optimizing VIN, that overshadows the benefit that the inductive bias could potentially provide.

5 Random Seed and Hyperparameter Sensitivity

The hypothesis this section sought to verify was whether the particular recurrent-convolutional form of the VIN did indeed negatively affect its optimization, as many ungated recurrent updates suffer from optimization problems including training instability and higher sensitivity to weight initialization and hyperparameters due to gradient scaling problems (Pascanu et al., 2013).

We test each architecture’s sensitivity to random seeds by running several experiments with the same hyperparameters but different random seeds, and measuring the variance in their final performance. These results are reported in Table 5. The results show that GPPN gets consistently lower variance than VIN over different random seed initializations, supporting the hypothesis that the LSTM update enables more training stability and easier optimization than the ungated recurrent update in VIN.

We additionally test hyperparameter sensitivity in Figure 2. We take all the results obtained on a hyperparameter sweep over settings $(K,F)$ where $K$ was varied over $K\in\{5,10,15,20,30\}$ and $F$ was varied over $F\in\{3,5,7,9,11\}$ . We then rank these results, and the x-axis is the top- $x$ ranked hyperparameter settings and the corresponding y-axis is the average %Opt/%Suc of those $x$ settings. This plot thus measures how stable the performance of the architecture is to hyperparameter changes as the number of hyperparameter settings we consider grows. Therefore, architectures whose average top- $x$ ranked performance remains high and relatively flat demonstrates that good performance with the architecture can be obtained with many different hyperparameter settings. This suggests that these models are both easier to optimize and consistently better than alternatives, and higher performance was not due to a single lucky hyperparameter setting. We can see from the figures that the performance of GPPN is clearly both higher and more stable over hyperparameter settings than VIN.

In Figure 3, we plot the learning curves for VIN and GPPN on 2D mazes with varying $K$ and $F$ . These plots show that VIN’s performance often oscillates between epochs (especially for larger kernel sizes $F>3$ ), while GPPN is much more stable. Learning curves for other experiments showing a similar result are included in the Appendix. The training stability of GPPN provides more evidence to the hypothesis that GPPNs are simpler to optimize than VINs and consistently outperform them.

6 Learning Speed

In this section, we examine whether VINs or GPPNs learn faster. To do this, we measure the number of training epochs (passes over the entire dataset) that it takes for each model to reach a specific %Opt for the first time. These results are reported in Table 6. We can see from this table that GPPN learns significantly faster, often reaching 95% within 5-6 epochs. Comparatively, VIN sometimes never reaches 95%, as is the case for the NEWS mazes, or it takes 2-5 times as many epochs. This is the case even on the Differential Drive mazes, where VIN takes 2-3 times longer to train despite also getting high final performance.

7 Larger Maze Size

To test whether the improved performance GPPN persists even on larger, more challenging mazes, we evaluated the models on a dataset of mazes of size $28\times 28$ , and varied $K\in\{14,28,56\}$ (Table 7). We used a training dataset size of 25k. GPPN outperformed VIN by a significant margin (3-6% for %Opt and %Suc) for all cases except Diff. Drive $15\times 15$ , where the gap was closer (GPPN 99.3 vs. VIN 98.3 for %Opt).

8 3D ViZDoom Experiments

In the 3D ViZDoom experiments, the state vector consists of RGB images showing the first-person view of the environment at each position and orientation, instead of the top-down 2D maze design (represented by a binary $m\times m$ matrix) as in the 2D maze experiments. To process the map images, we use a Convolutional Neural Network (LeCun et al., 1989) consisting of two convolutional layers: first layer with 32 filters of size $8\times 8$ and a stride of 4, and second layer with 64 filters of size $4\times 4$ with a stride $2\times 2$ , followed by a linear layer of size 256.This architecture was adapted from a previous work which is shown to perform well at playing deathmatches in Doom (Lample & Chaplot, 2017). The 256-dimensional representation for all the 4 orientations at each location is concatenated to create a 1024-dimensional representation. These representations of each location are then stacked at the corresponding x-y coordinate to create a map representation of size $1024\times m\times m$ . The map representation is then passed through two more convolutional layers (first layer with 64 filters and the second layer with 1 filter, both of size $3\times 3$ and a stride of 1) to predict a maze design matrix of size $1\times m\times m$ , which is trained using an auxillary binary cross-entropy loss. The predicted maze design is then stacked with the goal map and passed to the VIN or GPPN module in the same way as the 2D experiments.

The 3D ViZDoom results are summarized in Table 8. %Acc is the accuracy for predicting the top-down 2D maze design from first-person RGB images. Learning to plan in the 3D environments is more challenging due to the difficulty of simultaneously optimizing both the original planner loss and the auxiliary maze prediction loss. We can see that when %Acc is low, i.e., the planner module must rely on a noisy maze design, then the planner metrics %Opt and %Suc also suffer. We observe that VIN is more prone to overfitting on the training dataset: its validation %Acc is low ( $<91\%$ ) for all three transition kernels, whereas GPPN achieves higher validation %Acc on NEWS and Moore. However, GPPN also overfits on the Differential Drive.

Related Works

Karkus et al. (2017) looked at extending differentiable planning towards being able to plan in partially observable environments. In their setting, the agent is not provided a-priori with its position within the environment and thus needs to maintain a belief state over where it actually is. Similar to VIN’s differentiable extension of VI, the QMDP-Net architecture was based on creating a differentiable analogue of the QMDP algorithm (Littman et al., 1995), an algorithm designed to approximate belief space planning in POMDPs. The architecture itself consisted of a filter module, which maintained the beliefs over which states the agent currently was in, and a planning module, which determined what action to take next. The planning module was essentially using a VIN to enable it to make more informed decisions on which parts of the environment to explore.

In recent work there has been a variety of deep reinforcement learning models that have examined combining an internal planning process with model-free methods. The Predictron (Silver et al., 2016) was a value function approximator which predicted a policy’s value by internally rolling out an LSTM forward predictive model of the agent’s future rewards, discounts and values. These future rewards, values and discounts were then accumulated together, with the idea that this would predict a more accurate value by forcing the architecture to model a multi-step rollout. A later extension, Value Predictive Networks (Oh et al., 2017), learnt a forward model that is used to predict the future rewards and values of executing a multi-step rollout. Although similar to the Predictron, they considered the control setting, where not only a value function had to be learnt but a policy as well. They demonstrated that their model, trained using model-free methods, was able to outperform existing methods on a 2D goal navigation task and outperformed DQN on Atari games.

Convolutional-recurrent networks similar to the VIN and GPPN have had a recent history of use within computer vision, particularly for applications which have both a spatial and temporal aspect. Convolutional LSTMs (ConvLSTMs) were first used in the application of precipitation nowcasting, where the goal was to predict rainfall intensity within a region using past data (Shi et al., 2015). Recurrent-convolutional networks have also been used within computer vision applications where there is no explicit temporal aspect, such as object recognition. Feedback Networks (Zamir et al., 2017) utilized a ConvLSTM in order to allow information to feedback from higher layers to lower layers by unrolling the ConvLSTM over time. This enabled the Feedback Network to attain performance better than or on par with Residual Networks (ResNets) (He et al., 2016), one of the most commonly used feedforward architectures for object recognition.

A deeper connection has also been explored between residual and convolutional-recurrent networks. (Liao & Poggio, 2016) tested whether weight tying between layers in a ResNet significantly affects performance, finding that although performance slightly degrades, the change is not drastic. They provide some hypotheses on these results, suggesting that deep feedforward networks like ResNets are approximating recurrent networks in some capacity. While the GPPN can be seen as an instance of ConvLSTMs, our paper is the first to apply it to the domain of differentiable path planning and to show that, in general, structuring differentiable path planning within the context of convolutional-recurrent networks enables use of previous well-established recurrent architectures such as LSTM and GRUs.

Conclusion

In this work, we re-formulated VIN as a convolutional-recurrent network and designed a new planning module called the Gated Path Planning Network (GPPN) which replaced the unconventional recurrent update in VIN with a well-established gated LSTM recurrent operator. We presented experimental results comparing VIN and GPPN on 2D path-planning maze tasks and a 3D navigation task in the video game Doom, showing that the GPPN achieves results no worse and often better than VIN. The LSTM update alleviates many of the optimization issues including training instability and sensitivity to random seeds and hyperparameter settings. The GPPN is also able to utilize a larger kernel size, which the VIN is largely unable to do due to training instability, allowing the GPPN to work as well as VIN with fewer iterations. The GPPN also learns significantly faster, attaining high performance after only a few epochs, whereas the VIN takes longer to train. Finally, the relative performance improvement of GPPN over VIN increases with less training data. In conclusion, our analyses suggest that the inductive biases of VIN are not necessary in the design of a well-performing differentiable path planning module, and that the use of more general, gated recurrent architectures provides significant benefits over VINs.

Acknowledgements

LL is supported by a NSF GRFP Fellowship and by the CMU SEI under Contract FA8702-15-D-0002, Section H Clause, AFLCMC (H)-H001: 6-18014. EP, DC, and RS are supported in part by Apple, Nvidia NVAIL, DARPA D17AP00001, IARPA DIVA award D17PC00340, and ONR award N000141512791. The authors would also like to thank Nvidia for providing GPU support.

References

Appendix A Learning Plots

As mentioned in Section 5.5, we provide additional learning plots for varying dataset sizes (Figure 5), varying maze sizes (Figure 6), and 3D ViZDoom results (Figure 7).

Appendix B Hyperparameter Settings

Unless otherwise noted, all models in Tables 1, 2, 3, 4, 5, 6, 7, 8 are trained on 2D mazes of size $15\times 15$ for 30 epochs using a learning rate of 1e-3, batch size 32, gradient clipping of 40, and 25k/5k/5k train-val-test split. An initial sweep of learning rates over { 1e-4, 1e-3, 5e-3, 1e-2, 1e-1 } found that GPPN is more robust to varying learning rates and that 1e-3 worked best for both models (see Table 9). We do a hyperparameter sweep of $(K,F)$ over $K\in\{5,10,15,20,30\}$ and $F\in\{3,5,7,9,11\}$ . The only exceptions are the following: For the larger $28\times 28$ maze in Table 7, we sweep $K$ over $\{14,28,56\}$ to account for longer trajectories required to solve some mazes. For the 10k and 100k dataset sizes in Table 4, we used a train-test-val split of 10k/2k/2k and 100k/10k/10k, respectively. The variance results in Table 5 are obtained using dataset size 100k. The 3D ViZDoom results in Table 8 were obtained using $K=30$ , the best setting of $F$ for each transition kernel, a smaller dataset size 10k , a smaller learning rate 5e-4, and 100 training epochs.

The particular form of the LSTM update the GPPNs take in the experimental section is slightly different from the standard one. The first difference is that we remove the dependence of each layer of the ConvLSTM on the ”reward” function $\bar{R}_{[i^{\prime},j^{\prime},F]}$ since we did not find this skip-connection helped performance much in preliminary experiments. Therefore layers of the GPPN only take as input the previous layer’s hidden units. The second change was made to make the GPPN easier to implement in a framework where built-in LSTM updates are available but ConvLSTMs are not. It first takes the convolution over the previous hidden layers and produces a 1-channel feature map, and then for each position passes that feature map position to the framework’s built-in LSTM update. This is similar to having a single shared input gate for all the inputs. When tested against a GPPN with the standard LSTM update equation with full input gating, we did not observe any significant different in test metrics but the single input gate did save some computation time.

Appendix C Hyper-VIN

The VIN uses convolutions to represent the model, which causes it to effectively be spatially invariant, meaning VINs are incapable of truly solving mazeworld in the same way as value iteration on the true model. The result is that VINs learn a workaround that enables it to deal with non-linearities over the state space: it assigns a large negative reward to every wall position. This is shown in Figure 4: the large reward gradient between walls and non-walls discourages the model from producing policies that “visit” wall states which would be impossible under the true model. Additionally, the spatial convolution model is fixed and invariant for all mazes, which is suboptimal as each MDP in the 2D environments require a different transition kernel based on the maze design.

In this section, we try to alleviate this issue by, first, untying the weights of the spatial convolution and, second, predicting the untied convolution weights directly from the maze design. We call this variant the Hyper-VIN, adopting the naming convention from HyperNetworks (Ha et al., 2017) which also used the kernel of using a network with weights predicted from another network. To implement the Hyper-VIN, we predict for each position $(i,j)$ in the environment a convolutional weight matrix from the input map design. The Hyper-VIN update equation then becomes:

A question that can be asked about Hyper-VIN is if they perform as well as (or better than) the actual algorithms they were designed to mimic because the true algorithm is within the model class. This would provide some evidence whether such modules were actually computing the value, or whether they simply acted like recurrent networks and computed a less interpretable internal representation. Empirically, we instead found that Hyper-VINs have high variance in training and are difficult to optimize (see Table 10). Hyper-VINs trained by SGD often fail to reach the performance of their exact algorithmic counterpart (value iteration) on small mazes even though value iteration is within the hypothesis class of these models, suggesting that the optimization of such architectures is significantly difficult.