Reinforcement Learning with Unsupervised Auxiliary Tasks
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
Related Work
A variety of reinforcement learning architectures have focused on learning temporal abstractions, such as options (Sutton et al., 1999b), with policies that may maximise pseudo-rewards (Konidaris & Barreto, 2009; Silver & Ciosek, 2012). The emphasis here has typically been on the development of temporal abstractions that facilitate high-level learning and planning. In contrast, our agents do not make any direct use of the pseudo-reward maximising policies that they learn (although this is an interesting direction for future research). Instead, they are used solely as auxiliary objectives for developing a more effective representation.
The Horde architecture (Sutton et al., 2011) also applied reinforcement learning to identify value functions for a multitude of distinct pseudo-rewards. However, this architecture was not used for representation learning; instead each value function was trained separately using distinct weights.
The UVFA architecture (Schaul et al., 2015a) is a factored representation of a continuous set of optimal value functions, combining features of the state with an embedding of the pseudo-reward function. Initial work on UVFAs focused primarily on architectural choices and learning rules for these continuous embeddings. A pre-trained UVFA representation was successfully transferred to novel pseudo-rewards in a simple task.
Similarly, the successor representation (Dayan, 1993; Barreto et al., 2016; Kulkarni et al., 2016) factors a continuous set of expected value functions for a fixed policy, by combining an expectation over features of the state with an embedding of the pseudo-reward function. Successor representations have been used to transfer representations from one pseudo-reward to another (Barreto et al., 2016) or to different scales of reward (Kulkarni et al., 2016).
Another, related line of work involves learning models of the environment (Schmidhuber, 2010; Xie et al., 2015; Oh et al., 2015). Although learning environment models as auxiliary tasks could improve RL agents (e.g. Lin & Mitchell (1992); Li et al. (2015)), this has not yet been shown to work in rich visual environments.
More recently, auxiliary predictions tasks have been studied in 3D reinforcement learning environments. Lample & Chaplot (2016) showed that predicting internal features of the emulator, such as the presence of an enemy on the screen, is beneficial. Mirowski et al. (2016) study auxiliary prediction of depth in the context of navigation.
Background
In A3C many instances of the agent interact in parallel with many instances of the environment, which both accelerates and stabilises learning. The A3C agent architecture we build on uses an LSTM to jointly approximate both policy and value function , given the entire history of experience as inputs (see Figure 1 (a)).
Auxiliary Tasks for Reinforcement Learning
In this section we incorporate auxiliary tasks into the reinforcement learning framework in order to promote faster training, more robust learning, and ultimately higher performance for our agents. Section 3.1 introduces the use of auxiliary control tasks, Section 3.2 describes the addition of reward focussed auxiliary tasks, and Section 3.4 describes the complete UNREAL agent combining these auxiliary tasks.
Given a set of auxiliary control tasks , let be the agent’s policy for each auxiliary task and let be the agent’s policy on the base task. The overall objective is to maximise total performance across all these auxiliary tasks,
where, is the discounted return for auxiliary reward , and is the set of parameters of and all ’s. By sharing some of the parameters of and all the agent must balance improving its performance with respect to the global reward with improving performance on the auxiliary tasks.
While many types of auxiliary reward functions can be defined from these quantities we focus on two specific types:
Pixel changes - Changes in the perceptual stream often correspond to important events in an environment. We train agents that learn a separate policy for maximally changing the pixels in each cell of an non-overlapping grid placed over the input image. We refer to these auxiliary tasks as pixel control. See Section 4 for a complete description.
Network features - Since the policy or value networks of an agent learn to extract task-relevant high-level features of the environment (Mnih et al., 2015; Zahavy et al., 2016; Silver et al., 2016) they can be useful quantities for the agent to learn to control. Hence, the activation of any hidden unit of the agent’s neural network can itself be an auxiliary reward. We train agents that learn a separate policy for maximally activating each of the units in a specific hidden layer. We refer to these tasks as feature control.
2 Auxiliary Reward Tasks
In addition to learning generally about the dynamics of the environment, an agent must learn to maximise the global reward stream. To learn a policy to maximise rewards, an agent requires features that recognise states that lead to high reward and value. An agent with a good representation of rewarding states, will allow the learning of good value functions, and in turn should allow the easy learning of a policy.
However, in many interesting environments reward is encountered very sparsely, meaning that it can take a long time to train feature extractors adept at recognising states which signify the onset of reward. We want to remove the perceptual sparsity of rewards and rewarding states to aid the training of an agent, but to do so in a way which does not introduce bias to the agent’s policy.
To do this, we introduce the auxiliary task of reward prediction – that of predicting the onset of immediate reward given some historical context. This task consists of processing a sequence of consecutive observations, and requiring the agent to predict the reward picked up in the subsequent unseen frame. This is similar to value learning focused on immediate reward ().
Unlike learning a value function, which is used to estimate returns and as a baseline while learning a policy, the reward predictor is not used for anything other than shaping the features of the agent. This keeps us free to bias the data distribution, therefore biasing the reward predictor and feature shaping, without biasing the value function or policy.
The auxiliary reward predictions may use a different architecture to the agent’s main policy. Rather than simply “hanging” the auxiliary predictions off the LSTM, we use a simpler feedforward network that concatenates a stack of states after being encoded by the agent’s CNN, see Figure 1 (c). The idea is to simplify the temporal aspects of the prediction task in both the future direction (focusing only on immediate reward prediction rather than long-term returns) and past direction (focusing only on immediate predecessor states rather than the complete history); the features discovered in this manner is shared with the primary LSTM (via shared weights in the convolutional encoder) to enable the policy to be learned more efficiently.
3 Experience Replay
Experience replay has proven to be an effective mechanism for improving both the data efficiency and stability of deep reinforcement learning algorithms (Mnih et al., 2015). The main idea is to store transitions in a replay buffer, and then apply learning updates to sampled transitions from this buffer.
Experience replay provides a natural mechanism for skewing the distribution of reward prediction samples towards rewarding events: we simply split the replay buffer into rewarding and non-rewarding subsets, and replay equally from both subsets. The skewed sampling of transitions from a replay buffer means that rare rewarding states will be oversampled, and learnt from far more frequently than if we sampled sequences directly from the behaviour policy. This approach can be viewed as a simple form of prioritised replay (Schaul et al., 2015b).
In addition to reward prediction, we also use the replay buffer to perform value function replay. This amounts to resampling recent historical sequences from the behaviour policy distribution and performing extra value function regression in addition to the on-policy value function regression in A3C. By resampling previous experience, and randomly varying the temporal position of the truncation window over which the n-step return is computed, value function replay performs value iteration and exploits newly discovered features shaped by reward prediction. We do not skew the distribution for this case.
Experience replay is also used to increase the efficiency and stability of the auxiliary control tasks. Q-learning updates are applied to sampled experiences that are drawn from the replay buffer, allowing features to be developed extremely efficiently.
4 UNREAL Agent
The UNREAL algorithm combines the benefits of two separate, state-of-the-art approaches to deep reinforcement learning. The primary policy is trained with A3C (Mnih et al., 2016): it learns from parallel streams of experience to gain efficiency and stability; it is updated online using policy gradient methods; and it uses a recurrent neural network to encode the complete history of experience. This allows the agent to learn effectively in partially observed environments.
The auxiliary tasks are trained on very recent sequences of experience that are stored and randomly sampled; these sequences may be prioritised (in our case according to immediate rewards) (Schaul et al., 2015b); these targets are trained off-policy by Q-learning; and they may use simpler feedforward architectures. This allows the representation to be trained with maximum efficiency.
Experiments
In this section we give the results of experiments performed on the 3D environment Labyrinth in Section 4.1 and Atari in Section 4.2.
In all our experiments we used an A3C CNN-LSTM agent as our baseline and the UNREAL agent along with its ablated variants added auxiliary outputs and losses to this base agent. The agent is trained on-policy with 20-step returns and the auxiliary tasks are performed every 20 environment steps, corresponding to every update of the base A3C agent. The replay buffer stores the most recent 2k observations, actions, and rewards taken by the base agent. In Labyrinth we use the same set of 17 discrete actions for all games and on Atari the action set is game dependent (between 3 and 18 discrete actions). The full implementation details can be found in Section B.
Labyrinth is a first-person 3D game platform extended from OpenArena (contributors, 2005), which is itself based on Quake3 (id software, 1999). Labyrinth is comparable to other first-person 3D game platforms for AI research like VizDoom (Kempka et al., 2016) or Minecraft (Tessler et al., 2016). However, in comparison, Labyrinth has considerably richer visuals and more realistic physics. Textures in Labyrinth are often dynamic (animated) so as to convey a game world where walls and floors shimmer and pulse, adding significant complexity to the perceptual task. The action space allows for fine-grained pointing in a fully 3D world. Unlike in VizDoom, agents can look up to the sky or down to the ground. Labyrinth also supports continuous motion unlike the Minecraft platform of (Oh et al., 2016), which is a 3D grid world.
We evaluated agent performance on 13 Labyrinth levels that tested a range of different agent abilities. A top-down visualization showing the layout of each level can be found in Figure 7 of the Appendix. A gallery of example images from the first-person perspective of the agent are in Figure 8 of the Appendix. The levels can be divided into four categories:
We compared the full UNREAL agent to a basic A3C LSTM agent along with several ablated versions of UNREAL with different components turned off. A video of the final agent performance, as well as visualisations of the activations and auxiliary task outputs can be viewed at https://youtu.be/Uz-zGYrYEjA.
In order to better understand the benefits of auxiliary control tasks we compared it to two simple baselines on three Labyrinth levels. The first baseline was A3C augmented with a pixel reconstruction loss, which has been shown to improve performance on 3D environments (Kulkarni et al., 2016). The second baseline was A3C augmented with an input change prediction loss, which can be seen as simply predicting the immediate auxiliary reward instead of learning to control. Finally, we include preliminary results for A3C augmented with the feature control auxiliary task on one of the levels. We retuned the hyperparameters of all methods (including learning rate and the weight placed on the auxiliary loss) for each of the three Labyrinth levels. Figure 5 shows the learning curves for the top 5 hyperparameter settings on three Labyrinth navigation levels. The results show that learning to control pixel changes is indeed better than simply predicting immediate pixel changes, which in turn is better than simply learning to reconstruct the input. In fact, learning to reconstruct only led to faster initial learning and actually made the final scores worse when compared to vanilla A3C. Our hypothesis is that input reconstruction hurts final performance because it puts too much focus on reconstructing irrelevant parts of the visual input instead of visual cues for rewards, which rewarding objects are rarely visible. Encouragingly, we saw an improvement from including the feature control auxiliary task. Combining feature control with other auxiliary tasks is a promising future direction.
2 Atari
We applied the UNREAL agent as well as UNREAL without pixel control to 57 Atari games from the Arcade Learning Environment (Bellemare et al., 2012) domain. We use the same evaluation protocol as for our Labyrinth experiments where we evaluate 50 different random hyper parameter settings (learning rate and entropy cost) on each game. The results are shown in the bottom row of Figure 3. The left side shows the average performance curves of the top 3 agents for all three methods the right half shows sorted average human-normalised scores for each hyperparameter setting. More detailed learning curves for individual levels can be found in Figure 7. We see that UNREAL surpasses the current state-of-the-art agents, i.e. A3C and Prioritized Dueling DQN (Wang et al., 2016), across all levels attaining 880% mean and 250% median performance. Notably, UNREAL is also substantially more robust to hyper parameter settings than A3C.
Conclusion
We have shown how augmenting a deep reinforcement learning agent with auxiliary control and reward prediction tasks can drastically improve both data efficiency and robustness to hyperparameter settings. Most notably, our proposed UNREAL architecture more than doubled the previous state-of-the-art results on the challenging set of 3D Labyrinth levels, bringing the average scores to over of human scores. The same UNREAL architecture also significantly improved both the learning speed and the robustness of A3C over 57 Atari games.
Acknowledgements
We thank Charles Beattie, Julian Schrittwieser, Marcus Wainwright, and Stig Petersen for environment design and development, and Amir Sadik and Sarah York for expert human game testing. We also thank Joseph Modayil, Andrea Banino, Hubert Soyer, Razvan Pascanu, and Raia Hadsell for many helpful discussions.
References
Appendix A Atari Games
Appendix B Implementation Details
The input to the agent at each timestep was an RGB image. All agents processed the input with the convolutional neural network (CNN) originally used for Atari by Mnih et al. (2013). The network consists of two convolutional layers. The first one has filters applied with stride , while the second one has filters with stride . This is followed by a fully connected layer with units. All three layers are followed by a ReLU non-linearity. All agents used an LSTM with forget gates (Gers et al., 2000) with 256 cells which take in the CNN-encoded observation concatenated with the previous action taken and curren:t reward. The policy and value function are linear projections of the LSTM output. The agent is trained with 20-step unrolls. The action space of the agent in the environment is game dependent for Atari (between 3 and 18 discrete actions), and 17 discrete actions for Labyrinth. Labyrinth runs at 60 frames-per-second. We use an action repeat of four, meaning that each action is repeated four times, with the agent receiving the final fourth frame as input to the next processing step.
The auxiliary tasks are performed every 20 environment steps, corresponding to every update of the base A3C agent, once the replay buffer has filled with agent experience. The replay buffer stores the most recent 2k observations, actions, and rewards taken by the base agent.