Noisy Networks for Exploration

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg

Introduction

Despite the wealth of research into efficient methods for exploration in Reinforcement Learning (RL) (Kearns & Singh, 2002; Jaksch et al., 2010), most exploration heuristics rely on random perturbations of the agent’s policy, such as ϵ\epsilon-greedy (Sutton & Barto, 1998) or entropy regularisation (Williams, 1992), to induce novel behaviours. However such local ‘dithering’ perturbations are unlikely to lead to the large-scale behavioural patterns needed for efficient exploration in many environments (Osband et al., 2017).

Optimism in the face of uncertainty is a common exploration heuristic in reinforcement learning. Various forms of this heuristic often come with theoretical guarantees on agent performance (Azar et al., 2017; Lattimore et al., 2013; Jaksch et al., 2010; Auer & Ortner, 2007; Kearns & Singh, 2002). However, these methods are often limited to small state-action spaces or to linear function approximations and are not easily applied with more complicated function approximators such as neural networks (except from work by (Geist & Pietquin, 2010a; b) but it doesn’t come with convergence guarantees). A more structured approach to exploration is to augment the environment’s reward signal with an additional intrinsic motivation term (Singh et al., 2004) that explicitly rewards novel discoveries. Many such terms have been proposed, including learning progress (Oudeyer & Kaplan, 2007), compression progress (Schmidhuber, 2010), variational information maximisation (Houthooft et al., 2016) and prediction gain (Bellemare et al., 2016). One problem is that these methods separate the mechanism of generalisation from that of exploration; the metric for intrinsic reward, and–importantly–its weighting relative to the environment reward, must be chosen by the experimenter, rather than learned from interaction with the environment. Without due care, the optimal policy can be altered or even completely obscured by the intrinsic rewards; furthermore, dithering perturbations are usually needed as well as intrinsic reward to ensure robust exploration (Ostrovski et al., 2017). Exploration in the policy space itself, for example, with evolutionary or black box algorithms (Moriarty et al., 1999; Fix & Geist, 2012; Salimans et al., 2017), usually requires many prolonged interactions with the environment. Although these algorithms are quite generic and can apply to any type of parametric policies (including neural networks), they are usually not data efficient and require a simulator to allow many policy evaluations.

We propose a simple alternative approach, called NoisyNet, where learned perturbations of the network weights are used to drive exploration. The key insight is that a single change to the weight vector can induce a consistent, and potentially very complex, state-dependent change in policy over multiple time steps – unlike dithering approaches where decorrelated (and, in the case of ϵ\epsilon-greedy, state-independent) noise is added to the policy at every step. The perturbations are sampled from a noise distribution. The variance of the perturbation is a parameter that can be considered as the energy of the injected noise. These variance parameters are learned using gradients from the reinforcement learning loss function, along side the other parameters of the agent. The approach differs from parameter compression schemes such as variational inference (Hinton & Van Camp, 1993; Bishop, 1995; Graves, 2011; Blundell et al., 2015; Gal & Ghahramani, 2016) and flat minima search (Hochreiter & Schmidhuber, 1997) since we do not maintain an explicit distribution over weights during training but simply inject noise in the parameters and tune its intensity automatically. Consequently, it also differs from Thompson sampling (Thompson, 1933; Lipton et al., 2016) as the distribution on the parameters of our agents does not necessarily converge to an approximation of a posterior distribution.

At a high level our algorithm is a randomised value function, where the functional form is a neural network. Randomised value functions provide a provably efficient means of exploration (Osband et al., 2014). Previous attempts to extend this approach to deep neural networks required many duplicates of sections of the network (Osband et al., 2016). By contrast in our NoisyNet approach while the number of parameters in the linear layers of the network is doubled, as the weights are a simple affine transform of the noise, the computational complexity is typically still dominated by the weight by activation multiplications, rather than the cost of generating the weights. Additionally, it also applies to policy gradient methods such as A3C out of the box (Mnih et al., 2016). Most recently (and independently of our work) Plappert et al. (2017) presented a similar technique where constant Gaussian noise is added to the parameters of the network. Our method thus differs by the ability of the network to adapt the noise injection with time and it is not restricted to Gaussian noise distributions. We need to emphasise that the idea of injecting noise to improve the optimisation process has been thoroughly studied in the literature of supervised learning and optimisation under different names (e.g., Neural diffusion process (Mobahi, 2016) and graduated optimisation (Hazan et al., 2016)). These methods often rely on a noise of vanishing size that is non-trainable, as opposed to NoisyNet which tunes the amount of noise by gradient descent.

NoisyNet can also be adapted to any deep RL algorithm and we demonstrate this versatility by providing NoisyNet versions of DQN (Mnih et al., 2015), Dueling (Wang et al., 2016) and A3C (Mnih et al., 2016) algorithms. Experiments on 57 Atari games show that NoisyNet-DQN and NoisyNetDueling achieve striking gains when compared to the baseline algorithms without significant extra computational cost, and with less hyper parameters to tune. Also the noisy version of A3C provides some improvement over the baseline.

Background

This section provides mathematical background for Markov Decision Processes (MDPs) and deep RL with Q-learning, dueling and actor-critic methods.

MDPs model stochastic, discrete-time and finite action space control problems (Bellman & Kalaba, 1965; Bertsekas, 1995; Puterman, 1994). An MDP is a tuple M=(X,A,R,P,γ)M=(\mathcal{X},\mathcal{A},R,P,\gamma) where X\mathcal{X} is the state space, A\mathcal{A} the action space, RR the reward function, γ]0,1[\gamma\in]0,1[ the discount factor and PP a stochastic kernel modelling the one-step Markovian dynamics (P(yx,a)P(y|x,a) is the probability of transitioning to state yy by choosing action aa in state xx). A stochastic policy π\pi maps each state to a distribution over actions π(x)\pi(\cdot|x) and gives the probability π(ax)\pi(a|x) of choosing action aa in state xx. The quality of a policy π\pi is assessed by the action-value function QπQ^{\pi} defined as:

2 Deep Reinforcement Learning

Deep Reinforcement Learning uses deep neural networks as function approximators for RL methods. Deep Q-Networks (DQN) (Mnih et al., 2015), Dueling architecture (Wang et al., 2016), Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016), Trust Region Policy Optimisation (Schulman et al., 2015), Deep Deterministic Policy Gradient (Lillicrap et al., 2015) and distributional RL (C51) (Bellemare et al., 2017) are examples of such algorithms. They frame the RL problem as the minimisation of a loss function L(θ)L(\theta), where θ\theta represents the parameters of the network. In our experiments we shall consider the DQN, Dueling and A3C algorithms.

DQN (Mnih et al., 2015) uses a neural network as an approximator for the action-value function of the optimal policy Q(x,a)Q^{\star}(x,a). DQN’s estimate of the optimal action-value function, Q(x,a)Q(x,a), is found by minimising the following loss with respect to the neural network parameters θ\theta:

where DD is a distribution over transitions e=(x,a,r=R(x,a),yP(x,a))e=(x,a,r=R(x,a),y\sim P(\cdot|x,a)) drawn from a replay buffer of previously observed transitions. Here θ\theta^{-} represents the parameters of a fixed and separate target network which is updated (θθ\theta^{-}\leftarrow\theta) regularly to stabilise the learning. An ϵ\epsilon-greedy policy is used to pick actions greedily according to the action-value function QQ or, with probability ϵ\epsilon, a random action is taken.

The Dueling DQN (Wang et al., 2016) is an extension of the DQN architecture. The main difference is in using Dueling network architecture as opposed to the Q network in DQN. Dueling network estimates the action-value function using two parallel sub-networks, the value and advantage sub-network, sharing a convolutional layer. Let θconv\theta_{\text{conv}}, θV\theta_{V}, and θA\theta_{A} be, respectively, the parameters of the convolutional encoder ff, of the value network VV, and of the advantage network AA; and θ={θconv,θV,θA}\theta=\{\theta_{\text{conv}},\theta_{V},\theta_{A}\} is their concatenation. The output of these two networks are combined as follows for every (x,a)X×A(x,a)\in\mathcal{X}\times\mathcal{A}:

The Dueling algorithm then makes use of the double-DQN update rule (van Hasselt et al., 2016) to optimise θ\theta:

where the definition distribution DD and the target network parameter set θ\theta^{-} is identical to DQN.

In contrast to DQN and Dueling, A3C (Mnih et al., 2016) is a policy gradient algorithm. A3C’s network directly learns a policy π\pi and a value function VV of its policy. The gradient of the loss on the A3C policy at step tt for the roll-out (xt+i,at+iπ(xt+i;θ),rt+i)i=0k(x_{t+i},a_{t+i}\sim\pi(\cdot|x_{t+i};\theta),r_{t+i})_{i=0}^{k} is:

H[π(xt;θ)]H[\pi(\cdot|x_{t};\theta)] denotes the entropy of the policy π\pi and β\beta is a hyper parameter that trades off between optimising the advantage function and the entropy of the policy. The advantage function A(xt+i,at+i;θ)A(x_{t+i},a_{t+i};\theta) is the difference between observed returns and estimates of the return produced by A3C’s value network: A(xt+i,at+i;θ)=j=ik1γjirt+j+γkiV(xt+k;θ)V(xt+i;θ)A(x_{t+i},a_{t+i};\theta)=\sum_{j=i}^{k-1}\gamma^{j-i}r_{t+j}+\gamma^{k-i}V(x_{t+k};\theta)-V(x_{t+i};\theta), rt+jr_{t+j} being the reward at step t+jt+j and V(x;θ)V(x;\theta) being the agent’s estimate of value function of state xx.

The parameters of the value function are found to match on-policy returns; namely we have

NoisyNets for Reinforcement Learning

Consider a linear layer of a neural network with pp inputs and qq outputs, represented by

We now turn to explicit instances of the noise distributions for linear layers in a noisy network. We explore two options: Independent Gaussian noise, which uses an independent Gaussian noise entry per weight and Factorised Gaussian noise, which uses an independent noise per each output and another independent noise per each input. The main reason to use factorised Gaussian noise is to reduce the compute time of random number generation in our algorithms. This computational overhead is especially prohibitive in the case of single-thread agents such as DQN and Duelling. For this reason we use factorised noise for DQN and Duelling and independent noise for the distributed A3C, for which the compute time is not a major concern.

Independent Gaussian noise: the noise applied to each weight and bias is independent, where each entry εi,jw\varepsilon^{w}_{i,j} (respectively each entry εjb\varepsilon^{b}_{j}) of the random matrix εw\varepsilon^{w} (respectively of the random vector εb\varepsilon^{b}) is drawn from a unit Gaussian distribution. This means that for each noisy linear layer, there are pq+qpq+q noise variables (for pp inputs to the layer and qq outputs).

Factorised Gaussian noise: by factorising εi,jw\varepsilon^{w}_{i,j}, we can use pp unit Gaussian variables εi\varepsilon_{i} for noise of the inputs and and qq unit Gaussian variables εj\varepsilon_{j} for noise of the outputs (thus p+qp+q unit Gaussian variables in total). Each εi,jw\varepsilon^{w}_{i,j} and εjb\varepsilon^{b}_{j} can then be written as:

where ff is a real-valued function. In our experiments we used f(x)=sgn(x)xf(x)=\operatorname*{sgn}(x)\sqrt{|x|}. Note that for the bias Eq. (11) we could have set f(x)=xf(x)=x, but we decided to keep the same output noise for weights and biases.

We use a Monte Carlo approximation to the above gradients, taking a single sample ξ\xi at each step of optimisation:

We now turn to our application of noisy networks to exploration in deep reinforcement learning. Noise drives exploration in many methods for reinforcement learning, providing a source of stochasticity external to the agent and the RL task at hand. Either the scale of this noise is manually tuned across a wide range of tasks (as is the practice in general purpose agents such as DQN or A3C) or it can be manually scaled per task. Here we propose automatically tuning the level of noise added to an agent for exploration, using the noisy networks training to drive down (or up) the level of noise injected into the parameters of a neural network, as needed.

A noisy network agent samples a new set of parameters after every step of optimisation. Between optimisation steps, the agent acts according to a fixed set of parameters (weights and biases). This ensures that the agent always acts according to parameters that are drawn from the current noise distribution.

We apply the following modifications to both DQN and Dueling: first, ε\varepsilon-greedy is no longer used, but instead the policy greedily optimises the (randomised) action-value function. Secondly, the fully connected layers of the value network are parameterised as a noisy network, where the parameters are drawn from the noisy network parameter distribution after every replay step. We used factorised Gaussian noise as explained in (b) from Sec. 3. For replay, the current noisy network parameter sample is held fixed across the batch. Since DQN and Dueling take one step of optimisation for every action step, the noisy network parameters are re-sampled before every action. We call the new adaptations of DQN and Dueling, NoisyNet-DQN and NoisyNet-Dueling, respectively.

We now provide the details of the loss function that our variant of DQN is minimising. When replacing the linear layers by noisy layers in the network (respectively in the target network), the parameterised action-value function Q(x,a,ε;ζ)Q(x,a,\varepsilon;\zeta) (respectively Q(x,a,ε;ζ)Q(x,a,\varepsilon^{\prime};\zeta^{-})) can be seen as a random variable and the DQN loss becomes the NoisyNet-DQN loss:

where the outer expectation is with respect to distribution of the noise variables ε\varepsilon for the noisy value function Q(x,a,ε;ζ)Q(x,a,\varepsilon;\zeta) and the noise variable ε\varepsilon^{\prime} for the noisy target value function Q(y,b,ε;ζ)Q(y,b,\varepsilon^{\prime};\zeta^{-}). Computing an unbiased estimate of the loss is straightforward as we only need to compute, for each transition in the replay buffer, one instance of the target network and one instance of the online network. We generate these independent noises to avoid bias due to the correlation between the noise in the target network and the online network. Concerning the action choice, we generate another independent sample ε\varepsilon^{\prime\prime} for the online network and we act greedily with respect to the corresponding output action-value function.

Similarly the loss function for NoisyNet-Dueling is defined as:

Both algorithms are provided in Appendix C.1.

Asynchronous Advantage Actor Critic (A3C).

A3C is modified in a similar fashion to DQN: firstly, the entropy bonus of the policy loss is removed. Secondly, the fully connected layers of the policy network are parameterised as a noisy network. We used independent Gaussian noise as explained in (a) from Sec. 3. In A3C, there is no explicit exploratory action selection scheme (such as ϵ\epsilon-greedy); and the chosen action is always drawn from the current policy. For this reason, an entropy bonus of the policy loss is often added to discourage updates leading to deterministic policies. However, when adding noisy weights to the network, sampling these parameters corresponds to choosing a different current policy which naturally favours exploration. As a consequence of direct exploration in the policy space, the artificial entropy loss on the policy can thus be omitted. New parameters of the policy network are sampled after each step of optimisation, and since A3C uses nn step returns, optimisation occurs every nn steps. We call this modification of A3C, NoisyNet-A3C.

Indeed, when replacing the linear layers by noisy linear layers (the parameters of the noisy network are now noted ζ\zeta), we obtain the following estimation of the return via a roll-out of size kk:

As A3C is an on-policy algorithm the gradients are unbiased when noise of the network is consistent for the whole roll-out. Consistency among action value functions Q^i\hat{Q}_{i} is ensured by letting letting the noise be the same throughout each rollout, i.e., i,εi=ε\forall i,\varepsilon_{i}=\varepsilon. Additional details are provided in the Appendix A and the algorithm is given in Appendix C.2.

2 Initialisation of Noisy Networks

In the case of an unfactorised noisy networks, the parameters μ\mu and σ\sigma are initialised as follows. Each element μi,j\mu_{i,j} is sampled from independent uniform distributions U[3p,+3p]\mathcal{U}[-\sqrt{\frac{3}{p}},+\sqrt{\frac{3}{p}}], where pp is the number of inputs to the corresponding linear layer, and each element σi,j\sigma_{i,j} is simply set to 0.0170.017 for all parameters. This particular initialisation was chosen because similar values worked well for the supervised learning tasks described in Fortunato et al. (2017), where the initialisation of the variances of the posteriors and the variances of the prior are related. We have not tuned for this parameter, but we believe different values on the same scale should provide similar results.

For factorised noisy networks, each element μi,j\mu_{i,j} was initialised by a sample from an independent uniform distributions U[1p,+1p]\mathcal{U}[-\frac{1}{\sqrt{p}},+\frac{1}{\sqrt{p}}] and each element σi,j\sigma_{i,j} was initialised to a constant σ0p\frac{\sigma_{0}}{\sqrt{p}}. The hyperparameter σ0\sigma_{0} is set to 0.50.5.

Results

We evaluated the performance of noisy network agents on 57 Atari games (Bellemare et al., 2015) and compared to baselines that, without noisy networks, rely upon the original exploration methods (ε\varepsilon-greedy and entropy bonus).

We used the random start no-ops scheme for training and evaluation as described the original DQN paper (Mnih et al., 2015). The mode of evaluation is identical to those of Mnih et al. (2016) where randomised restarts of the games are used for evaluation after training has happened. The raw average scores of the agents are evaluated during training, every 1M frames in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play) (van Hasselt et al., 2016).

We consider three baseline agents: DQN (Mnih et al., 2015), duel clip variant of Dueling algorithm (Wang et al., 2016) and A3C (Mnih et al., 2016). The DQN and A3C agents were training for 200M and 320M frames, respectively. In each case, we used the neural network architecture from the corresponding original papers for both the baseline and NoisyNet variant. For the NoisyNet variants we used the same hyper parameters as in the respective original paper for the baseline.

We compared absolute performance of agents using the human normalised score:

where human and random scores are the same as those in Wang et al. (2016). Note that the human normalised score is zero for a random agent and 100100 for human level performance. Per-game maximum scores are computed by taking the maximum raw scores of the agent and then averaging over three seeds. However, for computing the human normalised scores in Figure 2, the raw scores are evaluated every 1M frames and averaged over three seeds. The overall agent performance is measured by both mean and median of the human normalised score across all 57 Atari games.

The aggregated results across all 57 Atari games are reported in Table 1, while the individual scores for each game are in Table 3 from the Appendix E. The median human normalised score is improved in all agents by using NoisyNet, adding at least 1818 (in the case of A3C) and at most 4848 (in the case of DQN) percentage points to the median human normalised score. The mean human normalised score is also significantly improved for all agents. Interestingly the Dueling case, which relies on multiple modifications of DQN, demonstrates that NoisyNet is orthogonal to several other improvements made to DQN.

We also compared relative performance of NoisyNet agents to the respective baseline agent without noisy networks:

As before, the per-game score is computed by taking the maximum performance for each game and then averaging over three seeds. The relative human normalised scores are shown in Figure 1. As can be seen, the performance of NoisyNet agents (DQN, Dueling and A3C) is better for the majority of games relative to the corresponding baseline, and in some cases by a considerable margin. Also as it is evident from the learning curves of Fig. 2 NoisyNet agents produce superior performance compared to their corresponding baselines throughout the learning process. This improvement is especially significant in the case of NoisyNet-DQN and NoisyNet-Dueling. Also in some games, NoisyNet agents provide an order of magnitude improvement on the performance of the vanilla agent; as can be seen in Table 3 in the Appendix E with detailed breakdown of individual game scores and the learning curves plots from Figs 6, 7 and 8, for DQN, Dueling and A3C, respectively. We also ran some experiments evaluating the performance of NoisyNet-A3C with factorised noise. We report the corresponding learning curves and the scores in Fig. LABEL:fig:learning_curves_median_factorised and Table 2, respectively (see Appendix D). This result shows that using factorised noise does not lead to any significant decrease in the performance of A3C. On the contrary it seems that it has positive effects in terms of improving the median score as well as speeding up the learning process.

2 Analysis of Learning in Noisy Layers

In this subsection, we try to provide some insight on how noisy networks affect the learning process and the exploratory behaviour of the agent. In particular, we focus on analysing the evolution of the noise weights σw\sigma^{w} and σb\sigma^{b} throughout the learning process. We first note that, as L(ζ)L(\zeta) is a positive and continuous function of ζ\zeta, there always exists a deterministic optimiser for the loss L(ζ)L(\zeta) (defined in Eq. (14)). Therefore, one may expect that, to obtain the deterministic optimal solution, the neural network may learn to discard the noise entries by eventually pushing σw\sigma^{w}s and σb\sigma^{b} towards .

To test this hypothesis we track the changes in σw\sigma^{w}s throughout the learning process. Let σiw\sigma^{w}_{i} denote the ithi^{\text{th}} weight of a noisy layer. We then define Σˉ\bar{\Sigma}, the mean-absolute of the σiw\sigma^{w}_{i}s of a noisy layer, as

Intuitively speaking Σˉ\bar{\Sigma} provides some measure of the stochasticity of the Noisy layers. We report the learning curves of the average of Σˉ\bar{\Sigma} across 33 seeds in Fig. 3 for a selection of Atari games in NoisyNet-DQN agent. We observe that Σˉ\bar{\Sigma} of the last layer of the network decreases as the learning proceeds in all cases, whereas in the case of the penultimate layer this only happens for 2 games out of 5 (Pong and Beam rider) and in the remaining 3 games Σˉ\bar{\Sigma} in fact increases. This shows that in the case of NoisyNet-DQN the agent does not necessarily evolve towards a deterministic solution as one might have expected. Another interesting observation is that the way Σˉ\bar{\Sigma} evolves significantly differs from one game to another and in some cases from one seed to another seed, as it is evident from the error bars. This suggests that NoisyNet produces a problem-specific exploration strategy as opposed to fixed exploration strategy used in standard DQN.

Conclusion

We have presented a general method for exploration in deep reinforcement learning that shows significant performance improvements across many Atari games in three different agent architectures. In particular, we observe that in games such as Beam rider, Asteroids and Freeway that the standard DQN, Dueling and A3C perform poorly compared with the human player, NoisyNet-DQN, NoisyNet-Dueling and NoisyNet-A3C achieve super human performance, respectively. Although the improvements in performance might also come from the optimisation aspect since the cost functions are modified, the uncertainty in the parameters of the networks introduced by NoisyNet is the only exploration mechanism of the method. Having weights with greater uncertainty introduces more variability into the decisions made by the policy, which has potential for exploratory actions, but further analysis needs to be done in order to disentangle the exploration and optimisation effects.

Another advantage of NoisyNet is that the amount of noise injected in the network is tuned automatically by the RL algorithm. This alleviates the need for any hyper parameter tuning (required with standard entropy bonus and ϵ\epsilon-greedy types of exploration). This is also in contrast to many other methods that add intrinsic motivation signals that may destabilise learning or change the optimal policy. Another interesting feature of the NoisyNet approach is that the degree of exploration is contextual and varies from state to state based upon per-weight variances. While more gradients are needed, the gradients on the mean and variance parameters are related to one another by a computationally efficient affine function, thus the computational overhead is marginal. Automatic differentiation makes implementation of our method a straightforward adaptation of many existing methods. A similar randomisation technique can also be applied to LSTM units (Fortunato et al., 2017) and is easily extended to reinforcement learning, we leave this as future work.

Note NoisyNet exploration strategy is not restricted to the baselines considered in this paper. In fact, this idea can be applied to any deep RL algorithms that can be trained with gradient descent, including DDPG (Lillicrap et al., 2015), TRPO (Schulman et al., 2015) or distributional RL (C51) (Bellemare et al., 2017). As such we believe this work is a step towards the goal of developing a universal exploration strategy.

We would like to thank Koray Kavukcuoglu, Oriol Vinyals, Daan Wierstra, Georg Ostrovski, Joseph Modayil, Simon Osindero, Chris Apps, Stephen Gaffney and many others at DeepMind for insightful discussions, comments and feedback on this work.

References

Appendix A NoisyNet-A3C implementation details

For simplicity, here we present the A3C version with only one thread. For a multi-thread implementation, refer to the pseudo-code C.2 or to the original A3C paper (Mnih et al., 2016). In order to train the policy-head, an approximation of the policy-gradient is computed for each state of the roll-out (xt+i,at+iπ(xt+i;θπ),rt+i)i=0k(x_{t+i},a_{t+i}\sim\pi(\cdot|x_{t+i};\theta_{\pi}),r_{t+i})_{i=0}^{k}:

where Q^i\hat{Q}_{i} is an estimation of the return Q^i=j=ik1γjirt+j+γkiV(xt+k;θV).\hat{Q}_{i}=\sum_{j=i}^{k-1}\gamma^{j-i}r_{t+j}+\gamma^{k-i}V(x_{t+k};\theta_{V}). The gradients are then added to obtain the cumulative gradient of the roll-out:

A3C trains the value-head by minimising the error between the estimated return and the value i=0k(Q^iV(xt+i;θV))2\sum_{i=0}^{k}(\hat{Q}_{i}-V(x_{t+i};\theta_{V}))^{2}. Therefore, the network parameters (θπ,θV)(\theta_{\pi},\theta_{V}) are updated after each roll-out as follows:

where (απ,αV)(\alpha_{\pi},\alpha_{V}) are hyper-parameters. As mentioned previously, in the original A3C algorithm, it is recommended to add an entropy term βi=0kθπH(π(xt+i;θπ))\beta\sum_{i=0}^{k}\nabla_{\theta_{\pi}}H(\pi(\cdot|x_{t+i};\theta_{\pi})) to the policy update, where H(π(xt+i;θπ))=βaAπ(axt+i;θπ)log(π(axt+i;θπ))H(\pi(\cdot|x_{t+i};\theta_{\pi}))=-\beta\sum_{a\in A}\pi(a|x_{t+i};\theta_{\pi})\log(\pi(a|x_{t+i};\theta_{\pi})). Indeed, this term encourages exploration as it favours policies which are uniform over actions. When replacing the linear layers in the value and policy heads by noisy layers (the parameters of the noisy network are now ζπ\zeta_{\pi} and ζV\zeta_{V}), we obtain the following estimation of the return via a roll-out of size kk:

We would like Q^i\hat{Q}_{i} to be a consistent estimate of the return of the current policy. To do so, we should force i,εi=ε\forall i,\varepsilon_{i}=\varepsilon. As A3C is an on-policy algorithm, this involves fixing the noise of the network for the whole roll-out so that the policy produced by the network is also fixed. Hence, each update of the parameters (ζπ,ζV)(\zeta_{\pi},\zeta_{V}) is done after each roll-out with the noise of the whole network held fixed for the duration of the roll-out:

Appendix B Noisy linear layer

In this Appendix we provide a graphical representation of noisy layer.

Appendix C Algorithms

C.2 NoisyNet-A3C

Appendix D Comparison between NoisyNet-A3C (factorised and non-factorised noise) and A3C

Appendix E Learning curves and raw scores

Here we directly compare the performance of DQN, Dueling DQN and A3C and their NoisyNet counterpart by presenting the maximal score in each of the 57 Atari games (Table 3), averaged over three seeds. In Figures 6-8 we show the respective learning curves.