A0C: Alpha Zero in Continuous Action Space

Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker

Introduction

Alpha Zero has achieved state-of-the-art, super-human performance in Chess, Shogi (Silver et al., 2017a, ) and the game of Go (Silver et al.,, 2016; Silver et al., 2017b, ). The key innovation of Alpha Zero compared to traditional reinforcement learning approaches is the use of a small, nested tree search as a policy evaluation.Additionally, the tree search provides an efficient exploration method, which is a key challenge in reinforcement learning Moerland et al., (2017). Whereas traditional reinforcement learning treats each environment step or trace as an individual training target, Alpha Zero aggregates the information of multiple traces in a tree, and eventually aggregates these tree statistics into targets to train a neural network. The neural network is then used as a prior to improve new tree searches. This closes the loop between search and function approximation (Figure 1). In section 6 we further discuss why this works so well.

While Alpha Zero has been very successful in two-player games with discrete action spaces, it is not yet applicable in continuous action space (nor has Alpha Zero been tested in single-player environments). Many real-world problems, such as robotics control, navigation and self-driving cars, have a continuous action space. We will now list the core contributions of this paper. Compared to the Alpha Zero paradigm for discrete action spaces, we require:

A Monte Carlo Tree Search (MCTS) method that works in continuous action space. We built here on earlier results on progressive widening (Section 3.1).

Incorporation of a continuous prior to steer a new MCTS iteration. While Alpha Zero uses the discrete density as a prior in a (P)UCT formula (Rosin,, 2011; Kocsis and Szepesvári,, 2006), we need to leverage a continuous density (which is unbounded) to direct the next MCTS iteration (Section 3.2)

A training method. Alpha Zero transforms the MCTS visitation counts to a discrete probability distribution. We need to estimate a continuous density from a set of support points, and specify an appropriate training loss in continuous policy space (Section 4).

The remainder of this paper is organized as follows. Section 2 presents essential preliminaries on reinforcement learning and MCTS. Section 3 discusses the required MCTS modifications for a continuous action space with a continuous prior (Fig. 1, upper part of the loop). In Section 4 we cover the generation of training targets from the tree search and specify an appropriate neural network loss (Fig. 1, lower part of the loop). Sections 5, 6 and 7 present experiments, discussion and conclusions.

Preliminaries

Monte Carlo Tree Search

We present a brief introduction of the well-known MCTS algorithm (Coulom,, 2006; Browne et al.,, 2012). In particular, we discuss a variant of the PUCT algorithm (Rosin,, 2011), as also used in Alpha Zero (Silver et al., 2017a, ; Silver et al., 2017b, ). Every action node in the tree stores statistics {n(s,a),W(s,a),Q(s,a)}\{n(\bm{s},\bm{a}),W(\bm{s},\bm{a}),Q(\bm{s},\bm{a})\}, where n(s,a)n(\bm{s},\bm{a}) is the visitation count, W(s,a)W(\bm{s},\bm{a}) the cumulative return over all roll-outs through (s,a)(\bm{s},\bm{a}), and Q(s,a)=W(s,a)/n(s,a)Q(\bm{s},\bm{a})=W(\bm{s},\bm{a})/n(\bm{s},\bm{a}) is the mean action value estimate. PUCT alternates four phases:

Select In the first stage, we descent the tree from the root node according to:

Back-up Finally, we recursively back-up the results in the tree nodes. Denote the current forward trace in the tree as {s0,a0,s1,..sL1,aL1,sL}\{\bm{s}_{0},\bm{a}_{0},\bm{s}_{1},..\bm{s}_{L-1},\bm{a}_{L-1},\bm{s}_{L}\}. Then, for each state-action edge (si,ai)(\bm{s}_{i},\bm{a}_{i}), L>i0L>i\geq 0, we recursively estimate the state-action value as

This procedure is repeated until the overall MCTS trace budget NtraceN_{\text{trace}} is reached. MCTS returns a set of root actions A0={a0,0,a0,1,..,a0,m}A_{0}=\{\bm{a}_{0,0},\bm{a}_{0,1},..,\bm{a}_{0,m}\} with associated counts N0={n(s0,a0,0),n(s0,a0,1),..,n(s0,a0,m)}N_{0}=\{n(\bm{s}_{0},\bm{a}_{0,0}),n(\bm{s}_{0},\bm{a}_{0,1}),..,n(\bm{s}_{0},\bm{a}_{0,m})\}. Here mm denotes the number of child actions, which for Alpha Zero is always fixed to the cardinality of the discrete action space m=Am=|\mathcal{A}|. We select the real action at\bm{a}^{t} to play in the environment by sampling from the probability distribution obtained from normalizing the action counts at the root s0(=st)\bm{s}_{0}(=\bm{s}^{t}):

and n(s0)=bA0n(s0,b)n(\bm{s}_{0})=\sum_{\bm{b}\in A_{0}}n(\bm{s}_{0},\bm{b}). Note that n(s0)Ntracen(\bm{s}_{0})\geq N_{\text{trace}}, since we store the subtree that belongs to the picked action at\bm{a}^{t} for the MCTS at the next timestep.

Neural Networks

We introduce two neural networks - similar to Alpha Zero - to estimate a parametrized policy πϕ(as)\pi_{\phi}(\bm{a}|\bm{s}) and the state value Vϕ(s)V_{\phi}(\bm{s}). Both networks share the initial layers. The joint set of parameters of both networks is denoted by ϕ\phi. The neural networks are trained on targets generated by the MCTS procedure. These training targets, extracted from the tree search, are denoted by π^(as)\hat{\pi}(\bm{a}|\bm{s}) and V^(s)\hat{V}(\bm{s}).

Tree Search in Continuous Action Space

As noted in the introduction, we require two modifications to the MCTS procedure: 1) a method to deal with continuous action spaces, and 2) a way to include a continuous policy network into the MCTS search.

During MCTS with a discrete action space we evaluate the PUCT formula for all actions. However, in continuous action space we can not enumerate all actions, i.e., there are actually infinitely many actions in a continuous set. A practical solution to this problem is progressive widening (Coulom,, 2007; Chaslot et al.,, 2008), where we make the number of child actions of state s\bm{s} in the tree m(s)m(\bm{s}) a function of the total number of visits to that state n(s)n(\bm{s}). This implies that actions with good returns, which will get more visits, will also gradually get more child actions for consideration. In particular, Couëtoux et al., (2011) uses

2 Continuous policy network prior

Neural network training in continuous action space

We next want to use the MCTS output to improve our neural networks. Compared to Alpha Zero, the continuous action space forces us to come up with a different policy network specification, policy target calculation and training loss. These aspects are covered in Section 4.1. Afterwards, we briefly detail the value network training procedure, including a slight variant of the value target estimation (Section 4.2).

Training Target

Loss

In short, our main idea is to leave the normalization and generalization of the policy over the action space to the network loss. If we specify a network output distribution that enforces aπϕ(as)=1\int_{a}\pi_{\phi}(\bm{a}|\bm{s})=1, i.e., making it a proper continuous density, then we may specify a loss with respect to a target density π^(as)\hat{\pi}(\bm{a}|\bm{s}), even when the target density is only known on a relative scale. More extreme counts (relative densities) will produce stronger gradients, and the restrictions of the network output density will ensure that we can not pull the density up or down over the entire support (as it needs to integrate to 1). This way, we make our network output density mimic the counts on a relative scale.

We will first give a general derivation, acting as if π^(as)\hat{\pi}(\bm{a}|\bm{s}) is a proper density, and swap in the empirical density at the end. We minimize a policy loss Lpolicy(ϕ)\mathcal{L}^{\text{policy}}(\phi) based on the Kullback-Leibler divergence between the network output πϕ(as)\pi_{\phi}(\bm{a}|\bm{s}) and the empirical density π^(as)\hat{\pi}(\bm{a}|\bm{s}) (Eq. 5):

We now drop Z(s,τ)Z(\bm{s},\tau) since it does not depend on ϕ\phi (or chose an appropriate state-dependent baseline, as is common with REINFORCE estimators). Moreover, we replace the expectation over aπϕ(as)\bm{a}\sim\pi_{\phi}(\bm{a}|\bm{s}) with the empirical support points aiDs\bm{a}_{i}\sim\mathcal{D}_{\bm{s}}, where Ds\mathcal{D}_{\bm{s}} denotes the subset of the database containing state s\bm{s}. Our final gradient estimator becomes

Entropy regularization

Continuous policies have a risk to collapse (Haarnoja et al.,, 2018). If all sampled actions are close to each other, then the distribution may narrow too much, loosing any exploration. In the worst case, the distribution may completely collapse, which will produce NaNs and break the training process. As we empirically observed this problem, we augment the training objective with an entropy maximization term. This prevents the policy from collapsing, and additionally ensures a minimum level of exploration. We define the entropy loss as

Details on the computation of the entropy for the case where πϕ(as)\pi_{\phi}(\bm{a}|\bm{s}) is a transformed Beta distribution are provided in Appendix A.1. The full policy loss thereby becomes

where λ\lambda is a hyperparameter that scales the contribution of the entropy term to the overall loss.

2 Value Network

Value network training is almost identical to the Alpha Zero specification. The only thing we modify is the estimation of V^(s)\hat{V}(\bm{s}), the training target for the value. Alpha Zero uses the eventual return of the full episode as the training target for every state in the trace. This is an unbiased, but high-variance signal (in reinforcement learning terminology (Sutton and Barto,, 2018), it uses a full Monte Carlo target). Instead, we use the MCTS procedure as a value estimator, leveraging the action value estimates Q(s0,a)Q(s_{0},a) at the root s0s_{0}. We could weigh these according to the visitation counts at the root. However, we usually built relatively small trees,AlphaGo Zero uses 1600 traces per timestep. We evaluate on smaller domains, and have less computational resources. for which a non-negligible fraction of the traces are exploratory. Therefore, we propose an off-policy estimate of the value at the root:

The value loss LV(ϕ)\mathcal{L}^{V}(\phi) is a standard mean-squared error loss:

Experiments

Figure 2 shows the results of our algorithm on the Pendulum-v0 task from the OpenAI Gym (Brockman et al.,, 2016). The curves show learning performance for different computational budgets per MCTS at each timestep. Note that the x-axis displays true environment steps, which includes the MCTS simulations. For example, if we use 10 traces per MCTS, then every real environment step counts as 10 on this scale.

First, we observe that our continuous Alpha Zero version does indeed learn on the Pendulum task. Interestingly, we observe different learning performance for different tree sizes, where the ‘sweet spot’ appears to be at an intermediate tree size (of 10). For larger trees, we complete less episodes (a single episode takes longer) and therefore train our neural network less frequently. Therefore, although each individual trace gets more budget, it takes longer before the tree search starts to profit from improved network estimates (generalization).

We train our neural network after every completed episode. However, the runs with smaller tree sizes complete much more episodes compared to the runs with a larger tree size. Moreover, the data generated from larger tree searches could be deemed ‘more trustworthy’, as we spend more computational effort in generating them. We try to compensate for this effect by making the number of training epochs over the database after each episode proportional to the size of the nested tree search. Specifically, after each episode we train for

We use a three layer neural network with 128 units in each hidden layer and ELu activation functions. For the MCTS we set cpuct=0.05c_{\text{puct}}=0.05, cpw=1c_{\text{pw}}=1 and κ=0.5\kappa=0.5, and for the policy loss λ=0.1\lambda=0.1 and τ=0.1\tau=0.1. We train the networks in Tensorflow (Abadi et al.,, 2016), using RMSProp optimizer on mini-batches of size 32 with a learning rate of 0.0001. Episodes last at maximum 300 steps.

Discussion

The results in Fig. 2 reveal an interesting trade-off in the iterated tree search and function approximation paradigm. We hypothesize that the strength of tree search is the in the locality of information. Each edge stores its own statistics, and this makes it easy to locally separate the effect of actions. Moreover, the forward search gives a more stable value estimate, smoothing out local errors in the value network. In contrast, the strength of the neural network is generalization. Frequently, we re-encounter the (almost) same state in a different subtree during a next episode. Supervised learning is a natural way to generalize the already learned knowledge from a previous episode.

One of the key observations of the present paper is that we actually need both. If we only perform tree search, then we eventually fail at solving the domain because all information is kept locally. In contrast, if we only build trees of size 1, then we are continuously generalizing without ever locally separating decisions and improving our training targets. Our results suggest that there is actually a sweet spot halfway, where we build trees of moderate size, after which we perform a few epochs of training.

Future work will test the A0C algorithm in more complicated, continuous action space tasks (Brockman et al.,, 2016; Todorov et al.,, 2012). Moreover, our algorithm could profit from recent improvements in the MCTS algorithm (Moerland et al.,, 2018) and other network architectures (Szegedy et al.,, 2015), as also leveraged in Alpha Zero.

Conclusion

This paper introduced Alpha Zero for Continuous action space (A0C). Our method learns a continuous policy network - based on transformed Beta distributions - by minimizing a KL-divergence between the network distribution and an unnormalized density at the support points from the MCTS search. Moreover, the policy network also directs new MCTS searches by proposing new candidate child actions in the search tree. Preliminary results on the Pendulum task show that our approach does indeed learn. Future work will further explore the empirical performance of A0C. In short, A0C may be a first step in transferring the success of iterated search and learning, as observed in two-player board games with discrete action spaces (Silver et al., 2017a, ; Silver et al., 2017b, ), to the single-player, continuous action space domains, like encountered in robotics, navigation and self-driving cars.

References

Appendix A Enforcing Action Space Bounds with Transformed Beta Distributions

For the loss specification in the paper, we require the (log)-density π(a)\pi(\bm{a}) of the transformed variable. We know from the change of variables rule that:

We know the entropy of a linear transformation of some variable from differential entropy (Michalowicz et al.,, 2013). For a linear transformation Mu+l\bm{M}\bm{u}+\bm{l}, with matrix M\bm{M} and vector l\bm{l}, we have

For our transformation g(u)g(\bm{u}) (Eq. 13), the second term of this equation equals nalog(2cb)n_{a}\log(2c_{b}). Since this term does not depend on ϕ\phi, and therefore does not contribute any gradients, we will simply ignore it. The entropy of the Beta distribution q(u)q(\bm{u}) can be computed analytically (Michalowicz et al., (2013), p.63).