The Option-Critic Architecture

Pierre-Luc Bacon, Jean Harb, Doina Precup

Introduction

Temporal abstraction allows representing knowledge about courses of action that take place at different time scales. In reinforcement learning, options (?; ?) provide a framework for defining such courses of action and for seamlessly learning and planning with them. Discovering temporal abstractions autonomously has been the subject of extensive research efforts in the last 15 years (?; ?; ?; ?; ?), but approaches that can be used naturally with continuous state and/or action spaces have only recently started to become feasible (?; ?; ?; ?; ?; ?; ?).

The majority of the existing work has focused on finding subgoals (useful states that an agent should reach) and subsequently learning policies to achieve them. This idea has led to interesting methods but ones which are also difficult to scale up given their “combinatorial” flavor. Additionally, learning policies associated with subgoals can be expensive in terms of data and computation time; in the worst case, it can be as expensive as solving the entire task.

We present an alternative view, which blurs the line between the problem of discovering options from that of learning options. Based on the policy gradient theorem (?), we derive new results which enable a gradual learning process of the intra-option policies and termination functions, simultaneously with the policy over them. This approach works naturally with both linear and non-linear function approximators, under discrete or continuous state and action spaces. Existing methods for learning options are considerably slower when learning from a single task: much of the benefit comes from re-using the learned options in similar tasks. In contrast, we show that our approach is capable of successfully learning options within a single task without incurring any slowdown and while still providing benefits for transfer learning.

We start by reviewing background related to the two main ingredients of our work: policy gradient methods and options. We then describe the core ideas of our approach: the intra-option policy and termination gradient theorems. Additional technical details are included in the appendix. We present experimental results showing that our approach learns meaningful temporally extended behaviors in an effective manner. As opposed to other methods, we only need to specify the number of desired options; it is not necessary to have subgoals, extra rewards, demonstrations, multiple problems or any other special accommodations (however, the approach can take advantage of pseudo-reward functions if desired). To our knowledge, this is the first end-to-end approach for learning options that scales to very large domains at comparable efficiency.

Preliminaries and Notation

A Markov Decision Process consists of a set of states S\mathcal{S}, a set of actions A\mathcal{A}, a transition function P:S×A(S)\operatorname{P}:\mathcal{S}\times\mathcal{A}\to(\mathcal{S}\to) and a reward function r:S×ARr:\mathcal{S}\times\mathcal{A}\to\mathbb{R}. For convenience, we develop our ideas assuming discrete state and action sets. However, our results extend to continuous spaces using usual measure-theoretic assumptions (some of our empirical results are in continuous tasks). A (Markovian stationary) policy is a probability distribution over actions conditioned on states, π:S×A\pi:\mathcal{S}\times\mathcal{A}\to. In discounted problems, the value function of a policy π\pi is defined as the expected return: Vπ(s)=Eπ[t=0γtrt+1  |  s0=s]V_{\pi}(s)=\operatorname*{\mathbb{E}}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\;\middle|\;s_{0}=s\right] and its action-value function as Qπ(s,a)=Eπ[t=0γtrt+1  |  s0=s,a0=a]Q_{\pi}(s,a)=\operatorname*{\mathbb{E}}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\;\middle|\;s_{0}=s,a_{0}=a\right], where γ[0,1)\gamma\in[0,1) is the discount factor. A policy π\pi is greedy with respect to a given action-value function QQ if π(s,a)>0\mboxiffa=argmaxaQ(s,a)\pi(s,a)>0\mbox{ iff }a=\operatorname{argmax}\limits_{a^{\prime}}Q(s,a^{\prime}). In a discrete MDP, there is at least one optimal policy which is greedy with respect to its own action-value function.

Policy gradient methods (?; ?) address the problem of finding a good policy by performing stochastic gradient descent to optimize a performance objective over a given family of parametrized stochastic policies, πθ\pi_{\theta}. The policy gradient theorem (?) provides expressions for the gradient of the average reward and discounted reward objectives with respect to θ\theta. In the discounted setting, the objective is defined with respect to a designated start state (or distribution) s0s_{0}: ρ(θ,s0)=Eπθ[t=0γtrt+1  |  s0]\rho(\theta,s_{0})=\operatorname*{\mathbb{E}}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\;\middle|\;s_{0}\right]. The policy gradient theorem shows that: ρ(θ,s0)θ=sμπθ(s  |  s0)aπθ(as)θQπθ(s,a)\frac{\partial\rho(\theta,s_{0})}{\partial\theta}=\sum_{s}\mu_{\pi_{\theta}}\left(s\;\middle|\;s_{0}\right)\sum_{a}\frac{\partial\pi_{\theta}\left(a|s\right)}{\partial\theta}Q_{\pi_{\theta}}(s,a), where μπθ(s  |  s0)=t=0γtP(st=s  |  s0)\mu_{\pi_{\theta}}\left(s\;\middle|\;s_{0}\right)=\sum_{t=0}^{\infty}\gamma^{t}\operatorname{P}\left(s_{t}=s\;\middle|\;s_{0}\right) is a discounted weighting of the states along the trajectories starting from s0s_{0}. In practice, the policy gradient is estimated from samples along the on-policy stationary distribution. (?) showed that neglecting the discount factor in this stationary distribution makes the usual policy gradient estimator biased. However, correcting for this discrepancy also reduces data efficiency. For simplicity, we build on the framework of (?) and discuss how to extend our results according to (?).

The options framework (?; ?) formalizes the idea of temporally extended actions. A Markovian option ωΩ\omega\in\Omega is a triple (Iω,πω,βω)(\mathcal{I}_{\omega},\pi_{\omega},\beta_{\omega}) in which IωS\mathcal{I}_{\omega}\subseteq\mathcal{S} is an initiation set, πω\pi_{\omega} is an intra-option policy, and βω:S\beta_{\omega}:\mathcal{S}\to is a termination function. We also assume that sS,ωΩ:sIω\forall s\in\mathcal{S},\forall\omega\in\Omega:s\in\mathcal{I}_{\omega} (i.e., all options are available everywhere), an assumption made in the majority of option discovery algorithms. We will discuss how to dispense with this assumption in the final section. (?; ?) show that an MDP endowed with a set of options becomes a Semi-Markov Decision Process (?, chapter 11), which has a corresponding optimal value function over options VΩ(s)V_{\Omega}(s) and option-value function QΩ(s,ω)Q_{\Omega}(s,\omega). Learning and planning algorithms for MDPs have their counterparts in this setting. However, the existence of the underlying MDP offers the possibility of learning about many different options in parallel : this is the idea of intra-option learning, which we leverage in our work.

Learning Options

We adopt a continual perspective on the problem of learning options. At any time, we would like to distill all of the available experience into every component of our system: value function and policy over options, intra-option policies and termination functions. To achieve this goal, we focus on learning option policies and termination functions, assuming they are represented using differentiable parameterized function approximators.

We consider the call-and-return option execution model, in which an agent picks option ω\omega according to its policy over options πΩ\pi_{\Omega} , then follows the intra-option policy πω\pi_{\omega} until termination (as dictated by βω\beta_{\omega}), at which point this procedure is repeated. Let πω,θ\pi_{\omega,\theta} denote the intra-option policy of option ω\omega parametrized by θ\theta and βω,ϑ\beta_{\omega,\vartheta}, the termination function of ω\omega parameterized by ϑ\vartheta. We present two new results for learning options, obtained using as blueprint the policy gradient theorem (?). Both results are derived under the assumption that the goal is to learn options that maximize the expected return in the current task. However, if one wanted to add extra information to the objective function, this could readily be done so long as it comes in the form of an additive differentiable function.

Suppose we aim to optimize directly the discounted return, expected over all the trajectories starting at a designated state s0s_{0} and option ω0\omega_{0}, then: ρ(Ω,θ,ϑ,s0,ω0)=EΩ,θ,ω[t=0γtrt+1  |  s0,ω0]\rho(\Omega,\theta,\vartheta,s_{0},\omega_{0})=\operatorname*{\mathbb{E}}_{\Omega,\theta,\omega}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\;\middle|\;s_{0},\omega_{0}\right]. Note that this return depends on the policy over options, as well as the parameters of the option policies and termination functions. We will take gradients of this objective with respect to θ\theta and ϑ\vartheta. In order to do this, we will manipulate equations similar to those used in intra-option learning (?, section 8). Specifically, the definition of the option-value function can be written as:

where QU:S×Ω×ARQ_{U}:\mathcal{S}\times\Omega\times\mathcal{A}\to\mathbb{R} is the value of executing an action in the context of a state-option pair:

𝑟𝑠𝑎𝛾subscriptsuperscript𝑠′Psuperscript𝑠′𝑠𝑎𝑈𝜔superscript𝑠′\displaystyle=r(s,a)+\gamma\sum_{s^{\prime}}\operatorname{P}\left(s^{\prime}\;\middle|\;s,a\right)U(\omega,s^{\prime})\enspace. (2) Note that the (s,ω)(s,\omega) pairs lead to an augmented state space, cf. (?). However, we will not work explicitly with this space; it is used only to simplify the derivation. The function U:Ω×SReU:\Omega\times\mathcal{S}\rightarrow\mathbb{Re} is called the option-value function upon arrival, (?, equation 20). The value of executing ω\omega upon entering a state ss^{\prime} is given by:

1subscript𝛽𝜔italic-ϑsuperscript𝑠′subscript𝑄Ωsuperscript𝑠′𝜔subscript𝛽𝜔italic-ϑsuperscript𝑠′subscript𝑉Ωsuperscript𝑠′\displaystyle=(1-\beta_{\omega,\vartheta}(s^{\prime}))Q_{\Omega}(s^{\prime},\omega)+\ \beta_{\omega,\vartheta}(s^{\prime})V_{\Omega}(s^{\prime}) (3) Note that QUQ_{U} and UU both depend on θ\theta and ϑ\vartheta, but we do not include these in the notation for clarity. The last ingredient required to derive policy gradients is the Markov chain along which the performance measure is estimated. The natural approach is to consider the chain defined in the augmented state space, because state-option pairs now play the role of regular states in a usual Markov chain. If option ωt\omega_{t} has been initiated or is executing at time tt in state sts_{t}, then the probability of transitioning to (st+1,ωt+1)(s_{t+1},\omega_{t+1}) in one step is:

Clearly, the process given by (4) is homogeneous. Under mild conditions, and with options available everywhere, it is in fact ergodic, and a unique stationary distribution over state-option pairs exists.

We will now compute the gradient of the expected discounted return with respect to the parameters θ\theta of the intra-option policies, assuming that they are stochastic and differentiable. From (1 , 2), it follows that:

subscript𝑄Ω𝑠𝜔𝜃\displaystyle\frac{\partial Q_{\Omega}(s,\omega)}{\partial\theta} = (aπω,θ(a  |  s)θQU(s,ω,a))\displaystyle=\ \left(\sum_{a}\frac{\partial\pi_{\omega,\theta}\left(a\;\middle|\;s\right)}{\partial\theta}Q_{U}(s,\omega,a)\right) +aπω,θ(a  |  s)sγP(s  |  s,a)U(ω,s)θ.\displaystyle+\sum_{a}\pi_{\omega,\theta}\left(a\;\middle|\;s\right)\sum_{s^{\prime}}\gamma\operatorname{P}\left(s^{\prime}\;\middle|\;s,a\right)\frac{\partial U(\omega,s^{\prime})}{\partial\theta}. We can further expand the right hand side using (3) and (4), which yields the following theorem:

Given a set of Markov options with stochastic intra-option policies differentiable in their parameters θ\theta, the gradient of the expected discounted return with respect to θ\theta and initial condition (s0,ω0)(s_{0},\omega_{0}) is:

where μΩ(s,ω  |  s0,ω0)\mu_{\Omega}\left(s,\omega\;\middle|\;s_{0},\omega_{0}\right) is a discounted weighting of state-option pairs along trajectories starting from (s0,ω0)(s_{0},\omega_{0}): μΩ(s,ω  |  s0,ω0)=t=0γtP(st=s,ωt=ω  |  s0,ω0)\mu_{\Omega}\left(s,\omega\;\middle|\;s_{0},\omega_{0}\right)=\sum_{t=0}^{\infty}\gamma^{t}\operatorname{P}\left(s_{t}=s,\omega_{t}=\omega\;\middle|\;s_{0},\omega_{0}\right).

The proof is in the appendix. This gradient describes the effect of a local change at the primitive level on the global expected discounted return. In contrast, subgoal or pseudo-reward methods assume the objective of an option is simply to optimize its own reward function, ignoring how a proposed change would propagate in the overall objective.

We now turn our attention to computing gradients for the termination functions, assumed this time to be stochastic and differentiable in ϑ\vartheta. From (1, 2, 3), we have:

subscript𝑄Ω𝑠𝜔italic-ϑ\displaystyle\frac{\partial Q_{\Omega}(s,\omega)}{\partial\vartheta} =aπω,θ(a  |  s)sγP(s  |  s,a)U(ω,s)ϑ.\displaystyle=\sum_{a}\pi_{\omega,\theta}\left(a\;\middle|\;s\right)\sum_{s^{\prime}}\gamma\operatorname{P}\left(s^{\prime}\;\middle|\;s,a\right)\frac{\partial U(\omega,s^{\prime})}{\partial\vartheta}. Hence, the key quantity is the gradient of UU. This is a natural consequence of the call-and-return execution, in which the “goodness” of termination functions can only be evaluated upon entering the next state. The relevant gradient can be further expanded as:

𝑈𝜔superscript𝑠′italic-ϑ\displaystyle\frac{\partial U(\omega,s^{\prime})}{\partial\vartheta} =βω,ϑ(s)ϑAΩ(s,ω)+\displaystyle=-\frac{\partial\beta_{\omega,\vartheta}(s^{\prime})}{\partial\vartheta}A_{\Omega}(s^{\prime},\omega)\,+ γωsP(s,ω  |  s,ω)U(ω,s)ϑ,\displaystyle\gamma\sum_{\omega^{\prime}}\sum_{s^{\prime\prime}}\operatorname{P}\left(s^{\prime\prime},\omega^{\prime}\;\middle|\;s^{\prime},\omega\right)\frac{\partial U(\omega^{\prime},s^{\prime\prime})}{\partial\vartheta}\enspace, (5) where AΩA_{\Omega} is the advantage function (?) over options AΩ(s,ω)=QΩ(s,ω)VΩ(s)A_{\Omega}(s^{\prime},\omega)=Q_{\Omega}(s^{\prime},\omega)-V_{\Omega}(s^{\prime}). Expanding U(ω,s)ϑ\frac{\partial U(\omega^{\prime},s^{\prime\prime})}{\partial\vartheta} recursively leads to a similar form as in theorem (1) but where the weighting of state-option pairs is now according to a Markov chain shifted by one time step: μΩ(st+1,ωt  |  st,ωt1)\mu_{\Omega}\left(s_{t+1},\omega_{t}\;\middle|\;s_{t},\omega_{t-1}\right) (details are in the appendix).

Given a set of Markov options with stochastic termination functions differentiable in their parameters ϑ\vartheta, the gradient of the expected discounted return objective with respect to ϑ\vartheta and the initial condition (s1,ω0)(s_{1},\omega_{0}) is:

where μΩ(s,ω  |  s1,ω0)\mu_{\Omega}\left(s^{\prime},\omega\;\middle|\;s_{1},\omega_{0}\right) is a discounted weighting of state-option pairs from (s1,ω0)(s_{1},\omega_{0}): μΩ(s,ω  |  s1,ω0)=t=0γtP(st+1=s,ωt=ω  |  s1,ω0)\mu_{\Omega}\left(s,\omega\;\middle|\;s_{1},\omega_{0}\right)=\sum_{t=0}^{\infty}\gamma^{t}\operatorname{P}\left(s_{t+1}=s,\omega_{t}=\omega\;\middle|\;s_{1},\omega_{0}\right).

The advantage function often appears in policy gradient methods (?) when forming a baseline to reduce the variance in the gradient estimates. Its presence in that context has to do mostly with algorithm design. It is interesting that in our case, it follows as a direct consequence of the derivation and gives the theorem an intuitive interpretation: when the option choice is suboptimal with respect to the expected value over all options, the advantage function is negative and it drives the gradient corrections up, which increases the odds of terminating. After termination, the agent has the opportunity to pick a better option using πΩ\pi_{\Omega}. A similar idea also underlies the interrupting execution model of options (?) in which termination is forced whenever the value of QΩ(s,ω)Q_{\Omega}(s^{\prime},\omega) for the current option ω\omega is less than VΩ(s)V_{\Omega}(s^{\prime}). (?) recently studied interrupting options through the lens of an interrupting Bellman Operator in a value-iteration setting. The termination gradient theorem can be interpreted as providing a gradient-based interrupting Bellman operator.

Algorithms and Architecture

Based on theorems 1 and 2, we can now design a stochastic gradient descent algorithm for learning options. Using a two-timescale framework (?), we propose to learn the values at a fast timescale while updating the intra-option policies and termination functions at a slower rate.

We refer to the resulting system as an option-critic architecture, in reference to the actor-critic architectures (?). The intra-option policies, termination functions and policy over options belong to the actor part of the system while the critic consists of QUQ_{U} and AΩA_{\Omega}. The option-critic architecture does not prescribe how to obtain πΩ\pi_{\Omega} since a variety of existing approaches would apply: using policy gradient methods at the SMDP level, with a planner over the options models, or using temporal difference updates. If πΩ\pi_{\Omega} is the greedy policy over options, it follows from (2) that the corresponding one-step off-policy update target gt(1)g_{t}^{(1)} is:

𝑡1\displaystyle g_{t}^{(1)}=r_{t+1}+ \displaystyle\gamma\Big{(}(1-\beta_{\omega_{t},\vartheta}(s_{t+1}))\sum_{a}\ \pi_{\omega_{t},\theta}\left(a\;\middle|\;s_{t+1}\right)Q_{U}(s_{t+1},\omega_{t},a) \displaystyle+\beta_{\omega_{t},\vartheta}(s_{t+1})\max_{\omega}\sum_{a}\pi_{\omega,\theta}\left(a\;\middle|\;s_{t+1}\right)Q_{U}(s_{t+1},\omega,a)\Big{)}\enspace, which is also the update target of the intra-option Q-learning algorithm of (?). A prototypical implementation of option-critic which uses intra-option Q-learning is shown in Algorithm 1. The tabular setting is assumed only for clarity of presentation. We write α,αθ\alpha,\alpha_{\theta} and αϑ\alpha_{\vartheta} for the learning rates of the critic, intra-option policies and termination functions respectively.

𝛿𝛾1subscript𝛽𝜔italic-ϑsuperscript𝑠′subscript𝑄Ωsuperscript𝑠′𝜔𝛾subscript𝛽𝜔italic-ϑsuperscript𝑠′subscript¯𝜔subscript𝑄Ωsuperscript𝑠′¯𝜔\delta\leftarrow\delta+\gamma(1-\beta_{\omega,\vartheta}(s^{\prime}))Q_{\Omega}(s^{\prime},\omega)+\gamma\beta_{\omega,\vartheta}(s^{\prime})\max\limits_{\bar{\omega}}Q_{\Omega}(s^{\prime},\bar{\omega}) end if QU(s,ω,a)QU(s,ω,a)+αδQ_{U}(s,\omega,a)\leftarrow Q_{U}(s,\omega,a)+\alpha\delta 2. Options improvement: θθ+αθlogπω,θ(a  |  s)θQU(s,ω,a)\theta\leftarrow\theta+\alpha_{\theta}\frac{\partial\log\pi_{\omega,\theta}\left(a\;\middle|\;s\right)}{\partial\theta}Q_{U}(s,\omega,a) ϑϑαϑβω,ϑ(s)ϑ(QΩ(s,ω)VΩ(s))  \vartheta\leftarrow\vartheta-\alpha_{\vartheta}\frac{\partial\beta_{\omega,\vartheta}(s^{\prime})}{\partial\vartheta}\left(Q_{\Omega}(s^{\prime},\omega)-V_{\Omega}(s^{\prime})\right)\; if βω,ϑ\beta_{\omega,\vartheta} terminates in ss^{\prime} then choose new ω\omega according to ϵ-soft(πΩ(s))\epsilon\text{-soft}(\pi_{\Omega}(s^{\prime})) sss\leftarrow s^{\prime} until ss^{\prime} is terminal Algorithm 1 Option-critic with tabular intra-option Q-learning Learning QUQ_{U} in addition to QΩQ_{\Omega} is computationally wasteful both in terms of the number of parameters and samples. A practical solution is to only learn QΩQ_{\Omega} and derive an estimate of QUQ_{U} from it. Because QUQ_{U} is an expectation over next states, QU(s,ω,a)=EsP[r(s,a)+γU(ω,s)  |  s,ω,a]Q_{U}(s,\omega,a)=\operatorname*{\mathbb{E}}_{s^{\prime}\sim\operatorname{P}}\left[r(s,a)+\gamma U(\omega,s^{\prime})\;\middle|\;s,\omega,a\right], it follows that gt(1)g_{t}^{(1)} is an appropriate estimator. We chose this approach for our experiment with deep neural networks in the Arcade Learning Environment.

Experiments

We first consider a navigation task in the four-rooms domain (?). Our goal is to evaluate the ability of a set of options learned fully autonomously to recover from a sudden change in the environment. (?) presented a similar experiment for a set of pre-specified options; the options in our results have not been specified a priori.

Initially the goal is located in the east doorway and the initial state is drawn uniformly from all the other cells. After 1000 episodes, the goal moves to a random location in the lower right room. Primitive movements can fail with probability 1/31/3, in which case the agent transitions randomly to one of the empty adjacent cells. The discount factor was 0.990.99, and the reward was +1+1 at the goal and otherwise. We chose to parametrize the intra-option policies with Boltzmann distributions and the terminations with sigmoid functions. The policy over options was learned using intra-option Q-learning. We also implemented primitive actor-critic (denoted AC-PG) using a Boltzmann policy. We also compared option-critic to a primitive SARSA agent using Boltzmann exploration and no eligibility traces. For all Boltzmann policies, we set the temperature parameter to 0.0010.001. All the weights were initialized to zero.

As can be seen in Figure 2, when the goal suddenly changes, the option-critic agent recovers faster. Furthermore, the initial set of options is learned from scratch at a rate comparable to primitive methods. Despite the simplicity of the domain, we are not aware of other methods which could have solved this task without incurring a cost much larger than when using primitive actions alone (?; ?).

In the two temporally extended settings, with 4 options and 8 options, termination events are more likely to occur near the doorways (Figure 3), agreeing with the intuition that they would be good subgoals. As opposed to (?), we did not encode this knowledge ourselves but simply let the agents find options that would maximize the expected discounted return.

In the Pinball domain (?), a ball must be guided through a maze of arbitrarily shaped polygons to a designated target location. The state space is continuous over the position and velocity of the ball in the xx-yy plane. At every step, the agent must choose among five discrete primitive actions: move the ball faster or slower, in the vertical or horizontal direction, or take the null action. Collisions with obstacles are elastic and can be used to the advantage of the agent. In this domain, a drag coefficient of 0.9950.995 effectively stops ball movements after a finite number of steps when the null action is chosen repeatedly. Each thrust action incurs a penalty of 5-5 while taking no action costs 1-1. The episode terminates with +10000+10000 reward when the agent reaches the target. We interrupted any episode taking more than 1000010000 steps and set the discount factor to 0.990.99.

We used intra-option Q-learning in the critic with linear function approximation over Fourier bases (?) of order 3. We experimented with 2, 3 or 4 options. We used Boltzmann policies for the intra-option policies and linear-sigmoid functions for the termination functions. The learning rates were set to 0.010.01 for the critic and 0.0010.001 for both the intra and termination gradients. We used an epsilon-greedy policy over options with ϵ=0.01\epsilon=0.01.

In (?), an option can only be used and updated after a gestation period of 10 episodes. As learning is fully integrated in option-critic, by 40 episodes a near optimal set of options had already been learned in all settings. From a qualitative point of view, the options exhibit temporal extension and specialization (fig. 4). We also observed that across many successful trajectories the red option would consistently be used in the vicinity of the goal.

Arcade Learning Environment

We applied the option-critic architecture in the Arcade Learning Environment (ALE) (?) using a deep neural network to approximate the critic and represent the intra-option policies and termination functions. We used the same configuration as (?) for the first 3 convolutional layers of the network. We used 3232 convolutional filters of size 8×88\times 8 and stride of 44 in the first layer, 6464 filters of size 4×44\times 4 with a stride of 22 in the second and 6464 3×33\times 3 filters with a stride of 11 in the third layer. We then fed the output of the third layer into a dense shared layer of 512512 neurons, as depicted in Figure 6. We fixed the learning rate for the intra-option policies and termination gradient to 0.000250.00025 and used RMSProp for the critic.

We represented the intra-option policies as linear-softmax of the fourth (dense) layer, so as to output a probability distribution over actions conditioned on the current observation. The termination functions were similarly defined using sigmoid functions, with one output neuron per termination.

The critic network was trained using intra-option Q-learning with experience replay. Option policies and terminations were updated on-line. We used an ϵ\epsilon-greedy policy over options with ϵ=0.05\epsilon=0.05 during the test phase (?).

As a consequence of optimizing for the return, the termination gradient tends to shrink options over time. This is expected since in theory primitive actions are sufficient for solving any MDP. We tackled this issue by adding a small ξ=0.01\xi=0.01 term to the advantage function, used by the termination gradient: AΩ(s,ω)+ξ=QΩ(s,ω)VΩ(s)+ξA_{\Omega}(s,\omega)+\xi=Q_{\Omega}(s,\omega)-V_{\Omega}(s)+\xi. This term has a regularization effect, by imposing an ξ\xi-margin between the value estimate of an option and that of the “optimal” one reflected in VΩV_{\Omega}. This makes the advantage function positive if the value of an option is near the optimal one, thereby stretching it. A similar regularizer was proposed in (?).

As in (?), we observed that the intra-option policies would quickly become deterministic. This problem seems to pertain to the use of policy gradient methods with deep neural networks in general, and not from option-critic itself. We applied the regularizer prescribed by (?), by penalizing for low-entropy intra-option policies.

Finally, the baseline QΩQ_{\Omega} was added to the intra-option policy gradient estimator to reduce its variance. This change provided substantial improvements (?) in the quality of the intra-option policy distributions and the overall agent performance as explained in Figure 7.

We evaluated option-critic in Asterisk, Ms. Pacman, Seaquest and Zaxxon. For comparison, we allowed the system to learn for the same number of episodes as (?) and fixed the parameters to the same values in all four domains. Despite having more parameters to learn, option-critic was capable of learning options that would achieve the goal in all games, from the ground up, within 200 episodes (Figure 8). In Asterisk, Seaquest and Zaxxon, option-critic surpassed the performance of the original DQN architecture based on primitive actions. The eight options learned in each game are learned fully end-to-end, in tandem with the feature representation, with no prior specification of a subgoal or pseudo-reward structure.

The solution found by option-critic was easy to interpret in the game of Seaquest when learning with only two options. We found that each option specialized in a behavior sequence which would include either the up or the down button. Figure 9 shows a typical transition from one option to the other, first going upward with option then switching to option 11 downward. Options with a similar structure were also found in this game by (?) using an option discovery algorithm based on graph partitioning.

Related Work

As option discovery has received a lot of attention recently, we now discuss in more detail the place of our approach with respect to others. (?) used a gradient-based approach for improving only the termination function of semi-Markov options; termination was modeled by a logistic distribution over a cumulative measure of the features observed since initiation. (?) also built on policy gradient methods by constructing explicitly the augmented state space and treating stopping events as additional control actions. In contrast, we do not need to construct this (very large) space directly. (?) dynamically chained options into longer temporal sequences by relying on compositionality properties. Earlier work on linear options (?) also used compositionality to plan using linear expectation models for options. Our approach also relies on the Bellman equations and compositionality, but in conjunction with policy gradient methods.

Several very recent papers also attempt to formulate option discovery as an optimization problem with solutions that are compatible with function approximation. (?) learn return-optimizing options by treating the termination functions as hidden variables, and using EM to learn them. (?) consider the problem of learning options that have open-loop intra-option policies, also called macro-actions. As in classical planning, action sequences that are more frequent are cached. A mapping from states to action sequences is learned along with a commitment module, which triggers re-planning when necessary. In contrast, we use closed-loop policies throughout, which are reactive to state information and can provide better solutions. (?) propose a gradient-based option learning algorithm, assuming a particular structure for the initiation sets and termination functions. Under this framework, exactly one option is active in any partition of the state space. (?) use the DQN framework to implement a gradient-based option learner, which uses intrinsic rewards to learn the internal policies of options, and extrinsic rewards to learn the policy over options. As opposed to our framework, descriptions of the subgoals are given as inputs to the option learners. Option-critic is conceptually general and does not require intrinsic motivation for learning the options.

Discussion

We developed a general gradient-based approach for learning simultaneously the intra-option policies and termination functions, as well as the policy over options, in order to optimize a performance objective for the task at hand. Our ALE experiments demonstrate successful end-to-end learning of options in the presence of nonlinear function approximation. As noted, our approach only requires specifying the number of options. However, if one wanted to use additional pseudo-rewards, the option-critic framework would easily accommodate it. In this case, the internal policies and termination function gradients would simply need to be taken with respect to the pseudo-rewards instead of the task reward. A simple instance of this idea, which we used in some of the experiments, is to use additional rewards to encourage options that are indeed temporally extended by adding a penalty whenever a switching event occurs. Our approach can work seamlessly with any other heuristic for biasing the set of options towards some desirable property (e.g. compositionality or sparsity), as long as it can be expressed as an additive reward structure. However, as seen in the results, such biasing is not necessary to produce good results.

The option-critic architecture relies on the policy gradient theorem, and as discussed in (?), the gradient estimators can be biased in the discounted case. By introducing factors of the form γti=1t(1βi)\gamma^{t}\prod_{i=1}^{t}(1-\beta_{i}) in our updates (?, eq (3)), it would be possible to obtain unbiased estimates. However, we do not recommend this approach since the sample complexity of the unbiased estimators is generally too high and the biased estimators performed well in our experiments.

Perhaps the biggest remaining limitation of our work is the assumption that all options apply everywhere. In the case of function approximation, a natural extension to initiation sets is to use a classifier over features, or some other form of function approximation. As a result, determining which options are allowed may have similar cost to evaluating a policy over options (unlike in the tabular setting, where options with sparse initiation sets lead to faster decisions). This is akin to eligibility traces, which are more expensive than using no trace in the tabular case, but have the same complexity with function approximation. If initiation sets are to be learned, the main constraint that needs to be added is that the options and the policy over them lead to an ergodic chain in the augmented state-option space. This can be expressed as a flow condition that links initiation sets with terminations. The precise description of this condition, as well as sparsity regularization for initiation sets, is left for future work.

Acknowledgements

The authors gratefully acknowledge financial support for this work by the National Science and Engineering Research Council of Canada (NSERC) and the Fonds de recherche du Quebec - Nature et Technologies (FRQNT).

Appendix

If ωt\omega_{t} has been initiated or is executing at time tt, then the discounted probability of transitioning to (st+1,ωt+1)(s_{t+1},\omega_{t+1}) is:

When conditioning the process from (st,ωt1)(s_{t},\omega_{t-1}), the discounted probability of transitioning to st+1,ωts_{t+1},\omega_{t} is:

More generally, the kk-steps discounted probabilities can be expressed recursively as follows:

Proof of the Intra-Option Policy Gradient Theorem

Taking the gradient of the option-value function:

𝑈𝜔superscript𝑠′𝜃absent\displaystyle\frac{\partial U(\omega,s^{\prime})}{\partial\theta}= (1βω,ϑ(s))QΩ(s,ω)θ+βω,ϑ(s)VΩ(s)θ\displaystyle(1-\beta_{\omega,\vartheta}(s^{\prime}))\frac{\partial Q_{\Omega}(s^{\prime},\omega)}{\partial\theta}+\beta_{\omega,\vartheta}(s^{\prime})\frac{\partial V_{\Omega}(s^{\prime})}{\partial\theta} =(1βω,ϑ(s))QΩ(s,ω)θ+\displaystyle=(1-\beta_{\omega,\vartheta}(s^{\prime}))\frac{\partial Q_{\Omega}(s^{\prime},\omega)}{\partial\theta}+ βω,ϑ(s)ωπΩ(ω  |  s)QΩ(s,ω)θ\displaystyle\hskip 40.00006pt\beta_{\omega,\vartheta}(s^{\prime})\sum_{\omega^{\prime}}\pi_{\Omega}\left(\omega^{\prime}\;\middle|\;s^{\prime}\right)\frac{\partial Q_{\Omega}(s^{\prime},\omega^{\prime})}{\partial\theta} \displaystyle=\sum_{\omega^{\prime}}\big{(}(1-\beta_{\omega,\vartheta}(s^{\prime}))\mathbf{1}_{\omega^{\prime}=\omega}+ \displaystyle\hskip 40.00006pt\beta_{\omega,\vartheta}(s^{\prime})\pi_{\Omega}\left(\omega^{\prime}\;\middle|\;s^{\prime}\right)\big{)}\ \frac{\partial Q_{\Omega}(s^{\prime},\omega^{\prime})}{\partial\theta}\enspace. (7) where (7) follows from the assumption that θ\theta only appears in the intra-option policies. Substituting (7) into (6) yields a recursion which, using the previous remarks about augmented process can be transformed into:

subscript𝑄Ωsuperscript𝑠′superscript𝜔′𝜃\displaystyle\hskip 40.00006pt\sum_{s^{\prime}}\sum_{\omega^{\prime}}\operatorname{P}_{\gamma}^{(1)}\left(s^{\prime},\omega^{\prime}\;\middle|\;s,\omega\right)\frac{\partial Q_{\Omega}(s^{\prime},\omega^{\prime})}{\partial\theta} =k=0s,ωPγ(k)(s,ωs,ω)aπω,θ(as)θQU(s,ω,a).\displaystyle=\sum_{k=0}^{\infty}\sum_{s^{\prime},\omega^{\prime}}\operatorname{P}_{\gamma}^{(k)}\left(s^{\prime},\omega^{\prime}|s,\omega\right)\sum_{a}\frac{\partial\pi_{\omega^{\prime},\theta}\left(a|s^{\prime}\right)}{\partial\theta}Q_{U}(s^{\prime},\omega^{\prime},a). The gradient of the expected discounted return with respect to θ\theta is then:

Proof of the Termination Gradient Theorem

The expected sum of discounted rewards starting from (s1,ω0)(s_{1},\omega_{0}) is given by:

1subscript𝛽𝜔italic-ϑsuperscript𝑠′subscript𝑄Ωsuperscript𝑠′𝜔subscript𝛽𝜔italic-ϑsuperscript𝑠′subscript𝑉Ωsuperscript𝑠′\displaystyle U(\omega,s^{\prime})=(1-\beta_{\omega,\vartheta}(s^{\prime}))Q_{\Omega}(s^{\prime},\omega)+\beta_{\omega,\vartheta}(s^{\prime})V_{\Omega}(s^{\prime}) \displaystyle=(1-\beta_{\omega,\vartheta}(s^{\prime}))\sum_{a}\pi_{\omega,\theta}\left(a\;\middle|\;s^{\prime}\right)\Big{(} \displaystyle\hskip 40.00006ptr(s^{\prime},a)+\sum_{s^{\prime\prime}}\gamma\operatorname{P}\left(s^{\prime\prime}\;\middle|\;s^{\prime},a\right)U(\omega,s^{\prime\prime})\Big{)} \displaystyle+\;\beta_{\omega,\vartheta}(s^{\prime})\sum_{\omega^{\prime}}\pi_{\Omega}\left(\omega^{\prime}\;\middle|\;s^{\prime}\right)\sum_{a}\pi_{\omega^{\prime},\theta}\left(a\;\middle|\;s^{\prime}\right)\Big{(} \displaystyle\hskip 40.00006ptr(s^{\prime},a)+\sum_{s^{\prime\prime}}\gamma\operatorname{P}\left(s^{\prime\prime}\;\middle|\;s^{\prime},a\right)U(\omega^{\prime},s^{\prime\prime})\Big{)}\enspace. The gradient of UU is then:

𝑈𝜔superscript𝑠′italic-ϑlimit-fromsubscript𝛽𝜔italic-ϑsuperscript𝑠′italic-ϑsubscript⏟subscript𝑉Ωsuperscript𝑠′subscript𝑄Ωsuperscript𝑠′𝜔subscript𝐴Ωsuperscript𝑠′𝜔\displaystyle\frac{\partial U(\omega,s^{\prime})}{\partial\vartheta}=\frac{\partial\beta_{\omega,\vartheta}(s^{\prime})}{\partial\vartheta}\underbrace{\left(V_{\Omega}(s^{\prime})-Q_{\Omega}(s^{\prime},\omega)\right)}_{-A_{\Omega}(s^{\prime},\omega)}+ (1βω,ϑ(s))aπω,θ(as)sγP(ss,a)U(ω,s)ϑ.\displaystyle(1-\beta_{\omega,\vartheta}(s^{\prime}))\sum_{a}\pi_{\omega,\theta}\left(a|s^{\prime}\right)\sum_{s^{\prime\prime}}\gamma\operatorname{P}\left(s^{\prime\prime}|s^{\prime},a\right)\frac{\partial U(\omega,s^{\prime\prime})}{\partial\vartheta}. Using the structure of the augmented process:

𝑈𝜔superscript𝑠′italic-ϑlimit-fromsubscript𝛽𝜔italic-ϑsuperscript𝑠′italic-ϑsubscript𝐴Ωsuperscript𝑠′𝜔\displaystyle\frac{\partial U(\omega,s^{\prime})}{\partial\vartheta}=-\frac{\partial\beta_{\omega,\vartheta}(s^{\prime})}{\partial\vartheta}A_{\Omega}(s^{\prime},\omega)+ ωsPγ(1)(s,ω  |  s,ω)U(ω,s)ϑ\displaystyle\hskip 40.00006pt\sum_{\omega^{\prime}}\sum_{s^{\prime\prime}}\operatorname{P}_{\gamma}^{(1)}\left(s^{\prime\prime},\omega^{\prime}\;\middle|\;s^{\prime},\omega\right)\frac{\partial U(\omega^{\prime},s^{\prime\prime})}{\partial\vartheta} =ω,sk=0Pγ(k)(s,ω  |  s,ω)βω,ϑ(s)ϑAΩ(s,ω).\displaystyle=-\sum_{\omega^{\prime},s^{\prime\prime}}\sum_{k=0}^{\infty}\operatorname{P}_{\gamma}^{(k)}\left(s^{\prime\prime},\omega^{\prime}\;\middle|\;s^{\prime},\omega\right)\frac{\partial\beta_{\omega^{\prime},\vartheta}(s^{\prime\prime})}{\partial\vartheta}A_{\Omega}(s^{\prime\prime},\omega^{\prime})\enspace. We finally obtain:

References