Bayesian policy gradient and actor-critic algorithms

Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko

Introduction

Policy gradient (PG) methodsThe term has been coined in Sutton00PG, but here we use it more liberally to refer to a whole class of reinforcement learning algorithms. are reinforcement learning (RL) algorithms that maintain a parameterized action-selection policy and update the policy parameters by moving them in the direction of an estimate of the gradient of a performance measure. Early examples of PG algorithms are the class of REINFORCE algorithms (Williams92SS),Note that policy gradient methods have been studied in the control community (see e.g., Dyer70CT, Jacobson70DD, Hasdorff76GO) before REINFORCE. However, unlike REINFORCE that is model-free, they were all based on the exact model of the system (model-based). which are suitable for solving problems in which the goal is to optimize the average reward. Subsequent work (e.g., Kimura95RL, Marbach98SM, Baxter01IP) extended these algorithms to the cases of infinite-horizon Markov decision processes (MDPs) and partially observable MDPs (POMDPs), while also providing much needed theoretical analysis.It is important to note that the pioneering work of Gullapali and colleagues in the early 1990s (Gullapalli90SR, Gullapalli92LC, Gullapalli94AR) in applying policy gradient methods to robot learning problems had an important role in popularizing this class of algorithms. In fact policy gradient methods have been continuously proven to be one of the most effective class of algorithms in learning in robots. However, both the theoretical results and empirical evaluations have highlighted a major shortcoming of these algorithms, namely, the high variance of the gradient estimates. This problem may be traced to the fact that in most interesting cases, the time-average of the observed rewards is a high-variance (although unbiased) estimator of the true average reward, resulting in the sample-inefficiency of these algorithms.

One solution proposed for this problem was to use an artificial discount factor in these algorithms (Marbach98SM, Baxter01IP), however, this creates another problem by introducing bias into the gradient estimates. Another solution, which does not involve biasing the gradient estimate, is to subtract a reinforcement baseline from the average reward estimate in the updates of PG algorithms (e.g., Williams92SS, Marbach98SM, Sutton00PG). In Williams92SS an average reward baseline was used, and in Sutton00PG it was conjectured that an approximate value function would be a good choice for a state-dependent baseline. However, it was shown in Weaver01OR and Greensmith04VR, perhaps surprisingly, that the mean reward is in general not the optimal constant baseline, and that the true value function is generally not the optimal state-dependent baseline.

A different approach for speeding-up PG algorithms was proposed by Kakade02NP and refined and extended by Bagnell03CP and Peters03RL. The idea is to replace the policy gradient estimate with an estimate of the so-called natural policy gradient. This is motivated by the requirement that the policy updates should be invariant to bijective transformations of the parametrization. Put more simply, a change in the way the policy is parametrized should not influence the result of the policy update. In terms of the policy update rule, the move to natural-gradient amounts to linearly transforming the gradient using the inverse Fisher information matrix of the policy. In empirical evaluations, natural PG has been shown to significantly outperform conventional PG (e.g., Kakade02NP, Bagnell03CP, Peters03RL, Peters08RL).

Another approach for reducing the variance of policy gradient estimates, and as a result making the search in the policy-space more efficient and reliable, is to use an explicit representation for the value function of the policy. This class of PG algorithms are called actor-critic algorithms. Actor-critic (AC) algorithms comprise a family of RL methods that maintain two distinct algorithmic components: An actor, whose role is to maintain and update an action-selection policy; and a critic, whose role is to estimate the value function associated with the actor’s policy. Thus, the critic addresses the problem of prediction, whereas the actor is concerned with control. Actor-critic methods were among the earliest to be investigated in RL (Barto83NE, Sutton84TC). They were largely supplanted in the 1990’s by methods, such as SARSA (Rummery94OQ), that estimate action-value functions and use them directly to select actions without maintaining an explicit representation of the policy. This approach was appealing because of its simplicity, but when combined with function approximation was found to be unreliable, often failing to converge. These problems led to renewed interest in PG methods.

Actor-critic algorithms (e.g., Sutton00PG, Konda00AA, Peters05NA, Bhatnagar07IN) borrow elements from these two families of RL algorithms. Like value-function based methods, a critic maintains a value function estimate, while an actor maintains a separately parameterized stochastic action-selection policy, as in policy based methods. While the role of the actor is to select actions, the role of the critic is to evaluate the performance of the actor. This evaluation is used to provide the actor with a feedback signal that allows it to improve its performance. The actor typically updates its policy along an estimate of the gradient (or natural gradient) of some measure of performance with respect to the policy parameters. When the representations used for the actor and the critic are compatible, in the sense explained in Sutton00PG and Konda00AA, the resulting AC algorithm is simple, elegant, and provably convergent (under appropriate conditions) to a local maximum of the performance measure used by the critic, plus a measure of the temporal difference (TD) error inherent in the function approximation scheme (Konda00AA, Bhatnagar09NA).

Existing AC algorithms are based on parametric critics that are updated to optimize frequentist fitness criteria. By “frequentist” we mean algorithms that return a point estimate of the value function, rather than a complete posterior distribution computed using Bayes’ rule. A Bayesian class of critics based on Gaussian processes (GPs) has been proposed by Engel03BM, Engel05RL, called Gaussian process temporal difference (GPTD). By their Bayesian nature, these algorithms return a full posterior distribution over value functions. Moreover, while these algorithms may be used to learn a parametric representation for the posterior, they are generally capable of searching for value functions in an infinite-dimensional Hilbert space of functions, resulting in a non-parametric posterior.

Both conventional and natural policy gradient and actor-critic methods rely on Monte-Carlo (MC) techniques in estimating the gradient of the performance measure. MC estimation is a frequentist procedure, and as such violates the likelihood principle (Berger84LP).The likelihood principle states that in a parametric statistical model, all the information about a data sample that is required for inferring the model parameters is contained in the likelihood function of that sample. Moreover, although MC estimates are unbiased, they tend to suffer from high variance, or alternatively, require excessive sample sizes (see Ohagan87MC for a discussion). In the case of policy gradient estimation this is exacerbated by the fact that consistent policy improvement requires multiple gradient estimation steps.

In Ohagan91BQ a Bayesian alternative to MC estimation is proposed.Ohagan91BQ mentions that this approach may be traced even as far back as Poincare1896CP. The idea is to model integrals of the form $\int f(x)g(x)dx$ as GPs. This is done by treating the first term $f$ in the integrand as a random function, the randomness of which reflects our subjective uncertainty concerning its true identity. This allows us to incorporate our prior knowledge of $f$ into its prior distribution. Observing (possibly noisy) samples of $f$ at a set of points $\{x_{1},\ldots,x_{M}\}$ allows us to employ Bayes’ rule to compute a posterior distribution of $f$ conditioned on these samples. This, in turn, induces a posterior distribution over the value of the integral.

In this paper, we first propose a Bayesian framework for policy gradient estimation by modeling the gradient as a GP. Our Bayesian policy gradient (BPG) algorithms use GPs to define a prior distribution over the gradient of the expected return, and compute its posterior conditioned on the observed data. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely the gradient covariance, are provided at little extra cost. Additional gains may be attained by learning a transition model of the environment, allowing knowledge transfer between policies. Since our BPG models and algorithms consider complete system trajectories as their basic observable unit, they do not require the dynamics within each trajectory to be of any special form. In particular, it is not necessary for the dynamics to have the Markov property, allowing the resulting algorithms to handle partially observable MDPs, Markov games, and other non-Markovian systems. On the downside, BPG algorithms cannot take advantage of the Markov property when this property is satisfied. To address this issue, we supplement our BPG framework with actor-critic methods and propose an AC algorithm that incorporates GPTD as its critic. However, rather than merely plugging-in our critic into an existing AC algorithm, we show how the posterior moments returned by the GPTD critic allow us to obtain closed-form expressions for the posterior moments of the policy gradient. This is made possible by utilizing the Fisher kernel (ShaweTaylor04KM) as our prior covariance kernel for the GPTD state-action advantage values. Unlike the BPG methods, the Bayesian actor-critic (BAC) algorithm takes advantage of the Markov property of the system trajectories and uses individual state-action-reward transitions as its basic observable unit. This helps reduce variance in the gradient estimates, resulting in steeper learning curves when compared to BPG and the classic MC approach.

It is important to note that a short version of the two main parts of this paper, Bayesian policy gradient and Bayesian actor-critic, appeared in Ghavamzadeh06BP and Ghavamzadeh07BA, respectively. This paper extends these conference papers in the following ways:

We have included a discussion on using Bayesian Quadrature (BQ) for estimating vector-valued integrals to the paper. This is totally relevant to this work because the gradient of a policy (the quantity that we are interested in estimating using BQ) is a vector-valued integral when the size of the policy parameter vector is more than 1, which is usually the case. This also helps to better see the difference between the two models we propose for BPG. In Model 1, we place a vector-valued Gaussian process (GP) over a component of the gradient integrant, while in Model 2, we put a scalar-valued GP over a different component of the gradient integrant.

We describe the BPG and BAC algorithms in more details and show the details of using online sparsification in these algorithms. Moreover, we show how BPG can be extended to partially observable Markov decision processes (POMDPs) along the same lines that the standard PG algorithms can be used in such problems.

In comparison to Ghavamzadeh06BP, we report more details of the experiments and more experimental results, especially in using the posterior variance (covariance) of the gradient to select the step size for updating the policy parameters.

We include all the proofs in this paper (almost none was reported in the two conference papers), in particular, the proofs of Propositions 3, 4, 5, and 6. These proofs are important and the proof techniques are novel and definitely useful for the community. The importance of these proofs come from the fact that they show how with the right choice of GP prior (the one that uses the family of Fisher information kernels), we are able to use BQ and have a Bayesian estimate of the gradient integral, while initially everything indicates that BQ cannot be used for the estimation of this integral.

We apply the BAC algorithm to two new domains: “Mountain Car”, a 2-dimensional continuous state and 1-dimensional discrete action problem, and “Ship Steering”, a 4-dimensional continuous state and 1-dimensional continuous action problem.

Reinforcement Learning, Policy Gradient, and Actor-Critic Methods

Reinforcement learning (RL) (Bertsekas96NP, Sutton98IR) is term describing a class of learning problems in which an agent (or controller) interacts with a dynamic, stochastic, and incompletely known environment (or plant), with the goal of finding an action-selection strategy, or policy, to optimize some measure of its long-term performance. This interaction is conventionally modeled as a Markov decision process (MDP) (Puterman94MD), or if the environmental state is not completely observable, as a partially observable MDP (POMDP) (Astrom65OC, Smallwood73OC, Kaelbling98PA). In this work we restrict our attention to the discrete-time MDP setting.

In addition, we need to specify the rule according to which the agent selects an action at each possible state. We assume that this rule is stationary, i.e., does not depend explicitly on time. A stationary policy $\mu(\cdot|x)\in{\mathcal{P}}({\mathcal{A}})$ is a probability distribution over actions, conditioned on the current state. A MDP controlled by a policy $\mu$ induces a Markov chain over state-action pairs ${\boldsymbol{z}}_{t}=(x_{t},a_{t})\in{\mathcal{Z}}={\mathcal{X}}\times{\mathcal{A}}$ , with a transition probability density $P^{\mu}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1})=P(x_{t}|x_{t-1},a_{t-1})\mu(a_{t}|x_{t})$ , and an initial state density $P_{0}^{\mu}({\boldsymbol{z}}_{0})=P_{0}(x_{0})\mu(a_{0}|x_{0})$ . We generically denote by $\xi=({\boldsymbol{z}}_{0},{\boldsymbol{z}}_{1},\ldots,{\boldsymbol{z}}_{T})\in\Xi,\;T\in\{0,1,\ldots,\infty\}$ a path generated by this Markov chain. Note that $\Xi$ is the set of all possible trajectories that can be generated by the Markov chain induced by the current policy $\mu$ . The probability (density) of such a path is given by

We denote by $R(\xi)=\sum_{t=0}^{T-1}\gamma^{t}r(x_{t},a_{t})$ the (possibly discounted, $\gamma\in$ ) cumulative return of the path $\xi$ . $R(\xi)$ is a random variable both because the path $\xi$ itself is a random variable, and because, even for a given path, each of the rewards sampled in it may be stochastic. The expected value of $R(\xi)$ for a given path $\xi$ is denoted by $\bar{R}(\xi)$ . Finally, we define the expected return of policy $\mu$ as

The $t$ -step state-action occupancy density of policy $\mu$ is given by

It can be shown that under certain regularity conditions (Sutton00PG), the expected return of policy $\mu$ may be written in terms of state-action pairs (rather than in terms of trajectories as in Equation 2) as

where $\pi^{\mu}({\boldsymbol{z}})=\sum_{t=0}^{\infty}\gamma^{t}{P_{t}^{\mu}}({\boldsymbol{z}})$ is a discounted weighting of state-action pairs encountered while following policy $\mu$ . Integrating $a$ out of $\pi^{\mu}({\boldsymbol{z}})=\pi^{\mu}(x,a)$ results in the corresponding discounted weighting of states encountered by following policy $\mu$ : $\nu^{\mu}(x)=\int_{{\mathcal{A}}}da\pi^{\mu}(x,a)$ . Unlike $\nu^{\mu}$ and $\pi^{\mu}$ , $(1-\gamma)\nu^{\mu}$ and $(1-\gamma)\pi^{\mu}$ are distributions. They are analogous to the stationary distributions over states and state-action pairs of policy $\mu$ in the undiscounted setting, respectively, since as $\gamma\rightarrow 1$ , they tend to these distributions, if they exist.

Our aim is to find a policy $\mu^{*}$ that maximizes the expected return, i.e., $\mu^{*}=\mathop{\rm arg\,max}_{\mu}\eta(\mu)$ . A policy $\mu$ is assessed according to the expected cumulative rewards associated with states $x$ or state-action pairs ${\boldsymbol{z}}$ . For all states $x\in{\mathcal{X}}$ and actions $a\in{\mathcal{A}}$ , the action-value function and the value function of policy $\mu$ are defined as

In policy gradient (PG) methods, we define a class of smoothly parameterized stochastic policies $\big\{\mu(\cdot|x;{\boldsymbol{\theta}}),x\in{\mathcal{X}},{\boldsymbol{\theta}}\in\Theta\big\}$ . We estimate the gradient of the expected return, defined by Equation 2 (or Equation 3), with respect to the policy parameters ${\boldsymbol{\theta}}$ , from the observed system trajectories. We then improve the policy by adjusting the parameters in the direction of the gradient (e.g., Williams92SS, Marbach98SM, Baxter01IP). Since in this setting a policy $\mu$ is represented by its parameters ${\boldsymbol{\theta}}$ , policy dependent functions such as $\eta(\mu)$ , $\Pr\big(\xi;\mu)$ , $\pi^{\mu}({\boldsymbol{z}})$ , $\nu^{\mu}(x)$ , $V^{\mu}(x)$ , and $Q^{\mu}({\boldsymbol{z}})$ may be written as $\eta({\boldsymbol{\theta}})$ , $\Pr(\xi;{\boldsymbol{\theta}})$ , $\pi({\boldsymbol{z}};{\boldsymbol{\theta}})$ , $\nu(x;{\boldsymbol{\theta}})$ , $V(x;{\boldsymbol{\theta}})$ , and $Q({\boldsymbol{z}};{\boldsymbol{\theta}})$ , respectively. We assume

For any state-action pair $(x,a)$ and any policy parameter ${\boldsymbol{\theta}}\in\Theta$ , the policy $\mu(a|x;{\boldsymbol{\theta}})$ is continuously differentiable in the parameters ${\boldsymbol{\theta}}$ .

The score function or likelihood ratio method has become the most prominent technique for gradient estimation from simulation. It has been first proposed in the 1960’s (Aleksandrov68SO, Rubinstein69SP) for computing performance gradients in i.i.d. (independently and identically distributed) processes, and was then extended to regenerative processes including MDPs by Glynn86SA, Glynn90LR, Reiman86SA, Reiman89SA, Glynn95LR, and to episodic MDPs by Williams92SS. This method estimates the gradient of the expected return with respect to the policy parameters ${\boldsymbol{\theta}}$ , defined by Equation 2, using the following equation:Throughout the paper, we use the notation $\nabla$ to denote $\nabla_{{\boldsymbol{\theta}}}$ – the gradient w.r.t. the policy parameters.

In Equation 4, the quantity $\frac{\nabla\Pr(\xi;{\boldsymbol{\theta}})}{\Pr(\xi;{\boldsymbol{\theta}})}=\nabla\mathop{\rm log}\Pr(\xi;{\boldsymbol{\theta}})$ is called the (Fisher) score function or likelihood ratio. Since the initial-state distribution $P_{0}$ and the state-transition distribution $P$ are independent of the policy parameters ${\boldsymbol{\theta}}$ , we may write the score function for a path $\xi$ using Equation 1 asTo simplify notation, we omit ${\boldsymbol{u}}$ ’s dependence on the policy parameters ${\boldsymbol{\theta}}$ , and denote ${\boldsymbol{u}}(\xi;{\boldsymbol{\theta}})$ as ${\boldsymbol{u}}(\xi)$ in the sequel.

Previous work on policy gradient used classical MC to estimate the gradient in Equation 4. These methods generate i.i.d. sample paths $\xi_{1},\ldots,\xi_{M}$ according to $\Pr(\xi;{\boldsymbol{\theta}})$ , and estimate the gradient $\nabla\eta({\boldsymbol{\theta}})$ using the MC estimator

This is an unbiased estimate, and therefore, by the law of large numbers, $\widehat{\nabla\eta}({\boldsymbol{\theta}})\rightarrow\nabla\eta({\boldsymbol{\theta}})$ as $M$ goes to infinity, with probability one.

The policy gradient theorem (Marbach98SM, Proposition 1; Sutton00PG, Theorem 1; Konda00AA, Theorem 1) states that the gradient of the expected return, defined by Equation 3, for parameterized policies satisfying Assumption 1 is given by

and thus, for any baseline $b(x)$ , the gradient of the expected return can be written as

The baseline may be chosen in such a way so as to minimize the variance of the gradient estimates (Greensmith04VR).

Now consider the actor-critic (AC) framework in which the action-value function for a fixed policy $\mu$ , $Q^{\mu}$ , is approximated by a learned function approximator. If the approximation is sufficiently good, we may hope to use it in place of $Q^{\mu}$ in Equations 7 and 8, and still point roughly in the direction of the true gradient. Sutton00PG and Konda00AA showed that if the approximation $\hat{Q}^{\mu}(\cdot;{\boldsymbol{w}})$ with parameter ${\boldsymbol{w}}$ is compatible, i.e., $\nabla_{\boldsymbol{w}}\hat{Q}^{\mu}(x,a;{\boldsymbol{w}})=\nabla\mathop{\rm log}\mu(a|x;{\boldsymbol{\theta}})$ , and if it minimizes the mean squared error

An approximation for the action-value function, in terms of a linear combination of basis functions, may be written as $\hat{Q}^{\mu}({\boldsymbol{z}};{\boldsymbol{w}})={\boldsymbol{w}}^{\top}{\boldsymbol{\psi}}({\boldsymbol{z}})$ . This approximation is compatible if the ${\boldsymbol{\psi}}$ ’s are compatible with the policy, i.e., ${\boldsymbol{\psi}}({\boldsymbol{z}};{\boldsymbol{\theta}})=\nabla\mathop{\rm log}\mu(a|x;{\boldsymbol{\theta}})$ . Note that compatibility is well defined under Assumption 1. Let ${\mathcal{E}}^{\mu}({\boldsymbol{w}})$ denote the mean squared error

of our compatible linear approximation ${\boldsymbol{w}}^{\top}{\boldsymbol{\psi}}({\boldsymbol{z}})$ and an arbitrary baseline $b(x)$ . Let ${\boldsymbol{w}}^{*}=\mathop{\rm arg\,min}_{\boldsymbol{w}}{\mathcal{E}}^{\mu}({\boldsymbol{w}})$ denote the optimal parameter. It can be shown that the value of ${\boldsymbol{w}}^{*}$ does not depend on the baseline $b(x)$ . As a result, the mean squared-error problems of Equations 9 and 10 have the same solutions (see e.g., Bhatnagar07IN, Bhatnagar09NA). It can also be shown that if the parameter ${\boldsymbol{w}}$ is set to be equal to ${\boldsymbol{w}}^{*}$ , then the resulting mean squared error ${\mathcal{E}}^{\mu}({\boldsymbol{w}}^{*})$ , now treated as a function of the baseline $b(x)$ , is further minimized by setting $b(x)=V^{\mu}(x)$ (Bhatnagar07IN, Bhatnagar09NA). In other words, the variance in the action-value function estimator is minimized if the baseline is chosen to be the value function itself.

A convenient and rather flexible choice for a space of policies that ensures compatibility between the policy and the action-value representation is a parametric exponential family

where $Z_{\boldsymbol{\theta}}(x)=\int_{\mathcal{A}}da\exp\big({\boldsymbol{\theta}}^{\top}{\boldsymbol{\phi}}(x,a)\big)$ is a normalizing factor, referred to as the partition function. It is easy to show that ${\boldsymbol{\psi}}({\boldsymbol{z}})={\boldsymbol{\phi}}({\boldsymbol{z}})-{\bf E}_{a|x}\left[{\boldsymbol{\phi}}({\boldsymbol{z}})\right]$ , where ${\bf E}_{a|x}[\cdot]=\int_{\mathcal{A}}da\mu(a|x;{\boldsymbol{\theta}})[\cdot]$ , and as a result, $\hat{Q}^{\mu}({\boldsymbol{z}};{\boldsymbol{w}}^{*})={\boldsymbol{w}}^{*\top}\big({\boldsymbol{\phi}}({\boldsymbol{z}})-{\bf E}_{a|x}[{\boldsymbol{\phi}}({\boldsymbol{z}})]\big)+b(x)$ is a compatible action-value function for this family of policies. Note that ${\bf E}_{a|x}[\hat{Q}({\boldsymbol{z}};{\boldsymbol{w}}^{*})]=b(x)$ , since ${\bf E}_{a|x}\big[{\boldsymbol{\phi}}({\boldsymbol{z}})-{\bf E}_{a|x}[{\boldsymbol{\phi}}({\boldsymbol{z}})]\big]=0$ . This means that if $\hat{Q}^{\mu}({\boldsymbol{z}};{\boldsymbol{w}}^{*})$ approximates $Q^{\mu}({\boldsymbol{z}})$ , then $b(x)$ must approximate the value function $V^{\mu}(x)$ . The term $\hat{A}^{\mu}({\boldsymbol{z}};{\boldsymbol{w}}^{*})=\hat{Q}^{\mu}({\boldsymbol{z}};{\boldsymbol{w}}^{*})-b(x)={\boldsymbol{w}}^{*\top}\big({\boldsymbol{\phi}}({\boldsymbol{z}})-{\bf E}_{a|x}[{\boldsymbol{\phi}}({\boldsymbol{z}})]\big)$ approximates the advantage function $A^{\mu}({\boldsymbol{z}})=Q^{\mu}({\boldsymbol{z}})-V^{\mu}(x)$ (Baird93AU).

Bayesian Quadrature

Bayesian quadrature (BQ) (Ohagan91BQ) is, as its name suggests, a Bayesian method for evaluating an integral using samples of its integrand. We consider the problem of evaluating the integral

If $g(x)$ is a probability density function, i.e., $g(x)=p(x)$ , this becomes the problem of evaluating the expected value of $f(x)$ . A well known frequentist approach to evaluating such expectations is the Monte-Carlo (MC) method. For MC estimation of such expectations, it is typically required that samples $x_{1},x_{2},\ldots,x_{M}$ are drawn from $p(x)$ .If samples are drawn from some other distribution, importance sampling variants of MC may be used. The integral in Equation 11 is then estimated as

It is easy to show that $\hat{\rho}_{MC}$ is an unbiased estimate of $\rho$ , with variance that diminishes to zero as $M\rightarrow\infty$ . However, as Ohagan87MC points out, MC estimation is fundamentally unsound, as it violates the likelihood principle, and moreover, does not make full use of the data at hand. The alternative proposed in Ohagan91BQ is based on the following reasoning: In the Bayesian approach, $f(\cdot)$ is random simply because it is unknown. We are therefore uncertain about the value of $f(x)$ until we actually evaluate it. In fact, even then, our uncertainty is not always completely removed, since measured samples of $f(x)$ may be corrupted by noise. Modeling $f$ as a Gaussian process (GP) means that our uncertainty is completely accounted for by specifying a Normal prior distribution over functions. This prior distribution is specified by its mean and covariance, and is denoted by $f(\cdot)\sim{\mathcal{N}}\big(\bar{f}(\cdot),k(\cdot,\cdot)\big)$ . This is shorthand for the statement that $f$ is a GP with prior mean and covariance

respectively. The choice of kernel function $k$ allows us to incorporate prior knowledge on the smoothness properties of the integrand into the estimation procedure. When we are provided with a set of samples ${\mathcal{D}}_{M}=\big\{\big(x_{i},y(x_{i})\big)\big\}_{i=1}^{M}$ , where $y(x_{i})$ is a (possibly noisy) sample of $f(x_{i})$ , we apply Bayes’ rule to condition the prior on these sampled values. If the measurement noise is normally distributed, the result is a Normal posterior distribution of $f|{\mathcal{D}}_{M}$ . The expressions for the posterior mean and covariance are standard:

Here and in the sequel, we make use of the definitions:

where ${\boldsymbol{K}}$ is the kernel (or Gram) matrix, and $[{\boldsymbol{\Sigma}}]_{i,j}$ is the measurement noise covariance between the $i$ th and $j$ th samples. It is typically assumed that the measurement noise is i.i.d., in which case ${\boldsymbol{\Sigma}}=\sigma^{2}{\boldsymbol{I}}$ , where $\sigma^{2}$ is the noise variance and ${\boldsymbol{I}}$ is the (appropriately sized - here $M\times M$ ) identity matrix.

Since integration is a linear operation, the posterior distribution of the integral in Equation 11 is also Gaussian, and the posterior moments are given by (Ohagan91BQ)

Substituting Equation 3 into Equation 3, we obtain

Note that $\rho_{0}$ and $b_{0}$ are the prior mean and variance of $\rho$ , respectively.

Rasmussen03BM experimentally demonstrated how this approach, when applied to the evaluation of an expectation, can outperform MC estimation by orders of magnitude in terms of the mean-squared error.

In order to prevent the problem from “degenerating into infinite regress”, as phrased by Ohagan91BQ,What O’Hagan means by “degenerating into infinite regress” is simply that if we cannot compute the posterior integrals of Equation 16 analytically, then we have started with estimating one integral (Equation 11) and ended up with three (Equation 16), and if we repeat this process, this can go forever and leave us with infinite integrals to evaluate. Therefore, for Bayesian MC to work, it is crucial to be able to analytically calculate the posterior integrals, and this can be achieved through the way we divide the integrant into two parts and the way we select the kernel function. we should choose the functions $g$ , $k$ , and $\bar{f}$ so as to allow us to solve the integrals in Equation 16 analytically. For example, O’Hagan provides the analysis required for the case where the integrands in Equation 16 are products of multivariate Gaussians and polynomials, referred to as Bayes-Hermite quadrature. One of the contributions of our work is in providing analogous analysis for kernel functions that are based on the Fisher kernel (Jaakkola98EG, ShaweTaylor04KM).

It is important to note that in MC estimation, samples must be drawn from the distribution $p(x)=g(x)$ , whereas in the Bayesian approach, samples may be drawn from arbitrary distributions. This affords us with flexibility in the choice of sample points, allowing us, for instance, to actively design the samples $x_{1},\ldots,x_{M}$ so as to maximize information gain.

In the second model, a scalar-valued function is sampled from the GP prior distribution, which is specified by a single prior mean function and a single prior covariance-kernel function. Gaussian noise may be added, and the result is then multiplied by each of the components of the $n$ -valued function $g$ to produce the integrand. This model is significantly simpler, both conceptually and in terms of the number of parameters required to specify it. To see how a model of the first kind may arise, consider the following example.

Let $\rho({\boldsymbol{\theta}})=\int f(x;{\boldsymbol{\theta}})g(x)dx$ , where $f$ is a scalar GP, parameterized by a vector of parameters ${\boldsymbol{\theta}}$ . Its prior mean and covariance functions must therefore depend on ${\boldsymbol{\theta}}$ . We denote these dependencies by writing:

We choose $\bar{f}(x;{\boldsymbol{\theta}})$ and $k(x,x^{\prime};{\boldsymbol{\theta}})$ so as to be once and twice differentiable in ${\boldsymbol{\theta}}$ , respectively. Suppose now that we are not interested in estimating $\rho({\boldsymbol{\theta}})$ , but rather in its gradient with respect to the parameters ${\boldsymbol{\theta}}$ : $\nabla_{\boldsymbol{\theta}}\rho({\boldsymbol{\theta}})=\int\nabla_{\boldsymbol{\theta}}f(x;{\boldsymbol{\theta}})g(x)dx$ . It may be easily verified that the mean functions and covariance kernels of the vector-valued GP $\nabla_{\boldsymbol{\theta}}f(x;{\boldsymbol{\theta}})$ are given by

where $\partial_{\theta_{j}}$ denotes the $j$ th component of $\nabla_{\boldsymbol{\theta}}$ . $\Box$

Propositions 1 and 2 specify the form taken by the mean and covariance functions of the integral GP under the two models discussed above.

Let $f$ be a scalar-valued GP with mean function $\bar{f}(x)={\bf E}\big[f(x)\big]$ and covariance function $k(x,x^{\prime})={\bf Cov}\big[f(x),f(x^{\prime})\big]$ , and let $g$ be an $n$ -valued function. Then, the mean and covariance of $\rho$ defined by Equation 11 are of the following form:

The proofs of these two propositions follow straightforwardly from the definition of the covariance operator in terms of expectations, and the order-exchangeability of GP expectations and integration with respect to $x$ .

To wrap things up, we need to describe the form taken by the posterior moments of $f$ in the vector-valued GP case. Using the standard Gaussian conditioning formulas, it is straightforward to show that

where ${\boldsymbol{C}}=({\boldsymbol{K}}+{\boldsymbol{\Sigma}})^{-1}$ . It should, however, be kept in mind that ignoring correlations between the components of $f$ , when such correlations exist, may result in suboptimal use of the available data (see Rasmussen06GP, Chapter 9 for references on GP regression with multiple outputs).

Bayesian Policy Gradient

In this section, we use vector-valued Bayesian quadrature to estimate the gradient of the expected return with respect to the policy parameters, allowing us to propose new Bayesian policy gradient (BPG) algorithms. In the frequentist approach to policy gradient, the performance measure used is $\eta({\boldsymbol{\theta}})=\int\bar{R}(\xi)\Pr(\xi;{\boldsymbol{\theta}})d\xi$ (Equation 2). In order to serve as a useful performance measure, it has to be a deterministic function of the policy parameters ${\boldsymbol{\theta}}$ . This is achieved by averaging the cumulative return $R(\xi)$ over all possible paths $\xi$ and all possible returns accumulated in each path. In the Bayesian approach we have an additional source of randomness, namely, our subjective Bayesian uncertainty concerning the process generating the cumulative return. Let us denote

where $\eta_{B}({\boldsymbol{\theta}})$ is a random variable because of the Bayesian uncertainty. Under the quadratic loss, the optimal Bayesian performance measure is the posterior expected value of $\eta_{B}({\boldsymbol{\theta}})$ , ${\bf E}\big[\eta_{B}({\boldsymbol{\theta}})|{\mathcal{D}}_{M}\big]$ . However, since we are interested in optimizing the performance rather than evaluating it,Although evaluating the posterior distribution of performance is an interesting question in its own right. we would rather evaluate the posterior distribution of the gradient of $\eta_{B}({\boldsymbol{\theta}})$ with respect to the policy parameters ${\boldsymbol{\theta}}$ . The posterior mean of the gradient is We may interchange the order of the gradient and the expectation operators for the mean, $\nabla{\bf E}\big[\eta_{B}({\boldsymbol{\theta}})\big]={\bf E}\big[\nabla\eta_{B}({\boldsymbol{\theta}})\big]$ , but the same is not true for the variance, namely, $\nabla{\bf Var}\big[\eta_{B}({\boldsymbol{\theta}})\big]\neq{\bf Cov}\big[\nabla\eta_{B}({\boldsymbol{\theta}})\big]$ .

Consequently, in BPG we cast the problem of estimating the gradient of the expected return (Equation 20) in the form of Equation 11. As described in Section 3, we need to partition the integrand into two parts, $f(\xi;{\boldsymbol{\theta}})$ and $g(\xi;{\boldsymbol{\theta}})$ . We will model $f$ as a GP and assume that $g$ is a function known to us. We will then proceed by calculating the posterior moments of the gradient $\nabla\eta_{B}({\boldsymbol{\theta}})$ conditioned on the observed data. Because in general, $R(\xi)$ cannot be known exactly, even for a given $\xi$ (due to the stochasticity of the rewards), $R(\xi)$ should always belong to the GP part of the model, i.e., $f(\xi;{\boldsymbol{\theta}})$ . Interestingly, in certain cases it is sufficient to know the Fisher information matrix corresponding to $\Pr(\xi;{\boldsymbol{\theta}})$ , rather than having exact knowledge of $\Pr(\xi;{\boldsymbol{\theta}})$ itself. We make use of this fact in the sequel. In the next two sections, we investigate two different ways of partitioning the integrand in Equation 20, resulting in two distinct Bayesian policy gradient models.

In our first model, we define $g$ and $f$ as follows:

We place a vector-valued GP prior over $f(\xi;{\boldsymbol{\theta}})$ which induces a GP prior over the corresponding noisy measurement $y(\xi;{\boldsymbol{\theta}})=R(\xi)\nabla\mathop{\rm log}\Pr(\xi;{\boldsymbol{\theta}})$ . We adopt the simplifying assumptions discussed at the end of Section 3.1: We assume that each component of $f(\xi;{\boldsymbol{\theta}})$ may be evaluated independently of all other components, and use the same kernel function and noise covariance for all components of $f(\xi;{\boldsymbol{\theta}})$ . We therefore omit the component index $j$ from ${\boldsymbol{K}}_{j,j}$ , ${\boldsymbol{\Sigma}}_{j,j}$ and ${\boldsymbol{C}}_{j,j}$ , denoting them simply as ${\boldsymbol{K}}$ , ${\boldsymbol{\Sigma}}$ and ${\boldsymbol{C}}$ , respectively. Hence, for the $j$ th component of $f$ and $y$ we have, a priori

In this vector-valued GP model, the posterior mean and covariance of $\nabla\eta_{B}({\boldsymbol{\theta}})$ are

Our choice of kernel, which allows us to derive closed-form expressions for ${\boldsymbol{b}}$ and $b_{0}$ , and as a result for the posterior moments of the gradient, is the quadratic Fisher kernel (Jaakkola98EG, ShaweTaylor04KM)

where ${\boldsymbol{u}}(\xi)=\nabla\mathop{\rm log}\Pr(\xi;{\boldsymbol{\theta}})$ is the Fisher score function of the path $\xi$ defined by Equation 5, and ${\boldsymbol{G}}({\boldsymbol{\theta}})$ is the corresponding Fisher information matrix defined asTo simplify notation, we omit ${\boldsymbol{G}}$ ’s dependence on the policy parameters ${\boldsymbol{\theta}}$ , and denote ${\boldsymbol{G}}({\boldsymbol{\theta}})$ as ${\boldsymbol{G}}$ in the sequel.

Using the quadratic Fisher kernel from Equation 23, the integrals ${\boldsymbol{b}}$ and $b_{0}$ in Equation 22 have the following closed form expressions

2 Model 2 – Scalar-Valued GP

In our second model, we define $g$ and $f$ as follows:

Now $g$ is a vector-valued function, while $f$ is a scalar valued GP representing the expected return of the path given as its argument. The noisy measurement corresponding to $f(\xi_{i})$ is $y(\xi_{i})=R(\xi_{i})$ , namely, the actual return accrued while following the path $\xi_{i}$ . In this model, the posterior mean and covariance of the gradient $\nabla\eta_{B}({\boldsymbol{\theta}})$ are

Here, our choice of kernel function, which again allows us to derive closed-form expressions for ${\boldsymbol{B}}$ and ${\boldsymbol{B}}_{0}$ , is the Fisher kernel (Jaakkola98EG, ShaweTaylor04KM)

Using the Fisher kernel from Equation 27, the integrals ${\boldsymbol{B}}$ and ${\boldsymbol{B}}_{0}$ in Equation 4.2 have the following closed-form expressions

where ${\boldsymbol{U}}=\big[{\boldsymbol{u}}(\xi_{1}),\ldots,{\boldsymbol{u}}(\xi_{M})\big]$ .

Table 1 summarizes the two BPG models presented in Sections 4.1 and 4.2. Our choice of Fisher-type kernels was motivated by the notion that a good representation should depend on the process generating the data (see Jaakkola98EG, ShaweTaylor04KM, for a thorough discussion). Our particular selection of linear and quadratic Fisher kernels were guided by the desideratum that the posterior moments of the gradient be analytically tractable as discussed in Section 3.

As described above, in either model, we are restricted in the choice of kernel (quadratic Fisher kernel in Model 1 and Fisher kernel in Model 2) in order to be able to derive closed-form expressions for the posterior mean and covariance of the gradient integral. The loss due to this restriction depends on the problem at hand and is hard to quantify. This loss is exactly the loss of selecting an inappropriate prior in any Bayesian algorithm or, more generally, of choosing a wrong representation (function space) in a machine learning algorithm (referred to as approximation error in approximation theory). However, the experimental results of Section 6 indicate that this restriction did not cause a significant error (especially for Model 1) in our gradient estimates, as those estimated by BPG were more accurate than the ones estimated by the MC-based method, given the same number of samples.

3 A Bayesian Policy Gradient Evaluation Algorithm

We can now use our two BPG models to define corresponding algorithms for evaluating the gradient of the expected return with respect to the policy parameters. Pseudo-code for these algorithms is shown in Algorithm 1. The generic algorithm (for either model) takes a set of policy parameters ${\boldsymbol{\theta}}$ and a sample size $M$ as input, and returns an estimate of the posterior moments of the gradient of the expected return with respect to the policy parameters. This algorithm generates $M$ sample paths to evaluate the gradient. For each path $\xi_{i}$ , the algorithm first computes its score function ${\boldsymbol{u}}(\xi_{i})$ (Line 6). The score function is needed for computing the kernel function $k$ , the measurement ${\boldsymbol{y}}$ in Model 1, and ${\boldsymbol{b}}$ or ${\boldsymbol{B}}$ . The algorithm then computes the return $R$ and the measurement $y(\xi_{i})$ for the observed path $\xi_{i}$ (Lines 7 and 9), and updates the kernel matrix ${\boldsymbol{K}}$ (Line 8) using

Finally, the algorithm adds the measurement error ${\boldsymbol{\Sigma}}$ to the covariance matrix ${\boldsymbol{K}}$ (Line 12) and computes the posterior moments of the policy gradient (Line 14). ${\boldsymbol{B}}(:,i)$ on Line 10 denotes the $i$ th column of the matrix ${\boldsymbol{B}}$ .

The kernel functions used in Models 1 and 2 (Equations 23 and 27) are both based on the Fisher kernel. Computing the Fisher kernel requires calculating the Fisher information matrix ${\boldsymbol{G}}({\boldsymbol{\theta}})$ (Equation 24). Consequently, every time we update the policy parameters, we need to recompute ${\boldsymbol{G}}$ . In Algorithm 1 we assume that the Fisher information matrix is known. However, in most practical situations this will not be the case, and consequently the Fisher information matrix must be estimated. Let us briefly outline two possible approaches for estimating the Fisher information matrix in an online manner.

The BPG algorithm generates a number of sample paths using the current policy parameterized by ${\boldsymbol{\theta}}$ in order to estimate the gradient $\nabla\eta_{B}({\boldsymbol{\theta}})$ . We can use these generated sample paths to estimate the Fisher information matrix ${\boldsymbol{G}}({\boldsymbol{\theta}})$ in an online manner, by replacing the expectation in ${\boldsymbol{G}}$ with empirical averaging as $\hat{{\boldsymbol{G}}}_{M}({\boldsymbol{\theta}})=\frac{1}{M}\sum_{i=1}^{M}{\boldsymbol{u}}(\xi_{i}){\boldsymbol{u}}(\xi_{i})^{\top}$ . It can be shown that $\hat{{\boldsymbol{G}}}_{M}$ is an unbiased estimator of ${\boldsymbol{G}}$ . One may obtain this estimate recursively $\hat{{\boldsymbol{G}}}_{i+1}=(1-\frac{1}{i})\hat{{\boldsymbol{G}}}_{i}+\frac{1}{i}{\boldsymbol{u}}(\xi_{i}){\boldsymbol{u}}(\xi_{i})^{\top}$ , or more generally $\hat{{\boldsymbol{G}}}_{i+1}=(1-\zeta_{i})\hat{{\boldsymbol{G}}}_{i}+\zeta_{i}{\boldsymbol{u}}(\xi_{i}){\boldsymbol{u}}(\xi_{i})^{\top}$ , where $\zeta_{i}$ is a step-size with $\sum_{i}\zeta_{i}=\infty$ and $\sum_{i}\zeta_{i}^{2}<\infty$ . Using the Sherman-Morrison matrix inversion lemma, it is possible to directly estimate the inverse of the Fisher information matrix as

The Fisher information matrix defined by Equation 24 depends on the probability distribution over paths. This distribution is a product of two factors, one corresponding to the current policy and the other corresponding to the MDP’s state-transition probability $P$ (see Equation 1). Thus if $P$ is known, the Fisher information matrix may be evaluated offline. We can model $P$ using a parameterized model and then estimate the maximum likelihood (ML) model parameters. This approach may lead to a model-based treatment of policy gradients, which could allow us to transfer information between different policies. Current policy gradient algorithms, including the algorithms described in this paper, are extremely wasteful of training data, since they do not have any disciplined way to use data collected for previous policy updates in computing the update of the current policy. Model-based policy gradient may help solve this problem.

4 BPG Online Sparsification

5 A Bayesian Policy Gradient Algorithm

So far we were concerned with estimating the gradient of the expected return with respect to the policy parameters. In this section, we present a Bayesian policy gradient (BPG) algorithm that employs the Bayesian gradient estimation methods proposed in Section 4.3 to update the policy parameters. The pseudo-code of this algorithm is shown in Algorithm 2. The algorithm starts with an initial vector of policy parameters ${\boldsymbol{\theta}}_{0}$ , and updates the parameters in the direction of the posterior mean of the gradient of the expected return estimated by Algorithm 1. This is repeated $N$ times, or alternatively, until the gradient estimate is sufficiently close to zero.

Extension to Partially Observable Markov Decision Processes

The Bayesian policy gradient models and algorithms of Section 4 can be extended to partially observable Markov decision processes (POMDPs) along the same lines as in Section 6 of Baxter01IP. In the partially observable case, the stochastic parameterized policy $\mu(\cdot|\cdot;{\boldsymbol{\theta}})$ controls a POMDP, i.e., the policy has access to an observation process that depends on the state, but it may not observe the state itself directly.

Specifically, for each state $x\in{\mathcal{X}}$ , an observation $o\in{\mathcal{O}}$ is generated independently according to a probability distribution $P_{o}$ over observations in ${\mathcal{O}}$ . We denote the probability of observation $o$ at state $x$ by $P_{o}(o|x)$ . A stationary stochastic parameterized policy $\mu(\cdot|\cdot;{\boldsymbol{\theta}})$ is a function mapping observations $o\in{\mathcal{O}}$ into probability distributions over the actions $\mu(\cdot|o;{\boldsymbol{\theta}})\in{\mathcal{P}}({\mathcal{A}})$ . In this case, the probability of a path $\xi=(x_{0},a_{0},x_{1},a_{1},\ldots,x_{T-1},a_{T-1},x_{T})$ , $T\in\{0,1,\ldots,\infty\}$ generated by the Markov chain induced by policy $\mu(\cdot|\cdot;{\boldsymbol{\theta}})$ is given by

The Fisher score of this path may be written as

which is the same as in the observable case (Equation 5), except here the policy is defined over observations instead of states. As a result, the models and algorithms of Section 4 may be used in the partially observable case with no change, substituting observations for states.

Moreover, similarly to the gradient estimated by the GPOMDP algorithm in Baxter01IP, the gradient estimated by Algorithm 1, $\nabla\eta_{B}({\boldsymbol{\theta}})$ , may be employed with the conjugate-gradients and line-search methods of Baxter01EI for making better use of gradient information. This allows us to exploit the information contained in the gradient estimate more aggressively than by simply adjusting the parameters by a small amount in the direction of $\nabla\eta_{B}({\boldsymbol{\theta}})$ . Conjugate-gradients and line-search are two widely used techniques in non-stochastic optimization that allow us to find better gradient directions than the pure gradient direction, and to obtain better step sizes, respectively.

Note that in this section, we followed Baxter01IP (the GPOMDP algorithm) and considered stochastic policies that map observations to actions. However, as mentioned by Baxter01IP, it is immediate that the same algorithm works for any finite history of observations. Moreover, along the same way that Aberdeen01PG showed that GPOMDP can be extended to apply to policies with internal state, our BPG POMDP algorithm can also be extended to handle such policies.

BPG Experimental Results

In this section, we compare the Bayesian quadrature (BQ) and the plain MC gradient estimates on a simple bandit problem as well as on a continuous state and action linear quadratic regulator (LQR). We also evaluate the performance of the Bayesian policy gradient (BPG) algorithm described in Algorithm 2 on the LQR, and compare it with a Monte-Carlo based policy gradient (MCPG) algorithm.

Table 2 shows the exact gradient of the expected return and its MC and BQ estimates using $10$ and $100$ samples for two instances of the bandit problem corresponding to two different deterministic reward functions $r(a)=a$ and $r(a)=a^{2}$ . The average over $10^{4}$ runs of the MC and BQ estimates and their standard deviations are reported in Table 2. The true gradient is analytically tractable and is reported as “Exact” in Table 2 for reference.

As shown in Table 2, the variance of the BQ estimates are lower than the variance of the MC estimates by an order of magnitude for the small sample size ( $M=10$ ), and by 6 orders of magnitude for the large sample size ( $M=100$ ). The BQ estimate is also more accurate than the MC estimate for the large sample size, and is roughly the same for the small sample size.

2 Linear Quadratic Regulator

In this section, we consider the following linear system in which the goal is to minimize the expected return over $20$ steps.What we mean by reward and return in this section is in fact cost and loss, and this is why we are dealing with a minimization, and not a maximization, problem here. The reason for this is to maintain consistency in notations and definitions throughout the paper. Thus, it is an episodic problem with paths of length $20$ .

We run two sets of experiments on this system. We first fix the set of policy parameters and compare the BQ and MC estimates of the gradient of the expected return using the same sample. We then proceed to solving the complete policy gradient problem and compare the performance of the BPG algorithm (with both conventional and natural gradients) with a Monte-Carlo based policy gradient (MCPG) algorithm.

In this section, we compare the BQ and MC estimates of the gradient of the expected return for the policy induced by parameters $\lambda=-0.2$ and $\sigma=1$ . We use several different sample sizes (number of paths used for gradient estimation) $M=5j\;,\;j=1,\ldots,20$ for the BQ and MC estimates. For each sample size, we compute the MC and BQ estimators using the same sample, repeat this process $10^{4}$ times, and then compute the average. The true gradient is analytically tractable and is used for comparison purposes.

Figure 1 shows the mean squared error (MSE) (left column) and the mean absolute angular error (right column) of the MC and BQ estimates of the gradient for several different sample sizes. The absolute angular error is the absolute value of the angle between the true and estimated gradients. In this figure, the BQ gradient estimates were calculated using Model 1 (top row) and Model 2 (bottom row) with sparsification. The error bars in the figures on the right column are the standard errors of the mean absolute angular errors.

We ran another set of experiments in which we added i.i.d. Gaussian noise to the rewards: $r_{t}=x_{t}^{2}+0.1a_{t}^{2}+n_{r}\;;\;n_{r}\sim{\mathcal{N}}(0,\sigma_{r}^{2})$ . Note that in Models 1 and 2, $y(\xi)$ , the noisy sample of $f(\xi)$ , is of the form $R(\xi)\nabla\mathop{\rm log}\Pr(\xi;{\boldsymbol{\theta}})$ and $R(\xi)$ , respectively (see Sections 4.1 and 4.2). Moreover, since each reward $r_{t}$ is a Gaussian random variable with variance $\sigma_{r}^{2}$ , the return $R(\xi)=\sum_{t=0}^{T-1}r_{t}$ is also a Gaussian random variable with variance $T\sigma_{r}^{2}$ . Therefore in this case, the measurement noise covariance matrices for Models 1 and 2 may be written as ${\boldsymbol{\Sigma}}=T\sigma_{r}^{2}\mathop{\rm diag}\Big(\big(\frac{\partial}{\partial\theta_{i}}\mathop{\rm log}p(\xi_{1};{\boldsymbol{\theta}})\big)^{2},\ldots,\big(\frac{\partial}{\partial\theta_{i}}\mathop{\rm log}p(\xi_{M};{\boldsymbol{\theta}})\big)^{2}\Big)$ and ${\boldsymbol{\Sigma}}=T\sigma_{r}^{2}{\boldsymbol{I}}$ , respectively, where $T=20$ is the path length.In Model 1, ${\boldsymbol{\Sigma}}$ is the measurement noise covariance matrix for the $i$ th component of the gradient $\frac{\partial}{\partial\theta_{i}}\eta_{B}({\boldsymbol{\theta}})$ . Note that $\frac{\partial}{\partial\theta_{i}}\mathop{\rm log}p(\xi_{j};{\boldsymbol{\theta}})$ depends only on the policy and can be calculated using Equation 5. We tried two different Gaussian reward noise standard deviations: $\sigma_{r}=0.1\;\text{and}\;1$ in our experiments. Adding noise to the rewards slightly increased the error of the BQ and MC estimates of the gradient. However, the graphs comparing these estimates remained quite similar to those shown in Figure 1. Hence in Figure 2, we compare the MSE (left column) and the mean absolute angular error (right column) of the BQ estimates with and without noise in the rewards as a function of the number of sample paths $M$ . In this figure, the noise in the rewards has variance $\sigma_{r}^{2}=1$ , and the BQ gradient estimates were calculated using Model 1 (top row) and Model 2 (bottom row) with sparsification.

2.2 Policy Optimization

In this section, we use Bayesian policy gradient (BPG) to optimize the policy parameters in the LQR problem. Figure 3 shows the performance of the BPG algorithm with the conventional (BPG) and natural (BPNG) gradient estimates, versus a MC-based policy gradient (MCPG) algorithm, for sample sizes (number of sample paths used to estimate the gradient of each policy) $M=5$ , $10$ , $20$ , and $40$ . We use Algorithm 2 with the number of updates set to $N=100$ , and Model 1 with sparsification for the BPG and BPNG methods. Since Algorithm 2 computes the Fisher information matrix for each set of policy parameters, the estimate of the natural gradient is provided at little extra cost at each step. The returns obtained by these methods are averaged over $10^{4}$ runs. The policy parameters are initialized randomly at each run. In order to ensure that the learned parameters do not exceed an acceptable range, the policy parameters are defined as $\lambda=-1.999+1.998/(1+e^{\kappa_{1}})$ and $\sigma=0.001+1/(1+e^{\kappa_{2}})$ . The optimal solution is $\lambda^{*}\approx-0.92,\;\sigma^{*}=0.001,\;\eta_{B}(\lambda^{*},\sigma^{*})=0.3067$ , corresponding to $\kappa^{*}_{1}\approx-0.16$ and $\kappa^{*}_{2}\rightarrow\infty$ .

Figure 3 shows that the MCPG algorithm performs better than BPG and BPNG only for the smallest sample size ( $M=5$ ), whereas for larger samples BPG and BPNG dominate MCPG. The better performance of MCPG for very small sample size is due to the fact that in this case, the Bayesian estimators, BPG and BPNG, like any other Bayesian estimator or posterior in such case, rely more on the prior, and thus, are not accurate if the prior is not very informative. A similar phenomenon was also reported by Rasmussen03BM. We used two different learning rates for the two components of the gradient. For a fixed sample size, BPG and MCPG methods start with an initial learning rate and decrease it according to the schedule $\beta_{j}=\beta_{0}\big(20/(20+j)\big)$ . The BPNG algorithm uses a fixed learning rate multiplied by the determinant of the Fisher information matrix. We tried many values for the initial learning rates used by these algorithms and those in Table 3 yielded the best performance of those we tried.

So far we have assumed that the Fisher information matrix is known. In the next experiment, we estimate it using both MC and a model-based maximum likelihood (ML) method, as discussed in Section 4.3. In the ML approach, we model the transition probability function as $P(x_{t+1}|x_{t},a_{t})={\mathcal{N}}(c_{1}x_{t}+c_{2}a_{t}+c_{3},c_{4}^{2})$ , and then estimate its parameters $(c_{1},c_{2},c_{3},c_{4})$ using observing state transitions. Figure 4 shows that the BPG algorithm, when the Fisher information matrix is estimated using ML and MC, still performs better than MCPG. Top and bottom rows contain the results for the BPG algorithm with conventional (BPG-ML and BPG-MC) and natural (BPNG-ML and BPNG-MC) gradient estimates, respectively. Although the BPG-ML (BPNG-ML) outperforms BPG-MC (BPNG-MC) for small sample sizes, the difference in their performance disappears as we increase the sample size. One reason for the good performance of BPG-ML is that the form of the state transition function $P(x_{t+1}|x_{t},a_{t})$ has been selected correctly. Here we used the same initial learning rates and learning rate schedules as in the experiments of Figure 3 (see Table 3).

Although the proposed Bayesian policy gradient algorithm (Algorithm 2) uses only the posterior mean of the gradient in its updates, it can be extended to make judicious use of the second moment information provided by the Bayesian policy gradient estimation algorithm (Algorithm 1). In the last experiment of this section, we use the posterior covariance of the gradient, provided by Algorithm 1, to select the learning rate and the direction of the updates in Algorithm 2. The idea is to use a small learning rate when the variance of the gradient estimate is large, and to have a large update when it is small. We refer to the resulting algorithm by the name BPG-var. This algorithm uses a fixed learning rate parameter (see Table 3) multiplied by $\Big[\big(1+n\big){\boldsymbol{I}}-{\bf Cov}\big(\nabla\eta_{B}({\boldsymbol{\theta}})|{\mathcal{D}}_{M}\big)\Big]/(1+n)$ in its updates. Note that $n+1$ is $b_{0}$ in the calculation of the posterior covariance of the gradient in Model 1 (see Proposition 3), and is used here as an upper bound for the posterior covariance of the gradient. Figure 5 compares the average expected return of BPG-var with BPG and MCPG for sample sizes $M=5$ , $10$ , $20$ , and $40$ . The figure shows that BPG-var performs better than BPG and MCPG for all the sample sizes. It even has a better performance than MCPG for the smallest sample size ( $M=5$ ). Comparing Figures 3 and 5 shows that BPG-var converges faster than BPNG and has similar final performance. As we expected, BPG-var and BPG perform more and more alike as we increase the sample size. This is because by increasing the sample size the estimated gradient (the posterior mean of the gradient), and as a result, the update direction used by BPG becomes more reliable.

In an approach similar to the one used in the experiments of Figure 5, Vien11HM used BQ to estimate the Hessian matrix distribution, and then used its mean as learning rate schedule to improve the performance of BPG. They empirically showed that their method performs better than BPG and BPNG in terms of convergence speed.

Bayesian Actor-Critic

The models and algorithms of Section 4 consider complete trajectories as the basic observable unit, and thus, do not require the dynamics within each trajectory to be of any special form. In particular, it is not necessary for the dynamics to have the Markov property, allowing the resulting algorithms to handle partially observable MDPs, Markov games, and other non-Markovian systems. On the down side, these algorithms cannot take advantage of the Markov property when operating in Markovian systems. Moreover, since the unit of observation of these algorithms is the entire trajectory, their gradient estimates have larger variance than the algorithms that will be discussed in this section, whose unit of observation is (current state, action, next state), since they take advantage of the Markov property, especially when the size of the trajectories is large.

In this section, we apply the Bayesian quadrature idea to the policy gradient expression given by Equation 7, i.e.,

and derive a family of Bayesian actor-critic (BAC) algorithms. In this approach, we place a Gaussian process (GP) prior over action-value functions using a prior covariance kernel defined on state-action pairs: $k({\boldsymbol{z}},{\boldsymbol{z}}^{\prime})={\bf Cov}\big[Q({\boldsymbol{z}}),Q({\boldsymbol{z}}^{\prime})\big]$ . We then compute the GP posterior conditioned on the sequence of individual observed transitions. In the same vein as Section 4, by an appropriate choice of a prior on action-value functions, we are able to derive closed-form expressions for the posterior moments of $\nabla\eta({\boldsymbol{\theta}})$ . The main questions here are: 1) how to compute the GP posterior of the action-value function given a sequence of observed transitions? and 2) how to choose a prior for the action-value function that allows us to derive closed-form expressions for the posterior moments of $\nabla\eta({\boldsymbol{\theta}})$ ? Fortunately, well developed machinery for computing the posterior moments of $Q({\boldsymbol{z}})$ is provided in a series of papers by Engel03BM, Engel05RL (for a thorough treatment see Engel05AR). In the next two sections, we will first briefly review some of the main results pertaining to the Gaussian process temporal difference (GPTD) model proposed in Engel05RL, and then will show how they may be combined with the Bayesian quadrature idea in developing a family of Bayesian actor-critic algorithms.

The Gaussian process temporal difference (GPTD) learning (Engel03BM, Engel05RL) model is based on a statistical generative model relating the observed reward signal $r$ to the unobserved action-value function $Q$

where $N({\boldsymbol{z}}_{i},{\boldsymbol{z}}_{i+1})$ is a zero-mean noise signal that accounts for the discrepancy between $r({\boldsymbol{z}}_{i})$ and $Q({\boldsymbol{z}}_{i})-\gamma Q({\boldsymbol{z}}_{i+1})$ . Let us define the finite-dimensional processes ${\boldsymbol{r}}_{t}$ , $Q_{t}$ , $N_{t}$ , and the $t\times(t+1)$ matrix ${\boldsymbol{H}}_{t}$ as follows:

The set of Equations 29 for $i=0,\ldots,t-1$ may be written as ${\boldsymbol{r}}_{t-1}={\boldsymbol{H}}_{t}Q_{t}+N_{t}$ . Under certain assumptions on the distribution of the discounted return random process (Engel05RL), the covariance of the noise vector $N_{t}$ is given by

In episodic tasks, if ${\boldsymbol{z}}_{t-1}$ is the last state-action pair in the episode (i.e., ${\boldsymbol{x}}_{t}$ is a zero-reward absorbing terminal state), ${\boldsymbol{H}}_{t}$ becomes a square $t\times t$ invertible matrix of the form shown in Equation 31 with its last column removed. The effect on the noise covariance matrix ${\boldsymbol{\Sigma}}_{t}$ is that the bottom-right element becomes $1$ instead of $1+\gamma^{2}$ .

Placing a GP prior on $Q$ and assuming that $N_{t}$ is also normally distributed, we may use Bayes’ rule to obtain the posterior moments of $Q$ :

where ${\mathcal{D}}_{t}$ denotes the observed data up to and including time step $t$ . We used here the following definitions:

Note that $\hat{Q}_{t}({\boldsymbol{z}})$ and $\hat{S}_{t}({\boldsymbol{z}},{\boldsymbol{z}}^{\prime})$ are the posterior mean and covariance functions of the posterior GP, respectively. As more samples are observed, the posterior covariance decreases, reflecting a growing confidence in the Q-function estimate $\hat{Q}_{t}$ .

2 A Family of Bayesian Actor-Critic Algorithms

We are now in a position to describe the main idea behind our BAC approach. Making use of the linearity of Equation 7 in $Q$ and denoting ${\boldsymbol{g}}({\boldsymbol{z}};{\boldsymbol{\theta}})=\pi^{\mu}({\boldsymbol{z}})\nabla\mathop{\rm log}\mu({\boldsymbol{a}}|{\boldsymbol{x}};{\boldsymbol{\theta}})$ , we obtain the following expressions for the posterior moments of the policy gradient (Ohagan91BQ):

Substituting the expressions for the posterior moments of $Q$ from Equation 7.1 into Equation 7.2, we obtain

These equations provide us with the general form of the posterior policy gradient moments. We are now left with a computational issue, namely, how to compute the integrals appearing in these expressions? We need to be able to evaluate the following integrals:

Using these definitions, we may write the gradient posterior moments compactly as

In order to render these integrals analytically tractable, we choose our prior covariance kernel to be the sum of an arbitrary state-kernel $k_{x}$ and the (invariant) Fisher kernel $k_{F}$ between state-action pairs (see e.g., ShaweTaylor04KM, Chapter 12). The (policy dependent) Fisher information kernel and our overall state-action kernel are then given by

where ${\boldsymbol{u}}({\boldsymbol{z}};{\boldsymbol{\theta}})$ and ${\boldsymbol{G}}({\boldsymbol{\theta}})$ are the score function and Fisher information matrix defined asSimilar to ${\boldsymbol{u}}(\xi)$ and ${\boldsymbol{G}}$ defined by Equations 5 and 24, to simplify the notation, we omit the dependence of ${\boldsymbol{u}}$ and ${\boldsymbol{G}}$ to the policy parameters ${\boldsymbol{\theta}}$ , and replace ${\boldsymbol{u}}({\boldsymbol{z}};{\boldsymbol{\theta}})$ and ${\boldsymbol{G}}({\boldsymbol{\theta}})$ with ${\boldsymbol{u}}({\boldsymbol{z}})$ and ${\boldsymbol{G}}$ in the sequel.

Although here we have total flexibility in selecting the state kernel, we are restricted to the Fisher kernel for state-action pairs. This restriction may cause an error in approximating some action-value functions $Q$ . This error depends on the problem at hand and is hard to quantify. This is exactly the same as selecting an inaccurate prior in any Bayesian algorithm or choosing a wrong representation (function space) in any machine learning algorithm (referred to as approximation error in the approximation theory). However, this restriction did not cause a significant error in our experiments (see Section 8), as in almost all of them, the gradients estimated by BAC were more accurate than those estimated by the MC-based method, given the same number of samples.

Note that in Sections 4 to 6 we used a formulation in which the observable unit is a system trajectory, and thus, the expected return and its gradient are defined by Equations 2 and 4. In this formulation, the score function and Fisher information matrix are defined by Equations 5 and 24. However, in the formulation used in this section and in the rest of the paper, where the observable unit is an individual state-action-reward transition, the expected return and its gradient are defined by Equations 3 and 7. In this formulation, the score function and Fisher information matrix are defined by Equations 39 and 40, respectively.

A nice property of the Fisher kernel is that while $k_{F}({\boldsymbol{z}},{\boldsymbol{z}}^{\prime})$ depends on the policy, it is invariant to policy reparameterization. In other words, it only depends on the actual probability mass assigned to each action and not on its explicit dependence on the policy parameters. As mentioned above, another attractive property of this particular choice of kernel is that it renders the integrals in Equation 36 analytically tractable, as made explicit in the following proposition

where ${\boldsymbol{U}}_{t}=\big[{\boldsymbol{u}}({\boldsymbol{z}}_{0}),{\boldsymbol{u}}({\boldsymbol{z}}_{1}),\ldots,{\boldsymbol{u}}({\boldsymbol{z}}_{t})\big]$ .

An immediate consequence of Proposition 6 is that, in order to compute the posterior moments of the policy gradient, we only need to be able to evaluate (or estimate) the score vectors ${\boldsymbol{u}}({\boldsymbol{z}}_{i}),\;i=0,\ldots,t$ and the Fisher information matrix ${\boldsymbol{G}}$ of our policy. Evaluating the Fisher information matrix ${\boldsymbol{G}}$ is somewhat more challenging, since on top of taking the expectation with respect to the policy $\mu(a|x;{\boldsymbol{\theta}})$ , computing ${\boldsymbol{G}}$ involves an additional expectation over the state-occupancy density $\nu^{\mu}(x)$ , which is not generally known. In most practical situations we therefore have to resort to estimating ${\boldsymbol{G}}$ from data. When $\nu^{\mu}$ in the definition of the Fisher information matrix (Equation 40) is the stationary distribution over states under policy $\mu$ , one straightforward method to estimate ${\boldsymbol{G}}$ from a trajectory ${\boldsymbol{z}}_{0},{\boldsymbol{z}}_{1},\ldots,{\boldsymbol{z}}_{t}$ is to use the (unbiased) estimator (see Proposition 6 for the definition of ${\boldsymbol{U}}_{t}$ ):

In case $\nu^{\mu}$ in Equation 40 is a discounted weighting of states encountered by following policy $\mu$ (as it is considered in this paper), a method for estimating ${\boldsymbol{G}}$ from a number of trajectories is shown in Algorithm 3. Note that $(1-\gamma)\nu^{\mu}$ corresponds to the distribution of a Markov chain that starts from a state sampled according to $P_{0}$ and at each step either follows the policy $\mu$ with probability $\gamma$ or restarts from a new initial state drawn from $P_{0}$ with probability $1-\gamma$ . It is easy to show that the average number of steps between two successive restarts of this distribution is $1/(1-\gamma)$ .

Algorithm 4 is a pseudocode sketch of the Bayesian actor-critic algorithm, using either the conventional gradient or the natural gradient in the policy update, and with ${\boldsymbol{G}}$ estimated using either $\hat{{\boldsymbol{G}}}_{t}$ in Equation 42 or $\hat{{\boldsymbol{G}}}({\boldsymbol{\theta}})$ in Algorithm 3.

3 BAC Online Sparsification

Using the sparsification method described above, the posterior moments of the gradient are approximated as

BAC Experimental Results

In this section, we empiricallyThe code for all the experiments of this section is available at https://sequel.lille.inria.fr/Software/BAC. evaluate the performance of the Bayesian actor-critic method presented in this paper in a 10-state random walk problem as well as in the widely used continuous-state-space mountain car problem (Sutton98IR) and ship steering problem (Miller90NN). In Section 8.1, we first compare BAC, Bayesian quadrature (BQ), and Monte Carlo (MC) gradient estimates in the 10-state random walk problem. We then evaluate the performance of the BAC algorithm on the same problem, and compare it with a Bayesian policy gradient (BPG) algorithm and a MC-based policy gradient (MCPG) algorithm. In Section 8.2, we compare the performance of the BAC algorithm with a MCPG algorithm on the mountain car problem. The BPG, BAC, and MCPG algorithms used in our experiments are Algorithms 2 and 4 presented in this paper, and Algorithm 1 in Baxter01IP, respectively. In Section 8.3, we compare the performance of the BAC algorithm with a MCPG algorithm on a problem in the ship steering domain. Similar to Section 8.2, the BAC, and MCPG algorithms used in our experiments are Algorithm 4 presented in this paper and Algorithm 1 in Baxter01IP, respectively.

In this section, we consider a 10-state random walk problem, ${\mathcal{X}}=\{1,2,\ldots,10\}$ , with states arranged linearly from state 1 on the left to state 10 on the right. The agent has two actions to choose from: ${\mathcal{A}}=\{left,right\}$ . The left wall is a retaining barrier, meaning that if the $left$ action is taken at state 1, in the next time-step the state will be 1 again. State 10 is a zero reward absorbing state. The only stochasticity in the transitions is induced by the policy, which is defined as $\mu(right|x)=1/1+\exp(-\theta_{x})$ and $\mu(left|x)=1-\mu(right|x)$ , for all $x\in{\mathcal{X}}$ . Note that each state $x$ has an independent parameter $\theta_{x}$ . Each episode begins at state 1 and ends when the agent reaches state 10. The mean reward is 1 for states 1–9 and is 0 for state 10. The observed rewards for states 1–9 are obtained by corrupting the mean rewards with a 0.1 standard deviation i.i.d. Gaussian noise. The discount factor is $\gamma=0.99$ . In the BAC experiments, we use the Gaussian state kernel $k_{x}(x,x^{\prime})=\exp(-||x-x^{\prime}||^{2}/(2\sigma_{k}^{2}))$ with $\sigma_{k}=3$ and the state-action kernel $0.01k_{F}({\boldsymbol{z}},{\boldsymbol{z}}^{\prime})$ .

We first compare the MC, BQ, and BAC estimates of $\nabla\eta({\boldsymbol{\theta}})$ for the policy induced by the parameters $\theta_{x}=\mathop{\rm log}(41/9)$ for all $x\in{\mathcal{X}}$ , which is equivalent to $\mu(right|x)=0.82$ . We use several different sample sizes: $M=5j,\;j=1,\ldots,20$ . Here, by sample size we mean the number of episodes used to estimate the gradient. For each value of $M$ , we compute the gradient estimates $10^{3}$ times. The true gradient is calculated analytically for reference. Figure 6 shows the mean squared error and the mean absolute angular error of MC, BQ, and BAC estimates of the gradient for different sample sizes $M$ . The error bars in the right figure are the standard errors of the mean absolute angular errors. The results depicted in Figure 6 indicate that the BAC gradient estimates are more accurate and have lower variance than their MC and BQ counterparts.

Next, we use BAC to optimize the policy parameters and compare its performance with a BPG algorithm and a MCPG algorithm for $M=1,\;25,\;50$ , and $75$ . The BPG algorithm uses Model 1 of Section 4.1. We use Algorithm 4 with the number of policy updates set to $500$ and the same kernels as in the previous experiment. The Fisher information matrix is estimated using Algorithm 3. The returns obtained by these methods are averaged over $10^{3}$ runs. For a fixed sample size $M$ , we tried many values of the learning rate, $\beta$ , for MCPG, BPG, and BAC, and those in Table 4 yielded the best performance. Note that the learning rate used for each algorithm in each experiment is fixed and does not converge to zero. BAC showed a very robust performance when we changed the learning rate. By robust we mean that it never generated a policy for which an episode does not end after $10^{6}$ steps. This seems to be due to the fact that BAC gradient estimates are more accurate and have less variance than their MC and BPG counterparts. The performance of BPG improves as we increase the sample size $M$ . It performs worse than MCPG for $M=1$ and $25$ , but achieves a performance similar to BAC for $M=100$ .

Figure 7 depicts the results of these experiments. From left to right and top to bottom the sub-figures correspond to the experiment in which all the algorithms used $M=1,\;25,\;50,$ and $75$ trajectories per policy update, respectively. Each curve depicts the difference between the exact average discounted return for the $500$ policies that follow each policy update and $\eta^{*}$ – the optimal average discounted return. All curves are averaged over $10^{3}$ repetitions of the experiment. The BAC algorithm clearly learns significantly faster than the other algorithms (note that the vertical scale is logarithmic).

Remark: Since BQ (and as a result BPG) is based on defining a kernel over system trajectories (quadratic Fisher kernel in Model 1 and Fisher kernel in Model 2), its performance degrades when the system generates trajectories of different size. This effect can be observed by most kernels that have been used in the literature for the trajectories generated by dynamical systems. This can be also observed in our experiments: BQ performs much better than MC in the “Linear Quadratic Regulator” problem (Section 6.2), in which all the system trajectories are of size 20, while its superiority over MC is less apparent in the “Random Walk” problem (Section 8.1). This is why we are not going to use BQ and BPG in the “Mountain Car” (Section 8.2) and “Ship Steering” (Section 8.3) problems, in which the system trajectories have different lengths.

2 Mountain Car

In this section, we consider the mountain car problem as formulated in Sutton98IR, and report the results of applying the BAC and MCPG algorithms to optimize the policy parameters in this task. The state ${\boldsymbol{x}}$ consists of the position $x$ and the velocity $\dot{x}$ of the car: ${\boldsymbol{x}}=(x,\dot{x})$ . The reward is $-1$ on all time steps until the car reaches its goal at the top of the hill, which ends the episode. There are three possible actions: forward, reverse, and zero. The car moves according to the following simplified dynamics:

When $x_{t+1}$ reaches the left boundary, $\dot{x}_{t+1}$ is set to zero and when it reaches the right boundary, the goal is reached and the episode ends. Each episode starts from a random position and velocity uniformly sampled from their domains. We use the discount factor $\gamma=0.99$ .

In order to define the policy, we first map the states ${\boldsymbol{x}}=(x,\dot{x})$ to the unit square $\times$ . The policy used in our experiments has the following form:

The policy feature vector is defined as $\phi({\boldsymbol{x}},a_{i})=\big(\phi({\boldsymbol{x}})^{\top}\delta_{a_{1}a_{i}},\phi({\boldsymbol{x}})^{\top}\delta_{a_{2}a_{i}},\phi({\boldsymbol{x}})^{\top}\delta_{a_{3}a_{i}}\big)^{\top}$ , where $\delta_{a_{j}a_{i}}$ is $1$ if $a_{j}=a_{i}$ , and is otherwise. The state feature vector $\phi({\boldsymbol{x}})$ is composed of $16$ Gaussian functions arranged in a $4\times 4$ grid over the unit square as follows:

where the $\bar{{\boldsymbol{x}}}_{i}$ ’s are the $16$ points of the grid $\{0,0.25,0.5,1\}\times\{0,0.25,0.5,1\}$ and $\kappa=1.3\times 0.25$ .

In Figure 8, we compare the performance of BAC with a MCPG algorithm for $M=5,\;10,\;20,$ and $40$ episodes used to estimate each gradient. For BAC, we use Algorithm 4 with the number of policy updates set to $500$ , a Gaussian state kernel $k_{x}({\boldsymbol{x}},{\boldsymbol{x}}^{\prime})=\exp\big(-||{\boldsymbol{x}}-{\boldsymbol{x}}^{\prime}||^{2}/(2\sigma_{k}^{2})\big)$ , with $\sigma_{k}=1.3\times 0.25$ , and the state-action kernel $k_{F}({\boldsymbol{z}},{\boldsymbol{z}}^{\prime})$ . The Fisher information matrix is estimated using Algorithm 3. After every $50$ policy updates the learned policy is evaluated for $10^{3}$ episodes to estimate accurately the average number of steps to goal. Each evaluation episode starts from a random position and velocity uniformly chosen from their ranges, and continues until the car either reaches the goal or a limit of $200$ time-steps is exceeded. The experiment is repeated $100$ times for the entire horizontal axis to obtain average results and confidence intervals. The error bars in this figure are the standard errors of the performance of the algorithms.

For a fixed sample size $M$ , each method starts with an initial learning rate and decreases it according to the schedule $\beta_{t}=\beta_{0}\beta_{c}/(\beta_{c}+t)$ . We tried many values of the learning rate parameters $(\beta_{0},\beta_{c})$ for MCPG and BAC, and those in Table 5 yielded the best performance. Note that $\beta_{c}=\infty$ means that we used a fixed learning rate $\beta_{0}$ for that experiment. The graphs indicate that BAC performs better and has lower variance than MCPG. It is able to find a good policy with only $M=5$ sample size and its performance does not change much as the sample size is increased. On the other hand, the performance of MCPG improves and its variance is reduced as we increase the sample size. Note that for $M=40$ , MCPG finally achieves a similar performance (still with slower rate) as BAC.

3 Ship Steering

In this section, we perform comparative experiments between BAC and MCPG on a more challenging problem in the continuous state continuous action ship steering domain (Miller90NN).

In this domain, a ship is located in a $150\times 150$ meter square water surface. At any point in time $t$ , the state of the ship is described by four continuous variables that are defined below along with their range of values

and then map it to the allowed range $[-15^{\circ},15^{\circ}]$ using the sigmoid transformation

For the BAC experiments, we used the Gaussian state kernel $k_{x}({\bf x},{\bf x}^{\prime})=\exp(-||{\bf x}-{\bf x}^{\prime}||^{2}/(2\sigma_{k}^{2}))$ , with $\sigma_{k}=1$ and the state-action kernel $k_{F}({\boldsymbol{z}},{\boldsymbol{z}}^{\prime})$ , i.e., the Fisher kernel.

Second, we calculate the gradient using the online sparsification procedure described in Section 4.4. Finaly, we never explicitly calculate the inverse of the Fisher information matrix $\hat{{\boldsymbol{G}}}$ and instead calculate the product of $\hat{{\boldsymbol{G}}}^{-1}$ with the score. For the numerical stability we also add $10^{-6}$ to the diagonal of $\hat{{\boldsymbol{G}}}$ .

Similar to the other experiments in the paper, we varied the number of trajectories used to estimate the gradient of a policy as $M=5$ , $10$ , and $20$ . Table 6 shows the best values of the learning rate $\beta$ for both MCPG and BAC for different values of $M$ . To evaluate each method, we ran $100$ independent learning trials. At each trial, we evaluate the performance of the policy every $100$ iterations by executing it $100$ (independent) times with $\theta_{1}$ and $\dot{\theta}_{1}$ randomly sampled. For each of these execution, we observe if the ship reached $(x_{*},y_{*})$ within $500$ steps and estimate the policy success ratio. We set the total number of gradient updates to $T=3000$ for $M=5$ and $10$ and to $T=1000$ for $M=20$ .

The results for all the experiments are presented in Figure 9 along with their standard deviations. Naturally, using more trajectories for the gradient update improves both methods. However, this improvement is bigger for the BAC method. In the case of $M=5$ , MCPG produces slightly better policies at the beginning of learning, but is soon outperformed by BAC. For $M=10$ and $20$ , BAC produces better policies from the beginning, especially for $M=20$ . This is consistent with the results of the other experimental domains reported in the paper. For all values of $M$ , BAC converges to a policy with a better success ratio than MCPG. Finally, as expected, BAC has usually less variance in its performance than MCPG.

Other Advancements in Bayesian Reinforcement Learning

The algorithms presented in this paper belong to the class of Bayesian model-free RL, as they do not assume that the system’s dynamic is known and do not explicitly construct a model of the system. In recent years, Bayesian methodology has been used to develop algorithms in several other areas of RL. In this section, we provide a brief overview of these results (for more details, see the survey by Ghavamzadeh15BR).

Another widely-used class of RL algorithms are those that build an explicit model of the system and use it to find a good (or optimal) policy, thus, are known as model-based RL algorithms. Recent years have witnessed many applications of the Bayesian methodology to this class of RL algorithms. The main idea of model-based Bayesian RL is to explicitly maintain a posterior over the model parameters and to use it to select actions in order to appropriately balance exploration and exploitation. The class of model-based Bayesian RL algorithms include those that work with MDPs and those that work with POMDPs (e.g., Ross08BAPOMDP, doshi08). The MDP-based algorithms can be further divided to those that are offline (e.g., duff01bamdpfsc, poupart06beetle), those that are online (e.g., dearden99model, strens00, wang05sparse, Ross08BAPOMDP), and those that have probably approximately correct (PAC)-guarantees (e.g., kolter09, asmuth09, sorg10).

The use of Bayesian methodology has also been explored to solve the inverse RL (IRL) problem, i.e., learning the underlying model of the decision-making agent (expert) from its observed behavior and the dynamics of the system (Russell98LA). The main idea of Bayesian IRL (BIRL) is to use a prior to encode the reward preference and to formulate the compatibility with the expert’s policy as a likelihood in order to derive a probability distribution over the space of reward functions, from which the expert’s reward function is somehow extracted. The most notable works in the area of BIRL include those by Ramachandran07BI, Choi11MA, Choi12NV, Michini12BN, Michini12IE.

Bayesian techniques have also been used to derive algorithms for the collaborative multi-agent RL problem. When dealing with multi-agent systems, the complexity of the decision problem is increased in the following way: while single-agent BRL requires maintaining a posterior over the MDP parameters (in the case of model-based methods) or over the value/policy (in the case of model-free methods), in multi-agent BRL, it is also necessary to keep a posterior over the policies of the other agents. Chalkiadakis13CM showed that this belief can be maintained in a tractable manner subject to certain structural assumptions on the domain, for example that the strategies of the agents are independent of each other.

Multi-task RL (MTRL) is another area that has witnessed the application of Bayesian methodology. All approaches to MTRL assume that the tasks share similarity in some components of the problem such as dynamics, reward structure, or value function. The Bayesian MTRL methods assume that the shared components are drawn from a common generative model (Wilson07MR, Mehta08TV, Lazaric10BM). In Mehta08TV, tasks share the same dynamics and reward features, and only differ in the weights of the reward function. The proposed method initializes the value function for a new task using the previously learned value functions as a prior. Wilson07MR and Lazaric10BM both assume that the distribution over some components of the tasks is drawn from a hierarchical Bayesian model.

Bayesian learning methods have also been used for regret minimization in multi-armed bandits. This area that goes back to the seminal work of Gittins79BP, has become very active with the Bayesian version of the upper confidence bound (UCB) algorithm (Kaufmann12BU) and the recent advancements in the analysis of Thompson Sampling (Agrawal11AT, Kaufmann12TS, agrawal2013further, agrawal2013thompson, Russo14IT, Gopalan14TS, guha2014stochastic, Liu2015prior) and its state-of-the-art empirical performance (Scott10MB, Chapelle11EE), which has also led to its use in several industrial applications (Graepel10WB, Tang13AA).

Discussion

In this paper, we first proposed an alternative approach to the conventional frequentist (Monte-Carlo based) policy gradient estimation procedure. Our approach is based on Bayesian quadrature (Ohagan91BQ), a Bayesian method for integral evaluation. The idea is to model the gradient of the expected return with respect to the policy parameters, which is of the form of an integral, as Gaussian processes (GPs). This is done by dividing the integrand into two parts, treating one as a random function (or random field), whose random nature reflects our subjective uncertainty concerning its true identity. This allows us to incorporate our prior knowledge of this term (part) into its prior distribution. Observing (possibly noisy) samples of this term allows us to employ Bayes’ rule to compute a posterior distribution of it conditioned on these samples. This in turn induces a posterior distribution over the value of the integral, which is the gradient of the expected return. By properly partitioning the integrand and by appropriately selecting a prior distribution, a closed-form expression for the posterior moments of the gradient of the expected return is obtained. We proposed two different ways of partitioning the integrand resulting in two distinct Bayesian models. For each model, we showed how the posterior moments of the gradient conditioned on the observed data are calculated. In line with previous work on Bayesian quadrature, our Bayesian approach tends to significantly reduce the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient and the gradient covariance are provided at little extra cost. We performed detailed experimental comparisons of the Bayesian policy gradient (BPG) algorithms presented in the paper with classic Monte-Carlo based algorithms on a bandit problem as well as on a linear quadratic regulator problem. The experimental results are encouraging, but we conjecture that even better gains may be attained using this approach. This calls for additional theoretical and empirical work. It is important to note that the gradient estimated by Algorithm 1 may be employed in conjunction with conjugate-gradients and line-search methods for making better use of the gradient information. We also showed that the models and algorithms presented in this paper can be extended to partially observable problems without any change along the same lines as Baxter01IP. This is due to the fact that our BPG framework considers complete system trajectories as its basic observable unit, and thus, does not require the dynamic within each trajectory to be of any special form. This generality has the downside that our proposed framework cannot take advantage of the Markov property when the system is Markovian.

To address this issue, we then extended our BPG framework to actor-critic algorithms and presented a new Bayesian take on the actor-critic architecture. By using GPs and choosing their prior distributions to make them compatible with a parametric family of policies, we were able to derive closed-form expressions for the posterior distribution of the policy gradient updates. The posterior mean is used to update the policy and the posterior covariance to gauge the reliability of this update. Our Bayesian actor-critic (BAC) framework uses individual state-action-reward transitions as its basic observable unit, and thus, is able to take advantage of the Markov property of the system trajectories (when the system is indeed Markovian). This improvement seems to be borne out in our experiments, where BAC provides more accurate estimates of the policy gradient than either of the two BPG models for the same amount of data. Similar to BPG, another feature of BAC is that its natural-gradient variant is obtained at little extra cost. For both BPG and BAC, we derived the sparse form of the algorithms, which would make them significantly more time and memory efficient. Finally, we performed an experimental evaluation of the BAC algorithm, comparing it with classic Monte-Carlo based policy gradient algorithms, as well as our BPG algorithms, on a random walk problem, the widely used mountain car problem (Sutton98IR), and the continuous state and continuous action ship steering domain (Miller90NN).

Additional experimental work is required to investigate the behavior of BPG and BAC algorithms in larger and more realistic domains, involving continuous and high-dimensional state and action spaces. The BPG and BAC algorithms proposed in the paper use only the posterior mean of the gradient in their updates. We conjecture that the second-order statistics obtained from BPG and BAC (both in the actor and critic) may be used to devise more efficient algorithms. In one of the experiments in Section 6, we employed the covariance information provided by Algorithm 1 for risk-aware selection of the step size in the gradient updates, which showed promising performance. Other interesting directions for future work include 1) investigating other possible partitions of the integrand in the expression for $\nabla\eta_{B}({\boldsymbol{\theta}})$ into a GP term and a deterministic term, 2) using other types of kernel functions such as sequence kernels, 3) combining our approach with MDP model estimation to allow transfer of learning between different policies (model-based Bayesian policy gradient), and 4) investigating more efficient methods for estimating the Fisher information matrix. Another direction is to derive a fully non-parametric actor-critic algorithm. In BAC, the critic is based on Gaussian process temporal difference learning, which is a non-parametric method, while the actor uses a family of parameterized policies. The idea here would be to replace the actor in the BAC algorithm with a non-parametric actor that performs gradient search in a function space (e.g., a reproducing kernel Hilbert space) of policies.

Part of the computational experiments was conducted using the Grid’5000 experimental testbed (https://www.grid5000.fr). Yaakov Engel was supported by an Alberta Ingenuity fellowship.

A Proof of Proposition 3

We start the proof with the $M\times 1$ vector ${\boldsymbol{b}}$ , whose $i$ th element can be written as

(a) substitutes $k(\xi,\xi_{i})$ with the quadratic Fisher kernel from Equation 23, (b) is algebra, (c) follows from (i) $\int\Pr(\xi;{\boldsymbol{\theta}})d\xi=1$ , and (ii) $\int{\boldsymbol{u}}(\xi)\Pr(\xi;{\boldsymbol{\theta}})d\xi=\int\nabla\mathop{\rm log}\Pr(\xi;{\boldsymbol{\theta}})\Pr(\xi;{\boldsymbol{\theta}})d\xi$ $=\int\nabla\Pr(\xi;{\boldsymbol{\theta}})d\xi=\nabla\int\Pr(\xi;{\boldsymbol{\theta}})d\xi=\nabla(1)=0$ , (d) is the result of replacing the integral with the Fisher information matrix ${\boldsymbol{G}}$ , (e) is algebra, and thus, the claim follows. Now the proof for the scalar $b_{0}$

(a) substitutes $k(\xi,\xi^{\prime})$ with the quadratic Fisher kernel from Equation 23, (b) is algebra, (c) follows from (i) $\iint\Pr(\xi;{\boldsymbol{\theta}})\Pr(\xi^{\prime};{\boldsymbol{\theta}})d\xi d\xi^{\prime}=1$ , and (ii) $\int{\boldsymbol{u}}(\xi)\Pr(\xi;{\boldsymbol{\theta}})d\xi=0$ , and finally (d) is the result of replacing the integral within the parentheses with the Fisher information matrix ${\boldsymbol{G}}$ .

The Fisher information matrix ${\boldsymbol{G}}$ is positive definite and symmetric. Thus, it can be written as ${\boldsymbol{G}}={\boldsymbol{V}}{\boldsymbol{\Lambda}}{\boldsymbol{V}}^{\top}$ , where ${\boldsymbol{V}}=[{\boldsymbol{v}}_{1},\ldots,{\boldsymbol{v}}_{n}]$ and ${\boldsymbol{\Lambda}}=\mathop{\rm diag}[\lambda_{1},\ldots,\lambda_{n}]$ are the matrix of orthonormal eigenvectors and the diagonal matrix of eigenvalues of matrix ${\boldsymbol{G}}$ , respectively. By replacing ${\boldsymbol{G}}^{-1}$ with ${\boldsymbol{V}}{\boldsymbol{\Lambda}}^{-1}{\boldsymbol{V}}^{\top}$ in Equation 44 we obtain

(a) and (b) are algebra, (c) is the result of switching the sum and the integral, (d) is algebra, (e) follows from the fact that ${\boldsymbol{v}}_{i}^{\top}{\boldsymbol{u}}(\xi)$ is a scalar, and thus, can be replaced by its transpose, (f) is algebra, (g) substitutes the integral within the parentheses with the Fisher information matrix ${\boldsymbol{G}}$ , (h) replaces ${\boldsymbol{G}}{\boldsymbol{v}}_{i}$ with $\lambda_{i}{\boldsymbol{v}}_{i}$ , (i) follows from the orthonormality of ${\boldsymbol{v}}_{i}$ ’s, and thus, the claim follows.

B Proof of Proposition 4

We start with the proof of ${\boldsymbol{B}}$ . This $n\times M$ matrix may be written as

(a) substitutes $k(\xi,\xi_{i})$ with the Fisher kernel from Equation 27, (b) is algebra, (c) follows from $\nabla\Pr(\xi;{\boldsymbol{\theta}})={\boldsymbol{u}}(\xi)\Pr(\xi;{\boldsymbol{\theta}})$ , (d) substitutes the integral within the parentheses with the Fisher information matrix ${\boldsymbol{G}}$ , (e) is algebra, and thus, the claim follows. Now the proof for the $n\times n$ matrix ${\boldsymbol{B}}_{0}$

(a) follows from the fact that $k(\xi,\xi^{\prime})$ is scalar, (b) substitutes $k(\xi,\xi^{\prime})$ with the Fisher information kernel from Equation 27 and $\nabla\Pr(\xi;{\boldsymbol{\theta}})$ with ${\boldsymbol{u}}(\xi)\Pr(\xi;{\boldsymbol{\theta}})$ , (c) is algebra, (d) is the result of substituting the integrals within the parentheses with the Fisher information matrix ${\boldsymbol{G}}$ , and thus, the claim follows.

C Proof of Proposition 5

Sparsification does not change $b_{0}$ and it remains equal to $n+1$ (see Proposition 3), however it modifies ${\boldsymbol{b}}$ to

The claim follows using Lemma 1.3.2 in Engel05AR.

D Proof of Proposition 6

We start the proof with the $n\times(t+1)$ matrix ${\boldsymbol{B}}_{t}$ , whose $i$ th column may be written as

The 1st line follows from the definition of matrix ${\boldsymbol{B}}_{t}$ , function ${\boldsymbol{g}}$ , and kernel $k$ , the 2nd line is algebra, the 3rd line follows from the definition of $\pi^{\mu}$ and the Fisher kernel $k_{F}$ , the 4th line is algebra, the 5th line is the result of replacing the integral in the parentheses with the Fisher information matrix ${\boldsymbol{G}}$ , finally the 6th line is algebra, and the claim follows.

Now the proof for the $n\times n$ matrix ${\boldsymbol{B}}_{0}$

(a) follows from the definition of function ${\boldsymbol{g}}$ and kernel $k$ , (b) is algebra, (c) follows from the definition of $\pi^{\mu}$ and the Fisher kernel $k_{F}$ , (c) is algebra, finally (d) follows from $\int_{\mathcal{A}}da\nabla\mu(a|x;{\boldsymbol{\theta}})=0$ and ${\boldsymbol{G}}=\int_{\mathcal{Z}}dz\pi^{\mu}({\boldsymbol{z}}){\boldsymbol{u}}({\boldsymbol{z}}){\boldsymbol{u}}({\boldsymbol{z}})^{\top}$ , and the claim follows.