(More) Efficient Reinforcement Learning via Posterior Sampling

Ian Osband, Daniel Russo, Benjamin Van Roy

Introduction

We consider the classical reinforcement learning problem of an agent interacting with its environment while trying to maximize total reward accumulated over time . The agent’s environment is modeled as a Markov decision process (MDP), but the agent is uncertain about the true dynamics of the MDP. As the agent interacts with its environment, it observes the outcomes that result from previous states and actions, and learns about the system dynamics. This leads to a fundamental tradeoff: by exploring poorly-understood states and actions the agent can learn to improve future performance, but it may attain better short-run performance by exploiting its existing knowledge.

Naïve optimization using point estimates for unknown variables overstates an agent’s knowledge, and can lead to premature and suboptimal exploitation. To offset this, the majority of provably efficient learning algorithms use a principle known as optimism in the face of uncertainty to encourage exploration. In such an algorithm, each state and action is afforded some optimism bonus such that their value to the agent is modeled to be as high as is statistically plausible. The agent will then choose a policy that is optimal under this “optimistic” model of the environment. This incentivizes exploration since poorly-understood states and actions will receive a higher optimism bonus. As the agent resolves its uncertainty, the effect of optimism is reduced and the agent’s behavior approaches optimality. Many authors have provided strong theoretical guarantees for optimistic algorithms . In fact, almost all reinforcement learning algorithms with polynomial bounds on sample complexity employ optimism to guide exploration.

We study an alternative approach to efficient exploration, posterior sampling, and provide finite time bounds on regret. We model the agent’s initial uncertainty over the environment through a prior distribution.For an MDP, this might be a prior over transition dynamics and reward distributions. At the start of each episode, the agent chooses a new policy, which it follows for the duration of the episode. Posterior sampling for reinforcement learning (PSRL) selects this policy through two simple steps. First, a single instance of the environment is sampled from the posterior distribution at the start of an episode. Then, PSRL solves for and executes the policy that is optimal under the sampled environment over the episode. PSRL randomly selects policies according to the probability they are optimal; exploration is guided by the variance of sampled policies as opposed to optimism.

The idea of posterior sampling goes back to 1933 and has been applied successfully to multi-armed bandits. In that literature, the algorithm is often referred to as Thompson sampling or as probability matching. Despite its long history, posterior sampling was largely neglected by the multi-armed bandit literature until empirical studies demonstrated that the algorithm could produce state of the art performance. This prompted a surge of interest, and a variety of strong theoretical guarantees are now available . Our results suggest this method has great potential in reinforcement learning as well.

PSRL was originally introduced in the context of reinforcement learning by Strens under the name “Bayesian Dynamic Programming”,We alter terminology since PSRL is neither Bayes-optimal, nor a direct approximation of this. where it appeared primarily as a heuristic method. In reference to PSRL and other “Bayesian RL” algorithms, Kolter and Ng write “little is known about these algorithms from a theoretical perspective, and it is unclear, what (if any) formal guarantees can be made for such approaches.” Those Bayesian algorithms for which performance guarantees exist are guided by optimism. BOSS introduces a more complicated version of PSRL that samples many MDPs, instead of just one, and then combines them into an optimistic environment to guide exploration. BEB adds an exploration bonus to states and actions according to how infrequently they have been visited. We show it is not always necessary to introduce optimism via a complicated construction, and that the simple algorithm originally proposed by Strens satisfies strong bounds itself.

Our work is motivated by several advantages of posterior sampling relative to optimistic algorithms. First, since PSRL only requires solving for an optimal policy for a single sampled MDP, it is computationally efficient both relative to many optimistic methods, which require simultaneous optimization across a family of plausible environments , and to computationally intensive approaches that attempt to approximate the Bayes-optimal solutions directly . Second, the presence of an explicit prior allows an agent to incorporate known environment structure in a natural way. This is crucial for most practical applications, as learning without prior knowledge requires exhaustive experimentation in each possible state. Finally, posterior sampling allows us to separate the algorithm from the analysis. In any optimistic algorithm, performance is greatly influenced by the manner in which optimism is implemented. Past works have designed algorithms, at least in part, to facilitate theoretical analysis for toy problems. Although our analysis of posterior sampling is closely related to the analysis in , this worst-case bound has no impact on the algorithm’s actual performance. In addition, PSRL is naturally suited to more complex settings where design of an efficiently optimistic algorithm might not be possible. We demonstrate through a computational study in Section 6 that PSRL outperforms the optimistic algorithm UCRL2 : a competitor with similar regret bounds over some example MDPs.

Problem formulation

A deterministic policy $\mu$ is a function mapping each state $s\in\mathcal{S}$ and $i=1,\ldots,\tau$ to an action $a\in\mathcal{A}$ . For each MDP $M=(\mathcal{S},\mathcal{A},R^{M},P^{M},\tau,\rho)$ and policy $\mu$ , we define a value function

The reinforcement learning agent interacts with the MDP over episodes that begin at times $t_{k}=(k-1)\tau+1$ , $k=1,2,\ldots$ . At each time $t$ , the agent selects an action $a_{t}$ , observes a scalar reward $r_{t}$ , and then transitions to $s_{t+1}$ . If an agent follows a policy $\mu$ then when in state $s$ at time $t$ during episode $k$ , it selects an action $a_{t}=\mu(s,t-t_{k})$ . Let $H_{t}=(s_{1},a_{1},r_{1},\ldots,s_{t-1},a_{t-1},r_{t-1})$ denote the history of observations made prior to time $t$ . A reinforcement learning algorithm is a deterministic sequence $\{\pi_{k}|k=1,2,\ldots\}$ of functions, each mapping $H_{t_{k}}$ to a probability distribution $\pi_{k}(H_{t_{k}})$ over policies. At the start of the $k$ th episode, the algorithm samples a policy $\mu_{k}$ from the distribution $\pi_{k}(H_{t_{k}})$ . The algorithm then selects actions $a_{t}=\mu_{k}(s_{t},t-t_{k})$ at times $t$ during the $k$ th episode.

We define the regret incurred by a reinforcement learning algorithm $\pi$ up to time $T$ to be

where $\Delta_{k}$ denotes regret over the $k$ th episode, defined with respect to the MDP $M^{*}$ by

with $\mu^{*}=\mu^{M^{*}}$ and $\mu_{k}\sim\pi_{k}(H_{t_{k}})$ . Note that regret is not deterministic since it can depend on the random MDP $M^{*}$ , the algorithm’s internal random sampling and, through the history $H_{t_{k}}$ , on previous random transitions and random rewards. We will assess and compare algorithm performance in terms of regret and its expectation.

Posterior sampling for reinforcement learning

The use of posterior sampling for reinforcement learning (PSRL) was first proposed by Strens . PSRL begins with a prior distribution over MDPs with states $\mathcal{S}$ , actions $\mathcal{A}$ and horizon $\tau$ . At the start of each $k$ th episode, PSRL samples an MDP $M_{k}$ from the posterior distribution conditioned on the history $H_{t_{k}}$ available at that time. PSRL then computes and follows the policy $\mu_{k}=\mu^{M_{k}}$ over episode $k$ .

We show PSRL obeys performance guarantees intimately related to those for learning algorithms based upon OFU, as has been demonstrated for multi-armed bandit problems . We believe that a posterior sampling approach offers some inherent advantages. Optimistic algorithms require explicit construction of the confidence bounds on $V^{M^{*}}_{\mu,1}(s)$ based on observed data, which is a complicated statistical problem even for simple models. In addition, even if strong confidence bounds for $V^{M^{*}}_{\mu,1}(s)$ were known, solving for the best optimistic policy may be computationally intractable. Algorithms such as UCRL2 are computationally tractable, but must resort to separately bounding $\overline{R}^{M}_{a}(s)$ and $P^{M}_{a}(s)$ with high probability for each $s,a$ . These bounds allow a “worst-case” mis-estimation simultaneously in every state-action pair and consequently give rise to a confidence set which may be far too conservative.

By contrast, PSRL always selects policies according to the probability they are optimal. Uncertainty about each policy is quantified in a statistically efficient way through the posterior distribution. The algorithm only requires a single sample from the posterior, which may be approximated through algorithms such as Metropolis-Hastings if no closed form exists. As such, we believe PSRL will be simpler to implement, computationally cheaper and statistically more efficient than existing optimistic methods.

If $f$ is the distribution of $M^{*}$ then,

This result holds for any prior distribution on MDPs, and so applies to an immense class of models. To accommodate this generality, the result bounds expected regret under the prior distribution (sometimes called Bayes risk or Bayesian regret). We feel this is a natural measure of performance, but should emphasize that it is more common in the literature to bound regret under a worst-case MDP instance. The next result provides a link between these notions of regret. Applying Markov’s inequality to (1) gives convergence in probability.

If $f$ is the distribution of $M^{*}$ then for any $\alpha>\frac{1}{2}$ ,

True versus sampled MDP

A simple observation, which is central to our analysis, is that, at the start of each $k$ th episode, $M^{*}$ and ${M_{k}}$ are identically distributed. This fact allows us to relate quantities that depend on the true, but unknown, MDP $M^{*}$ , to those of the sampled MDP ${M_{k}}$ , which is fully observed by the agent. We introduce $\sigma({H}_{t_{k}})$ as the $\sigma$ -algebra generated by the history up to $t_{k}$ . Readers unfamiliar with measure theory can think of this as “all information known just before the start of period $t_{k}$ .” When we say that a random variable X is $\sigma({H}_{t_{k}})$ -measurable, this intuitively means that although X is random, it is deterministically known given the information contained in $H_{t_{k}}$ . The following lemma is an immediate consequence of this observation .

If $f$ is the distribution of $M^{*}$ then, for any $\sigma({H}_{t_{k}})$ -measurable function $g$ ,

Note that taking the expectation of (2) shows $\mathds{E}[g(M^{*})]=\mathds{E}[g({M_{k}})]$ through the tower property.

Recall, we have defined $\Delta_{k}=\sum_{s\in\mathcal{S}}\rho(s)(V^{M^{*}}_{\mu^{*},1}(s)-V^{M^{*}}_{\mu_{k},1}(s))$ to be the regret over period $k$ . A significant hurdle in analyzing this equation is its dependence on the optimal policy $\mu^{*}$ , which we do not observe. For many reinforcement learning algorithms, there is no clean way to relate the unknown optimal policy to the states and actions the agent actually observes. The following result shows how we can avoid this issue using Lemma 1. First, define

as the difference in expected value of the policy $\mu_{k}$ under the sampled MDP $M_{k}$ , which is known, and its performance under the true MDP $M^{*}$ , which is observed by the agent.

and for any $\delta>0$ with probability at least $1-\delta$ ,

This result bounds the agent’s regret in epsiode $k$ by the difference between the agent’s estimate $V^{M_{k}}_{\mu_{k},1}(s_{t_{k}})$ of the expected reward in $M_{k}$ from the policy it chooses, and the expected reward $V^{M^{*}}_{\mu_{k},1}(s_{t_{k}})$ in $M^{*}$ . If the agent has a poor estimate of the MDP $M^{*}$ , we expect it to learn as the performance of following $\mu_{k}$ under $M^{*}$ differs from its expectation under $M_{k}$ . As more information is gathered, its performance should improve. In the next section, we formalize these ideas and give a precise bound on the regret of posterior sampling.

Analysis

An essential tool in our analysis will be the dynamic programming, or Bellman operator $\mathcal{T}_{\mu}^{M}$ , which for any MDP $M=(\mathcal{S},\mathcal{A},R^{M},P^{M},\tau,\rho)$ , stationary policy $\mu:\mathcal{S}\rightarrow\mathcal{A}$ and value function $V:\mathcal{S}\rightarrow\mathds{R}$ , is defined by

This operation returns the expected value of state $s$ where we follow the policy $\mu$ under the laws of $M$ , for one time step. The following lemma gives a concise form for the dynamic programming paradigm in terms of the Bellman operator.

For any MDP $M=(\mathcal{S},\mathcal{A},R^{M},P^{M},\tau,\rho)$ and policy $\mu:\mathcal{S}\times\{1,\ldots,\tau\}\rightarrow\mathcal{A}$ , the value functions $V^{M}_{\mu}$ satisfy

for $i=1\dots\tau$ , with $V_{\mu,\tau+1}^{M}:=0$ .

In order to streamline our notation we will let $V_{\mu,i}^{*}:=V_{\mu,i}^{M^{*}}$ , $V_{\mu,i}^{k}(s):=V_{\mu,i}^{M_{k}}(s)$ , $\mathcal{T}_{\mu}^{k}:=\mathcal{T}_{\mu}^{M_{k}}$ , $\mathcal{T}_{\mu}^{*}:=\mathcal{T}_{\mu}^{M^{*}}$ and $P^{*}_{\mu}(\cdot|s):=P_{\mu(s)}^{M^{*}}(\cdot|s)$ .

To see why (6) holds, simply apply the Dynamic programming equation inductively:

where $d_{t_{k}+i}:=\sum_{s^{\prime}\in\mathcal{S}}\{P_{\mu_{k}(\cdot,i)}^{*}(s^{\prime}|s_{t_{k}+i})(V_{\mu_{k},i+1}^{*}-V_{\mu_{k},i+1}^{k})(s^{\prime})\}-(V_{\mu_{k},i+1}^{*}-V_{\mu_{k},i+1}^{k})(s_{t_{k}+i})$ .

This expresses the regret in terms two factors. The first factor is the one step Bellman error $\left[(\mathcal{T}_{\mu_{k}(\cdot,i)}^{k}-\mathcal{T}_{\mu_{k}(\cdot,i)}^{*})V_{\mu_{k},i+1}^{k}(s_{t_{k}+i})\right]$ under the sampled MDP $M_{k}$ . Crucially, (6) depends only the Bellman error under the observed policy $\mu_{k}$ and the states $s_{1},..,s_{T}$ that are actually visited over the first $T$ periods. We go on to show the posterior distribution of $M_{k}$ concentrates around $M^{*}$ as these actions are sampled, and so this term tends to zero.

The second term captures the randomness in the transitions of the true MDP $M^{*}$ . In state $s_{t}$ under policy $\mu_{k}$ , the expected value of $(V_{\mu_{k},i+1}^{*}-V_{\mu_{k},i+1}^{k})(s_{t_{k}+i})$ is exactly $\sum_{s^{\prime}\in\mathcal{S}}\{P_{\mu_{k}(\cdot,i)}^{*}(s^{\prime}|s_{t_{k}+i})(V_{\mu_{k},i+1}^{*}-V_{\mu_{k},i+1}^{k})(s^{\prime})\}$ . Hence, conditioned on the true MDP $M^{*}$ and the sampled MDP $M_{k}$ , the term $\sum_{i=1}^{\tau}d_{t_{k}+i}$ has expectation zero.

2 Introducing confidence sets

The last section reduced the algorithm’s regret to its expected Bellman error. We will proceed by arguing that the sampled Bellman operator $\mathcal{T}_{\mu_{k}(\cdot,i)}^{k}$ concentrates around the true Bellman operatior $\mathcal{T}_{\mu_{k}(\cdot,i)}^{*}$ . To do this, we introduce high probability confidence sets similar to those used in and . Let $\hat{P}_{a}^{t}(\cdot|s)$ denote the emprical distribution up period $t$ of transitions observed after sampling $(s,a)$ , and let $\hat{R}_{a}^{t}(s)$ denote the empirical average reward. Finally, define $N_{t_{k}}(s,a)=\sum_{t=1}^{t_{k}-1}\mathds{1}_{\{(s_{t},a_{t})=(s,a)\}}$ to be the number of times $(s,a)$ was sampled prior to time $t_{k}$ . Define the confidence set for episode $k$ :

Now, since $\mathcal{M}_{k}$ is $\sigma({H}_{t_{k}})$ -measureable, by Lemma 1, $\mathds{E}[\mathds{1}_{\{M_{k}\notin\mathcal{M}_{k}\}}|H_{t_{k}}]=\mathds{E}[\mathds{1}_{\{M^{*}\notin\mathcal{M}_{k}\}}|H_{t_{k}}]$ . Lemma 17 of showsOur confidence sets are equivalent to those of when the parameter $\delta=1/m$ . $\mathds{P}(M^{*}\notin\mathcal{M}_{k})\leq 1/m$ for this choice of $\beta_{k}(s,a)$ , which implies

Simulation results

We compare performance of PSRL to UCRL2 : an optimistic algorithm with similar regret bounds. We use the standard example of RiverSwim , as well as several randomly generated MDPs. We provide results in both the episodic case, where the state is reset every $\tau=20$ steps, as well as the setting without episodic reset.

RiverSwim consists of six states arranged in a chain as shown in Figure 1. The agent begins at the far left state and at every time step has the choice to swim left or right. Swimming left (with the current) is always successful, but swimming right (against the current) often fails. The agent receives a small reward for reaching the leftmost state, but the optimal policy is to attempt to swim right and receive a much larger reward. This MDP is constructed so that efficient exploration is required in order to obtain the optimal policy. To generate the random MDPs, we sampled 10-state, 5-action environments according to the prior.

We express our prior in terms of Dirichlet and normal-gamma distributions over the transitions and rewards respectively.These priors are conjugate to the multinomial and normal distribution. We used the values $\alpha=1/S,\mu=\sigma^{2}=1$ and pseudocount $n=1$ for a diffuse uniform prior. In both environments we perform 20 Monte Carlo simulations and compute the total regret over 10,000 time steps. We implement UCRL2 with $\delta=0.05$ and optimize the algorithm to take account of finite episodes where appropriate. PSRL outperformed UCRL2 across every environment, as shown in Table 1. In Figure 2, we show regret through time across 50 Monte Carlo simulations to 100,000 time–steps in the RiverSwim environment: PSRL’s outperformance is quite extreme.

The majority of practical problems in reinforcement learning can be mapped to repeated episodic interactions for some length $\tau$ . Even in cases where there is no actual reset of episodes, one can show that PSRL’s regret is bounded against all policies which work over horizon $\tau$ or less . Any setting with discount factor $\alpha$ can be learned for $\tau\propto(1-\alpha)^{-1}$ .

One appealing feature of UCRL2 and REGAL is that they learn this optimal timeframe $\tau$ . Instead of computing a new policy after a fixed number of periods, they begin a new episode when the total visits to any state-action pair is doubled. We can apply this same rule for episodes to PSRL in the $\infty$ -horizon case, as shown in Figure 2. Using optimism with KL-divergence instead of $L^{1}$ balls has also shown improved performance over UCRL2 , but its regret remains orders of magnitude more than PSRL on RiverSwim.

Conclusion

Osband and Russo are supported by Stanford Graduate Fellowships courtesy of PACCAR inc., and Burt and Deedee McMurty, respectively. This work was supported in part by Award CMMI-0968707 from the National Science Foundation.

References

Appendix A Relating Bayesian to frequentist regret

Let $\mathcal{M}$ be any family of MDPs with non-zero probability under the prior. Then, for any $\epsilon>0$ , $\alpha>\frac{1}{2}$ :

This provides regret bounds even if $M^{*}$ is not distributed according to $f$ . As long as the true MDP is not impossible under the prior, we will have an asymptotic frequentist regret close to the theoretical lower bounds of in $T$ -dependence of $O(\sqrt{T})$ .

Therefore via theorem (1), for any $\alpha>\frac{1}{2}$ :

Appendix B Bounding the sum of confidence set widths

We are interested in bounding $\min\{\tau\sum_{k=1}^{m}\sum_{i=1}^{\tau}\min\{\beta_{k}s_{t_{k}+i},a_{t_{k}+i}),1\},T\}$ which we claim is $O(\tau S\sqrt{AT\log(SAT)}$ for $\beta_{k}(s,a):=\sqrt{\frac{14S\log(2SAmt_{k})}{\max\{1,N_{t_{k}}(s,a)\}}}$ .

Now, the consider the event $(s_{t},a_{t})=(s,a)$ and $(N_{t_{k}}(s,a)\leq\tau)$ . This can happen fewer than $2\tau$ times per state action pair. Therefore, $\sum_{k=1}^{m}\sum_{i=1}^{\tau}\mathbf{1}(N_{t_{k}}(s,a)\leq\tau)\leq 2\tau SA$ .Now, suppose $N_{t_{k}}(s,a)>\tau$ . Then for any $t\in\{t_{k},..,t_{k+1}-1\}$ , $N_{t}(s,a)+1\leq N_{t_{k}}(s,a)+\tau\leq 2N_{t_{k}}(s,a)$ . Therefore:

Note that since all rewards and transitions are absolutely constrained $\in$ our regret