Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

Introduction

Reinforcement learning (RL) is one of the most successful approaches to artificial intelligence, and has found successful applications in robotics, games, dialogue systems, and recommendation systems, among others. One of the key problems in RL is policy evaluation: given a fixed policy, estimate the average reward garnered by an agent that runs this policy in the environment. In this paper, we consider the off-policy estimation problem, in which we want to estimate the expected reward of a given target policy with samples collected by a different behavior policy. This problem is of great practical importance in many application domains where deploying a new policy can be costly or risky, such as medical treatments , econometrics , recommender systems , education , Web search , advertising and marketing . It can also be used as a key component for developing efficient off-policy policy optimization algorithms .

Most state-of-the-art off-policy estimation methods are based on importance sampling (IS) [e.g., 22]. A major limitation, however, is that this approach can become inaccurate due to the high variance introduced by the importance weights, especially when the trajectory is long. Indeed, most existing IS-based estimators compute the weight as the product of the importance ratios of many steps in the trajectory. Variances in individual steps accumulate multiplicatively, so that the overall IS weight of a random trajectory can have an exponentially high variance to result in an unreliable estimator. In the extreme case when the trajectory length is infinite, as in infinite-horizon average-reward problems, some of these estimators are not even well-defined. Ad hoc approaches can be used, such as truncating the trajectories, but often lead to a hard-to-control bias in the final estimation. Analogous to the well-known “curse of dimensionality” in dynamic programming , we call this problem the “curse of horizon” in off-policy learning.

In this work, we develop a new approach that tackles the curse of horizon. The key idea is to apply importance sampling on the average visitation distribution of single steps of state-action pairs, instead of the much higher dimensional distribution of whole trajectories. This avoids the cumulative product across time in the density ratio, substantially decreasing its variance and eliminating the estimator’s dependence on the horizon.

Our key challenge, of course, is to estimate the importance ratios of average visitation distributions. In practice, we often have access to both the target and behavior policies to compute their importance ratio of an action conditioned on a given state. But we typically have no access to transition probabilities of the environment, so estimating importance ratios of state visitation distributions has been very difficult, especially when only off-policy samples are available. In this paper, we develop a mini-max loss function for estimating the true stationary density ratio, which yields a closed-form representation similar to maximum mean discrepancy when combined with a reproducing kernel Hilbert space (RKHS). We study the theoretical properties of our loss function, and demonstrate its empirical effectiveness on long-horizon problems.

Background

Consider a Markov decision process (MDP) M=S,A,r,TM=\langle\mathcal{S},\mathcal{A},r,{\boldsymbol{T}}\rangle with state space S\mathcal{S}, action space A\mathcal{A}, reward function rr, and transition probability function T{\boldsymbol{T}}. Assume the environment is initialized at state s0Ss_{0}\in\mathcal{S}, drawn from an unknown distribution d0()d_{0}(\cdot). At each time step tt, an agent observes the current state sts_{t}, takes an action ata_{t} according to a possibly stochastic policy π(st)\pi(\cdot|s_{t}), receives a reward rtr_{t} whose expectation is r(st,at)r(s_{t},a_{t}), and transitions to a next state st+1s_{t+1} according to transition probabilities T(st,at){\boldsymbol{T}}(\cdot|s_{t},a_{t}). To simplify exposition and avoid unnecessary technicalities, we assume S\mathcal{S} and A\mathcal{A} are finite unless otherwise specified, although our method extends to continuous spaces straightforwardly, as demonstrated in experiments.

We consider the infinite horizon problem in which the MDP continues without termination. Let pπ()p_{\pi}(\cdot) be the distribution of trajectory τ={st,at,rt}t=0{\boldsymbol{\tau}}=\{s_{t},a_{t},r_{t}\}_{t=0}^{\infty} under policy π\pi. The expected reward of π\pi is

where RπT(τ)R^{T}_{\pi}({\boldsymbol{\tau}}) is the reward of trajectory τ{\boldsymbol{\tau}} up to time TT. Here, γ(0,1]\gamma\in(0,1] is a discount factor. We distinguish two reward criteria, the average reward (γ=1\gamma=1) and discounted reward (0<γ<10<\gamma<1):

where (1γ)=1/t=0γt(1-\gamma)=1/\sum_{t=0}^{\infty}\gamma^{t} is a normalization factor. The problem of off-policy value estimation is to estimate the expected reward Rπ{R}_{\pi} of a given target policy π\pi, when we only observe a set of trajectories τi={sti,ati,rti}t=0T{\boldsymbol{\tau}}^{i}=\{s_{t}^{i},a_{t}^{i},r_{t}^{i}\}_{t=0}^{{T}} generated by following a different behavior policy π0\pi_{0}.

Bellman Equation

Under these definitions, VπV^{\pi} is the fixed-point solution to the respective Bellman equations:

Importance Sampling

IS represents a major class of approaches to off-policy estimation, which, in principle, only applies to the finite-horizon reward RπTR^{T}_{\pi} when the trajectory is truncated at a finite time step T<T<\infty. IS-based estimators are based on the following change-of-measure equality:

where βπ/π0(as):=π(as)/π0(as)\beta_{\pi/\pi_{0}}(a|s):=\pi(a|s)/\pi_{0}(a|s) is the single-step density ratio of policies π\pi and π0\pi_{0} evaluated at a particular state-action pair (s,a)(s,a), and w0:Tw_{0:T} is the density ratio of the trajectory τ{\boldsymbol{\tau}} up to time TT. Methods based on (3) are called trajectory-wise IS, or weighted IS (WIS) when the weights are self-normalized . It is possible to improve trajectory-wise IS with the so called step-wise, or per-decision, IS/WIS, which uses weight w0:tw_{0:t} for reward rtr_{t} at time tt, yielding smaller variance . More details about these estimators are given in Appendix A.

The Curse of Horizon

The importance weight w0:Tw_{0:T} is a product of TT density ratios, whose variance can grow exponentially with TT. Thus, IS-based estimators have not been widely successful in long-horizon problems, let alone infinite-horizon ones where w0:w_{0:\infty} may not even be well-defined. While WIS estimators often have reduced variance, the exponential dependence on horizon is unavoidable in general. We call this phenomenon in IS/WIS-based estimators the curse of horizon.

Not all hope is lost, however. To see this, consider an MDP with nn states and 22 actions, where states are arranged on a circle (see figure on the right). The two actions deterministically move the agent from the current state to the neighboring state counterclockwise and clockwise, respectively. Suppose we are given two policies with opposite effects: the behavior policy π0\pi_{0} moves the agent clockwise with probability ρ\rho, and the target policy π\pi moves the agent counterclockwise with probability ρ\rho, for some constant ρ(0,1)\rho\in(0,1). As shown in Appendix B, IS and WIS estimators suffer from exponentially large variance when estimating the average reward of π\pi. However, a keen reader will realize that the two policies are symmetric, and thus their stationary state visitation distributions are identical. As we show in the sequel, this allows us to estimate the expected reward using a much more efficient importance sampling, whose importance weight equals the single-step density ratio βπ/π0(atst)\beta_{\pi/\pi_{0}}(a_{t}|s_{t}), instead of the cumulative product weight w0:Tw_{0:T} in (3), allowing us to significantly reduce the variance. Such an observation inspired the approach developed in this paper.

Off-Policy Estimation via Stationary State Density Ratio Estimation

As shown in the example above, significant decrease in estimation variance is possible when we apply importance weighting on the state space, rather than the trajectory space. It eliminates the dependency on the trajectory length and is much more suited for long- or infinite-horizon problems. To realize this, we need to introduce an alternative representation of the expected reward. Denote by dπ,t()d_{\pi,t}(\cdot) the distribution of state sts_{t} when we execute policy π\pi starting from an initial state s0s_{0} drawn from an initial distribution d0()d_{0}(\cdot). We define the average visitation distribution to be

We always assume the limit TT\to\infty exists in this work. When γ(0,1)\gamma\in(0,1) in the discounted case, dπd_{\pi} is a discounted average of dπ,td_{\pi,t}, that is, dπ(s)=(1γ)t=0γtdπ,t(s)d_{\pi}(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}d_{\pi,t}(s) ; when γ=1\gamma=1 in the average reward case, dπd_{\pi} is the stationary distribution of sts_{t} as tt\to\infty under policy π\pi, that is, dπ(s)=limT1T+1t=0Tdπ,t(s)=limtdπ,t(s)d_{\pi}(s)=\lim_{T\to\infty}\frac{1}{{T+1}}\sum_{t=0}^{{T}}d_{\pi,t}(s)=\lim_{t\to\infty}d_{\pi,t}(s).

Following Definition 4, it can be verified that RπR_{\pi} can be expressed alternatively as

where, abusing notation slightly, we use (s,a)dπ(s,a)\sim d_{\pi} to denote draws from distribution dπ(s,a):=dπ(s)π(as)d_{\pi}(s,a):=d_{\pi}(s)\pi(a|s). Our idea is to construct an IS estimator based on (5), where the importance ratio is computed on state-action pairs rather than on trajectories:

This IS estimator works in the space of (s,a)(s,a), instead of trajectoris τ={st,at}t=0T{\boldsymbol{\tau}}=\{s_{t},a_{t}\}_{t=0}^{{T}}, leading to a potentially significant variance reduction. Returning to the example in Section 2 (see also Appendix B), since the two policies are symmetric and lead to the same stationary distributions, that is, wπ/π0(s)=1w_{\pi/\pi_{0}}(s)=1, the importance weight in (6) is simply π(as)/π0(as){\pi(a|s)}/{\pi_{0}(a|s)}, independent of the trajectory length. This avoids the excessive variance in long horizon problems. In Appendix A, we provide a further discussion, showing that our estimator can be viewed as a type of Rao-Backwellization of the trajectory-wise and step-wise estimators.

The key technical challenge remaining is estimating the density ratio wπ/π0(s)w_{\pi/\pi_{0}}(s), which we address in this section. For simplifying the presentation, we start with estimating dπ(s)d_{\pi}(s) for the average reward case and discuss the discounted case in Section 3.2.

Let Tπ(ss):=aT(ss,a)π(as){\boldsymbol{T}}_{\pi}(s^{\prime}|s):=\sum_{a}{\boldsymbol{T}}(s^{\prime}|s,a)\pi(a|s) be the transition probability from ss to ss^{\prime} following policy π\pi. In the average reward case, dπd_{\pi} equals the stationary distribution of Tπ{\boldsymbol{T}}_{\pi}, satisfying

Assume the Markov chain of Tπ{\boldsymbol{T}}_{\pi} is finite state and ergodic, dπd_{\pi} is also the unique distribution that satisfies (8). This simple fact can be leveraged to derive the following key property of wπ/π0(s)w_{\pi/\pi_{0}}(s).

In the average reward case (γ=1\gamma=1), assume dπd_{\pi} is the unique invariant distribution of Tπ{\boldsymbol{T}}_{\pi} and dπ0(s)>0d_{\pi_{0}}(s)>0, s\forall s. Then a function w(s)w(s) equals wπ/π0(s):=dπ(s)/dπ0(s)w_{\pi/\pi_{0}}(s):=d_{\pi}(s)/d_{\pi_{0}}(s) (up to a constant factor) if and only if it satisfies

where βπ/π0(as)=π(as)/π0(as)\beta_{\pi/\pi_{0}}(a|s)=\pi(a|s)/\pi_{0}(a|s) and (s,a)sdπ0(s,a)|s^{\prime}\sim d_{\pi_{0}} denote the conditional distribution dπ0(s,as)d_{\pi_{0}}(s,a|s^{\prime}) related to joint distribution dπ0(s,a,s):=dπ0(s)π0(as)T(ss,a)d_{\pi_{0}}(s,a,s^{\prime}):=d_{\pi_{0}}(s)\pi_{0}(a|s){\boldsymbol{T}}(s^{\prime}|s,a). Note that this is a time-reserved conditional probability, since it is the conditional distribution of (s,a)(s,a) given that their next state is ss^{\prime} following policy π0\pi_{0}.

Following Theorem 1, we have wwπ/π0w\propto w_{\pi/\pi_{0}} if and only if L(w,f)=0L(w,f)=0 for any function ff. This motivates us to estimate wπ/π0w_{\pi/\pi_{0}} with a mini-max problem:

Assume H\mathcal{H} is a RKHS of functions f(s)f(s) with a positive definite kernel k(s,sˉ)k(s,\bar{s}), and define F:={fH ⁣:fH1}{\mathcal{F}}:=\{f\in\mathcal{H}\colon||f||_{\mathcal{H}}\leq 1\} to be the unit ball of H\mathcal{H}. We have

where (s,a,s)(s,a,s^{\prime}) and (sˉ,aˉ,sˉ)(\bar{s},\bar{a},\bar{s}^{\prime}) are independent transition pairs obtained when running policy π0\pi_{0}, and Δ(w;s,a,s)\Delta(w;s,a,s^{\prime}) is defined in (10). See Appendix C for more background on RKHS.

In practice, we approximate the expectation in (12) using discounted empirical distribution of the transition pairs, yielding consistent estimates following standard results on V-statistics .

2 Discounted Reward Case

We now discuss the extension to the discount case of γ(0,1)\gamma\in(0,1). Similar to the average reward case, we start with a recursive equation that characterizes dπ(s)d_{\pi}(s) in the discounted case.

Following the definition of dπd_{\pi} in (4), for any γ(0,1]\gamma\in(0,1], we have

Denote by (s,a,s)dπ(s,a,s^{\prime})\sim d_{\pi} draws from dπ(s)π(as)T(ss,a)d_{\pi}(s)\pi(a|s){\boldsymbol{T}}(s^{\prime}|s,a). For any function ff, we have

One may view dπd_{\pi} as the invariant distribution of an induced Markov chain with transition probability of (1γ)d0(s)+γTπ(ss)(1-\gamma)d_{0}(s^{\prime})+\gamma{\boldsymbol{T}}_{\pi}(s^{\prime}|s), which follows Tπ{\boldsymbol{T}}_{\pi} with probability γ\gamma, and restarts from initial distribution d0(s)d_{0}(s^{\prime}) with probability 1γ1-\gamma. We can show that dπd_{\pi} exists and is unique under mild conditions .

Assume dπd_{\pi} is the unique solution of (13), and dπ0(s)>0d_{\pi_{0}}(s)>0, s\forall s. Define

Assume 0<γ<10<\gamma<1, then w(s)=wπ/π0(s)w(s)=w_{\pi/\pi_{0}}(s) if and only if L(w,f)=0L(w,f)=0 for any test function ff.

3 Further Theoretical Analysis

In this section, we develop further theoretical understanding on the loss function L(w,f)L(w,f). Lemma 5 below reveals an interesting connection between L(w,f)L(w,f) and the Bellman equation, allowing us to bound the estimation error of density ratio and expected reward with the mini-max loss when the discriminator space F{\mathcal{F}} is chosen properly (Theorems 6 and 7). The results in this section apply to both discounted and average reward cases.

Note that Πf\Pi f equals the left hand side of the Bellman equations (1) and (2), when f=Vπf=V^{\pi}.

Lemma 5 represents L(w,f)L(w,f) as an inner product between wπ/π0ww_{\pi/\pi_{0}}-w and Πf\Pi f (under base measure dπ0d_{\pi_{0}}). This provides an alternative proof of Theorem 4, since L(w,f)=0, fFL(w,f)=0,~{}\forall f\in{\mathcal{F}} implies that wπ/π0ww_{\pi/\pi_{0}}-w is orthogonal with all Πf\Pi f and hence wπ/π0=ww_{\pi/\pi_{0}}=w when {Πf ⁣:fF}\{\Pi f\colon f\in{\mathcal{F}}\} is sufficiently rich.

Since our main goal is to estimate the expected total reward Rπ{R}_{\pi} instead of the density ratio wπ/π0w_{\pi/\pi_{0}}, it is of interest to select F{\mathcal{F}} to directly bound the estimation error of the total reward. Interestingly, this can be achieved once F{\mathcal{F}} includes the true value function VπV^{\pi}.

Define Rπ[w]{R}_{\pi}[w] to be the reward estimate using estimated density ratio w(s)w(s) (which may not equal the true ratio wπ/π0w_{\pi/\pi_{0}}) and infinite number of trajectories from dπ0d_{\pi_{0}}, that is,

Related Work

Our off-policy setting is related to, but different from, off-policy value-function learning . Our goal is to estimate a single scalar that summarizes the quality of a policy (a.k.a. off-policy value estimation as called by some authors ). However, our idea can be extended to estimating value functions as well, by using estimated density ratios to weight observed transitions (c.f., the distribution μ\mu in LSTDQ ). We leave this as future work.

IS-based off-policy value estimation has seen a lot of interest recently for short-horizon problems, including contextual bandits , and achieved many empirical successes . When extended to long-horizon problems, it faces an exponential blowup of variance, and variance-reduction techniques are used to improve the estimator . However, it can be proved that in the worst case, the mean squared error of any estimator has to depend exponentially on the horizon . Fortunately, many problems encountered in practical applications may present structures that enable more efficient off-policy estimation, as tackled by the present paper. An interesting open direction is to characterize theoretical conditions that can ensure tractable estimation for long horizon problems.

Few prior work directly target infinite-horizon problems. There exists approaches that use simulated samples to estimate stationary state distributions [1, Chapter IV]. However, they need a reliable model to draw such simulations, a requirement that is not satisfied in many real-world applications. To the best of our knowledge, the recently developed COP-TD algorithm is the only work that attempts to estimate wπ/π0w_{\pi/\pi_{0}} as an intermediate step of estimating the value function of a target policy π\pi. They take a stochastic-approximation approach and show asymptotic consistence. However, extending their approach to continuous state/action spaces appears challenging.

Finally, there is a comprehensive literature of two-sample density ratio estimation [e.g., 27, 35], which estimates the density ratio of two distributions from pairs of their samples. Our problem setting is different in that we only have data from dπ0d_{\pi_{0}}, but not from dπd_{\pi}; this makes the traditional density ratio estimators inapplicable to our problem. Our method is made possible by taking the special temporal structure of MDP into consideration.

Experiment

In this section, we conduct experiments on different environmental settings to compare our method with existing off-policy evaluation methods. We compare with the standard trajectory-wise and step-wise IS and WIS methods. We do not report the results of unnormalized IS because they are generally significantly worse than WIS methods . In all the cases, we also compare with an on-policy oracle and a naive averaging baseline, which estimates the reward using direct averaging over the trajectories generated by the target policy and behavior policy, respectively. For problems with discrete action and state spaces, we also compare with a standard model-based method, which estimates the transition and reward model and then calculates expected reward explicitly using the model up to the desired truncation length. When applying our method on problems with finite and discrete state space, we optimize ww and ff in the space of all possible functions (corresponding to using a delta kernel in terms of RKHS). For continuous state space, we assume ww is a standard feed-forward neural network, and F{\mathcal{F}} is a RKHS with a standard Gaussian RBF kernel whose bandwidth equals the median of the pairwise distances between the observed data points.

Because we cannot simulate truly infinite steps in practice, we use the behavior policy to generate trajectories of length TT, and evaluate the algorithms based on the mean square error (MSE) w.r.t. the TT-step rewards of a large number of trajectories of length TT from the target policy. We expect that our method gets better as TT increases, since it is designed for infinite horizon problems, while the IS/WIS methods receive large variance and deteriorate as TT increases.

Taxi is a 2D grid world simulating taxi movement along the grids. A taxi moves North, East, South, West or attends to pick up or drop off a passenger. It receives a reward of 2020 when it successfully picks up a passenger or drops her off at the right place, and otherwise a reward of -1 every time step. The original taxi environment would stop when the taxi successfully picks up a passenger and drops her off at the right place. We modify the environment to make it infinite horizon, by allowing passengers to randomly appear and disappear at every corner of the map at each time step. We use a grid size of 5×55\times 5, which yields 20002000 states in total (25×24×525\times 2^{4}\times 5, corresponding to 2525 taxi locations, 242^{4} passenger appearance status and 55 taxi status (empty or with one of 4 destinations)).

To construct target and behavior policies for testing our algorithm, we set our target policy to be the final policy π\pi_{*} after running Q-learning for 10001000 iterations, and set another policy π+\pi_{+} after 950950 iterations. The behavior policy is π=(1α)π+απ+\pi=(1-\alpha)\pi_{*}+\alpha\pi_{+}, where α\alpha is a mixing ratio that can be varied.

Results in Taxi Environment

Figure 1(a)–(b) show results with average reward. We can see our method performs almost as well as the on-policy oracle, outperforming all the other methods. To evaluate the approximation error of the estimated density ratio w^\hat{w}, we plot in Figure 1(c) the weighted total variation distance between d^π=w^dπ0\hat{d}_{\pi}=\hat{w}d_{\pi_{0}} with the true dπd_{\pi} with TV distance as we optimize the loss function. Figure 1(d) shows scatter plot of {(d^π(s),dπ(s)) : sS}\{(\hat{d}_{\pi}(s),d_{\pi}(s))~{}:~{}\forall s\in\mathcal{S}\} at convergence, indicating our method correctly estimates the true density ratio over the state space.

Figure 2 shows similar results for discounted reward. From Figure 2(c) and (d), we can see that typical IS methods deteriorate as the trajectory length TT and discount factor γ\gamma increase, respectively, which is expected since their variance grows exponentially with TT. In contrast, our density ratio method performs better as trajectory length TT increases, and is robust as γ\gamma increases.

Pendulum Environment

We train a near-optimal policy π\pi_{*} using REINFORCE and set it to be the target policy. The behavior policy is set to be π=(1α)π+απ+\pi=(1-\alpha)\pi_{*}+\alpha\pi_{+}, where α\alpha is a mixing ratio, and π+\pi_{+} is another policy from REINFORCE when it has not converged. Our results are shown in Figure 3, where we again find that our method generally outperforms the standard trajectory-wise and step-wise WIS, and works favorably in long-horizon problems (Figure 3(b)).

SUMO Traffic Simulator

SUMO is an open source traffic simulator; see Figure 4(a) for an illustration. We consider the task of reducing traffic congestion by modelling traffic light control as a reinforcement learning problem . We use TraCI, a built-in “Traffic Control Interface”, to interact with the SUMO simulator. Full details of our environmental settings can be found in Appendix E. Our results are shown in Figure 4, where we again find that our method is consistently better than standard IS methods.

Conclusions

We study the off-policy estimation problem in infinite-horizon problems and develop a new algorithm based on direct estimation of the stationary state density ratio between the target and behavior policies. Our mini-max objective function enjoys nice theoretical properties and yields an intriguing connection with Bellman equations that is worth further investigation. Future directions include scaling our method to larger scale problems and extending it to estimate value functions and leverage off-policy data in policy optimization.

Acknowledgement

This work is supported in part by NSF CRII 1830161. We would like to acknowledge Google Cloud for their support.

References

Appendix A Several Variants of IS- and WIS-based Estimators

Denote by γt=γt/t=0Tγt\gamma_{t}=\gamma^{t}/\sum_{t=0}^{{T}}\gamma^{t} for notation simplicity. Define

Then we have the following two key formulas, which derive the trajectory-wise, and step-wise importance sampling (IS) estimators, respectively.

Given a set of mm observed trajectories τi={sti,ati,rti}t=0T{\boldsymbol{\tau}}^{i}=\{s_{t}^{i},a_{t}^{i},r_{t}^{i}\}_{t=0}^{T}, i=1,,m\forall i=1,\ldots,m, drawn from pπ0p_{\pi_{0}}. The trajectory-wise and step-wise estimators are

where w0:ti=w0:t(τi)w_{0:t}^{i}=w_{0:t}({\boldsymbol{\tau}}^{i}) and ZtZ_{t} is a normalization constant of the importance weights: when Zt=m,   tZ_{t}=m,~{}~{}~{}\forall t, the corresponding estimators (called Trajectory-wise IS and Step-wise IS, respectively) provide unbiased estimates of RπTR_{\pi}^{T}; when Zt=i=1mw0:tiZ_{t}=\sum_{i=1}^{m}w_{0:t}^{i}, the corresponding estimators are weighted (or self-normalized) importance sampling (called Trajectory-wise WIS and Step-wise WIS, respectively), which introduce bias but often have lower variance. It has been shown that the Step-wise WIS often performs the best among all these variants .

In comparison, our method can be viewed as a further Rao-backwellization of the step-wise estimators. Define

where we replace w0:tw_{0:t} in (20) with wt:tw_{t:t}, based on Rao-backwellization conditioning on (st,at)(s_{t},a_{t}). This gives an empirical estimator:

where wt:ti=wt:t(ati,sti)w_{t:t}^{i}=w_{t:t}(a_{t}^{i},s_{t}^{i}) and Zt=mZ_{t}=m or Zt=i=1mwt:tiZ_{t}=\sum_{i=1}^{m}w_{t:t}^{i}. Comparing this with the trajectory-wise and step-wise estimators, it is easy to expect that it yields smaller variance, when ignoring the estimation error of wt:tw_{t:t}.

Appendix B A motivating example

Here we provide an example when w0:Tw_{0:T} is exponential on the trajectory length TT, yielding high variance in trajectory-wise and step-wise estimators in long horizon problems, while the variance of our stationary density ratio based importance weight wt:tw_{t:t} stays to be a constant as TT increases.

The MDP has nn states: S={0,1,,n1}\mathcal{S}=\{0,1,\dots,n-1\}, arranged on a circle (see the figure on the right), where nn is an odd number. There are two actions, left (L\mathsf{L}) and right (R\mathsf{R}). The left action moves the agent from the current state counterclockwise to the next state, and the right action has the opposite (clockwise) effect. The deterministic reward is if taking action L\mathsf{L} and 11 otherwise. In summary, we have for any ss and aa that

Suppose we are given two policies. The behavior policy π0\pi_{0} and target policy π\pi choose action R\mathsf{R} with probability ρ\rho and 1ρ1-\rho, respectively. We focus on the average reward (γ=1)(\gamma=1) here.

First, note that the MDP is ergodic under either policy, as nn is odd. Since π0\pi_{0} and π\pi are symmetric, their stationary distributions are identical, that is, dπ(s)/dπ0(s)=1d_{\pi}(s)/d_{\pi_{0}}(s)=1. In fact, both dπ=dπ0d_{\pi}=d_{\pi_{0}} are uniform over S\mathcal{S}. Therefore,

and similarly wt:t(s,L)=(1ρ)/ρw_{t:t}(s,\mathsf{L})=(1-\rho)/\rho. Both ratios are independent of the trajectory length, and have zero variance.

Under the setting above, let τ={st,at,rt}0tT{\boldsymbol{\tau}}=\{s_{t},a_{t},r_{t}\}_{0\leq t\leq T} be a trajectory drawn from the behavior policy π0\pi_{0}, we have

Obviously, Aρ>1A_{\rho}>1 for ρ1/2\rho\neq 1/2 and Aρ=1A_{\rho}=1 for ρ=1/2\rho=1/2, and Bρ,T>0B_{\rho,T}>0 for large enough TT. Therefore, the variance of both the trajectory-wise importance weights and the corresponding estimator grow exponentially in the order of AρTA_{\rho}^{T}.

Remark

From the definition of the setting, it is easy to show that

Under policy π0\pi_{0}, F(τ)F({\boldsymbol{\tau}}) follows a Binomial distribution Binomial(T+1,ρ){Binomial}({T+1},\rho). The first order moments can be easily calculated as follows

It remains to calculate the second order moments. We achieve this by leveraging the moment-generating function (MGF) of Binomial distribution:

It will turn out be useful to consider the derivatives of Φ(λ)\Phi(\lambda):

For convenience, define C=(1ρ)/ρC=(1-\rho)/\rho, and we have

where we use the fact that (1ρ+ρC4)C2=ρ3+(1ρ)3(1ρ)ρ=Aρ.(1-\rho+\rho C^{4})C^{-2}=\frac{\rho^{3}+(1-\rho)^{3}}{(1-\rho)\rho}=A_{\rho}. Similarly, we have

where we use the fact that Bρ,T=(C/(T+1)+ρC4)ρ2B_{\rho,T}=({C}/{({T+1})}+\rho C^{4})\rho^{2}. It is then straightforward to calculate the variance from here. ∎

Claim #3. Variance of trajectory-wise WIS weight grows exponentially in T𝑇T.

Although weighted-IS (WIS) often improves over IS estimators by using self-normalized weights, it cannot eliminate the exponential dependence on the trajectory length. Here, we calculate the asymptotic variance of trajectory-wise WIS using delta method [28, Chapter 9].

Let R^n,wis\hat{R}_{n,\text{wis}} be the trajectory-wise WIS estimator of RπR_{\pi} based on nn copies of independent trajectories drawn from π0\pi_{0}, we have

where Dρ,A=Bρ,TAρ1  2(1ρ)3/ρ + (1ρ)2Aρ,D_{\rho,A}=B_{\rho,T}A_{\rho}^{-1}~{}-~{}2{(1-\rho)^{3}}/{\rho}~{}+~{}(1-\rho)^{2}A_{\rho}, with AρA_{\rho} and Bρ,TB_{\rho,T} defined in Proposition 8.

The asymptotic mean square error (MSE) of a self-normalized importance sampling estimator can be estimated using the delta method [28, Chapter 9]:

where the first and third terms have been calculated in the proof of Proposition 8. We just need to calculate the cross term:

Appendix C Proofs

where we assume g(x)=ibik(s,si)g(x)=\sum_{i}b_{i}k(s,s_{i}). A simple yet important fact that our proof will leverage is that

A key property of RKHS is the so called reproducing property, which says

this can be proved using the reproducing property as follows

For more introduction to RKHS, see , to name only a few.

Note that dπ0(s,as)=dπ0(s)π0(as)T(ss,a)dπ0(s)d_{\pi_{0}}(s,a|s^{\prime})=\frac{d_{\pi_{0}}(s)\pi_{0}(a|s){\boldsymbol{T}}(s^{\prime}|s,a)}{d_{\pi_{0}}(s^{\prime})}. Therefore, (9) is equivalent to

Denote g(s):=dπ0(s)w(s)g(s):=d_{\pi_{0}}(s)w(s). Since dπ0(s)>0d_{\pi_{0}}(s^{\prime})>0 for all ss^{\prime}, we find that (9) is equivalent to

This implies that g(s)g(s) is invariant under Markov transition T(ss,a)π(as){\boldsymbol{T}}(s^{\prime}|s,a)\pi(a|s). Because dπ(s)d_{\pi}(s) is the unique stationary distribution under the same Markov transition, (24) holds if and only if g(s)dπ(s)g(s)\propto d_{\pi}(s), or equivalently, w(s)wπ/π0(s).w(s)\propto w_{\pi/\pi_{0}}(s). This completes the proof. ∎

Assume γ(0,1)\gamma\in(0,1). The definition in (4) gives dπ(s)=(1γ)t=0γtdπ,t(s)d_{\pi}(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}d_{\pi,t}(s). Therefore,

Multiplying both sides by f(s)f(s^{\prime}) and summing over ss^{\prime}, we get

Recall that (s,a,s)dπ(s,a,s^{\prime})\sim d_{\pi} denotes sampling from the joint distribution of dπ(s,a,s)=dπ(s)T(s,as)π(as)d_{\pi}(s,a,s^{\prime})=d_{\pi}(s){\boldsymbol{T}}(s^{\prime},a|s)\pi(a|s). Note that under this joint distribution, the marginal distribution of ss^{\prime} is different from dπ(s)d_{\pi}(s).This is different from the average reward case, in which dπ(s)d_{\pi}(s) is the stationary distribution of Tπ{\boldsymbol{T}}_{\pi}.

where gg is any function. Then by assumption, we have g(s)=dπ(s)g(s)=d_{\pi}(s) if and only if δ(g, s)=0\delta(g,~{}s^{\prime})=0 for any ss^{\prime}. Replacing dπd_{\pi} with dπ0d_{\pi_{0}} and f(s)f(s) with w(s)f(s)w(s)f(s) in (14) gives

Plugging it into the definition of L(w,f)L(w,f) in (15), we get

where we have defined g(s):=dπ(s)wπ/π0(s)1w(s)g(s):=d_{\pi}(s)w_{\pi/\pi_{0}}(s)^{-1}w(s). Therefore, L(w,f)=0L(w,f)=0 for f\forall f is equivalent to δ(g,s)=0\delta(g,s^{\prime})=0 for s\forall s^{\prime}, which is in turn equivalent to g(s)=dπ(s)g(s)=d_{\pi}(s). Therefore, we have w(s)=wπ/π0(s)w(s)=w_{\pi/\pi_{0}}(s) when 0<γ<10<\gamma<1, and g(s)dπ(s)g(s)\propto d_{\pi}(s), or equivalently, w(s)wπ/π0(s)w(s)\propto w_{\pi/\pi_{0}}(s), when γ=1\gamma=1. ∎

Following the proof of Theorem 4 up to (25), we have

Consider first the discounted case γ(0,1)\gamma\in(0,1), we have

For the uniqueness, assume g=Πf1g=\Pi f_{1} and g=Πf2g=\Pi f_{2}, and δf=f1f2{\delta f}=f_{1}-f_{2}, then Πδf=0\Pi{\delta f}=0, where

which implies δf=0\left\|{\delta f}\right\|_{\infty}=0.

For the average reward case γ=1\gamma=1, we have

which is contradictory. Therefore, δf{\delta f} must be a constant. In fact, functions that satisfies δf=sTπ(ss)δf(s){\delta f}=\sum_{s^{\prime}}{\boldsymbol{T}}_{\pi}(s^{\prime}|s){\delta f}(s^{\prime}) is called harmonic [17, Lemma 1.16]. ∎

We just need to calculate fgf_{g}, following Lemma 10.

We consider the average reward case first. Following the definition of the operator Π\Pi in (17) and the average reward Bellman equation, we have

For the discounted case, following the definition of Π\Pi and the discounted Bellman equation (2), we have ΠVπ(s)=rπ\Pi V_{\pi}(s)=r_{\pi}, which gives

Appendix D Algorithm Details

Algorithm 1 summarizes our main algorithm for the average reward case, where we approximate the mini-max loss function in (12) using empirical averaging of observed data.

The algorithm for the discounted case follows the same idea, but requires some modification due to the additional term in (15). To handle it in a notionally convenient way, we find it is useful to introduce a dummy transition pair {s1,a1,s1,r1}\{s_{-1},a_{-1},s_{-1}^{\prime},r_{-1}\} at time t=1t=-1, for which we define s1=s0s_{-1}^{\prime}=s_{0}, r1=0r_{-1}=0 and Δ(w; s1,a1,s1):=1w(s0)f(s0)\Delta(w;~{}s_{-1},a_{-1},s^{\prime}_{-1}):=1-w(s_{0})f(s_{0}). Related, we define an augmented discounted visitation distribution via

Under this notation, the loss (15) of discounted case is rewritten into a form identical to the average reward case:

when F{\mathcal{F}} is the ball of RKHS with kernel k(s,sˉ)k(s^{\prime},\bar{s}^{\prime}).

This equation is identical to the one in Algorithm 1 for the average case, but differs in the way the minibatch M\mathcal{M} is generated: it includes the dummy transition at time t=1t=-1 with probability (1γ)(1-\gamma) and select time tt with discounted probability γt+1\gamma^{t+1}. See Algorithm 2 for the summary of the procedure.

Appendix E Information on SUMO Traffic Simulator

We provide details of the SUMO traffic simulator and how we formulate it as a standard reinforcement learning problem.

A states of a traffic should provide us with enough information to control the traffic light. A complex way is an image-like representation of the traffic vehicle around the traffic light intersection . Here, to simplify the problem, we add lane detectors around traffic light intersections, and count the total number of vehicles on each lane as states sts_{t}. This should give us enough, though not perfect, information to guide the traffic light agent to choose its action.

Actions

For a standard crossing intersection, its traffic light will have a program for 8 phases: “Straight signal for North-South”, “Turn-left signal for North-south”, “Straight signal for East-West”, “Turn-left signal for East-west” and their corresponding “yellow light” slow down signals. Here, we simplify these 4 phases into actions ata_{t} for each traffic light, where we let one big time step tt in reinforcement learning setting to be 6 real time steps in SUMO simulator. Within each big time step tt, we add a transition of 3 real time steps “yellow light” phase as a buffer to prevent vehicles for “emergency stop” if our agent decides to change light status (ataTa_{t}\neq a_{{T}}).

Rewards

Our goal is to minimize the total travelling time for all vehicles. Thus, we could set the negative of current aggregate total number of vehicles during the one big time step as reward rtr_{t}. To simplify, we can just consider 6 times the current total number of vehicle as a approximation of rtr_{t} to make our system simpler.

Policy

We use linear policy with the final softmax layer as probability for each action. We train a policy π\pi_{*} using Cross entropy(CE) method for 10 iterations and set it to be the target policy. And we set the policies at the training iteration 6, 7, 8, 9 as behavior policies, which correspond to x-ticks 1-4 in Figure 4(c).

Other details

To simulate on our given network, we also need to design route documents for a vehicle to follow. Each route is a set of roads that connect any two exit nodes from the map. To make simple but reasonable routes for the vehicle, we constrain our routes with at most one turn in the network to avoid detours. We control each route with a fixed probability (different from each route) every time step to generate a vehicle, to guarantee a randomized environment.