Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

Introduction

Reinforcement learning (RL) is one of the most successful approaches to artificial intelligence, and has found successful applications in robotics, games, dialogue systems, and recommendation systems, among others. One of the key problems in RL is policy evaluation: given a fixed policy, estimate the average reward garnered by an agent that runs this policy in the environment. In this paper, we consider the off-policy estimation problem, in which we want to estimate the expected reward of a given target policy with samples collected by a different behavior policy. This problem is of great practical importance in many application domains where deploying a new policy can be costly or risky, such as medical treatments , econometrics , recommender systems , education , Web search , advertising and marketing . It can also be used as a key component for developing efficient off-policy policy optimization algorithms .

Most state-of-the-art off-policy estimation methods are based on importance sampling (IS) [e.g., 22]. A major limitation, however, is that this approach can become inaccurate due to the high variance introduced by the importance weights, especially when the trajectory is long. Indeed, most existing IS-based estimators compute the weight as the product of the importance ratios of many steps in the trajectory. Variances in individual steps accumulate multiplicatively, so that the overall IS weight of a random trajectory can have an exponentially high variance to result in an unreliable estimator. In the extreme case when the trajectory length is infinite, as in infinite-horizon average-reward problems, some of these estimators are not even well-defined. Ad hoc approaches can be used, such as truncating the trajectories, but often lead to a hard-to-control bias in the final estimation. Analogous to the well-known “curse of dimensionality” in dynamic programming , we call this problem the “curse of horizon” in off-policy learning.

In this work, we develop a new approach that tackles the curse of horizon. The key idea is to apply importance sampling on the average visitation distribution of single steps of state-action pairs, instead of the much higher dimensional distribution of whole trajectories. This avoids the cumulative product across time in the density ratio, substantially decreasing its variance and eliminating the estimator’s dependence on the horizon.

Our key challenge, of course, is to estimate the importance ratios of average visitation distributions. In practice, we often have access to both the target and behavior policies to compute their importance ratio of an action conditioned on a given state. But we typically have no access to transition probabilities of the environment, so estimating importance ratios of state visitation distributions has been very difficult, especially when only off-policy samples are available. In this paper, we develop a mini-max loss function for estimating the true stationary density ratio, which yields a closed-form representation similar to maximum mean discrepancy when combined with a reproducing kernel Hilbert space (RKHS). We study the theoretical properties of our loss function, and demonstrate its empirical effectiveness on long-horizon problems.

Background

Consider a Markov decision process (MDP) $M=\langle\mathcal{S},\mathcal{A},r,{\boldsymbol{T}}\rangle$ with state space $\mathcal{S}$ , action space $\mathcal{A}$ , reward function $r$ , and transition probability function ${\boldsymbol{T}}$ . Assume the environment is initialized at state $s_{0}\in\mathcal{S}$ , drawn from an unknown distribution $d_{0}(\cdot)$ . At each time step $t$ , an agent observes the current state $s_{t}$ , takes an action $a_{t}$ according to a possibly stochastic policy $\pi(\cdot|s_{t})$ , receives a reward $r_{t}$ whose expectation is $r(s_{t},a_{t})$ , and transitions to a next state $s_{t+1}$ according to transition probabilities ${\boldsymbol{T}}(\cdot|s_{t},a_{t})$ . To simplify exposition and avoid unnecessary technicalities, we assume $\mathcal{S}$ and $\mathcal{A}$ are finite unless otherwise specified, although our method extends to continuous spaces straightforwardly, as demonstrated in experiments.

We consider the infinite horizon problem in which the MDP continues without termination. Let $p_{\pi}(\cdot)$ be the distribution of trajectory ${\boldsymbol{\tau}}=\{s_{t},a_{t},r_{t}\}_{t=0}^{\infty}$ under policy $\pi$ . The expected reward of $\pi$ is

where $R^{T}_{\pi}({\boldsymbol{\tau}})$ is the reward of trajectory ${\boldsymbol{\tau}}$ up to time $T$ . Here, $\gamma\in(0,1]$ is a discount factor. We distinguish two reward criteria, the average reward ( $\gamma=1$ ) and discounted reward ( $0<\gamma<1$ ):

where $(1-\gamma)=1/\sum_{t=0}^{\infty}\gamma^{t}$ is a normalization factor. The problem of off-policy value estimation is to estimate the expected reward ${R}_{\pi}$ of a given target policy $\pi$ , when we only observe a set of trajectories ${\boldsymbol{\tau}}^{i}=\{s_{t}^{i},a_{t}^{i},r_{t}^{i}\}_{t=0}^{{T}}$ generated by following a different behavior policy $\pi_{0}$ .

Bellman Equation

Under these definitions, $V^{\pi}$ is the fixed-point solution to the respective Bellman equations:

Importance Sampling

IS represents a major class of approaches to off-policy estimation, which, in principle, only applies to the finite-horizon reward $R^{T}_{\pi}$ when the trajectory is truncated at a finite time step $T<\infty$ . IS-based estimators are based on the following change-of-measure equality:

where $\beta_{\pi/\pi_{0}}(a|s):=\pi(a|s)/\pi_{0}(a|s)$ is the single-step density ratio of policies $\pi$ and $\pi_{0}$ evaluated at a particular state-action pair $(s,a)$ , and $w_{0:T}$ is the density ratio of the trajectory ${\boldsymbol{\tau}}$ up to time $T$ . Methods based on (3) are called trajectory-wise IS, or weighted IS (WIS) when the weights are self-normalized . It is possible to improve trajectory-wise IS with the so called step-wise, or per-decision, IS/WIS, which uses weight $w_{0:t}$ for reward $r_{t}$ at time $t$ , yielding smaller variance . More details about these estimators are given in Appendix A.

The Curse of Horizon

The importance weight $w_{0:T}$ is a product of $T$ density ratios, whose variance can grow exponentially with $T$ . Thus, IS-based estimators have not been widely successful in long-horizon problems, let alone infinite-horizon ones where $w_{0:\infty}$ may not even be well-defined. While WIS estimators often have reduced variance, the exponential dependence on horizon is unavoidable in general. We call this phenomenon in IS/WIS-based estimators the curse of horizon.

Not all hope is lost, however. To see this, consider an MDP with $n$ states and $2$ actions, where states are arranged on a circle (see figure on the right). The two actions deterministically move the agent from the current state to the neighboring state counterclockwise and clockwise, respectively. Suppose we are given two policies with opposite effects: the behavior policy $\pi_{0}$ moves the agent clockwise with probability $\rho$ , and the target policy $\pi$ moves the agent counterclockwise with probability $\rho$ , for some constant $\rho\in(0,1)$ . As shown in Appendix B, IS and WIS estimators suffer from exponentially large variance when estimating the average reward of $\pi$ . However, a keen reader will realize that the two policies are symmetric, and thus their stationary state visitation distributions are identical. As we show in the sequel, this allows us to estimate the expected reward using a much more efficient importance sampling, whose importance weight equals the single-step density ratio $\beta_{\pi/\pi_{0}}(a_{t}|s_{t})$ , instead of the cumulative product weight $w_{0:T}$ in (3), allowing us to significantly reduce the variance. Such an observation inspired the approach developed in this paper.

Off-Policy Estimation via Stationary State Density Ratio Estimation

As shown in the example above, significant decrease in estimation variance is possible when we apply importance weighting on the state space, rather than the trajectory space. It eliminates the dependency on the trajectory length and is much more suited for long- or infinite-horizon problems. To realize this, we need to introduce an alternative representation of the expected reward. Denote by $d_{\pi,t}(\cdot)$ the distribution of state $s_{t}$ when we execute policy $\pi$ starting from an initial state $s_{0}$ drawn from an initial distribution $d_{0}(\cdot)$ . We define the average visitation distribution to be

We always assume the limit $T\to\infty$ exists in this work. When $\gamma\in(0,1)$ in the discounted case, $d_{\pi}$ is a discounted average of $d_{\pi,t}$ , that is, $d_{\pi}(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}d_{\pi,t}(s)$ ; when $\gamma=1$ in the average reward case, $d_{\pi}$ is the stationary distribution of $s_{t}$ as $t\to\infty$ under policy $\pi$ , that is, $d_{\pi}(s)=\lim_{T\to\infty}\frac{1}{{T+1}}\sum_{t=0}^{{T}}d_{\pi,t}(s)=\lim_{t\to\infty}d_{\pi,t}(s)$ .

Following Definition 4, it can be verified that $R_{\pi}$ can be expressed alternatively as

where, abusing notation slightly, we use $(s,a)\sim d_{\pi}$ to denote draws from distribution $d_{\pi}(s,a):=d_{\pi}(s)\pi(a|s)$ . Our idea is to construct an IS estimator based on (5), where the importance ratio is computed on state-action pairs rather than on trajectories:

This IS estimator works in the space of $(s,a)$ , instead of trajectoris ${\boldsymbol{\tau}}=\{s_{t},a_{t}\}_{t=0}^{{T}}$ , leading to a potentially significant variance reduction. Returning to the example in Section 2 (see also Appendix B), since the two policies are symmetric and lead to the same stationary distributions, that is, $w_{\pi/\pi_{0}}(s)=1$ , the importance weight in (6) is simply ${\pi(a|s)}/{\pi_{0}(a|s)}$ , independent of the trajectory length. This avoids the excessive variance in long horizon problems. In Appendix A, we provide a further discussion, showing that our estimator can be viewed as a type of Rao-Backwellization of the trajectory-wise and step-wise estimators.

The key technical challenge remaining is estimating the density ratio $w_{\pi/\pi_{0}}(s)$ , which we address in this section. For simplifying the presentation, we start with estimating $d_{\pi}(s)$ for the average reward case and discuss the discounted case in Section 3.2.

Let ${\boldsymbol{T}}_{\pi}(s^{\prime}|s):=\sum_{a}{\boldsymbol{T}}(s^{\prime}|s,a)\pi(a|s)$ be the transition probability from $s$ to $s^{\prime}$ following policy $\pi$ . In the average reward case, $d_{\pi}$ equals the stationary distribution of ${\boldsymbol{T}}_{\pi}$ , satisfying

Assume the Markov chain of ${\boldsymbol{T}}_{\pi}$ is finite state and ergodic, $d_{\pi}$ is also the unique distribution that satisfies (8). This simple fact can be leveraged to derive the following key property of $w_{\pi/\pi_{0}}(s)$ .

In the average reward case ( $\gamma=1$ ), assume $d_{\pi}$ is the unique invariant distribution of ${\boldsymbol{T}}_{\pi}$ and $d_{\pi_{0}}(s)>0$ , $\forall s$ . Then a function $w(s)$ equals $w_{\pi/\pi_{0}}(s):=d_{\pi}(s)/d_{\pi_{0}}(s)$ (up to a constant factor) if and only if it satisfies

where $\beta_{\pi/\pi_{0}}(a|s)=\pi(a|s)/\pi_{0}(a|s)$ and $(s,a)|s^{\prime}\sim d_{\pi_{0}}$ denote the conditional distribution $d_{\pi_{0}}(s,a|s^{\prime})$ related to joint distribution $d_{\pi_{0}}(s,a,s^{\prime}):=d_{\pi_{0}}(s)\pi_{0}(a|s){\boldsymbol{T}}(s^{\prime}|s,a)$ . Note that this is a time-reserved conditional probability, since it is the conditional distribution of $(s,a)$ given that their next state is $s^{\prime}$ following policy $\pi_{0}$ .

Following Theorem 1, we have $w\propto w_{\pi/\pi_{0}}$ if and only if $L(w,f)=0$ for any function $f$ . This motivates us to estimate $w_{\pi/\pi_{0}}$ with a mini-max problem:

Assume $\mathcal{H}$ is a RKHS of functions $f(s)$ with a positive definite kernel $k(s,\bar{s})$ , and define ${\mathcal{F}}:=\{f\in\mathcal{H}\colon||f||_{\mathcal{H}}\leq 1\}$ to be the unit ball of $\mathcal{H}$ . We have

where $(s,a,s^{\prime})$ and $(\bar{s},\bar{a},\bar{s}^{\prime})$ are independent transition pairs obtained when running policy $\pi_{0}$ , and $\Delta(w;s,a,s^{\prime})$ is defined in (10). See Appendix C for more background on RKHS.

In practice, we approximate the expectation in (12) using discounted empirical distribution of the transition pairs, yielding consistent estimates following standard results on V-statistics .

2 Discounted Reward Case

We now discuss the extension to the discount case of $\gamma\in(0,1)$ . Similar to the average reward case, we start with a recursive equation that characterizes $d_{\pi}(s)$ in the discounted case.

Following the definition of $d_{\pi}$ in (4), for any $\gamma\in(0,1]$ , we have

Denote by $(s,a,s^{\prime})\sim d_{\pi}$ draws from $d_{\pi}(s)\pi(a|s){\boldsymbol{T}}(s^{\prime}|s,a)$ . For any function $f$ , we have

One may view $d_{\pi}$ as the invariant distribution of an induced Markov chain with transition probability of $(1-\gamma)d_{0}(s^{\prime})+\gamma{\boldsymbol{T}}_{\pi}(s^{\prime}|s)$ , which follows ${\boldsymbol{T}}_{\pi}$ with probability $\gamma$ , and restarts from initial distribution $d_{0}(s^{\prime})$ with probability $1-\gamma$ . We can show that $d_{\pi}$ exists and is unique under mild conditions .

Assume $d_{\pi}$ is the unique solution of (13), and $d_{\pi_{0}}(s)>0$ , $\forall s$ . Define

Assume $0<\gamma<1$ , then $w(s)=w_{\pi/\pi_{0}}(s)$ if and only if $L(w,f)=0$ for any test function $f$ .

3 Further Theoretical Analysis

In this section, we develop further theoretical understanding on the loss function $L(w,f)$ . Lemma 5 below reveals an interesting connection between $L(w,f)$ and the Bellman equation, allowing us to bound the estimation error of density ratio and expected reward with the mini-max loss when the discriminator space ${\mathcal{F}}$ is chosen properly (Theorems 6 and 7). The results in this section apply to both discounted and average reward cases.

Note that $\Pi f$ equals the left hand side of the Bellman equations (1) and (2), when $f=V^{\pi}$ .

Lemma 5 represents $L(w,f)$ as an inner product between $w_{\pi/\pi_{0}}-w$ and $\Pi f$ (under base measure $d_{\pi_{0}}$ ). This provides an alternative proof of Theorem 4, since $L(w,f)=0,~{}\forall f\in{\mathcal{F}}$ implies that $w_{\pi/\pi_{0}}-w$ is orthogonal with all $\Pi f$ and hence $w_{\pi/\pi_{0}}=w$ when $\{\Pi f\colon f\in{\mathcal{F}}\}$ is sufficiently rich.

Since our main goal is to estimate the expected total reward ${R}_{\pi}$ instead of the density ratio $w_{\pi/\pi_{0}}$ , it is of interest to select ${\mathcal{F}}$ to directly bound the estimation error of the total reward. Interestingly, this can be achieved once ${\mathcal{F}}$ includes the true value function $V^{\pi}$ .

Define ${R}_{\pi}[w]$ to be the reward estimate using estimated density ratio $w(s)$ (which may not equal the true ratio $w_{\pi/\pi_{0}}$ ) and infinite number of trajectories from $d_{\pi_{0}}$ , that is,

Related Work

Our off-policy setting is related to, but different from, off-policy value-function learning . Our goal is to estimate a single scalar that summarizes the quality of a policy (a.k.a. off-policy value estimation as called by some authors ). However, our idea can be extended to estimating value functions as well, by using estimated density ratios to weight observed transitions (c.f., the distribution $\mu$ in LSTDQ ). We leave this as future work.

IS-based off-policy value estimation has seen a lot of interest recently for short-horizon problems, including contextual bandits , and achieved many empirical successes . When extended to long-horizon problems, it faces an exponential blowup of variance, and variance-reduction techniques are used to improve the estimator . However, it can be proved that in the worst case, the mean squared error of any estimator has to depend exponentially on the horizon . Fortunately, many problems encountered in practical applications may present structures that enable more efficient off-policy estimation, as tackled by the present paper. An interesting open direction is to characterize theoretical conditions that can ensure tractable estimation for long horizon problems.

Few prior work directly target infinite-horizon problems. There exists approaches that use simulated samples to estimate stationary state distributions [1, Chapter IV]. However, they need a reliable model to draw such simulations, a requirement that is not satisfied in many real-world applications. To the best of our knowledge, the recently developed COP-TD algorithm is the only work that attempts to estimate $w_{\pi/\pi_{0}}$ as an intermediate step of estimating the value function of a target policy $\pi$ . They take a stochastic-approximation approach and show asymptotic consistence. However, extending their approach to continuous state/action spaces appears challenging.

Finally, there is a comprehensive literature of two-sample density ratio estimation [e.g., 27, 35], which estimates the density ratio of two distributions from pairs of their samples. Our problem setting is different in that we only have data from $d_{\pi_{0}}$ , but not from $d_{\pi}$ ; this makes the traditional density ratio estimators inapplicable to our problem. Our method is made possible by taking the special temporal structure of MDP into consideration.

Experiment

In this section, we conduct experiments on different environmental settings to compare our method with existing off-policy evaluation methods. We compare with the standard trajectory-wise and step-wise IS and WIS methods. We do not report the results of unnormalized IS because they are generally significantly worse than WIS methods . In all the cases, we also compare with an on-policy oracle and a naive averaging baseline, which estimates the reward using direct averaging over the trajectories generated by the target policy and behavior policy, respectively. For problems with discrete action and state spaces, we also compare with a standard model-based method, which estimates the transition and reward model and then calculates expected reward explicitly using the model up to the desired truncation length. When applying our method on problems with finite and discrete state space, we optimize $w$ and $f$ in the space of all possible functions (corresponding to using a delta kernel in terms of RKHS). For continuous state space, we assume $w$ is a standard feed-forward neural network, and ${\mathcal{F}}$ is a RKHS with a standard Gaussian RBF kernel whose bandwidth equals the median of the pairwise distances between the observed data points.

Because we cannot simulate truly infinite steps in practice, we use the behavior policy to generate trajectories of length $T$ , and evaluate the algorithms based on the mean square error (MSE) w.r.t. the $T$ -step rewards of a large number of trajectories of length $T$ from the target policy. We expect that our method gets better as $T$ increases, since it is designed for infinite horizon problems, while the IS/WIS methods receive large variance and deteriorate as $T$ increases.

Taxi is a 2D grid world simulating taxi movement along the grids. A taxi moves North, East, South, West or attends to pick up or drop off a passenger. It receives a reward of $20$ when it successfully picks up a passenger or drops her off at the right place, and otherwise a reward of -1 every time step. The original taxi environment would stop when the taxi successfully picks up a passenger and drops her off at the right place. We modify the environment to make it infinite horizon, by allowing passengers to randomly appear and disappear at every corner of the map at each time step. We use a grid size of $5\times 5$ , which yields $2000$ states in total ( $25\times 2^{4}\times 5$ , corresponding to $25$ taxi locations, $2^{4}$ passenger appearance status and $5$ taxi status (empty or with one of 4 destinations)).

To construct target and behavior policies for testing our algorithm, we set our target policy to be the final policy $\pi_{*}$ after running Q-learning for $1000$ iterations, and set another policy $\pi_{+}$ after $950$ iterations. The behavior policy is $\pi=(1-\alpha)\pi_{*}+\alpha\pi_{+}$ , where $\alpha$ is a mixing ratio that can be varied.

Results in Taxi Environment

Figure 1(a)–(b) show results with average reward. We can see our method performs almost as well as the on-policy oracle, outperforming all the other methods. To evaluate the approximation error of the estimated density ratio $\hat{w}$ , we plot in Figure 1(c) the weighted total variation distance between $\hat{d}_{\pi}=\hat{w}d_{\pi_{0}}$ with the true $d_{\pi}$ with TV distance as we optimize the loss function. Figure 1(d) shows scatter plot of $\{(\hat{d}_{\pi}(s),d_{\pi}(s))~{}:~{}\forall s\in\mathcal{S}\}$ at convergence, indicating our method correctly estimates the true density ratio over the state space.

Figure 2 shows similar results for discounted reward. From Figure 2(c) and (d), we can see that typical IS methods deteriorate as the trajectory length $T$ and discount factor $\gamma$ increase, respectively, which is expected since their variance grows exponentially with $T$ . In contrast, our density ratio method performs better as trajectory length $T$ increases, and is robust as $\gamma$ increases.

Pendulum Environment

We train a near-optimal policy $\pi_{*}$ using REINFORCE and set it to be the target policy. The behavior policy is set to be $\pi=(1-\alpha)\pi_{*}+\alpha\pi_{+}$ , where $\alpha$ is a mixing ratio, and $\pi_{+}$ is another policy from REINFORCE when it has not converged. Our results are shown in Figure 3, where we again find that our method generally outperforms the standard trajectory-wise and step-wise WIS, and works favorably in long-horizon problems (Figure 3(b)).

SUMO Traffic Simulator

SUMO is an open source traffic simulator; see Figure 4(a) for an illustration. We consider the task of reducing traffic congestion by modelling traffic light control as a reinforcement learning problem . We use TraCI, a built-in “Traffic Control Interface”, to interact with the SUMO simulator. Full details of our environmental settings can be found in Appendix E. Our results are shown in Figure 4, where we again find that our method is consistently better than standard IS methods.

Conclusions

We study the off-policy estimation problem in infinite-horizon problems and develop a new algorithm based on direct estimation of the stationary state density ratio between the target and behavior policies. Our mini-max objective function enjoys nice theoretical properties and yields an intriguing connection with Bellman equations that is worth further investigation. Future directions include scaling our method to larger scale problems and extending it to estimate value functions and leverage off-policy data in policy optimization.

Acknowledgement

This work is supported in part by NSF CRII 1830161. We would like to acknowledge Google Cloud for their support.

References

Appendix A Several Variants of IS- and WIS-based Estimators

Denote by $\gamma_{t}=\gamma^{t}/\sum_{t=0}^{{T}}\gamma^{t}$ for notation simplicity. Define

Then we have the following two key formulas, which derive the trajectory-wise, and step-wise importance sampling (IS) estimators, respectively.

Given a set of $m$ observed trajectories ${\boldsymbol{\tau}}^{i}=\{s_{t}^{i},a_{t}^{i},r_{t}^{i}\}_{t=0}^{T}$ , $\forall i=1,\ldots,m$ , drawn from $p_{\pi_{0}}$ . The trajectory-wise and step-wise estimators are

where $w_{0:t}^{i}=w_{0:t}({\boldsymbol{\tau}}^{i})$ and $Z_{t}$ is a normalization constant of the importance weights: when $Z_{t}=m,~{}~{}~{}\forall t$ , the corresponding estimators (called Trajectory-wise IS and Step-wise IS, respectively) provide unbiased estimates of $R_{\pi}^{T}$ ; when $Z_{t}=\sum_{i=1}^{m}w_{0:t}^{i}$ , the corresponding estimators are weighted (or self-normalized) importance sampling (called Trajectory-wise WIS and Step-wise WIS, respectively), which introduce bias but often have lower variance. It has been shown that the Step-wise WIS often performs the best among all these variants .

In comparison, our method can be viewed as a further Rao-backwellization of the step-wise estimators. Define

where we replace $w_{0:t}$ in (20) with $w_{t:t}$ , based on Rao-backwellization conditioning on $(s_{t},a_{t})$ . This gives an empirical estimator:

where $w_{t:t}^{i}=w_{t:t}(a_{t}^{i},s_{t}^{i})$ and $Z_{t}=m$ or $Z_{t}=\sum_{i=1}^{m}w_{t:t}^{i}$ . Comparing this with the trajectory-wise and step-wise estimators, it is easy to expect that it yields smaller variance, when ignoring the estimation error of $w_{t:t}$ .

Appendix B A motivating example

Here we provide an example when $w_{0:T}$ is exponential on the trajectory length $T$ , yielding high variance in trajectory-wise and step-wise estimators in long horizon problems, while the variance of our stationary density ratio based importance weight $w_{t:t}$ stays to be a constant as $T$ increases.

The MDP has $n$ states: $\mathcal{S}=\{0,1,\dots,n-1\}$ , arranged on a circle (see the figure on the right), where $n$ is an odd number. There are two actions, left ( $\mathsf{L}$ ) and right ( $\mathsf{R}$ ). The left action moves the agent from the current state counterclockwise to the next state, and the right action has the opposite (clockwise) effect. The deterministic reward is if taking action $\mathsf{L}$ and $1$ otherwise. In summary, we have for any $s$ and $a$ that

Suppose we are given two policies. The behavior policy $\pi_{0}$ and target policy $\pi$ choose action $\mathsf{R}$ with probability $\rho$ and $1-\rho$ , respectively. We focus on the average reward $(\gamma=1)$ here.

First, note that the MDP is ergodic under either policy, as $n$ is odd. Since $\pi_{0}$ and $\pi$ are symmetric, their stationary distributions are identical, that is, $d_{\pi}(s)/d_{\pi_{0}}(s)=1$ . In fact, both $d_{\pi}=d_{\pi_{0}}$ are uniform over $\mathcal{S}$ . Therefore,

and similarly $w_{t:t}(s,\mathsf{L})=(1-\rho)/\rho$ . Both ratios are independent of the trajectory length, and have zero variance.

Under the setting above, let ${\boldsymbol{\tau}}=\{s_{t},a_{t},r_{t}\}_{0\leq t\leq T}$ be a trajectory drawn from the behavior policy $\pi_{0}$ , we have

Obviously, $A_{\rho}>1$ for $\rho\neq 1/2$ and $A_{\rho}=1$ for $\rho=1/2$ , and $B_{\rho,T}>0$ for large enough $T$ . Therefore, the variance of both the trajectory-wise importance weights and the corresponding estimator grow exponentially in the order of $A_{\rho}^{T}$ .

Remark

From the definition of the setting, it is easy to show that

Under policy $\pi_{0}$ , $F({\boldsymbol{\tau}})$ follows a Binomial distribution ${Binomial}({T+1},\rho)$ . The first order moments can be easily calculated as follows

It remains to calculate the second order moments. We achieve this by leveraging the moment-generating function (MGF) of Binomial distribution:

It will turn out be useful to consider the derivatives of $\Phi(\lambda)$ :

For convenience, define $C=(1-\rho)/\rho$ , and we have

where we use the fact that $(1-\rho+\rho C^{4})C^{-2}=\frac{\rho^{3}+(1-\rho)^{3}}{(1-\rho)\rho}=A_{\rho}.$ Similarly, we have

where we use the fact that $B_{\rho,T}=({C}/{({T+1})}+\rho C^{4})\rho^{2}$ . It is then straightforward to calculate the variance from here. ∎

Claim #3. Variance of trajectory-wise WIS weight grows exponentially in T𝑇T.

Although weighted-IS (WIS) often improves over IS estimators by using self-normalized weights, it cannot eliminate the exponential dependence on the trajectory length. Here, we calculate the asymptotic variance of trajectory-wise WIS using delta method [28, Chapter 9].

Let $\hat{R}_{n,\text{wis}}$ be the trajectory-wise WIS estimator of $R_{\pi}$ based on $n$ copies of independent trajectories drawn from $\pi_{0}$ , we have

where $D_{\rho,A}=B_{\rho,T}A_{\rho}^{-1}~{}-~{}2{(1-\rho)^{3}}/{\rho}~{}+~{}(1-\rho)^{2}A_{\rho},$ with $A_{\rho}$ and $B_{\rho,T}$ defined in Proposition 8.

The asymptotic mean square error (MSE) of a self-normalized importance sampling estimator can be estimated using the delta method [28, Chapter 9]:

where the first and third terms have been calculated in the proof of Proposition 8. We just need to calculate the cross term:

Appendix C Proofs

where we assume $g(x)=\sum_{i}b_{i}k(s,s_{i})$ . A simple yet important fact that our proof will leverage is that

A key property of RKHS is the so called reproducing property, which says

this can be proved using the reproducing property as follows

For more introduction to RKHS, see , to name only a few.

Note that $d_{\pi_{0}}(s,a|s^{\prime})=\frac{d_{\pi_{0}}(s)\pi_{0}(a|s){\boldsymbol{T}}(s^{\prime}|s,a)}{d_{\pi_{0}}(s^{\prime})}$ . Therefore, (9) is equivalent to

Denote $g(s):=d_{\pi_{0}}(s)w(s)$ . Since $d_{\pi_{0}}(s^{\prime})>0$ for all $s^{\prime}$ , we find that (9) is equivalent to

This implies that $g(s)$ is invariant under Markov transition ${\boldsymbol{T}}(s^{\prime}|s,a)\pi(a|s)$ . Because $d_{\pi}(s)$ is the unique stationary distribution under the same Markov transition, (24) holds if and only if $g(s)\propto d_{\pi}(s)$ , or equivalently, $w(s)\propto w_{\pi/\pi_{0}}(s).$ This completes the proof. ∎

Assume $\gamma\in(0,1)$ . The definition in (4) gives $d_{\pi}(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}d_{\pi,t}(s)$ . Therefore,

Multiplying both sides by $f(s^{\prime})$ and summing over $s^{\prime}$ , we get

Recall that $(s,a,s^{\prime})\sim d_{\pi}$ denotes sampling from the joint distribution of $d_{\pi}(s,a,s^{\prime})=d_{\pi}(s){\boldsymbol{T}}(s^{\prime},a|s)\pi(a|s)$ . Note that under this joint distribution, the marginal distribution of $s^{\prime}$ is different from $d_{\pi}(s)$ .This is different from the average reward case, in which $d_{\pi}(s)$ is the stationary distribution of ${\boldsymbol{T}}_{\pi}$ .

where $g$ is any function. Then by assumption, we have $g(s)=d_{\pi}(s)$ if and only if $\delta(g,~{}s^{\prime})=0$ for any $s^{\prime}$ . Replacing $d_{\pi}$ with $d_{\pi_{0}}$ and $f(s)$ with $w(s)f(s)$ in (14) gives

Plugging it into the definition of $L(w,f)$ in (15), we get

where we have defined $g(s):=d_{\pi}(s)w_{\pi/\pi_{0}}(s)^{-1}w(s)$ . Therefore, $L(w,f)=0$ for $\forall f$ is equivalent to $\delta(g,s^{\prime})=0$ for $\forall s^{\prime}$ , which is in turn equivalent to $g(s)=d_{\pi}(s)$ . Therefore, we have $w(s)=w_{\pi/\pi_{0}}(s)$ when $0<\gamma<1$ , and $g(s)\propto d_{\pi}(s)$ , or equivalently, $w(s)\propto w_{\pi/\pi_{0}}(s)$ , when $\gamma=1$ . ∎

Following the proof of Theorem 4 up to (25), we have

Consider first the discounted case $\gamma\in(0,1)$ , we have

For the uniqueness, assume $g=\Pi f_{1}$ and $g=\Pi f_{2}$ , and ${\delta f}=f_{1}-f_{2}$ , then $\Pi{\delta f}=0$ , where

which implies $\left\|{\delta f}\right\|_{\infty}=0$ .

For the average reward case $\gamma=1$ , we have

which is contradictory. Therefore, ${\delta f}$ must be a constant. In fact, functions that satisfies ${\delta f}=\sum_{s^{\prime}}{\boldsymbol{T}}_{\pi}(s^{\prime}|s){\delta f}(s^{\prime})$ is called harmonic [17, Lemma 1.16]. ∎

We just need to calculate $f_{g}$ , following Lemma 10.

We consider the average reward case first. Following the definition of the operator $\Pi$ in (17) and the average reward Bellman equation, we have

For the discounted case, following the definition of $\Pi$ and the discounted Bellman equation (2), we have $\Pi V_{\pi}(s)=r_{\pi}$ , which gives

Appendix D Algorithm Details

Algorithm 1 summarizes our main algorithm for the average reward case, where we approximate the mini-max loss function in (12) using empirical averaging of observed data.

The algorithm for the discounted case follows the same idea, but requires some modification due to the additional term in (15). To handle it in a notionally convenient way, we find it is useful to introduce a dummy transition pair $\{s_{-1},a_{-1},s_{-1}^{\prime},r_{-1}\}$ at time $t=-1$ , for which we define $s_{-1}^{\prime}=s_{0}$ , $r_{-1}=0$ and $\Delta(w;~{}s_{-1},a_{-1},s^{\prime}_{-1}):=1-w(s_{0})f(s_{0})$ . Related, we define an augmented discounted visitation distribution via

Under this notation, the loss (15) of discounted case is rewritten into a form identical to the average reward case:

when ${\mathcal{F}}$ is the ball of RKHS with kernel $k(s^{\prime},\bar{s}^{\prime})$ .

This equation is identical to the one in Algorithm 1 for the average case, but differs in the way the minibatch $\mathcal{M}$ is generated: it includes the dummy transition at time $t=-1$ with probability $(1-\gamma)$ and select time $t$ with discounted probability $\gamma^{t+1}$ . See Algorithm 2 for the summary of the procedure.

Appendix E Information on SUMO Traffic Simulator

We provide details of the SUMO traffic simulator and how we formulate it as a standard reinforcement learning problem.

A states of a traffic should provide us with enough information to control the traffic light. A complex way is an image-like representation of the traffic vehicle around the traffic light intersection . Here, to simplify the problem, we add lane detectors around traffic light intersections, and count the total number of vehicles on each lane as states $s_{t}$ . This should give us enough, though not perfect, information to guide the traffic light agent to choose its action.

Actions

For a standard crossing intersection, its traffic light will have a program for 8 phases: “Straight signal for North-South”, “Turn-left signal for North-south”, “Straight signal for East-West”, “Turn-left signal for East-west” and their corresponding “yellow light” slow down signals. Here, we simplify these 4 phases into actions $a_{t}$ for each traffic light, where we let one big time step $t$ in reinforcement learning setting to be 6 real time steps in SUMO simulator. Within each big time step $t$ , we add a transition of 3 real time steps “yellow light” phase as a buffer to prevent vehicles for “emergency stop” if our agent decides to change light status ( $a_{t}\neq a_{{T}}$ ).

Rewards

Our goal is to minimize the total travelling time for all vehicles. Thus, we could set the negative of current aggregate total number of vehicles during the one big time step as reward $r_{t}$ . To simplify, we can just consider 6 times the current total number of vehicle as a approximation of $r_{t}$ to make our system simpler.

Policy

We use linear policy with the final softmax layer as probability for each action. We train a policy $\pi_{*}$ using Cross entropy(CE) method for 10 iterations and set it to be the target policy. And we set the policies at the training iteration 6, 7, 8, 9 as behavior policies, which correspond to x-ticks 1-4 in Figure 4(c).

Other details

To simulate on our given network, we also need to design route documents for a vehicle to follow. Each route is a set of roads that connect any two exit nodes from the map. To make simple but reasonable routes for the vehicle, we constrain our routes with at most one turn in the network to avoid detours. We control each route with a fixed probability (different from each route) every time step to generate a vehicle, to guarantee a randomized environment.