Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou
Introduction
Reinforcement learning (RL) is one of the most successful approaches to artificial intelligence, and has found successful applications in robotics, games, dialogue systems, and recommendation systems, among others. One of the key problems in RL is policy evaluation: given a fixed policy, estimate the average reward garnered by an agent that runs this policy in the environment. In this paper, we consider the off-policy estimation problem, in which we want to estimate the expected reward of a given target policy with samples collected by a different behavior policy. This problem is of great practical importance in many application domains where deploying a new policy can be costly or risky, such as medical treatments , econometrics , recommender systems , education , Web search , advertising and marketing . It can also be used as a key component for developing efficient off-policy policy optimization algorithms .
Most state-of-the-art off-policy estimation methods are based on importance sampling (IS) [e.g., 22]. A major limitation, however, is that this approach can become inaccurate due to the high variance introduced by the importance weights, especially when the trajectory is long. Indeed, most existing IS-based estimators compute the weight as the product of the importance ratios of many steps in the trajectory. Variances in individual steps accumulate multiplicatively, so that the overall IS weight of a random trajectory can have an exponentially high variance to result in an unreliable estimator. In the extreme case when the trajectory length is infinite, as in infinite-horizon average-reward problems, some of these estimators are not even well-defined. Ad hoc approaches can be used, such as truncating the trajectories, but often lead to a hard-to-control bias in the final estimation. Analogous to the well-known “curse of dimensionality” in dynamic programming , we call this problem the “curse of horizon” in off-policy learning.
In this work, we develop a new approach that tackles the curse of horizon. The key idea is to apply importance sampling on the average visitation distribution of single steps of state-action pairs, instead of the much higher dimensional distribution of whole trajectories. This avoids the cumulative product across time in the density ratio, substantially decreasing its variance and eliminating the estimator’s dependence on the horizon.
Our key challenge, of course, is to estimate the importance ratios of average visitation distributions. In practice, we often have access to both the target and behavior policies to compute their importance ratio of an action conditioned on a given state. But we typically have no access to transition probabilities of the environment, so estimating importance ratios of state visitation distributions has been very difficult, especially when only off-policy samples are available. In this paper, we develop a mini-max loss function for estimating the true stationary density ratio, which yields a closed-form representation similar to maximum mean discrepancy when combined with a reproducing kernel Hilbert space (RKHS). We study the theoretical properties of our loss function, and demonstrate its empirical effectiveness on long-horizon problems.
Background
Consider a Markov decision process (MDP) with state space , action space , reward function , and transition probability function . Assume the environment is initialized at state , drawn from an unknown distribution . At each time step , an agent observes the current state , takes an action according to a possibly stochastic policy , receives a reward whose expectation is , and transitions to a next state according to transition probabilities . To simplify exposition and avoid unnecessary technicalities, we assume and are finite unless otherwise specified, although our method extends to continuous spaces straightforwardly, as demonstrated in experiments.
We consider the infinite horizon problem in which the MDP continues without termination. Let be the distribution of trajectory under policy . The expected reward of is
where is the reward of trajectory up to time . Here, is a discount factor. We distinguish two reward criteria, the average reward () and discounted reward ():
where is a normalization factor. The problem of off-policy value estimation is to estimate the expected reward of a given target policy , when we only observe a set of trajectories generated by following a different behavior policy .
Bellman Equation
Under these definitions, is the fixed-point solution to the respective Bellman equations:
Importance Sampling
IS represents a major class of approaches to off-policy estimation, which, in principle, only applies to the finite-horizon reward when the trajectory is truncated at a finite time step . IS-based estimators are based on the following change-of-measure equality:
where is the single-step density ratio of policies and evaluated at a particular state-action pair , and is the density ratio of the trajectory up to time . Methods based on (3) are called trajectory-wise IS, or weighted IS (WIS) when the weights are self-normalized . It is possible to improve trajectory-wise IS with the so called step-wise, or per-decision, IS/WIS, which uses weight for reward at time , yielding smaller variance . More details about these estimators are given in Appendix A.
The Curse of Horizon
The importance weight is a product of density ratios, whose variance can grow exponentially with . Thus, IS-based estimators have not been widely successful in long-horizon problems, let alone infinite-horizon ones where may not even be well-defined. While WIS estimators often have reduced variance, the exponential dependence on horizon is unavoidable in general. We call this phenomenon in IS/WIS-based estimators the curse of horizon.
Not all hope is lost, however. To see this, consider an MDP with states and actions, where states are arranged on a circle (see figure on the right). The two actions deterministically move the agent from the current state to the neighboring state counterclockwise and clockwise, respectively. Suppose we are given two policies with opposite effects: the behavior policy moves the agent clockwise with probability , and the target policy moves the agent counterclockwise with probability , for some constant . As shown in Appendix B, IS and WIS estimators suffer from exponentially large variance when estimating the average reward of . However, a keen reader will realize that the two policies are symmetric, and thus their stationary state visitation distributions are identical. As we show in the sequel, this allows us to estimate the expected reward using a much more efficient importance sampling, whose importance weight equals the single-step density ratio , instead of the cumulative product weight in (3), allowing us to significantly reduce the variance. Such an observation inspired the approach developed in this paper.
Off-Policy Estimation via Stationary State Density Ratio Estimation
As shown in the example above, significant decrease in estimation variance is possible when we apply importance weighting on the state space, rather than the trajectory space. It eliminates the dependency on the trajectory length and is much more suited for long- or infinite-horizon problems. To realize this, we need to introduce an alternative representation of the expected reward. Denote by the distribution of state when we execute policy starting from an initial state drawn from an initial distribution . We define the average visitation distribution to be
We always assume the limit exists in this work. When in the discounted case, is a discounted average of , that is, ; when in the average reward case, is the stationary distribution of as under policy , that is, .
Following Definition 4, it can be verified that can be expressed alternatively as
where, abusing notation slightly, we use to denote draws from distribution . Our idea is to construct an IS estimator based on (5), where the importance ratio is computed on state-action pairs rather than on trajectories:
This IS estimator works in the space of , instead of trajectoris , leading to a potentially significant variance reduction. Returning to the example in Section 2 (see also Appendix B), since the two policies are symmetric and lead to the same stationary distributions, that is, , the importance weight in (6) is simply , independent of the trajectory length. This avoids the excessive variance in long horizon problems. In Appendix A, we provide a further discussion, showing that our estimator can be viewed as a type of Rao-Backwellization of the trajectory-wise and step-wise estimators.
The key technical challenge remaining is estimating the density ratio , which we address in this section. For simplifying the presentation, we start with estimating for the average reward case and discuss the discounted case in Section 3.2.
Let be the transition probability from to following policy . In the average reward case, equals the stationary distribution of , satisfying
Assume the Markov chain of is finite state and ergodic, is also the unique distribution that satisfies (8). This simple fact can be leveraged to derive the following key property of .
In the average reward case (), assume is the unique invariant distribution of and , . Then a function equals (up to a constant factor) if and only if it satisfies
where and denote the conditional distribution related to joint distribution . Note that this is a time-reserved conditional probability, since it is the conditional distribution of given that their next state is following policy .
Following Theorem 1, we have if and only if for any function . This motivates us to estimate with a mini-max problem:
Assume is a RKHS of functions with a positive definite kernel , and define to be the unit ball of . We have
where and are independent transition pairs obtained when running policy , and is defined in (10). See Appendix C for more background on RKHS.
In practice, we approximate the expectation in (12) using discounted empirical distribution of the transition pairs, yielding consistent estimates following standard results on V-statistics .
2 Discounted Reward Case
We now discuss the extension to the discount case of . Similar to the average reward case, we start with a recursive equation that characterizes in the discounted case.
Following the definition of in (4), for any , we have
Denote by draws from . For any function , we have
One may view as the invariant distribution of an induced Markov chain with transition probability of , which follows with probability , and restarts from initial distribution with probability . We can show that exists and is unique under mild conditions .
Assume is the unique solution of (13), and , . Define
Assume , then if and only if for any test function .
3 Further Theoretical Analysis
In this section, we develop further theoretical understanding on the loss function . Lemma 5 below reveals an interesting connection between and the Bellman equation, allowing us to bound the estimation error of density ratio and expected reward with the mini-max loss when the discriminator space is chosen properly (Theorems 6 and 7). The results in this section apply to both discounted and average reward cases.
Note that equals the left hand side of the Bellman equations (1) and (2), when .
Lemma 5 represents as an inner product between and (under base measure ). This provides an alternative proof of Theorem 4, since implies that is orthogonal with all and hence when is sufficiently rich.
Since our main goal is to estimate the expected total reward instead of the density ratio , it is of interest to select to directly bound the estimation error of the total reward. Interestingly, this can be achieved once includes the true value function .
Define to be the reward estimate using estimated density ratio (which may not equal the true ratio ) and infinite number of trajectories from , that is,
Related Work
Our off-policy setting is related to, but different from, off-policy value-function learning . Our goal is to estimate a single scalar that summarizes the quality of a policy (a.k.a. off-policy value estimation as called by some authors ). However, our idea can be extended to estimating value functions as well, by using estimated density ratios to weight observed transitions (c.f., the distribution in LSTDQ ). We leave this as future work.
IS-based off-policy value estimation has seen a lot of interest recently for short-horizon problems, including contextual bandits , and achieved many empirical successes . When extended to long-horizon problems, it faces an exponential blowup of variance, and variance-reduction techniques are used to improve the estimator . However, it can be proved that in the worst case, the mean squared error of any estimator has to depend exponentially on the horizon . Fortunately, many problems encountered in practical applications may present structures that enable more efficient off-policy estimation, as tackled by the present paper. An interesting open direction is to characterize theoretical conditions that can ensure tractable estimation for long horizon problems.
Few prior work directly target infinite-horizon problems. There exists approaches that use simulated samples to estimate stationary state distributions [1, Chapter IV]. However, they need a reliable model to draw such simulations, a requirement that is not satisfied in many real-world applications. To the best of our knowledge, the recently developed COP-TD algorithm is the only work that attempts to estimate as an intermediate step of estimating the value function of a target policy . They take a stochastic-approximation approach and show asymptotic consistence. However, extending their approach to continuous state/action spaces appears challenging.
Finally, there is a comprehensive literature of two-sample density ratio estimation [e.g., 27, 35], which estimates the density ratio of two distributions from pairs of their samples. Our problem setting is different in that we only have data from , but not from ; this makes the traditional density ratio estimators inapplicable to our problem. Our method is made possible by taking the special temporal structure of MDP into consideration.
Experiment
In this section, we conduct experiments on different environmental settings to compare our method with existing off-policy evaluation methods. We compare with the standard trajectory-wise and step-wise IS and WIS methods. We do not report the results of unnormalized IS because they are generally significantly worse than WIS methods . In all the cases, we also compare with an on-policy oracle and a naive averaging baseline, which estimates the reward using direct averaging over the trajectories generated by the target policy and behavior policy, respectively. For problems with discrete action and state spaces, we also compare with a standard model-based method, which estimates the transition and reward model and then calculates expected reward explicitly using the model up to the desired truncation length. When applying our method on problems with finite and discrete state space, we optimize and in the space of all possible functions (corresponding to using a delta kernel in terms of RKHS). For continuous state space, we assume is a standard feed-forward neural network, and is a RKHS with a standard Gaussian RBF kernel whose bandwidth equals the median of the pairwise distances between the observed data points.
Because we cannot simulate truly infinite steps in practice, we use the behavior policy to generate trajectories of length , and evaluate the algorithms based on the mean square error (MSE) w.r.t. the -step rewards of a large number of trajectories of length from the target policy. We expect that our method gets better as increases, since it is designed for infinite horizon problems, while the IS/WIS methods receive large variance and deteriorate as increases.
Taxi is a 2D grid world simulating taxi movement along the grids. A taxi moves North, East, South, West or attends to pick up or drop off a passenger. It receives a reward of when it successfully picks up a passenger or drops her off at the right place, and otherwise a reward of -1 every time step. The original taxi environment would stop when the taxi successfully picks up a passenger and drops her off at the right place. We modify the environment to make it infinite horizon, by allowing passengers to randomly appear and disappear at every corner of the map at each time step. We use a grid size of , which yields states in total (, corresponding to taxi locations, passenger appearance status and taxi status (empty or with one of 4 destinations)).
To construct target and behavior policies for testing our algorithm, we set our target policy to be the final policy after running Q-learning for iterations, and set another policy after iterations. The behavior policy is , where is a mixing ratio that can be varied.
Results in Taxi Environment
Figure 1(a)–(b) show results with average reward. We can see our method performs almost as well as the on-policy oracle, outperforming all the other methods. To evaluate the approximation error of the estimated density ratio , we plot in Figure 1(c) the weighted total variation distance between with the true with TV distance as we optimize the loss function. Figure 1(d) shows scatter plot of at convergence, indicating our method correctly estimates the true density ratio over the state space.
Figure 2 shows similar results for discounted reward. From Figure 2(c) and (d), we can see that typical IS methods deteriorate as the trajectory length and discount factor increase, respectively, which is expected since their variance grows exponentially with . In contrast, our density ratio method performs better as trajectory length increases, and is robust as increases.
Pendulum Environment
We train a near-optimal policy using REINFORCE and set it to be the target policy. The behavior policy is set to be , where is a mixing ratio, and is another policy from REINFORCE when it has not converged. Our results are shown in Figure 3, where we again find that our method generally outperforms the standard trajectory-wise and step-wise WIS, and works favorably in long-horizon problems (Figure 3(b)).
SUMO Traffic Simulator
SUMO is an open source traffic simulator; see Figure 4(a) for an illustration. We consider the task of reducing traffic congestion by modelling traffic light control as a reinforcement learning problem . We use TraCI, a built-in “Traffic Control Interface”, to interact with the SUMO simulator. Full details of our environmental settings can be found in Appendix E. Our results are shown in Figure 4, where we again find that our method is consistently better than standard IS methods.
Conclusions
We study the off-policy estimation problem in infinite-horizon problems and develop a new algorithm based on direct estimation of the stationary state density ratio between the target and behavior policies. Our mini-max objective function enjoys nice theoretical properties and yields an intriguing connection with Bellman equations that is worth further investigation. Future directions include scaling our method to larger scale problems and extending it to estimate value functions and leverage off-policy data in policy optimization.
Acknowledgement
This work is supported in part by NSF CRII 1830161. We would like to acknowledge Google Cloud for their support.
References
Appendix A Several Variants of IS- and WIS-based Estimators
Denote by for notation simplicity. Define
Then we have the following two key formulas, which derive the trajectory-wise, and step-wise importance sampling (IS) estimators, respectively.
Given a set of observed trajectories , , drawn from . The trajectory-wise and step-wise estimators are
where and is a normalization constant of the importance weights: when , the corresponding estimators (called Trajectory-wise IS and Step-wise IS, respectively) provide unbiased estimates of ; when , the corresponding estimators are weighted (or self-normalized) importance sampling (called Trajectory-wise WIS and Step-wise WIS, respectively), which introduce bias but often have lower variance. It has been shown that the Step-wise WIS often performs the best among all these variants .
In comparison, our method can be viewed as a further Rao-backwellization of the step-wise estimators. Define
where we replace in (20) with , based on Rao-backwellization conditioning on . This gives an empirical estimator:
where and or . Comparing this with the trajectory-wise and step-wise estimators, it is easy to expect that it yields smaller variance, when ignoring the estimation error of .
Appendix B A motivating example
Here we provide an example when is exponential on the trajectory length , yielding high variance in trajectory-wise and step-wise estimators in long horizon problems, while the variance of our stationary density ratio based importance weight stays to be a constant as increases.
The MDP has states: , arranged on a circle (see the figure on the right), where is an odd number. There are two actions, left () and right (). The left action moves the agent from the current state counterclockwise to the next state, and the right action has the opposite (clockwise) effect. The deterministic reward is if taking action and otherwise. In summary, we have for any and that
Suppose we are given two policies. The behavior policy and target policy choose action with probability and , respectively. We focus on the average reward here.
First, note that the MDP is ergodic under either policy, as is odd. Since and are symmetric, their stationary distributions are identical, that is, . In fact, both are uniform over . Therefore,
and similarly . Both ratios are independent of the trajectory length, and have zero variance.
Under the setting above, let be a trajectory drawn from the behavior policy , we have
Obviously, for and for , and for large enough . Therefore, the variance of both the trajectory-wise importance weights and the corresponding estimator grow exponentially in the order of .
Remark
From the definition of the setting, it is easy to show that
Under policy , follows a Binomial distribution . The first order moments can be easily calculated as follows
It remains to calculate the second order moments. We achieve this by leveraging the moment-generating function (MGF) of Binomial distribution:
It will turn out be useful to consider the derivatives of :
For convenience, define , and we have
where we use the fact that Similarly, we have
where we use the fact that . It is then straightforward to calculate the variance from here. ∎
Claim #3. Variance of trajectory-wise WIS weight grows exponentially in T𝑇T.
Although weighted-IS (WIS) often improves over IS estimators by using self-normalized weights, it cannot eliminate the exponential dependence on the trajectory length. Here, we calculate the asymptotic variance of trajectory-wise WIS using delta method [28, Chapter 9].
Let be the trajectory-wise WIS estimator of based on copies of independent trajectories drawn from , we have
where with and defined in Proposition 8.
The asymptotic mean square error (MSE) of a self-normalized importance sampling estimator can be estimated using the delta method [28, Chapter 9]:
where the first and third terms have been calculated in the proof of Proposition 8. We just need to calculate the cross term:
Appendix C Proofs
where we assume . A simple yet important fact that our proof will leverage is that
A key property of RKHS is the so called reproducing property, which says
this can be proved using the reproducing property as follows
For more introduction to RKHS, see , to name only a few.
Note that . Therefore, (9) is equivalent to
Denote . Since for all , we find that (9) is equivalent to
This implies that is invariant under Markov transition . Because is the unique stationary distribution under the same Markov transition, (24) holds if and only if , or equivalently, This completes the proof. ∎
Assume . The definition in (4) gives . Therefore,
Multiplying both sides by and summing over , we get
Recall that denotes sampling from the joint distribution of . Note that under this joint distribution, the marginal distribution of is different from .This is different from the average reward case, in which is the stationary distribution of .
where is any function. Then by assumption, we have if and only if for any . Replacing with and with in (14) gives
Plugging it into the definition of in (15), we get
where we have defined . Therefore, for is equivalent to for , which is in turn equivalent to . Therefore, we have when , and , or equivalently, , when . ∎
Following the proof of Theorem 4 up to (25), we have
Consider first the discounted case , we have
For the uniqueness, assume and , and , then , where
which implies .
For the average reward case , we have
which is contradictory. Therefore, must be a constant. In fact, functions that satisfies is called harmonic [17, Lemma 1.16]. ∎
We just need to calculate , following Lemma 10.
We consider the average reward case first. Following the definition of the operator in (17) and the average reward Bellman equation, we have
For the discounted case, following the definition of and the discounted Bellman equation (2), we have , which gives
Appendix D Algorithm Details
Algorithm 1 summarizes our main algorithm for the average reward case, where we approximate the mini-max loss function in (12) using empirical averaging of observed data.
The algorithm for the discounted case follows the same idea, but requires some modification due to the additional term in (15). To handle it in a notionally convenient way, we find it is useful to introduce a dummy transition pair at time , for which we define , and . Related, we define an augmented discounted visitation distribution via
Under this notation, the loss (15) of discounted case is rewritten into a form identical to the average reward case:
when is the ball of RKHS with kernel .
This equation is identical to the one in Algorithm 1 for the average case, but differs in the way the minibatch is generated: it includes the dummy transition at time with probability and select time with discounted probability . See Algorithm 2 for the summary of the procedure.
Appendix E Information on SUMO Traffic Simulator
We provide details of the SUMO traffic simulator and how we formulate it as a standard reinforcement learning problem.
A states of a traffic should provide us with enough information to control the traffic light. A complex way is an image-like representation of the traffic vehicle around the traffic light intersection . Here, to simplify the problem, we add lane detectors around traffic light intersections, and count the total number of vehicles on each lane as states . This should give us enough, though not perfect, information to guide the traffic light agent to choose its action.
Actions
For a standard crossing intersection, its traffic light will have a program for 8 phases: “Straight signal for North-South”, “Turn-left signal for North-south”, “Straight signal for East-West”, “Turn-left signal for East-west” and their corresponding “yellow light” slow down signals. Here, we simplify these 4 phases into actions for each traffic light, where we let one big time step in reinforcement learning setting to be 6 real time steps in SUMO simulator. Within each big time step , we add a transition of 3 real time steps “yellow light” phase as a buffer to prevent vehicles for “emergency stop” if our agent decides to change light status ().
Rewards
Our goal is to minimize the total travelling time for all vehicles. Thus, we could set the negative of current aggregate total number of vehicles during the one big time step as reward . To simplify, we can just consider 6 times the current total number of vehicle as a approximation of to make our system simpler.
Policy
We use linear policy with the final softmax layer as probability for each action. We train a policy using Cross entropy(CE) method for 10 iterations and set it to be the target policy. And we set the policies at the training iteration 6, 7, 8, 9 as behavior policies, which correspond to x-ticks 1-4 in Figure 4(c).
Other details
To simulate on our given network, we also need to design route documents for a vehicle to follow. Each route is a set of roads that connect any two exit nodes from the map. To make simple but reasonable routes for the vehicle, we constrain our routes with at most one turn in the network to avoid detours. We control each route with a fixed probability (different from each route) every time step to generate a vehicle, to guarantee a randomized environment.