Consistent On-Line Off-Policy Evaluation

Assaf Hallak, Shie Mannor

Introduction

Reinforcement Learning (RL) techniques were successfully applied in fields such as robotics, games, marketing and more (Kober et al., 2013; Al-Rawi et al., 2015; Barrett et al., 2013). We consider the problem of off-policy evaluation (OPE) – assessing the performance of a complex strategy without applying it. An OPE formulation is often considered in domains with limited sampling capability. For example, marketing and recommender systems (Theocharous & Hallak, 2013; Theocharous et al., 2015) directly relate policies to revenue. A more extreme example is drug administration, as there are only few patients in the testing population, and sub-optimal policies can have life threatening effects (Hochberg et al., 2016). OPE can also be useful as a module for policy optimization in a policy improvement scheme (Thomas et al., 2015a).

In this paper, we consider the OPE problem in an on-line setup where each new sample is immediately used to update our current value estimate of some previously unseen policy. We propose and analyze a new algorithm called COP-TD(λ\lambda,β\beta) for estimating the value of the target policy; COP-TD(λ\lambda,β\beta) has the following properties:

Easy to understand and implement on-line.

Allows closing the gap to consistency such that the limit point is the same that would have been obtained by on-policy learning with the target policy.

Empirically comparable to state-of-the art algorithms.

Our algorithm resembles (Sutton et al., 2015)’s Emphatic TD that was extended by (Hallak et al., 2015) to the general parametric form ETD(λ\lambda,β\beta). We clarify the connection between the algorithms and compare them empirically. Finally, we introduce an additional related heuristic called Log-COP-TD(λ\lambda,β\beta) and motivate it.

Notations and Background

We consider the standard discounted Markov Decision Process (MDP) formulation (Bertsekas & Tsitsiklis, 1996) with a single long trajectory. Let M=(S,A,P,R,ζ,γ)M=(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\zeta,\gamma) be an MDP where S\mathcal{S} is the finite state space and A\mathcal{A} is the finite action space. The parameter P\mathcal{P} sets the transition probabilities Pr(ss,a)\Pr(s^{\prime}|s,a) given the previous state sSs\in\mathcal{S} and action aAa\in\mathcal{A}, where the first state is determined by the distribution ζ\zeta. The parameter R\mathcal{R} sets the reward distribution r(s,a)r(s,a) obtained by taking action aa in state ss and γ\gamma is the discount factor specifying the exponential reduction in reward with time.

The process advances as follows: A state s0s_{0} is sampled according to the distribution ζ(s)\zeta(s). Then, at each time step tt starting from t=0t=0 the agent draws an action ata_{t} according to the stochastic behavior policy μ(ast)\mu(a|s_{t}), a reward rtr(st,at)r_{t}\doteq r(s_{t},a_{t}) is accumulated by the agent, and the next state st+1s_{t+1} is sampled using the transition probability Pr(sst,at)\Pr(s^{\prime}|s_{t},a_{t}).

The expected discounted accumulated reward starting from a specific state and choosing an action by some policy π\pi is called the value function, which is also known to satisfy the Bellman equation in a vector form:

where αt\alpha_{t} is the step size. The value Rt,st(n)R_{t,s_{t}}^{(n)} is an estimate of the current state’s V(st)V(s_{t}), looking forward nn steps, and Rt,stλR_{t,s_{t}}^{\lambda} is an exponentially weighted average of all of these estimates going forward till infinity. Notice that Equation 1 does not specify an on-line implementation since Rt,st(n)R_{t,s_{t}}^{(n)} depends on future observations, however there exists a compact on-line implementation using eligibility traces (Bertsekas & Tsitsiklis (1996) for on-line TD(λ\lambda), and Sutton et al. (2014), Sutton et al. (2015) for off-policy TD(λ\lambda)). The underlying operator of TD(λ\lambda) is given by:

and is a γ(1λ)1λγ\frac{\gamma(1-\lambda)}{1-\lambda\gamma}-contraction (Bertsekas, 2012).

We denote by dμ(s)d_{\mu}(s) the stationary distribution over states induced by taking the policy μ\mu and mark Dμ=diag(dμ)D_{\mu}=diag(d_{\mu}). Since we are concerned with the behavior at infinite horizon, we assume ζ(s)=dμ(s)\zeta(s)=d_{\mu}(s). In addition, we assume that the MDP is ergodic for the two specified policies μ,π\mu,\pi so sS:dμ(s)>0,dπ(s)>0\forall s\in\mathcal{S}:d_{\mu}(s)>0,d_{\pi}(s)>0 and that the OPE problem is proper – π(as)>0μ(as)>0\pi(a|s)>0\Rightarrow\mu(a|s)>0.

TD(λ\lambda) can be adjusted to find the fixed point of ΠdπTπλ\Pi_{d_{\pi}}T_{\pi}^{\lambda} where Πdπ\Pi_{d_{\pi}} is the projection to the subspace spanned by the features with respect to the dπd_{\pi}-weighted norm (Sutton & Barto, 1998):

Finally, we define OPE-related quantities:

we call ρd\rho_{d} the covariate shift ratio (as denoted under different settings by (Hachiya et al., 2012)).

We summarize the assumptions used in the proofs:

Under both policies the induced Markov chain is ergodic.

The first state s0s_{0} is distributed according to the stationary distribution of the behavior policy dμ(s)d_{\mu}(s).

The problem is proper: π(as)>0μ(as)>0\pi(a|s)>0\Rightarrow\mu(a|s)>0.

The feature matrix Φ\Phi has full rank kk.

Assumption 1 is commonly used for convergence theorems as it verifies the value function is well defined on all states regardless of the initial sampled state. Assumption 2 can be relaxed since we are concerned with the long-term properties of the algorithm past its mixing time – we require it for clarity of the proofs. Assumption 3 is required so the importance sampling ratios will be well defined. Assumption 4 guarantees the optimal θ\theta is unique which greatly simplifies the proofs.

Previous Work

We can roughly categorize previous OPE algorithms to two main families. Gradient based methods that perform stochastic gradient descent on error terms they want to minimize. These include GTD (Sutton et al., 2009a), GTD-2, TDC (Sutton et al., 2009b) and HTD (White & White, 2016). The main disadvantages of gradient based methods are (A) they usually update an additional error correcting term, which means another time-step parameter needs to be controlled; and (B) they rely on estimating non-trivial terms, an estimate that tends to converge slowly. The other family uses importance sampling (IS) methods that correct the gains between on-policy and off-policy updates using the IS-ratios ρt\rho_{t}’s. Among these are full IS (Precup et al., 2001) and ETD(λ\lambda,β\beta) (Sutton et al., 2015). These methods are characterized by the bias-variance trade-off they resort to – navigating between biased convergent values (or even divergent), and very slow convergence stemming from the high variance of IS correcting factors (the ρt\rho_{t} products). There are also a few algorithms that fall between the two, for example TO-GTD (van Hasselt et al., 2014) and WIS-TD(λ\lambda) (Mahmood & Sutton, 2015).

A comparison of these algorithms in terms of convergence rate, synergy with function approximation and more is available in (White & White, 2016; Geist & Scherrer, 2014). We focus in this paper on the limit point of the convergence. For most of the aforementioned algorithms, the process was shown to converge almost surely to the fixed point of the projected Bellman operator ΠdTπ\Pi_{d}T_{\pi} where dd is some stationary distribution (usually dμd_{\mu}), however the dd in question was neverExcept full IS, however its variance is too high to be applicable in practice. dπd_{\pi} as we would have obtained from running on-policy TD with the target policy. The algorithm achieving the closest result is ETD(λ\lambda,β\beta) which replaced dd with f=(IβPπ)1dμf=\left(I-\beta P^{\top}_{\pi}\right)^{-1}d_{\mu}, where β\beta trades-off some of the process’ variance with the bias in the limit point. Hence, our main contribution is a consistent algorithm which can converge to the same value that would have been obtained by running an on-policy scheme with the same policy.

Motivation

Here we provide a motivating example showing that even in simple cases with “close” behavior and target policies, the two induced stationary distributions can differ greatly. Choosing a specific linear parameterization further emphasizes the difference between applying on-policy TD with the target policy, and applying inconsistent off-policy TD.

Assume a chain MDP with numbered states 1,2,..S1,2,..|S|, where from each state ss you can either move left to state s1s-1, or right to state s+1s+1. If you’ve reached the beginning or the end of the chain (states 11 or S|S|) then taking a step further does not affect your location. Assume the behavior policy moves left with probability 0.5+ϵ0.5+\epsilon, while the target policy moves right with probability 0.5+ϵ0.5+\epsilon. It is easy to see that the stationary distributions are given by:

For instance, if we have a length 100100 chain with ϵ=0.01\epsilon=0.01, for the rightmost state we have dμ(S)8104,dπ(S)0.04d_{\mu}(|S|)\approx 8\cdot 10^{-4},d_{\pi}(|S|)\approx 0.04. Let’s set the reward to be 11 for the right half of the chain, so the target policy is better since it spends more time in the right half. The value of the target policy in the edges of the chain for γ=0.99\gamma=0.99 is Vπ(1)=0.21,Vπ(100)=99.97V^{\pi}(1)=0.21,V^{\pi}(100)=99.97.

Now what happens if we try to approximate the value function using one constant feature ϕ(s)1\phi(s)\equiv 1? The fixed point of ΠdμTπ\Pi_{d_{\mu}}T_{\pi} is θ=11.92\theta=11.92, while the fixed point of ΠdπTπ\Pi_{d_{\pi}}T_{\pi} is θ=88.08\theta=88.08 – a substantial difference. The reason for this difference lies in the emphasis each projection puts on the states: according to Πdμ\Pi_{d_{\mu}}, the important states are in the left half of the chain – these with low value function, and therefore the value estimation of all states is low. However, according to Πdπ\Pi_{d_{\pi}} the important states are concentrated on the right part of the chain since the target policy will visit these more often. Hence, the estimation error is emphasized on the right part of the chain and the value estimation is higher. When we wish to estimate the value of the target policy, we want to know what will happen if we deploy it instead of the behavior policy, thus taking the fixed point of ΠdπTπ\Pi_{d_{\pi}}T_{\pi} better represents the off-policy evaluation solution.

COP-TD(λ𝜆\lambda, β𝛽\beta)

Most off-policy algorithms multiply the TD summand of TD(λ\lambda) with some value that depends on the history and the current state. For example, full IS-TD by (Precup et al., 2001) examines the ratio between the probabilities of the trajectory under both policies:

In problems with a long horizon, or these that start from the stationary distribution, we suggest using the time-invariant covariate shift ρd\rho_{d} multiplied by the current ρt\rho_{t}. The intuition is the following: We would prefer using the probabilities ratio given in Equation 3, but it has very high variance, and after many time steps we might as well look at the stationary distribution ratio instead. This direction leads us to the following update equations:

If the αt\alpha_{t} satisfy t=0αt=,t=0αt2<\sum_{t=0}^{\infty}\alpha_{t}=\infty,\sum_{t=0}^{\infty}\alpha^{2}_{t}<\infty then the process described by Eq. (4) converges almost surely to the fixed point of ΠπTπV=V\Pi_{\pi}T_{\pi}V=V.

The proof follows the ODE method (Kushner & Yin, 2003) similarly to Tsitsiklis & Van Roy (1997) (see the appendix for more details).

Since ρd(s)\rho_{d}(s) is generally unknown, it is estimated using an additional stochastic approximation process. In order to do so, we note the following Lemma:

Note that ρd(s)\rho_{d}(s), unlike V(s)V(s), is restricted to a close set since its dμd_{\mu}-weighted linear combination is equal to 11 and all of its entries are non-negative; We denote this dμd_{\mu}-weighted simplex by Δdμ\Delta_{d_{\mu}}, and let ΠΔdμ\Pi_{\Delta_{d_{\mu}}} be the (non-linear) projection to this set with respect to the Euclidean norm (ΠΔdμ\Pi_{\Delta_{d_{\mu}}} can be calculated efficiently, (Chen & Ye, 2011)). Now, we can devise a TD algorithm which estimates ρd\rho_{d} and uses it to find θ\theta, which we call COP-TD(, β\beta) (Consistent Off-Policy TD).

Similarly to the Bellman operator for TD-learning, we define the underlying COP-operator YY and its β\beta extension:

The following Lemma may give some intuition on the convergence of the ρd\rho_{d} estimation process:

Under the ergodicity assumption, denote the eigenvalues of PπP_{\pi} by 0ξ2<ξ1=10\leq\dots\leq|\xi_{2}|<\xi_{1}=1. Then YβY^{\beta} is a maxi1(1β)ξi1βξi<1\max_{i\neq 1}\frac{(1-\beta)|\xi_{i}|}{|1-\beta\xi_{i}|}<1-contraction in the L2L_{2}-norm on the orthogonal subspace to ρd\rho_{d}, and ρd\rho_{d} is a fixed point of YβY^{\beta}.

The technical proof is given in the appendix.

The proof is given in the appendix and also follows the ODE method. Notice that a theorem is only given for λ=0\lambda=0, convergence results for general λ\lambda should follow the work by Yu (2015).

A possible criticism on COP-TD(,β\beta) is that it is not actually consistent, since in order to be consistent the original state space has to be small, in which case every off-policy algorithm is consistent as well. Still, the dependence on another set of features allows to trade-off accuracy with computational power in estimating ρd\rho_{d} and subsequently VV. Moreover, smart feature selection may further reduce this gap, and COP-TD(, β\beta) is still the first algorithm addressing this issue. We conclude with linking the error in ρd\rho_{d}’s estimate with the difference in the resulting θ\theta, which suggests that a well estimated ρd\rho_{d} results in consistency:

Let 0<ϵ<10<\epsilon<1. If (1ϵ)ρdρdCOP(1+ϵ)ρd(1-\epsilon)\rho_{d}\leq\rho_{d}^{\text{COP}}\leq(1+\epsilon)\rho_{d}, then the fixed point of COP-TD(,β\beta) with function approximation θCOP\theta^{\text{COP}} satisfies the following, where \|\cdot\|_{\infty} is the LL_{\infty} induced norm:

where Aπ=ΦDπ(IγPπ)ΦA_{\pi}=\Phi^{\top}D_{\pi}(I-\gamma P_{\pi})\Phi, and θ\theta^{*} sets the fixed point of the operator ΠdπTπV\Pi_{d_{\pi}}T_{\pi}V.

Recently, Sutton et al. (2015) had suggested an algorithm for off-policy evaluation called Emphatic TD. Their algorithm was later on extended by Hallak et al. (2015) and renamed ETD(λ\lambda, β\beta), which was shown to perform extremely well empirically by White & White (2016). ETD(, β\beta) can be represented as:

When comparing ETD(λ\lambda,β\beta)’s form to COP-TD(λ\lambda,β\beta)’s, instead of spending memory and time resources on a state/feature-dependent FtF_{t}, ETD(λ\lambda,β\beta) uses a one-variable approximation. The resulting FtF_{t} is in fact a one-step estimate of ρd\rho_{d}, starting from ρd^(s)1\widehat{\rho_{d}}(s)\equiv 1 (see Equations 15, 8), up to a minor difference: FtETD=βFtCOP-TD+1F_{t}^{\text{ETD}}=\beta F_{t}^{\text{COP-TD}}+1 (which following our logic adds bias to the estimate We have conducted several experiments with an altered ETD and indeed obtained better results compared with the original, these experiments are outside the scope of the paper.).

Unlike ETD(λ\lambda, β\beta), COP-TD(λ\lambda,β\beta)’s effectiveness depends on the available resources. The number of features ϕρ(s)\phi_{\rho}(s) can be adjusted accordingly to provide the most affordable approximation. The added cost is fine-tuning another step-size, though β\beta’s effect is less prominent.

The Logarithm Approach for Handling Long Products

We now present a heuristic algorithm which works similarly to COP-TD(λ\lambda, β\beta). Before presenting the algorithm, we explain the motivation behind it.

Konidaris et al. (2011) suggested a statistical interpretation of TD(λ\lambda). They show that under several assumptions the TD(λ\lambda) estimate RstλR^{\lambda}_{s_{t}} is the maximum likelihood estimator of V(st)V(s_{t}) given RstnR_{s_{t}}^{n}: (1) Each RstnR_{s_{t}}^{n} is an unbiased estimator of V(st)V(s_{t}); (2) The random variables RstnR_{s_{t}}^{n} are independent and specifically uncorrelated; (3) The random variables RstnR_{s_{t}}^{n} are jointly normally distributed; and (4) The variance of each RstnR_{s_{t}}^{n} is proportional to λn\lambda^{n}.

Under Assumptions 1-3 the maximum likelihood estimator of V(s)V(s) given its previous estimate can be represented as a linear convex combination of RstnR_{s_{t}}^{n} with weights:

Subsequently, in Konidaris et al. (2011) Assumption 44 was relaxed and instead a closed form approximation of the variance was proposed. In a follow-up paper by Thomas et al. (2015b), the second assumption was also removed and the weights were instead given as: wn=1cov(Rst)en1cov(Rst)1w_{n}=\frac{\textbf{1}^{\top}cov(\textbf{R}_{s_{t}})\textbf{e}_{n}}{\textbf{1}^{\top}cov(\textbf{R}_{s_{t}})\textbf{1}}, where the covariance matrix can be estimated from the data, or otherwise learned through some parametric form.

While both the approximated variance and learned covariance matrix solutions improve performance on several benchmarks, the first uses a rather crude approximation, and the second solution is both state-dependent and based on noisy estimates of the covariance matrix. In addition, there aren’t efficient on-line implementations since all past weights should be recalculated to match a new sample. Still, the suggested statistical justification is a valuable tool in assessing the similar role of β\beta in ETD(λ\lambda, β\beta).

3 Log-COP-TD(λ𝜆\lambda, β𝛽\beta)

This approximation is crude – we could add terms reducing the error through Taylor expansion, but these would be complicated to deal with. Hence, we can relate to this method mainly as a well-motivated heuristic.

Notice that this formulation resembles the standard MDP formulation, only with the corresponding ”reward” terms log[ρt]\log[\rho_{t}] going backward instead of forward, and no discount factor. Unfortunately, without a discount factor we cannot expect the estimated value to converge, so we propose using an artificial one γlog\gamma_{\log}. We can incorporate function approximation for this formulation as well. Unlike COP-TD(λ\lambda, β\beta), we can choose the features and weights as we wish with no restriction, besides the linear constraint on the resulting ρd\rho_{d} through the weight vector θρ\theta_{\rho}. This can be approximately enforced by normalizing θρ\theta_{\rho} using Xt1ttexp(θρ,tϕ(st))\frac{X}{t}\doteq\frac{1}{t}\sum_{t}\exp(\theta_{\rho,t}^{\top}\phi(s_{t})) (which should equal 11 if we were exactly correct). We call the resulting algorithm Log-COP-TD(λ\lambda,β\beta).

4 Using the Original Features

An interesting phenomenon occurs when the behavior and target policies employ a feature based Boltzmann distribution for choosing the actions: μ(as)=exp(θa,μϕ(s))\mu(a|s)=\exp\left(\theta_{a,\mu}^{\top}\phi(s)\right), and π(as)=exp(θa,πϕ(s))\pi(a|s)=\exp\left(\theta_{a,\pi}^{\top}\phi(s)\right), where a constant feature is added to remove the (possibly different) normalizing constant. Thus, log(ρt)=(θa,πθa,μ)ϕ(st)\log(\rho_{t})=(\theta_{a,\pi}-\theta_{a,\mu})^{\top}\phi(s_{t}), and Log-COP-TD(λ\lambda,β\beta) obtains a parametric form that depends on the original features instead of a different set.

5 Approximation Hardness

As we propose to use linear function approximation for ρd(s)\rho_{d}(s) and log(ρd(s))\log\left(\rho_{d}(s)\right) one cannot help but wonder how hard it is to approximate these quantities, especially compared to the value function. The comparison between V(s)V(s) and ρd(s)\rho_{d}(s) is problematic for several reasons:

The ultimate goal is estimating Vπ(s)V^{\pi}(s), approximation errors in ρd(s)\rho_{d}(s) are second order terms.

The value function Vπ(s)V^{\pi}(s) depends on the policy-induced reward function and transition probability matrix, while ρd(s)\rho_{d}(s) depends on the stationary distributions induced by both policies. Since each depends on at least one distinct factor - we can expect different setups to result in varied approximation hardness. For example, if the reward function has a poor approximation then so will Vπ(s)V^{\pi}(s), while extremely different behavior and target policies can cause ρd(s)\rho_{d}(s) to behave erratically.

Subsequently, the choice of features for approximating Vπ(s)V^{\pi}(s) and ρd(s)\rho_{d}(s) can differ significantly depending on the problem at hand.

If we would still like to compare Vπ(s)V^{\pi}(s) and ρd(s)\rho_{d}(s), we could think of extreme examples:

When π=μ\pi=\mu, ρd(s)1\rho_{d}(s)\equiv 1, when R(s)0R(s)\equiv 0 then Vπ(s)0V^{\pi}(s)\equiv 0.

In the chain MDP example in Section 4 we saw that ρd(s)\rho_{d}(s) is an exponential function of the location in the chain. Setting reward in one end to 11 will result in an exponential form for Vπ(s)V^{\pi}(s) as well. Subsequently, in the chain MDP example approximating log(ρd(s))\log\left(\rho_{d}(s)\right) is easier than ρd(s)\rho_{d}(s) as we obtain a linear function of the position; This is not the general case.

Experiments

We have performed 3 types of experiments. Our first batch of experiments (Figure 1) demonstrates the accuracy of predicting ρd\rho_{d} by both COP-TD(λ\lambda, β\beta) and Log-COP-TD(λ\lambda, β\beta). We show two types of setups in which visualization of ρd\rho_{d} is relatively clear - the chain MDP example mentioned in Section 4 and the mountain car domain (Sutton & Barto, 1998) in which the state is determined by only two continuous variables - the car’s position and speed. The parameters λ\lambda and β\beta exhibited low sensitivity in these tasks so they were simply set to , we show the estimated ρd\rho_{d} after 10610^{6} iterations. For the chain MDP (top two plots, notice the logarithmic scale) we first approximate ρd\rho_{d} without any function approximation (top-left) and we can see COP-TD manages to converge to the correct value while Log-COP-TD is much less exact. When we use linear feature space (constant parameter and position) Log-COP-TD captures the true behavior of ρd\rho_{d} much better as expected. The two lower plots show the error (in color) in ρd\rho_{d} estimated for the mountain car with a pure exploration behavior policy vs. a target policy oriented at moving right. The z-axis is the same for both plots and it describes a much more accurate estimate of ρd\rho_{d} obtained through simulations. The features used were local state aggregation. We can see that both algorithms succeed similarly on the position-speed pairs which are sampled often due to the behavior policy and the mountain. When looking at more rarely observed states, the estimate becomes worse for both algorithms, though Log-COP-TD seems to be better performing on the spike at position >0>0.

Next we test the sensitivity of COP-TD(λ\lambda, β\beta) and Log-COP-TD(λ\lambda,β\beta) to the parameters β\beta and γlog\gamma_{\log} (Figure 2) on two distinct toy examples - the chain MDP introduced before but with only 30 states with the position-linear features, and a random MDP with 32 states, 2 actions and a 55-bit binary feature vector along with a free parameter (this compact representation was suggested by White & White (2016) to approximate real world problems). The policies on the chain MDP were taken as described before, and on the random MDP a state independent 0.750.75/0.250.25 probability to choose an action by the behavior/target policy. As we can see, larger values of β\beta cause noisier estimations in the random MDP for COP-TD(λ\lambda, β\beta), but has little effect in other venues. As for γlog\gamma_{\log} - we can see that if it is too large or too small the error behaves sub-optimally, as expected for the crude approximation of Equation 10. In conclusion, unlike ETD(λ\lambda, β\beta), Log/COP-TD(λ\lambda, β\beta) are much less effected by β\beta, though γlog\gamma_{\log} should be tuned to improve results.

Our final experiment (Figure 3) compares our algorithms to ETD(λ\lambda, β\beta) and GTD(λ\lambda, β\beta) over 4 setups: chain MDP with 100 states with right half rewards 11 with linear features, a 2 action random MDP with 256 states and binary features, acrobot (3 actions) and cart-pole balancing (21 actions) (Sutton & Barto, 1998) with reset at success and state aggregation to 100100 states. In all problems we used the same features for ρd\rho_{d} and Vπ(s)V^{\pi}(s) estimation, γ=0.99\gamma=0.99, constant step size 0.050.05 for the TD process and results were averaged over 10 trajectories, other parameters (λ\lambda, β\beta, other step sizes, γlog\gamma_{\log}) were swiped over to find the best ones. To reduce figure clutter we have not included standard deviations though the noisy averages still reflect the variance in the process. Our method of comparison on the first 2 setups estimates the value function using the suggested algorithm, and finds the dπd_{\pi} weighted average of the error between VV and the on-policy fixed point ΠπTVπ\Pi_{\pi}TV_{\pi}:

where θ\theta^{*} is the optimal θ\theta obtained by on-policy TD using the target policy. On the latter continuous state problems we applied on-line TD on a different trajectory following the target policy, used the resulting θ\theta value as ground truth and taken the sum of squared errors with respect to it. The behavior and target policies for the chain MDP and random MDP are as specified before. For the acrobot problem the behavior policy is uniform over the 3 actions and the target policy chooses between these with probabilities (16,13,12)(\frac{1}{6},\frac{1}{3},\frac{1}{2}). For the cart-pole the action space is divided to 21 actions from -1 to 1 equally, the behavior policy chooses among these uniformly while the target policy is 1.5 times more prone to choosing a positive action than a negative one.

The experiments show that COP-TD(λ\lambda, β\beta) and Log-COP-TD(λ\lambda, β\beta) have comparable performance to ETD(λ\lambda, β\beta) where at least one is better in every setup. The advantage in the new algorithms is especially seen in the chain MDP corresponding to a large discrepancy between the stationary distribution of the behavior and target policy. GTD(λ\lambda) is consistently worse on the tested setups, this might be due to the large difference between the chosen behavior and target policies which affects GTD(λ\lambda) the most.

Conclusion

Research on off-policy evaluation has flourished in the last decade. While a plethora of algorithms were suggested so far, ETD(λ\lambda, β\beta) by Hallak et al. (2015) has perhaps the simplest formulation and theoretical properties. Unfortunately, ETD(λ\lambda, β\beta) does not converge to the same point achieved by on-line TD when linear function approximation is applied.

We address this issue with COP-TD(λ\lambda,β\beta) and proved it can achieve consistency when used with a correct set of features, or at least allow trading-off some of the bias by adding or removing features. Despite requiring a new set of features and calibrating an additional update function, COP-TD(λ\lambda,β\beta)’s performance does not depend as much on β\beta as ETD(λ\lambda,β\beta), and shows promising empirical results.

We offer a connection to the statistical interpretation of TD(λ\lambda) that motivates our entire formulation. This interpretation leads to two additional approaches: (a) weight the Γtn\Gamma^{n}_{t} using estimated variances instead of β\beta exponents and (b) approximating log[ρd]\log[\rho_{d}] instead of ρd\rho_{d}; both approaches deserve consideration when facing a real application.

References

Appendix

Under both policies the induced Markov chain is ergodic.

The first state s0s_{0} is distributed according to the behavior policy dμ(s)d_{\mu}(s).

The support of μ\mu contains the support of π\pi, i.e. π(as)>0μ(as)>0\pi(a|s)>0\Rightarrow\mu(a|s)>0.

The feature matrix [Φ]s,:ϕ(s)[\Phi]_{s,:}\doteq\phi(s) has full rank.

If the step sizes αt\alpha_{t} hold t=0αt=,t=0αt2<\sum_{t=0}^{\infty}\alpha_{t}=\infty,\sum_{t=0}^{\infty}\alpha^{2}_{t}<\infty then the process described by Equation 4 converges almost surely to the fixed point of ΠπTπV=V\Pi_{\pi}T_{\pi}V=V.

Similarly to on-policy TD, we define AA and bb, the fixed point is the solution to Aθ=bA\theta=b. First we find AA and show stability:

This is exactly the same AA we would have obtained from TD() and it is negative definite (see (Sutton et al., 2015)). Similarly we can find bb:

and we obtained the same bb as on-policy TD() with π\pi.

Now we consider the noise of this off-policy TD, which is exactly the same noise as the on-policy TD only multiplied by ρtρd(st)\rho_{t}\rho_{d}(s_{t}) - as long as the noise term of the ODE formulation (Kushner & Yin, 2003) is still bounded, the proof is exactly the same. According to Assumption 1, we know that ρd\rho_{d} is lower and upper bounded. By Assumption 3 we also know that ρt\rho_{t} is lower and upper bounded. Therefore the noise of the new process is bounded and the same a.s. convergence applies as on-policy TD() (Tsitsiklis & Van Roy, 1997). Since A,bA,b are the same as on-policy TD() for the target policy π\pi, the convergence is to the same fixed point. ∎

2 Proof of Lemma 2

For any function on the state space u(s)u(s):

where este_{s_{t}} is the unit vector of state sts_{t}. So, for an unbiased estimate of ρd\rho_{d} denoted ρd^\widehat{\rho_{d}} we can define and derive:

3 Proof of Lemma 3

Under the ergodicity assumption, denote the eigenvalues of PπP_{\pi} by 0ξ2<ξ1=10\leq\dots\leq|\xi_{2}|<\xi_{1}=1. Then YβY^{\beta} is a maxi1(1β)ξi1βξi\max_{i\neq 1}\frac{(1-\beta)|\xi_{i}|}{|1-\beta\xi_{i}|}-contraction in the L2L_{2}-norm on the orthogonal subspace to ρd\rho_{d}, and ρd\rho_{d} is a fixed point of YβY^{\beta}.

We first show that ρd\rho_{d} is a fixed point of YβY^{\beta}:

Due to similarity, the eigenvalues of YβY^{\beta} are the same as these of (1β)Pπ(IβPπ)1(1-\beta)P^{\top}_{\pi}(I-\beta P^{\top}_{\pi})^{-1} which is a stochastic matrix with eigenvalues ((1β)ξi1βξi)i=1S\left(\frac{(1-\beta)\xi_{i}}{1-\beta\xi_{i}}\right)_{i=1}^{|S|}, where for i=1i=1 we obtain the eigenvalue 11. Now for every vector orthogonal to ρd\rho_{d} denoted uu, the first eigenvalue has no effect on its spectral decomposition, which means that Yβumaxi1(1β)ξi1βξiu\|Y^{\beta}u\|\leq\max_{i\neq 1}\frac{(1-\beta)|\xi_{i}|}{|1-\beta\xi_{i}|}\|u\|. ∎

4 Proof of Theorem 1

We use a three timescales stochastic approximation analysis. The fastest process is d^μ(s)\hat{d}_{\mu}(s) which converges naturally with time-step O(1t)O(\frac{1}{t}):

The process d^μ,t\hat{d}_{\mu,t} converges almost surely to dμd_{\mu} by the strong law of large numbers. Our next process is ρd^,t\widehat{\rho_{d}}_{,t}, which we will show converges a.s. to ρd\rho_{d} with d^μ(s)=dμ(s)\hat{d}_{\mu}(s)=d_{\mu}(s):

We follow the notation from (Schuss & Borkar, 2009). We first specify the stochastic approximation using h(x)h(x) and Mn+1M_{n+1}:

Now there are several conditions that must follow - condition on the step sizes, conditions on the Martingale and conditions on the projection. If all of these are met than the process converegs to the fixed point of the projected operator ρd\rho_{d}.

The step size conditions follow by the theorem’s assumption. Now we move on to the Martingale conditions.

Now let’s consider the projection where we follow the discussion in (Schuss & Borkar, 2009), Section 5.4. Notice that hh is Lipschitz and the eigenvalues around the fixed point are non-negative, therefore there’s a stable invariant solution set. In addition, the projection is to a closed convex set Δdμ\Delta_{d_{\mu}}, so ρd,t^\widehat{\rho_{d,t}} is bounded. Hence, our goal is to show that the projection is Lipschitz and that we can ignore its non-smooth boundary.

The projection to the simplex zeros some coordinates and decreases a constant from the other coordinates. If indeed it zeros some coordinates - we are at a problematic area of the space since the projection is not Frechet differentiable there (we are on the boundary of the set). However, because ρd(s)>0\rho_{d}(s)>0 (Assumption 1), the unprojected ODE repels ρd^\widehat{\rho_{d}} from these problematic boundaries, and we can assume that after enough the steps the projection is simply a projection to the affine subspace sdμ(s)u(s)=1\sum_{s}d_{\mu}(s)u(s)=1. In that case the projection is given by: ΠΔdμu=(I1dμ2dμdμ)(u1)+1\Pi_{\Delta_{d_{\mu}}}u=(I-\frac{1}{\|d_{\mu}\|^{2}}d_{\mu}d_{\mu}^{\top})(u-\textbf{1})+\textbf{1} and its Frechet derivative is ΠˉΔdμu=(I1dμ2dμdμ)u\bar{\Pi}_{\Delta_{d_{\mu}}}u=(I-\frac{1}{\|d_{\mu}\|^{2}}d_{\mu}d_{\mu}^{\top})u. This derivative is Lipschitz continuous which means that its composition with hh is also Lipschitz .

Hence, the process converges to the solution set of ΠˉΔdμh(x)=0\bar{\Pi}_{\Delta_{d_{\mu}}}h(x)=0 for ΠΔdμx=x\Pi_{\Delta_{d_{\mu}}}x=x. Under these constraints the only fixed point can be ρd\rho_{d} (intersection of the cρdc\cdot\rho_{d} line with the weighted simplex set Δdμ)\Delta_{d_{\mu}}).

Finally, treating the θt\theta_{t} process assuming ρd^,t\widehat{\rho_{d}}_{,t} already converged to ρd\rho_{d}, leaves us with Lemma 1. Since each process depends only on the previous ones, it is enough to show they converge independently as long as the step sizes satisfy the rate constraints. ∎

5 Proof of Theorem 2

Similarly to the proof of Theorem , we can analyze the system on 3-time scales. The fastest process is d^ϕρ\hat{d}_{\phi_{\rho}} which converges naturally with time-step O(1t)O(\frac{1}{t}):

First we represent the corresponding AA and bb of the projected ρd\rho_{d} ODE as follows (we assume β=0\beta=0, but the results are similar for general β\beta):

We can now verify that a solution to Ax=bAx=b also holds the Projected COP equation: ΠdμϕρYΦρθρ=Φθρ\Pi^{\phi_{\rho}}_{d_{\mu}}Y\Phi_{\rho}\theta_{\rho}=\Phi\theta_{\rho} by multiplying it from the left by ΦρDμ\Phi_{\rho}^{\top}D_{\mu}:

Moving on to the last process, we get the following equations:

leading us to the known convergence solution (if indeed the process converge, which is not necessarily true) which is the fixed point of ΠdμρdCOPTπV\Pi_{d_{\mu}\circ\rho^{\text{COP}}_{d}}T_{\pi}V.

6 Proof of Corollary 1

Let 0<ϵ<10<\epsilon<1. If (1ϵ)ρdρdCOP(1+ϵ)ρd(1-\epsilon)\rho_{d}\leq\rho_{d}^{\text{COP}}\leq(1+\epsilon)\rho_{d}, then the fixed point of COP-TD(,β\beta) with function approximation θCOP\theta^{\text{COP}} satisfies the following, where \|\cdot\|_{\infty} is the LL_{\infty} induced norm:

where Aπ=ΦDπ(IγPπ)ΦA_{\pi}=\Phi^{\top}D_{\pi}(I-\gamma P_{\pi})\Phi, and θ\theta^{*} sets the fixed point of the operator ΠdπTπV\Pi_{d_{\pi}}T_{\pi}V.

If (1ϵ)ρdρdCOP(1+ϵ)ρd(1-\epsilon)\rho_{d}\leq\rho_{d}^{\text{COP}}\leq(1+\epsilon)\rho_{d}, then we know that the weights of the projection hold:

We write the solution equations for both weight vectors

Now we subtract both equations and add and subtract (ΦDπ(IγPπ)Φ)θCOP(\Phi^{\top}D_{\pi}(I-\gamma P_{\pi})\Phi)\theta^{\text{COP}}:

Now we take LL_{\infty} norm on both sides, and use induced matrix sub-multiplicative property:

7 More details on the experiments

Experiments for Figure 1: 100 states chain MDP, with probability 0.510.51 to move left / right for the behavior / target policy. Results were taken after T=1e6T=1e6 iterations. For COP-TD we used β=0\beta=0 and a constant step size 0.50.5. For Log-COP-TD we used β=0,γlog=0.9999\beta=0,\gamma_{\log}=0.9999 and constant step size 0.50.5. The experiment was conducted 10 times and the standard deviation is given as shading in the graph.

For the mountain car experiment we used the simulator given by https://jamh-web.appspot.com/download.htm. The state aggregation was obtained by running kmeans with 100100 clusters and taking the centers as representative states. The behavior policy was taken to be uniform over the 3 possible actions (-1, 0, 1), and the target policy chose these actions with probabilities (1/6,1/3,1/2) regardless of the state. COP-TD was applied with β=0\beta=0 and constant step size 0.010.01, and Log-COP-TD was applied with β=0,γlog=0.9\beta=0,\gamma_{\log}=0.9 and constant step size 0.010.01. Both algorithms ran for T=1e6T=1e6 iterations before the estimated ρd\rho_{d} was taken.

Experiments for Figure 2: 100 states chain MDP, with probability 0.510.51 to move left / right for the behavior / target policy. For both algorithms constant step size 0.50.5. For Log-COP-TD we used β=0,γlog=0.9999\beta=0,\gamma_{\log}=0.9999 when sweeping on the other parameter. All experiments were conducted 10 times and we show the average result.

In the randomized MDP with 32 states we used uniform distribution over transition probabilities for two possible actions, and the target policy had p=0.75p=0.75 to choose one action where the behavior policy had p=0.75p=0.75 to choose the other action. All experiments were conducted 10 times and we show the average result.

Experiments for Figure 3 were obtained by running COP-TD, Log-COP-TD, ETD and GTD over 4 setups. The distinct parameters of each algorithm were swiped over to find the best value: COP-TD’s step size and β\beta, Log-COP-TD’s steps size, β\beta and γlog\gamma_{\log}, ETD’s β\beta and GTD’s step size. The step size of the main process and λ\lambda were taken to be the same for all algorithms: step size = 0.05 and λ=0\lambda=0 (obtained also by sweeping over possible values). The simulators for acrobot and pole-balancing were taken from https://jamh-web.appspot.com/download.htm.