Safe and Efficient Off-Policy Reinforcement Learning

Rémi Munos, Tom Stepleton, Anna Harutyunyan, Marc G. Bellemare

Notation

The value function for a policy π\pi, QπQ^{\pi}, describes the expected discounted sum of rewards associated with following π\pi from a given state-action pair. Using operator notation, we write this as

The Bellman operator Tπ\mathcal{T}^{\pi} for a policy π\pi is defined as TπQ:=r+γPπQ\mathcal{T}^{\pi}Q:=r+\gamma P^{\pi}Q and its fixed point is QπQ^{\pi}, i.e. TπQπ=Qπ=(IγPπ)1r\mathcal{T}^{\pi}Q^{\pi}=Q^{\pi}=(I-\gamma P^{\pi})^{-1}r. The Bellman optimality operator introduces a maximization over the set of policies:

Its fixed point is QQ^{*}, the unique optimal value function (Puterman,, 1994). It is this quantity that we will seek to obtain when we talk about the “control setting”.

The λ\lambda-return extension (Sutton,, 1988) of the Bellman operators considers exponentially weighted sums of nn-steps returns:

where TπQQ\mathcal{T}^{\pi}Q-Q is the Bellman residual of QQ for policy π\pi. Examination of the above shows that QπQ^{\pi} is also the fixed point of Tλπ\mathcal{T}_{\lambda}^{\pi}. At one extreme (λ=0\lambda=0) we have the Bellman operator Tλ=0πQ=TπQ\mathcal{T}_{\lambda=0}^{\pi}Q=\mathcal{T}^{\pi}Q, while at the other (λ=1\lambda=1) we have the policy evaluation operator Tλ=1πQ=Qπ\mathcal{T}_{\lambda=1}^{\pi}Q=Q^{\pi} which can be estimated using Monte Carlo methods (Sutton and Barto,, 1998). Intermediate values of λ\lambda trade off estimation bias with sample variance (Kearns and Singh,, 2000).

We seek to evaluate a target policy π\pi using trajectories drawn from a behaviour policy μ\mu. If π=μ\pi=\mu, we are on-policy; otherwise, we are off-policy. We will consider trajectories of the form:

Off-Policy Algorithms

We are interested in two related off-policy learning problems. In the policy evaluation setting, we are given a fixed policy π\pi whose value QπQ^{\pi} we wish to estimate from sample trajectories drawn from a behaviour policy μ\mu. In the control setting, we consider a sequence of policies that depend on our own sequence of Q-functions (such as ε\varepsilon-greedy policies), and seek to approximate QQ^{*}.

The general operator that we consider for comparing several return-based off-policy algorithms is:

Importance sampling is the simplest way to correct for the discrepancy between μ\mu and π\pi when learning from off-policy returns (Precup et al.,, 2000, 2001; Geist and Scherrer,, 2014). The off-policy correction uses the product of the likelihood ratios between π\pi and μ\mu. Notice that RQ\mathcal{R}Q defined in (3) with this choice of (cs)(c_{s}) yields QπQ^{\pi} for any QQ. For Q=0Q=0 we recover the basic IS estimate \sum_{t\geq 0}\gamma^{t}\big{(}\prod_{s=1}^{t}c_{s}\big{)}r_{t}, thus (3) can be seen as a variance reduction technique (with a baseline QQ). It is well known that IS estimates can suffer from large – even possibly infinite – variance (mainly due to the variance of the product π(a1x1)μ(a1x1)π(atxt)μ(atxt)\frac{\pi(a_{1}|x_{1})}{\mu(a_{1}|x_{1})}\cdots\frac{\pi(a_{t}|x_{t})}{\mu(a_{t}|x_{t})}), which has motivated further variance reduction techniques such as in (Mahmood and Sutton,, 2015; Mahmood et al.,, 2015; Hallak et al.,, 2015).

A recent alternative proposed by Harutyunyan et al., (2016) introduces an off-policy correction based on a QQ-baseline (instead of correcting the probability of the sample path like in IS). This approach, called Qπ(λ\lambda) and Q∗(λ\lambda) for policy evaluation and control, respectively, corresponds to the choice cs=λc_{s}=\lambda. It offers the advantage of avoiding the blow-up of the variance of the product of ratios encountered with IS. Interestingly, this operator contracts around QπQ^{\pi} provided that μ\mu and π\pi are sufficiently close to each other. Defining ε:=maxxπ(x)μ(x)1\varepsilon:=\max_{x}\|\pi(\cdot|x)-\mu(\cdot|x)\|_{1} the level of “off-policyness”, the authors prove that the operator defined by (3) with cs=λc_{s}=\lambda is a contraction mapping around QπQ^{\pi} for λ<1γγε\lambda<\frac{1-\gamma}{\gamma\varepsilon}, and around QQ^{*} for the worst case of λ<1γ2γ\lambda<\frac{1-\gamma}{2\gamma}. Unfortunately, Qπ(λ\lambda) requires knowledge of ε\varepsilon, and the condition for Q∗(λ\lambda) is very conservative. Neither Qπ(λ\lambda), nor Q∗(λ\lambda) are safe as they do not guarantee convergence for arbitrary π\pi and μ\mu.

The TB(λ\lambda) algorithm of Precup et al., (2000) corrects for the target/behaviour discrepancy by multiplying each term of the sum by the product of target policy probabilities. The corresponding operator defines a contraction mapping for any policies π\pi and μ\mu, which makes it a safe algorithm. However, this algorithm is not efficient in the near on-policy case (where μ\mu and π\pi are similar) as it unnecessarily cuts the traces, preventing it to make use of full returns: indeed we need not discount stochastic on-policy transitions (as shown by Harutyunyan et al.,’s results about Qπ).

Our contribution is an algorithm – Retrace(λ)(\lambda) – that takes the best of the three previous algorithms. Retrace(λ)(\lambda) uses an importance sampling ratio truncated at 11. Compared to IS, it does not suffer from the variance explosion of the product of IS ratios. Now, similarly to Qπ(λ)Q^{\pi}(\lambda) and unlike TB(λ\lambda), it does not cut the traces in the on-policy case, making it possible to benefit from the full returns. In the off-policy case, the traces are safely cut, similarly to TB(λ\lambda). In particular, \min\big{(}1,\frac{\pi(a_{s}|x_{s})}{\mu(a_{s}|x_{s})}\big{)}\geq\pi(a_{s}|x_{s}): Retrace(λ\lambda) does not cut the traces as much as TB(λ\lambda). In the subsequent sections, we will show the following:

For any traces 0csπ(asxs)/μ(asxs)0\leq c_{s}\leq\pi(a_{s}|x_{s})/\mu(a_{s}|x_{s}) (thus including the Retrace(λ\lambda) operator), the return-based operator (3) is a γ\gamma-contraction around QπQ^{\pi}, for arbitrary policies μ\mu and π\pi

In the control case (where π\pi is replaced by a sequence of increasingly greedy policies) the online Retrace(λ\lambda) algorithm converges a.s. to QQ^{*}, without requiring the GLIE assumption.

As a corollary, Watkins’s Q(λ)(\lambda) converges a.s. to QQ^{*}.

Analysis of Retrace(λ𝜆\lambda)

We will in turn analyze both off-policy policy evaluation and control settings. We will show that R\mathcal{R} is a contraction mapping in both settings (under a mild additional assumption for the control case).

Our first result states the γ\gamma-contraction of the operator (3) defined by any set of non-negative coefficients cs=cs(as,Fs)c_{s}=c_{s}(a_{s},\mathcal{F}_{s}) (in order to emphasize that csc_{s} can be a function of the whole history Fs\mathcal{F}_{s}) under the assumption that 0csπ(asxs)μ(asxs)0\leq c_{s}\leq\frac{\pi(a_{s}|x_{s})}{\mu(a_{s}|x_{s})}.

The operator R\mathcal{R} defined by (3) has a unique fixed point QπQ^{\pi}. Furthermore, if for each asAa_{s}\in\mathcal{A} and each history Fs\mathcal{F}_{s} we have c_{s}=c_{s}(a_{s},\mathcal{F}_{s})\in\big{[}0,\frac{\pi(a_{s}|x_{s})}{\mu(a_{s}|x_{s})}\big{]}, then for any Q-function QQ

The following lemma will be useful in proving Theorem 1 (proof in the appendix).

The difference between RQ\mathcal{R}Q and its fixed point QπQ^{\pi} is

Now since π(axt)μ(axt)ct(b,Ft)0\pi(a|x_{t})-\mu(a|x_{t})c_{t}(b,\mathcal{F}_{t})\geq 0, we have that RQ(x,a)Qπ(x,a)=y,bwy,bΔQ(y,b)\mathcal{R}Q(x,a)-Q^{\pi}(x,a)=\sum_{y,b}w_{y,b}\Delta Q(y,b), i.e. a linear combination of ΔQ(y,b)\Delta Q(y,b) weighted by non-negative coefficients:

Thus η(x,a)[0,γ]\eta(x,a)\in[0,\gamma] is a (x,a)(x,a)-specific contraction coefficient, which is γ\gamma when c1=0c_{1}=0 (the trace is cut immediately) and can be close to zero when learning from full returns (ct1c_{t}\approx 1 for all tt).

2 Control

In the control setting, the single target policy π\pi is replaced by a sequence of policies (πk)(\pi_{k}) which depend on (Qk)(Q_{k}). While most prior work has focused on strictly greedy policies, here we consider the larger class of increasingly greedy sequences. We now make this notion precise.

Intuitively, this means that each πk+1\pi_{k+1} is at least as greedy as the previous policy πk\pi_{k} for Qk+1Q_{k+1}. Many natural sequences of policies are increasingly greedy, including εk\varepsilon_{k}-greedy policies (with non-increasing εk\varepsilon_{k}) and softmax policies (with non-increasing temperature). See proofs in the appendix.

We will assume that cs=cs(as,Fs)=c(as,xs)c_{s}=c_{s}(a_{s},\mathcal{F}_{s})=c(a_{s},x_{s}) is Markovian, in the sense that it depends on xs,asx_{s},a_{s} (as well as the policies π\pi and μ\mu) only but not on the full past history. This allows us to define the (sub)-probability transition operator

Finally, an additional requirement to the convergence in the control case, we assume that Q0Q_{0} satisfies Tπ0Q0Q0\mathcal{T}^{\pi_{0}}Q_{0}\geq Q_{0} (this can be achieved by a pessimistic initialization Q0=RMAX/(1γ)Q_{0}=-R_{MAX}/(1-\gamma)).

Consider an arbitrary sequence of behaviour policies (μk)(\mu_{k}) (which may depend on (Qk)(Q_{k})) and a sequence of target policies (πk)(\pi_{k}) that are increasingly greedy w.r.t. the sequence (Qk)(Q_{k}):

where the return operator Rk\mathcal{R}_{k} is defined by (3) for πk\pi_{k} and μk\mu_{k} and a Markovian cs=c(as,xs)[0,πk(asxs)μk(asxs)]c_{s}=c(a_{s},x_{s})\in[0,\frac{\pi_{k}(a_{s}|x_{s})}{\mu_{k}(a_{s}|x_{s})}]. Assume the target policies πk\pi_{k} are εk\varepsilon_{k}-away from the greedy policies w.r.t. QkQ_{k}, in the sense that TπkQkTQkεkQke\mathcal{T}^{\pi_{k}}Q_{k}\geq\mathcal{T}Q_{k}-\varepsilon_{k}\|Q_{k}\|e, where ee is the vector with 1-components. Further suppose that Tπ0Q0Q0\mathcal{T}^{\pi_{0}}Q_{0}\geq Q_{0}. Then for any k0k\geq 0,

In consequence, if εk0\varepsilon_{k}\to 0 then QkQQ_{k}\to Q^{*}.

Using PcμkP^{c\mu_{k}}, the Retrace(λ)(\lambda) operator rewrites

We now lower- and upper-bound the term Qk+1QQ_{k+1}-Q^{*}.

Upper bound on Qk+1QQ_{k+1}-Q^{*}. We prove that Qk+1QAk(QkQ)Q_{k+1}-Q^{*}\leq A_{k}(Q_{k}-Q^{*}) with A_{k}:=\gamma(I-\gamma P^{c\mu_{k}})^{-1}\big{[}P^{\pi_{k}}-P^{c\mu_{k}}\big{]}. Since ct[0,π(atxt)μ(atxt)]c_{t}\in[0,\frac{\pi(a_{t}|x_{t})}{\mu(a_{t}|x_{t})}] we deduce that AkA_{k} has non-negative elements, whose sum over each row, is at most γ\gamma. Thus

Lower bound on Qk+1QQ_{k+1}-Q^{*}. Using the fact that TπkQkTπQkεkQke\mathcal{T}^{\pi_{k}}Q_{k}\geq\mathcal{T}^{\pi^{*}}Q_{k}-\varepsilon_{k}\|Q_{k}\|e we have

Lower bound on TπkQkQk\mathcal{T}^{\pi_{k}}Q_{k}-Q_{k}. Since the sequence (πk)(\pi_{k}) is increasingly greedy w.r.t. (Qk)(Q_{k}), we have

where Bk:=γ[PπkPcμk](IγPcμk)1B_{k}:=\gamma[P^{\pi_{k}}-P^{c\mu_{k}}](I-\gamma P^{c\mu_{k}})^{-1}. Since PπkPcμkP^{\pi_{k}}-P^{c\mu_{k}} and (IγPcμk)1(I-\gamma P^{c\mu_{k}})^{-1} are non-negative matrices, so is BkB_{k}. Thus TπkQkQkBk1Bk2B0(Tπ0Q0Q0)0,\mathcal{T}^{\pi_{k}}Q_{k}-Q_{k}\geq B_{k-1}B_{k-2}\dots B_{0}(\mathcal{T}^{\pi_{0}}Q_{0}-Q_{0})\geq 0, since we assumed Tπ0Q0Q00T^{\pi_{0}}Q_{0}-Q_{0}\geq 0. Thus, (5) implies that

Combining the above with (4) we deduce Qk+1QγQkQ+εkQk\|Q_{k+1}-Q^{*}\|\leq\gamma\|Q_{k}-Q^{*}\|+\varepsilon_{k}\|Q_{k}\|. When εk0\varepsilon_{k}\rightarrow 0, we further deduce that QkQ_{k} are bounded, thus QkQQ_{k}\to Q^{*}. ∎

3 Online algorithms

So far we have analyzed the contraction properties of the expected R\mathcal{R} operators. We now describe online algorithms which can learn from sample trajectories. We analyze the algorithms in the every visit form (Sutton and Barto,, 1998), which is the more practical generalization of the first-visit form. In this section, we will only consider the Retrace(λ\lambda) algorithm defined with the coefficient c=λmin(1,π/μ)c=\lambda\min(1,\pi/\mu). For that cc, let us rewrite the operator PcμP^{c\mu} as λPπμ\lambda P^{\pi\wedge\mu}, where PπμQ(x,a):=ybmin(π(by),μ(by))Q(y,b)P^{\pi\wedge\mu}Q(x,a):=\sum_{y}\sum_{b}\min(\pi(b|y),\mu(b|y))Q(y,b), and write the Retrace operator RQ=Q+(IλγPπμ)1(TπQQ)\mathcal{R}Q=Q+(I-\lambda\gamma P^{\pi\wedge\mu})^{-1}(\mathcal{T}^{\pi}Q-Q). We focus on the control case, noting that a similar (and simpler) result can be derived for policy evaluation.

Consider a sequence of sample trajectories, with the kthk^{th} trajectory x0,a0,r0,x1,a1,r1,x_{0},a_{0},r_{0},x_{1},a_{1},r_{1},\dots generated by following μk\mu_{k}: atμk(xt)a_{t}\sim\mu_{k}(\cdot|x_{t}). For each (x,a)(x,a) along this trajectory, with ss being the time of first occurrence of (x,a)(x,a), update

The proof extends similar convergence proofs of TD(λ\lambda) by Bertsekas and Tsitsiklis, (1996) and of optimistic policy iteration by Tsitsiklis, (2003), and is provided in the appendix. Notice that compared to Theorem 2 we do not assume that Tπ0Q0Q00\mathcal{T}^{\pi_{0}}Q_{0}-Q_{0}\geq 0 here. However, we make the additional (rather technical) assumption that PπkP^{\pi_{k}} and PπkμkP^{\pi_{k}\wedge\mu_{k}} commute at the limit. This is satisfied for example when the probability assigned by the behavior policy μk(x)\mu_{k}(\cdot|x) to the greedy action πQk(x)\pi_{Q_{k}}(x) is independent of xx. Examples include ε\varepsilon-greedy policies, or more generally mixtures between the greedy policy πQk\pi_{Q_{k}} and an arbitrary distribution μ\mu (see Lemma 5 in the appendix for the proof):

Notice that the mixture coefficient ε\varepsilon needs not go to .

Discussion of the results

Theorems 1 and 2 ensure convergence to QπQ^{\pi} and QQ^{*} for any trace coefficient cs[0,π(asxs)μ(asxs)]c_{s}\in[0,\frac{\pi(a_{s}|x_{s})}{\mu(a_{s}|x_{s})}]. However, to make the best choice of csc_{s}, we need to consider the speed of convergence, which depends on both (1) the variance of the online estimate, which indicates how many online updates are required in a single iteration of R\mathcal{R}, and (2) the contraction coefficient of R\mathcal{R}.

Contraction speed: The contraction coefficient η[0,γ]\eta\in[0,\gamma] of R\mathcal{R} (see Remark 1) depends on how much the traces have been cut, and should be as small as possible (since it takes log(1/ε)/log(1/η)\log(1/\varepsilon)/\log(1/\eta) iterations of R\mathcal{R} to obtain an ε\varepsilon-approximation). It is smallest when the traces are not cut at all (i.e. if cs=1c_{s}=1 for all ss, R\mathcal{R} is the policy evaluation operator which produces QπQ^{\pi} in a single iteration). Indeed, when the traces are cut, we do not benefit from learning from full returns (in the extreme, c1=0c_{1}=0 and R\mathcal{R} reduces to the (one step) Bellman operator with η=γ\eta=\gamma).

A reasonable trade-off between low variance (when csc_{s} are small) and high contraction speed (when csc_{s} are large) is given by Retrace(λ\lambda), for which we provide the convergence of the online algorithm.

If we relax the assumption that the trace is Markovian (in which case only the result for policy evaluation has been proven so far) we could trade off a low trace at some time for a possibly larger-than-11 trace at another time, as long as their product is less than 11. A possible choice could be

2 Other topics of discussion

The crucial point of Theorem 2 is that convergence to QQ^{*} occurs for arbitrary behaviour policies. Thus the online result in Theorem 3 does not require the behaviour policies to become greedy in the limit with infinite exploration (i.e. GLIE assumption, Singh et al.,, 2000). We believe Theorem 3 provides the first convergence result to QQ^{*} for a λ\lambda-return (with λ>0\lambda>0) algorithm that does not require this (hard to satisfy) assumption.

Proof of Watkins’ Q(λ𝜆\lambda).

As a corollary of Theorem 3 when selecting our target policies πk\pi_{k} to be greedy w.r.t. QkQ_{k} (i.e. εk=0\varepsilon_{k}=0), we deduce that Watkins’ Q(λ\lambda) ( e.g., Watkins,, 1989; Sutton and Barto,, 1998) converges a.s. to QQ^{*} (under the assumption that μk\mu_{k} commutes asymptotically with the greedy policies, which is satisfied for e.g. μk\mu_{k} defined by (8)). We believe this is the first such proof.

Increasingly greedy policies

The assumption that the sequence of target policies (πk)(\pi_{k}) is increasingly greedy w.r.t. the sequence of (Qk)(Q_{k}) is more general that just considering greedy policies w.r.t. (Qk)(Q_{k}) (which is Watkins’s Q(λ\lambda)), and leads to more efficient algorithms. Indeed, using non-greedy target policies πk\pi_{k} may speed up convergence as the traces are not cut as frequently. Of course, in order to converge to QQ^{*}, we eventually need the target policies (and not the behaviour policies, as mentioned above) to become greedy in the limit (i.e. εk0\varepsilon_{k}\to 0 as defined in Theorem 2).

Unlike Retrace(λ\lambda), Qπ(λ)Q^{\pi}(\lambda) does not need to know the behaviour policy μ\mu. However, it fails to converge when μ\mu is far from π\pi. Retrace(λ\lambda) uses its knowledge of μ\mu (for the chosen actions) to cut the traces and safely handle arbitrary policies π\pi and μ\mu.

Comparison to TB(λ𝜆\lambda).

Similarly to Qπ(λ)Q^{\pi}(\lambda), TB(λ\lambda) does not need the knowledge of the behaviour policy μ\mu. But as a consequence, TB(λ\lambda) is not able to benefit from possible near on-policy situations, cutting traces unnecessarily when π\pi and μ\mu are close.

Estimating the behavior policy.

In the case μ\mu is unknown, it is reasonable to build an estimate μ^\widehat{\mu} from observed samples and use μ^\widehat{\mu} instead of μ\mu in the definition of the trace coefficients csc_{s}. This may actually even lead to a better estimate, as analyzed by LiAistats2015.

Continuous action space.

Let us mention that Theorems 1 and 2 extend to the case of (measurable) continuous or infinite action spaces. The trace coefficients will make use of the densities min(1,dπ/dμ)\min(1,d\pi/d\mu) instead of the probabilities min(1,π/μ)\min(1,\pi/\mu). This is not possible with TB(λ\lambda).

Open questions include:

(1) Removing the technical assumption that PπkP^{\pi_{k}} and PπkμkP^{\pi_{k}\wedge\mu_{k}} asymptotically commute, (2) Relaxing the Markov assumption in the control case in order to allow trace coefficients csc_{s} of the form (9).

Experimental Results

To validate our theoretical results, we employ Retrace(λ)(\lambda) in an experience replay (Lin,, 1993) setting, where sample transitions are stored within a large but bounded replay memory and subsequently replayed as if they were new experience. Naturally, older data in the memory is usually drawn from a policy which differs from the current policy, offering an excellent point of comparison for the algorithms presented in Section 2.

Our agent adapts the DQN architecture of Mnih et al., (2015) to replay short sequences from the memory (details in the appendix) instead of single transitions. The Q-function target update for a sample sequence xt,at,rt,,xt+kx_{t},a_{t},r_{t},\cdots,x_{t+k} is

We compare our algorithms’ performance on 60 different Atari 2600 games in the Arcade Learning Environment (Bellemare et al.,, 2013) using Bellemare et al.,’s inter-algorithm score distribution. Inter-algorithm scores are normalized so that 0 and 1 respectively correspond to the worst and best score for a particular game, within the set of algorithms under comparison. If g{1,,60}g\in\{1,\dots,60\} is a game and zg,az_{g,a} the inter-algorithm score on gg for algorithm aa, then the score distribution function is f(x):={g:zg,ax}/60f(x):=|\{g:z_{g,a}\geq x\}|/60. Roughly, a strictly higher curve corresponds to a better algorithm.

Across values of λ\lambda, λ=1\lambda=1 performs best, save for Q(λ)Q^{*}(\lambda) where λ=0.5\lambda=0.5 obtains slightly superior performance. However, is highly sensitive to the choice of λ\lambda (see Figure 1, left, and Table 2 in the appendix). Both Retrace(λ\lambda) and TB(λ)(\lambda) achieve dramatically higher performance than Q-Learning early on and maintain their advantage throughout. Compared to TB(λ\lambda), Retrace(λ\lambda) offers a narrower but still marked advantage, being the best performer on 30 games; TB(λ\lambda) claims 15 of the remainder. Per-game details are given in the appendix.

Retrace(λ\lambda) can be seen as an algorithm that automatically adjusts – efficiently and safely – the length of the return to the degree of ”off-policyness” of any available data.

Acknowledgments.

The authors thank Daan Wierstra, Nicolas Heess, Hado van Hasselt, Ziyu Wang, David Silver, Audrunas Grūslys, Georg Ostrovski, Hubert Soyer, and others at Google DeepMind for their very useful feedback on this work.

References

Appendix A Proof of Lemma 1

Let ΔQ:=QQπ\Delta Q:=Q-Q^{\pi}. We begin by rewriting (3):

Since QπQ^{\pi} is the fixed point of R\mathcal{R}, we have

Appendix B Increasingly greedy policies

Recall the definition of an increasingly greedy sequence of policies.

We say that a sequence of policies (πk)(\pi_{k}) is increasingly greedy w.r.t. a sequence of functions (Qk)(Q_{k}) if the following property holds for all kk:

It is obvious to see that this property holds if all policies πk\pi_{k} are greedy w.r.t. QkQ_{k}. Indeed in such case, Tπk+1Qk+1=TQk+1TπQk+1\mathcal{T}^{\pi_{k+1}}Q_{k+1}=\mathcal{T}Q_{k+1}\geq\mathcal{T}^{\pi}Q_{k+1} for any π\pi.

We now prove that this property holds for εk\varepsilon_{k}-greedy policies (with non-increasing (εk)(\varepsilon_{k})) as well as soft-max policies (with non-decreasing (βk)(\beta_{k})), as stated in the two lemmas below.

Of course not all policies satisfy this property (a counter-example being πk(ax):=argminaQk(x,a)\pi_{k}(a|x):=\arg\min_{a^{\prime}}Q_{k}(x,a^{\prime})).

Let (εk)(\varepsilon_{k}) be a non-increasing sequence. Then the sequence of policies (πk)(\pi_{k}) which are εk\varepsilon_{k}-greedy w.r.t. the sequence of functions (Qk)(Q_{k}) is increasingly greedy w.r.t. that sequence.

From the definition of an ε\varepsilon-greedy policy we have:

where we used the fact that εk+1εk\varepsilon_{k+1}\leq\varepsilon_{k}. ∎

Let (βk)(\beta_{k}) be a non-decreasing sequence of soft-max parameters. Then the sequence of policies (πk)(\pi_{k}) which are soft-max (with parameter βk\beta_{k}) w.r.t. the sequence of functions (Qk)(Q_{k}) is increasingly greedy w.r.t. that sequence.

For any QQ and yy, define πβ(b)=eβQ(y,b)beβQ(y,b)\pi_{\beta}(b)=\frac{e^{\beta Q(y,b)}}{\sum_{b^{\prime}}e^{\beta Q(y,b^{\prime})}} and f(β)=bπβ(b)Q(y,b).f(\beta)=\sum_{b}\pi_{\beta}(b)Q(y,b). Then we have

Thus βf(β)\beta\mapsto f(\beta) is a non-decreasing function, and since βk+1βk\beta_{k+1}\geq\beta_{k}, we have

Appendix C Proof of Theorem 2

As mentioned in the main text, since csc_{s} is Markovian, we can define the (sub)-probability transition operator

The Retrace(λ)(\lambda) operator then writes

We now lower- and upper-bound the term Qk+1QQ_{k+1}-Q^{*}.

where A_{k}:=\gamma(I-\gamma P^{c\mu_{k}})^{-1}\big{[}P^{\pi_{k}}-P^{c\mu_{k}}\big{]}.

Now let us prove that AkA_{k} has non-negative elements, whose sum over each row is at most γ\gamma. Let ee be the vector with 1-components. By rewriting AkA_{k} as γt0γt(Pcμk)t(PπkPcμk)\gamma\sum_{t\geq 0}\gamma^{t}(P^{c\mu_{k}})^{t}(P^{\pi_{k}}-P^{c\mu_{k}}) and noticing that

it is clear that all elements of AkA_{k} are non-negative. We have

(since t0γt(Pcμk)tee\sum_{t\geq 0}\gamma^{t}(P^{c\mu_{k}})^{t}e\geq e). Thus AkA_{k} has non-negative elements, whose sum over each row, is at most γ\gamma. We deduce from (10) that Qk+1QQ_{k+1}-Q^{*} is upper-bounded by a sub-convex combination of components of QkQQ_{k}-Q^{*}; the sum of their coefficients is at most γ\gamma. Thus

Now, from the definition of εk\varepsilon_{k} we have TπkQkTQkεkQkTπQkεkQk,\mathcal{T}^{\pi_{k}}Q_{k}\geq\mathcal{T}Q_{k}-\varepsilon_{k}\|Q_{k}\|\geq\mathcal{T}^{\pi^{*}}Q_{k}-\varepsilon_{k}\|Q_{k}\|, thus

By hypothesis, (πk)(\pi_{k}) is increasingly greedy w.r.t. (Qk)(Q_{k}), thus

where Bk:=γ[PπkPcμk](IγPcμk)1B_{k}:=\gamma[P^{\pi_{k}}-P^{c\mu_{k}}](I-\gamma P^{c\mu_{k}})^{-1}. Since PπkPcμkP^{\pi_{k}}-P^{c\mu_{k}} has non-negative elements (as proven in (11)) as well as (IγPcμk)1(I-\gamma P^{c\mu_{k}})^{-1}, then BkB_{k} has non-negative elements as well. Thus

since we assumed Tπ0Q0Q00T^{\pi_{0}}Q_{0}-Q_{0}\geq 0. Thus (15) implies that

and combining the above with (13) we deduce

Now assume that εk0\varepsilon_{k}\rightarrow 0. We first deduce that QkQ_{k} is bounded. Indeed as soon as εk<(1γ)/2\varepsilon_{k}<(1-\gamma)/2, we have

Thus lim supQk1+γ1(1+γ)/2Q\limsup\|Q_{k}\|\leq\frac{1+\gamma}{1-(1+\gamma)/2}\|Q^{*}\|. Since QkQ_{k} is bounded, we deduce that lim supQk=Q\limsup Q_{k}=Q^{*}. ∎

Appendix D Proof of Theorem 3

We first prove convergence of the general online algorithm.

and assume that (1) ωk\omega_{k} is a centered, Fk{\cal F}_{k}-measurable noise term of bounded variance, and (2) υk\upsilon_{k} is bounded from above by θk(Qk+1)\theta_{k}(\|Q_{k}\|+1), where (θk)(\theta_{k}) is a random sequence that converges to 0 a.s. Then, under the same assumptions as in Theorem 3, we have that QkQQ_{k}\rightarrow Q^{*} almost surely.

We write R\mathcal{R} for Rk\mathcal{R}_{k}. Let us prove the result in three steps.

Upper bound on RQkQ\mathcal{R}Q_{k}-Q^{*}. The first part of the proof is similar to the proof of (13), so we have

Lower bound on RQkQ\mathcal{R}Q_{k}-Q^{*}. Again, similarly to (15) we have

Lower-bound on TπkQkQk\mathcal{T}^{\pi_{k}}Q_{k}-Q_{k}. Since the sequence of policies (πk)(\pi_{k}) is increasingly greedy w.r.t. (Qk)(Q_{k}), we have

where ωk:=(γPπkI)ωk\omega^{\prime}_{k}:=(\gamma P^{\pi_{k}}-I)\omega_{k} and υk:=(γPπkI)υk\upsilon^{\prime}_{k}:=(\gamma P^{\pi_{k}}-I)\upsilon_{k}. It is easy to see that both ωk\omega^{\prime}_{k} and υk\upsilon^{\prime}_{k} continue to satisfy the assumptions on ωk\omega_{k}, and υk\upsilon_{k}. Now, from the definition of the R\mathcal{R} operator, we have

Using this equality into (20) and writing ξk:=TπkQkQk\xi_{k}:=\mathcal{T}^{\pi_{k}}Q_{k}-Q_{k}, we have

where Bk:=γ(PπkλPπkμk)(IγλPπkμk)1B_{k}:=\gamma(P^{\pi_{k}}-\lambda P^{\pi_{k}\wedge\mu_{k}})(I-\gamma\lambda P^{\pi_{k}\wedge\mu_{k}})^{-1}. The matrix BkB_{k} is non-negative but may not be a contraction mapping (the sum of its components per row may be larger than 11). Thus we cannot directly apply Proposition 4.5 of Bertsekas and Tsitsiklis, (1996). However, as we have seen in the proof of Theorem 2, the matrix Ak:=γ(IγλPπkμk)1(PπkλPπkμk)A_{k}:=\gamma(I-\gamma\lambda P^{\pi_{k}\wedge\mu_{k}})^{-1}(P^{\pi_{k}}-\lambda P^{\pi_{k}\wedge\mu_{k}}) is a γ\gamma-contraction mapping. So now we relate BkB_{k} to AkA_{k} using our assumption that PπkP^{\pi_{k}} and PπkμkP^{\pi_{k}\wedge\mu_{k}} commute asymptotically, i.e. PπkPπkμkPπkμkPπk=ηk\|P^{\pi_{k}}P^{\pi_{k}\wedge\mu_{k}}-P^{\pi_{k}\wedge\mu_{k}}P^{\pi_{k}}\|=\eta_{k} with ηk0\eta_{k}\rightarrow 0. For any (sub)-transition matrices UU and VV, we have

Replacing UU by PπkP^{\pi_{k}} and VV by PπkμkP^{\pi_{k}\wedge\mu_{k}}, we deduce

where υk:=υk+γt0t(λγ)tηkξk\upsilon^{\prime\prime}_{k}:=\upsilon^{\prime}_{k}+\gamma\sum_{t\geq 0}t(\lambda\gamma)^{t}\eta_{k}\|\xi_{k}\| continues to satisfy the assumptions on υk\upsilon_{k} (since ηk0\eta_{k}\rightarrow 0).

Now, let us define another sequence ξk\xi^{\prime}_{k} as follows: ξ0=ξ0\xi^{\prime}_{0}=\xi_{0} and

We can now apply Proposition 4.5 of Bertsekas and Tsitsiklis, (1996) to the sequence (ξk)(\xi^{\prime}_{k}). The matrices AkA_{k} are non-negative, and the sum of their coefficients per row is bounded by γ\gamma, see (12), thus AkA_{k} are γ\gamma-contraction mappings and have the same fixed point which is . The noise ωk\omega^{\prime}_{k} is centered and Fk\mathcal{F}_{k}-measurable and satisfies the bounded variance assumption, and υk\upsilon^{\prime\prime}_{k} is bounded above by (1+γ)θk(Qk+1)(1+\gamma)\theta^{\prime}_{k}(\|Q_{k}\|+1) for some θk0\theta^{\prime}_{k}\rightarrow 0. Thus limkξk=0\lim_{k}\xi^{\prime}_{k}=0 almost surely.

Now, it is straightforward to see that ξkξk\xi_{k}\geq\xi^{\prime}_{k} for all k0k\geq 0. Indeed by induction, let us assume that ξkξk\xi_{k}\geq\xi^{\prime}_{k}. Then

since all elements of the matrix AkA_{k} are non-negative. Thus we deduce that

Conclusion. Using (23) in (19) we deduce the lower bound:

almost surely. Now combining with the upper bound (18) we deduce that

The last two terms can be incorporated to the υk(x,a)\upsilon_{k}(x,a) and ωk(x,a)\omega_{k}(x,a) terms, respectively; we thus again apply Proposition 4.5 of Bertsekas and Tsitsiklis, (1996) to the sequence (Qk)(Q_{k}) defined by (17) and deduce that QkQQ_{k}\rightarrow Q^{*} almost surely. ∎

It remains to rewrite the update (7) in the form of (17), in order to apply Theorem 4.

Let zs,tkz^{k}_{s,t} denote the accumulating trace (Sutton and Barto,, 1998):

Let us write Qk+1o(xs,as)Q^{o}_{k+1}(x_{s},a_{s}) to emphasize the online setting. Then (7) can be written as

Using our assumptions on finite trajectories, and ci1c_{i}\leq 1, we can show that:

Finally, using the above, and writing αk=αk(xs,as)\alpha_{k}=\alpha_{k}(x_{s},a_{s}), (25) can be rewritten in the desired form:

We can thus apply Theorem 4 to (27), and conclude that the iterates QkoQQ^{o}_{k}\rightarrow Q^{*} as kk\rightarrow\infty, w.p. 1.

Let (πk)(\pi_{k}) and (μk)(\mu_{k}) two sequences of policies. If there exists α\alpha such that for all x,ax,a,

then the transition matrices PπkP^{\pi_{k}} and PπkμkP^{\pi_{k}\wedge\mu_{k}} asymptotically commute: PπkPπkμkPπkμkPπk=o(1)\|P^{\pi_{k}}P^{\pi_{k}\wedge\mu_{k}}-P^{\pi_{k}\wedge\mu_{k}}P^{\pi_{k}}\|=o(1).

Let (πQk)(\pi_{Q_{k}}) a sequence of (deterministic) greedy policies w.r.t. a sequence (Qk)(Q_{k}). Let (πk)(\pi_{k}) a sequence of policies that are εk\varepsilon_{k} away from (πQk)(\pi_{Q_{k}}), in the sense that, for all xx,

Let (μk)(\mu_{k}) a sequence of policies defined by:

for some arbitrary policy μ\mu and α\alpha\in. Assume εk0\varepsilon_{k}\rightarrow 0. Then the transition matrices PπkP^{\pi_{k}} and PπkμkP^{\pi_{k}\wedge\mu_{k}} asymptotically commute.

The intuition is that asymptotically πk\pi_{k} gets very close to the deterministic policy πQk\pi_{Q_{k}}. In that case, the minimum distribution (πkμk)(x)(\pi_{k}\wedge\mu_{k})(\cdot|x) puts a mass close to 1α1-\alpha on the greedy action πQk(x)\pi_{Q_{k}}(x), and no mass on other actions, thus (πkμk)(\pi_{k}\wedge\mu_{k}) gets very close to (1α)πk(1-\alpha)\pi_{k}, and Lemma 4 applies (with multiplicative constant 1α1-\alpha).

Indeed, from our assumption that πk\pi_{k} is ε\varepsilon-away from πQk\pi_{Q_{k}} we have:

Thus Lemma 4 applies (with a multiplicative constant 1α1-\alpha) and PπkP^{\pi_{k}} and PπkμkP^{\pi_{k}\wedge\mu_{k}} asymptotically commute. ∎

Appendix F Experimental Methods

Although our experiments’ learning problem closely matches the DQN setting used by Mnih et al., (2015) (i.e. single-thread off-policy learning with large replay memory), we conducted our trials in the multi-threaded, CPU-based framework of Mnih et al., (2016), obtaining ample result data from affordable CPU resources. Key differences from the DQN are as follows. Sixteen threads with private environment instances train simultaneously; each infers with and finds gradients w.r.t. a local copy of the network parameters; gradients then update a “master” parameter set and local copies are refreshed. Target network parameters are simply shared globally. Each thread has private replay memory holding 62,500 transitions (1/16th of DQN’s total replay capacity). The optimizer is unchanged from (Mnih et al.,, 2016): “Shared RMSprop” with step size annealing to 0 over 3×1083\times 10^{8} environment frames (summed over threads). Exploration parameter (ε\varepsilon) behaviour differs slightly: every 50,000 frames, threads switch randomly (probability 0.3, 0.4, and 0.3 respectively) between three schedules (anneal ε\varepsilon from 1 to 0.5, 0.1, or 0.01 over 250,000 frames), starting new schedules at the intermediate positions where they left old ones.We evaluated a DQN-style single schedule for ε\varepsilon, but our multi-schedule method, similar to the one used by Mnih et al.,, yielded improved performance in our multi-threaded setting.

Our experiments comprise 60 Atari 2600 games in ALE (Bellemare et al.,, 2013), with “life” loss treated as episode termination. The control, minibatched (64 transitions/minibatch) one-step Q-learning as in (Mnih et al.,, 2015), shows performance comparable to DQN in our multi-threaded setup. Retrace, TB, and Q* runs use minibatches of four 16-step sequences (again 64 transitions/minibatch) and the current exploration policy as the target policy π\pi. All trials clamp rewards into .Inthecontrol,Qfunctiontargetsareclampedinto. In the control, Q-function targets are clamped into prior to gradient calculation; analogous quantities in the multi-step algorithms are clamped into $$, then scaled (divided by) the sequence length. Coarse, then fine logarithmic parameter sweeps on the games Asterix, Breakout, Enduro, Freeway, H.E.R.O, Pong, Q*bert, and Seaquest yielded step sizes of 0.0000439 and 0.0000912, and RMSprop regularization parameters of 0.001 and 0.0000368, for control and multi-step algorithms respectively. Reported performance averages over four trials with different random seeds for each experimental configuration.

We compared our algorithms for different values of λ\lambda, using the DQN score as a baseline. As before, for each λ\lambda we compute the inter-algorithm scores on a per-game basis. We then averaged the inter-algorithm scores across games to produce Table 2 (see also Figure 2 for a visual depiction). We first remark that Retrace always achieve a score higher than TB, demonstrating that it is efficient in the sense of Section 2. Next, we note that QQ^{*} performs best for small values of λ\lambda, but begins to fail for values above λ=0.5\lambda=0.5. In this sense, it is also not safe. This is particularly problematic as the safe threshold of λ\lambda is likely to be problem-dependent. Finally, there is no setting of λ\lambda for which Retrace performs particularly poorly; for high values of λ\lambda, it achieves close to the top score in most games. For Retrace(λ\lambda) it makes sense to use a values λ=1\lambda=1 (at least in deterministic environments) as the trace cutting effect required in off-policy learning is taken care of by the use of the min(1,π/μ)\min(1,\pi/\mu) coefficient. On the contrary, Q(λ)Q^{*}(\lambda) only relies on a value λ<1\lambda<1 to take care of cutting traces for off-policy data.