On Convergence of Emphatic Temporal-Difference Learning

Huizhen Yu

Introduction

We consider discounted finite-spaces Markov decision processes (MDPs) and the problem of learning an approximate value function for a given policy from off-policy data, that is, from data due to a different policy. The first policy is called the target policy and the second is called the behavior policy. For example, one may want to learn value functions for many target policies in parallel from one (exploratory) behavior; this requires off-policy learning.

We focus on temporal-difference (TD) methods with linear function approximation