Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik

Introduction

Contextual bandits refer to a learning setting where the learner repeatedly observes a context, takes an action and observes a reward for the chosen action in the observed context, but no feedback on any other action. An example is movie recommendation, where the context describes a user, actions are candidate movies and the reward measures if the user enjoys the recommended movie. The learner produces a policy, meaning a mapping from contexts to actions. A common question in such settings is, given a target policy, what is its expected reward? By letting the policy choose actions (e.g., recommend movies to users), we can compute its reward. Such online evaluation is typically costly since it exposes users to an untested experimental policy, and does not scale to evaluating many different target policies.

Off-policy evaluation is an alternative paradigm for the same question. Given logs from the existing system, which might be choosing actions according to a very different logging policy than the one we seek to evaluate, can we estimate the expected reward of the target policy? There are three classes of approaches to address this question: the direct method (DM), also known as regression adjustment, inverse propensity scoring (IPS) (Horvitz & Thompson, 1952) and doubly robust (DR) estimators (Robins & Rotnitzky, 1995; Bang & Robins, 2005; Dudík et al., 2011, 2014).

Our first goal in this paper is to study the optimality of these three classes of approaches (or lack thereof), and more fundamentally, to quantify the statistical hardness of off-policy evaluation. This problem was previously studied for multi-armed bandits (Li et al., 2015) and is related to a large body of work on asymptotically optimal estimators of average treatment effects (ATE) (Hahn, 1998; Hirano et al., 2003; Imbens et al., 2007; Rothe, 2016), which can be viewed as a special case of off-policy evaluation. In both settings, a major underlying assumption is that rewards can be consistently estimated from the features (i.e., covariates) describing contexts and actions, either via a parametric model or non-parametrically. Under such consistency assumptions, it has been shown that DM and/or DR are optimal (Imbens et al., 2007; Li et al., 2015; Rothe, 2016),The precise assumptions vary for each estimator, and are somewhat weaker for DR than for DM. whereas standard IPS is not (Hahn, 1998; Li et al., 2015), but it becomes (asymptotically) optimal when the true propensity scores are replaced by suitable estimates (Hirano et al., 2003).

Unfortunately, consistency of a reward model can be difficult to achieve in practice. Parametric models tend to suffer from a large bias (see, e.g., the empirical evaluation of Dudík et al., 2011) and non-parametric models are limited to small dimensions, otherwise non-asymptotic terms become too large (see, e.g., the analysis of non-parametric regression by Bertin et al., 2004). Therefore, here we ask: What can be said about hardness of policy evaluation in the absence of reward-model consistency?

In this pursuit, we provide the first rate-optimal lower bound on the mean-squared error (MSE) for off-policy evaluation in contextual bandits without consistency assumptions. Our lower bound matches the upper bounds of IPS and DR up to constants, when given a non-degenerate context distributions. This result is in contrast with the suboptimality of IPS under previously studied consistency assumptions, which implies that the two settings are qualitatively different.

Whereas IPS and DR are both minimax optimal, our experiments (similar to prior work) show that IPS is readily outperformed by DR, even when using a simple parametric regression model that is not asymptotically consistent. We attribute this to a lower variance of the DR estimator. We also empirically observe that while DR is generally highly competitive, it is sometimes substantially outperformed by DM. We therefore ask whether it is possible to achieve an even better bias-variance tradeoff than DR. We answer affirmatively and propose a new class of estimators, called the switch estimators, that adaptively interpolate between DM and DR (or IPS). We show that switch has MSE no worse than DR (or IPS) in the worst case, but is robust to large importance weights and can achieve a substantially smaller variance than DR or IPS.

We empirically evaluate the switch estimators against a number of strong baselines from prior work, using a previously used experimental setup to simulate contextual bandit problems on real-world multiclass classification data. The results affirm the superior bias-variance tradeoff of switch estimators, with substantial improvements across a number of problems.

In summary, the first part of our paper initiates the study of optimal estimators in a finite-sample setting and without making strong modeling assumptions, while the second part shows how to practically exploit domain knowledge by building better estimators.

Setup

In order to correct for the mismatch in the action distributions under $\mu$ and $\pi$ , it is typical to use importance weights, defined as $\rho(x,a)\,{\coloneqq}\,\pi(a\mathbin{|}x)/\mu(a\mathbin{|}x)$ . For consistent estimation, it is standard to assume that $\rho(x,a)\neq\infty$ , corresponding to absolute continuity of $\pi$ with respect to $\mu$ , meaning that whenever $\pi(a\mathbin{|}x)>0$ , then also $\mu(a\mathbin{|}x)>0$ . We make this assumption throughout the paper. In the remainder of the setup we present three common estimators of $v^{\pi}$ .

The first is the inverse propensity scoring (IPS) estimator (Horvitz & Thompson, 1952), defined as

where the DM stands for direct method (Dudík et al., 2011), also known as regression adjustment or imputation (Rothe, 2016). IPS can have a large variance when the target and logging policies differ substantially, and parametric variants of DM can be inconsistent, leading to a large bias. Therefore, both in theory and practice, it is beneficial to combine the approaches into a doubly robust estimator (Cassel et al., 1976; Robins & Rotnitzky, 1995; Dudík et al., 2011), such as the following variant,

Note that IPS is a special case of DR with $\hat{r}\equiv 0$ . In the sequel, we mostly focus on IPS and DR, and then suggest how to improve them by further interpolating with DM.

Limits of Off-policy Evaluation

In this section, we study the off-policy evaluation problem in a minimax setup. After setting up the framework, we present our lower bound and the matching upper bounds for IPS and DR under appropriate conditions.

While minimax optimality is standard in statistical estimations, it is not the only notion of optimality. An alternative framework is that of asymptotic optimality, which establishes Cramer-Rao style bounds on the asymptotic variance of estimators. We use the minimax framework, because it is the most amenable to finite-sample lower bounds, and is complementary to previous asymptotic results, as we discuss after presenting our main results.

Recall that the expectation is taken over the $n$ samples drawn from $\mu$ , along with any randomness in the estimator. The main goal of this section is to obtain a lower bound on the minimax risk. To state our bound, recall that $\rho(x,a)=\pi(a\mathbin{|}x)/\mu(a\mathbin{|}x)<\infty$ is an importance weight at $(x,a)$ . We make the following technical assumption on our problem instances, described by tuples of the form $(\pi,\lambda,\mu,\sigma,R_{\max})$ :

2 Minimax Lower Bound for Off-policy Evaluation

The bound holds for every $\gamma\in$ , and we can take the maximum over $\gamma$ . In particular, we get the following simple corollary under continuous context distributions.

Under conditions of Theorem 1, assume further that $\lambda$ has a density relative to Lebesgue measure. Then

If $\lambda$ is a mixture of a density and point masses, then $\gamma=0$ will exclude the point masses from the second term of the lower bound. In general, choosing $\gamma=\mathcal{O}\bigl{(}1/(n\log n)\bigr{)}$ excludes the contexts likely to appear multiple times, and ensures that the second term in Theorem 1 remains non-trivial (when $\mu(x,a)\leq\gamma$ with positive probability).

Before sketching the proof of Theorem 1, we discuss its preconditions and implications.

Preconditions of the theorem: The theorem assumes the existence of a (problem-dependent) constant $C_{\gamma}$ which depends on the constant $\gamma$ and various moments of the importance-weighted rewards. When $R_{\max}$ and $\sigma$ are bounded (a common situation), $C_{\gamma}$ measures how heavy-tailed the importance weights are. Note that $C_{\gamma}<\infty$ for all $\gamma\in$ whenever Assumption 1 holds, and so the condition on $n$ in Theorem 1 is eventually satisfied as long as the random variable $\sigma/R_{\max}$ has a bounded second moment. This is quite reasonable since in typical applications the a priori bound on expected rewards is on the same order or larger than the a priori bound on the reward noise. For the remainder of the discussion, we assume that $n$ is appropriately large so the preconditions of the theorem hold.

On a closer inspection, the first term of our bound in Theorem 1 coincides with the lower bound of Li et al. (2015) (up to constants). The second term (optimized over $\gamma$ ) is non-zero only if there are contexts with small probabilities relative to the number of samples. In multi-armed bandits, we recover the bound of Li et al. (2015). When the context distribution is continuous, or the probability of seeing repeated contexts in a data set of size $n$ is small, we get the minimax optimality of IPS.

One of our key contributions is to highlight this agnostic contextual regime where IPS is optimal. In the non-contextual regime, where each context appears frequently, the rewards for each context-action pair can be consistently estimated by empirical averages. Similarly, the asymptotic results discussed earlier focus on a setting where rewards can be consistently estimated thanks to parametric assumptions or smoothness (for non-parametric estimation), with the goal of asymptotic efficiency. Our work complements that line of research. In many practical situations, we wish to evaluate policies on high-dimensional context spaces, where the consistent estimation of rewards is not a feasible option. In other words, the agnostic contextual regime dominates.

The distinction between the contextual and non-contextual regime is also present in our proof, which combines a non-contextual lower bound due to the reward noise, similar to the analysis of Li et al. (2015), and an additional bound arising for non-degenerate context distributions. This latter result is a key technical novelty of our paper.

We only sketch some of the main ideas here and defer the full proof to Appendix A. For simplicity, we discuss the case where $\lambda$ is a continuous distribution. We consider two separate problem instances corresponding to the two terms in Theorem 1. The first part is relatively straightforward and reduces the problem to Gaussian mean estimation. We focus on the second part which depends on $R_{\max}$ . Our construction defines a prior over the reward distributions, $D(r\mathbin{|}x,a)$ . Given any $(x,a)$ , a problem instance is given by

Incorporating Reward Models

As discussed in the previous section, it is generally possible to beat our minimax bound when consistent reward models exist. We also argued that even in the absence of a consistent model, when DR and IPS both achieve optimal risk rates, the performance of DR on finite samples will be better than IPS as long as the reward model is even moderately good (see Eq. 6). However, under a large reward noise $\sigma$ , DR may still suffer from high variance when the importance weights are large, even when given a perfect reward model. In this section, we derive a class of estimators that leverage reward models to directly address this source of high variance, in a manner very different from the standard DR approach.

Our starting point is the observation that insistence on maintaining unbiasedness puts the DR estimator at one extreme end of the bias-variance tradeoff. Prior works have considered ideas such as truncating the rewards or importance weights when the importance weights are large (see, e.g., Bottou et al. 2013), which can dramatically reduce the variance at the cost of a little bias. We take the intuition a step further and propose to estimate the rewards for actions by two distinct strategies, based on whether they have a large or a small importance weight in a given context. When importance weights are small, we continue to use our favorite unbiased estimators, but switch to directly applying the (potentially biased) reward model on actions with large importance weights. Here, “small” and “large” are defined via a threshold parameter $\tau$ . Varying this parameter between and $\infty$ leads to a family of estimators which we call the switch estimators as they switch between an agnostic approach (such as DR or IPS) and the direct method.

We now formalize this intuition, and begin by decomposing $v^{\pi}$ according to importance weights:

Conceptually, we split our problem into two. The first problem always has small importance weights, so we can use unbiased estimators such as IPS or DR. The second problem, where importance weights are large, is addressed by DM. Writing this out leads to the following estimator:

Note that the above estimator specifically uses IPS on the first part of the problem. When DR is used instead of IPS, we refer to the resulting estimator as switch-DR. The reward model used within the DR part of the switch-DR estimator can be the same or different from the reward model used to impute rewards in the second part. We next present a bound on the MSE of the switch estimator using IPS. A similar bound holds for switch-DR.

The proposed estimator interpolates between DM and IPS. For $\tau=0$ , switch coincides with DM, while $\tau\rightarrow\infty$ yields IPS. Consequently, switch estimator is minimax optimal when $\tau$ is appropriately chosen. However, unlike IPS and DR, the switch and switch-DR estimators are by design more robust to large (or heavy-tailed) importance weights. Several estimators related to switch have been previously studied:

Bottou et al. (2013) consider a special case of switch with $\hat{r}\equiv 0$ , meaning that all the actions with large importance weights are eliminated from IPS. We refer to this method as Trimmed IPS.

Thomas & Brunskill (2016) study an estimator similar to switch in the more general setting of reinforcement learning. Their MAGIC estimator can be seen as using several candidate thresholds $\tau$ and then evaluating the policy by a weighted sum of the estimators corresponding to each $\tau$ . Similar to our approach of automatically determining $\tau$ , they determine the weighting of estimators via optimization (as we discuss below).

2 Automatic Parameter Tuning

So far we have discussed the properties of the switch estimators assuming that the parameter $\tau$ is chosen well. Our goal is to obtain the best of IPS and DM, but a poor choice of $\tau$ might easily give us the worst of the two estimators. Therefore, a method for selecting $\tau$ plays an essential role. A natural criterion would be to pick $\tau$ that minimizes the MSE of the resulting estimator. Since we do not know the precise MSE (as $v^{\pi}$ is unknown), an alternative is to minimize its data-dependent estimate. Recalling that the MSE can be written as the sum of variance and squared bias, we estimate and bound the terms individually.

Recall that we are working with a data set $(x_{i},a_{i},r_{i})$ and $\rho_{i}\coloneqq\pi(a_{i}\mathbin{|}x_{i})/\mu(a_{i}\mathbin{|}x_{i})$ . Using this data, it is straightforward to estimate the variance of the switch estimator. Let $Y_{i}(\tau)$ denote the estimated value that $\pi$ obtains on the data point $x_{i}$ according to the switch estimator with the threshold $\tau$ , that is

where the approximation above is clearly consistent since the random variables $Y_{i}$ are appropriately bounded as long as the rewards are bounded, because the importance weights are capped at the threshold $\tau$ .

Next we turn to the bias term. For understanding bias, we look at the MSE bound in Theorem 2, and observe that the last term in that theorem is precisely the squared bias. Rather than using a direct bias estimate, which would require knowledge of the error in $\hat{r}$ , we will upper bound this term. We assume that the function $R_{\max}(x,a)$ is known. This is not limiting since in most practical applications an a priori bound on the rewards is known. Then we can upper bound the squared bias as

Replacing the expectation with an average, we obtain

With these estimates, we pick the threshold $\widehat{\tau}$ by optimizing the sum of estimated variance and the upper bound on bias,

Our upper bound on the bias is rather conservative, as it upper bounds the error of DM at the largest possible value for every data point. This has the effect of favoring the use of the unbiased part in switch whenever possible, unless the variance would overwhelm even an arbitrarily biased DM. This conservative choice, however, immediately implies the minimax optimality of the switch estimator using $\widehat{\tau}$ , because the incurred bias is no more than our upper bound, and it is incurred only when the minimax optimal IPS estimator would be suffering an even larger variance.

Our automatic tuning is related to the MAGIC estimator of Thomas & Brunskill (2016). The key differences are that we pick only one threshold $\tau$ , while they combine the estimates with many different $\tau$ s using a weighting function. They pick this weighting function by optimizing a bias-variance tradeoff, but with significantly different bias and variance estimators. In our experiments, the automatic tuning using Eq. (9) generally works better than MAGIC.

Experiments

We next empirically evaluate the proposed switch estimators on the 10 UCI data sets previously used for off-policy evaluation (Dudík et al., 2011). We convert the multi-class classification problem to contextual bandits by treating the labels as actions for a policy $\mu$ , and recording the reward of $1$ if the correct label is chosen, and otherwise.

In addition to this deterministic reward model, we also consider a noisy reward model for each data set, which reveals the correct reward with probability $0.5$ and outputs a random coin toss otherwise. Theoretically, this should lead to bigger $\sigma^{2}$ and larger variance in all estimators. In both reward models, $R_{\max}\equiv 1$ is a valid bound.

The target policy $\pi$ is the deterministic decision of a logistic regression classifier learned on the multi-class data, while the logging policy $\mu$ samples according to the probability estimates of a logistic model learned on a covariate-shifted version of the data. The covariate shift is obtained as in prior work (Dudík et al., 2011; Gretton et al., 2009).

We compare switch and switch-DR against the following baselines: 1. IPS; 2. DM trained via logistic regression; 3. DR; 4. Truncated and Reweighted IPS (TrunIPS); and 5. Trimmed IPS (TrimIPS).

In DM, we train $\hat{r}$ and then evaluate the policy on the same contextual bandit data set. Following Dudík et al. (2011), DR is constructed by randomly splitting the contextual bandit data into two folds, estimating $\hat{r}$ on one fold, and then evaluating $\pi$ on the other fold and vice versa, obtaining two estimates. The final estimate is the average of the two. TrunIPS is a variant of IPS, where importance weights are capped at a threshold $\tau$ and then renormalized to sum to one (see, e.g., Bembom & van der Laan, 2008). TrimIPS is a special case of switch due to Bottou et al. (2013) described earlier, where $\hat{r}\equiv 0$ .

For switch and switch-DR as well as TrunIPS and TrimIPS we select the parameter $\tau$ by our automatic tuning from Section 4.2. To evaluate our tuning approach, we also include the results for the $\tau$ tuned optimally in hindsight, which we refer to as the oracle setting, and also show results obtained by the multi-threshold MAGIC approach. In all these approaches we optimize among 21 possible thresholds, from an exponential grid between the smallest and the largest importance weight observed in the data, considering all actions in each observed context.

The results are summarized in Figure 2, plotting the number of data sets where each method achieves at least a given relative MSE.For clarity, we have excluded switch, which significantly outperforms IPS, but is dominated by switch-DR. Similarly, we only report the better of oracle-TrimIPS and oracle-TrunIPS. Thus, methods that achieve smaller MSE across more data sets are towards the top-left corner of the plot, and a larger area under the curve indicates better performance. Some of the differences in MSE are several orders of magnitude large since the relative MSE is shown on the logaritmic scale. As we see, switch-DR dominates all baselines and our empirical tuning of $\tau$ is not too far from the best possible. The automatic tuning by MAGIC tends to revert to DM, because its bias estimate is too optimistic and so DM is preferred whenever IPS or DR have some significant variance. The gains of switch-DR are even greater in the noisy-reward setting, where we add label noise to UCI data.

In Figure 2, we illustrate the convergence of MSE as $n$ increases. We select two data sets and show how switch-DR performs against baselines in two typical cases: (i) when the direct method works well initially but is outperformed by IPS and DR as $n$ gets large, and (ii) when the direct method works poorly. In the first case, switch-DR outperforms both DM and IPS, while DR improves over IPS only moderately. In the second case, switch-DR performs about as well as IPS and DR despite a poor performance of DM. In all cases, switch-DR is robust to additional noise in the reward, while IPS and DR suffer from higher variance. Results for the remaining data sets are in Appendix D.

Conclusion

In this paper we have carried out minimax analysis of off-policy evaluation in contextual bandits and showed that IPS and DR are minimax optimal in the worst-case, when no consistent reward model is available. This result complements existing asymptotic theory with assumptions on reward models, and highlights the differences between agnostic and consistent settings. Practically, the result further motivates the importance of using side information, possibly by modeling rewards directly, especially when importance weights are too large. Given this observation, we propose a new class of estimators called switch that can be used to combine any importance weighting estimators, including IPS and DR, with DM. The estimators adaptively switch between DM when the importance weights are large and either IPS or DR when the importance weights are small. We show that the new estimators have favorable theoretical properties and also work well on real-world data. Many interesting directions remain open for future work, including high-probability upper bounds on the finite-sample MSE of switch estimators, as well as sharper finite-sample lower bounds under realistic assumptions on the reward model.

Acknowledgments

The work was partially completed during YW’s internship at Microsoft Research NYC from May 2016 to Aug 2016. The authors would like to thank Lihong Li and John Langford for helpful discussions, Edward Kennedy for bringing our attention to related problems and recent developments in causal inference, and an anonymous reviewer for pointing out relevant econometric references and providing valuable feedback that helped connect our work with research on average treatment effects.

References

Appendix A Proof of Theorem 1

In this appendix we prove the minimax bound of Theorem 1. The result is obtained by combining the following two lower bounds:

The reason for introducing $\gamma^{\prime}$ in Theorem 4 is to allow $\gamma=0$ , which is an important special case of the theorem. Otherwise, we could just assume $0<\delta\leq\gamma$ . The first bound captures the intrinsic difficulty due to the variance of reward, and is present even in a vanilla multi-armed bandit problem without contexts. The second result shows the additional dependence on $R_{\max}^{2}$ , even when $\sigma\equiv 0$ , whenever the distribution $\lambda$ is not too degenerate, and captures the additional difficulty of the contextual bandit problem. We next show how these two lower bounds yield Theorem 1 and then return to their proofs.

First, we simplify the correction term in the lower bound of Theorem 3. Using Hölder’s inequality and Eq. (10), we have

Using Eq. (12), the bound of Theorem 3 simplifies as

Similarly, by Eq. (13), Theorem 4 simplifies as

Since this bound is valid for all $\delta>0$ , taking $\delta\to 0$ , we obtain

Combining this bound with Eq. (14) yields

It remains to prove Theorems 3 and 4. They are both proved by a reduction to hypothesis testing, and invoke Le Cam’s argument to lower-bound the error in this testing problem. As in most arguments of this nature, the key contribution lies in the construction of an appropriate testing problem that leads to the desired lower bounds. Before proving the theorems, we recall the basic result of Le Cam which underlies our proofs. We point the reader to the excellent exposition of Lafferty et al. (2008, Section 36.4) on more details about Le Cam’s argument.

Let $\mathcal{P}$ be a set of distributions, let $X_{1},\dotsc,X_{n}$ be an i.i.d. sample from some $P\in\mathcal{P}$ , let $\theta(P)$ be any function of $P\in\mathcal{P}$ , let $\hat{\theta}(X_{1},\dotsc,X_{n})$ be an estimator, and $d$ be a metric. For any pair $P_{0},P_{1}\in\mathcal{P}$ ,

While the proofs of the two theorems share a lot of similarities, they have to use reductions to slightly different testing problems given the different mean and variance constraints in the two results. We begin with the proof of Theorem 3, which has a simpler construction.

The basic idea of this proof is to reduce the problem of policy evaluation to that of Gaussian mean estimation where there is a mean associated with each $x,a$ pair. We now describe our construction.

Given this family of problem instances, it is easy to see that for any pair of $\eta_{1},\eta_{2}$ which are both pointwise upper bounded by $R_{\max}$ , we have the lower bound:

where the last inequality lower bounds the maximum by the average. So far we have been working with an estimation problem. We next describe how to reduce this to a hypothesis testing problem.

Reduction to hypothesis testing

For turning our estimation problem into a testing problem, the idea is to identify a pair $\eta_{1}$ , $\eta_{2}$ such that they are far enough from each other so that any estimator which gets a small estimation loss can essentially identify whether the data generating distribution corresponds to $P_{\eta_{1}}$ or $P_{\eta_{2}}$ . In order to do this, we take any estimator $\hat{v}$ and identify a corresponding test statistic which maps $\hat{v}$ into one of $\eta_{1}$ , $\eta_{2}$ . The way to do this is essentially identified in Eq. (16), and we describe it next.

where the final inequality uses our earlier lower bound in Eq. (16).

To finish connecting our the estimation problem to testing, it remains to establish our earlier supposition (17). Assume for now that $\eta_{1}$ and $\eta_{2}$ are chosen such that

Then an application of the inequality $(a+b)^{2}\leq 2a^{2}+2b^{2}$ yields

Invoking Le Cam’s argument

So far we have identified a hypothesis testing problem and a test statistic whose error is upper bounded in terms of the minimax risk of our problem. In order to complete the proof, we now place a lower bound on the error of this test statistic. Recall the result of Le Cam (15), which places an upper bound on the attainable error in any testing problem. In our setting, this translates to

Since the distribution of the rewards is a spherical Gaussian, the KL-divergence is given by the squared distance between the means, scaled by the variance, that is

where we recall that the contexts and actions are drawn from $\lambda$ and $\mu$ respectively. Since we would like the probability of error in the test to be a constant, it suffices to choose $\eta_{1}$ and $\eta_{2}$ such that

Picking the parameters

So far, we have not made any concrete choices for $\eta_{1}$ and $\eta_{2}$ , apart from some constraints which we have introduced along the way. Note that we have the constraints (19) and (20) which try to ensure that $\eta_{1}$ and $\eta_{2}$ are not too close that an estimator does not have to identify the true parameter, or too far that the testing problem becomes trivial. Additionally, we have the upper and lower bounds of and $R_{\max}$ on $\eta_{1}$ and $\eta_{2}$ . In order to reason about these constraints, it is convenient to set $\eta_{2}\equiv 0$ , and pick $\eta_{1}(x,a)=\eta_{1}(x,a)-\eta_{2}(x,a)=\Delta(x,a)$ . We now write all our constraints in terms of $\Delta$ .

Note that $v_{\eta_{2}}$ is now , so that the first constraint (19) is equivalent to

where the importance weighting function $\rho$ is introduced since $P_{\eta_{1}}$ is based on choosing actions according to $\mu$ and we seek to evaluate $\pi$ . The second constraint (20) is also straightforward

Finally, the bound $R_{\max}$ and non-negativity of $\eta_{1}$ and $\eta_{2}$ are enforced by requiring $0\leq\Delta(x,a)\leq R_{\max}(x,a)$ almost surely.

The minimax lower bound is then obtained by the largest $\epsilon$ in the constraint (19) such that the other two constraints can be satisfied. This gives rise to the following variational characterization of the minimax lower bound:

Instead of finding the optimal solution, we just exhibit a feasible setting of $\Delta$ here. We set

This setting satisfies the bounds (23) by construction. A quick substitution also verifies that the constraint (22) is satisfied. Consequently, it suffices to set $\epsilon$ to the value attained in the constraint (21). Substituting the value of $\Delta$ in the constraint, we see that

Putting all the foregoing bounds together, we obtain that for all estimators $\hat{v}$

A.2 Proof of Theorem 4

We now give the proof of Theorem 4. While it shares a lot of reasoning with the proof of Theorem 3, it has one crucial difference. In Theorem 3, there is a non-trivial noise in the reward function, unlike in Theorem 4. This allowed the proof to work with just two candidate mean-reward functions, since any realization in the data is corrupted with noise. However, in the absence of added noise, the task of mean identification becomes rather trivial: an estimator can just check whether $\eta_{1}$ or $\eta_{2}$ matches the observations exactly.

To prevent such a strategy, we instead construct a richer family of reward functions. Instead of merely two mean rewards, our construction will involve a randomized design of the expected reward function from an appropriate prior distribution. The draw of the mean reward from a prior will essentially generate noise even though any given problem is noiseless. The construction will also highlight the crucial sources of difference between the contextual and multi-armed bandit problems, since the arguments here rely on having access to a rich context distribution, by which we mean distribution that puts non-trivial probability on many contexts. In the absence of this property, the bound of Theorem 4 becomes weaker.

Our family of problems will be parametrized by the two reals $\delta$ and $\gamma$ from the statement of the theorem. Our construction begins with a discretization step at the resolution $\delta$ , whose goal is to create a countable partition of the set of pairs $\mathcal{X}\times\mathcal{A}$ . If sets $\mathcal{X}$ and $\mathcal{A}$ are countable or finite, this step is vacuous, but if the sets of contexts or actions have continuous parts, this step is required.

First, let $\mu(x,a)$ denote the joint probability measure obtained by first drawing $x\sim\lambda$ and then $a\sim\mu(\cdot\mathbin{|}x)$ . In Lemma 1, we show that $\mathcal{X}\times\mathcal{A}$ can be split into countably many disjoint sets $B_{i}$ , $\biguplus_{i\in\mathcal{I}}B_{i}=\mathcal{X}\times\mathcal{A}$ , such that the following conditions are satisfied:

Each $i\in\mathcal{I}$ is associated with numbers $R_{i}\geq 0$ , $\rho_{i}\geq 0$ and $\xi_{i}\in\{0,1\}$ such that

Each $B_{i}$ either satisfies $\mu(B_{i})\leq\delta$ or consists of a single pair $(x_{i},a_{i})$ .

The numbers $R_{i}$ and $\rho_{i}$ will be exactly $\hat{R}(x,a)$ and $\hat{\rho}(x,a)$ from the theorem statement.

As before, we parametrize the family of reward distributions in terms of the mean reward function $\eta(x,a)$ . However, now $\eta(x,a)$ is itself a random variable, which is drawn from a prior distribution. The reward function $\eta(x,a)$ will be constant on each $B_{i}$ , and its value on $B_{i}$ , written as $\eta(i)$ , will be drawn from a scaled Bernoulli, parametrized by a prior function $\theta(i)$ as follows:

Taking the worst case over all problems in the above inequality, we obtain

Reduction to hypothesis testing

As in the proof of Theorem 3, we now reduce the estimation problem into a hypothesis test for whether the data is generated according to the parameter $\theta_{1}$ or $\theta_{2}$ . The arguments here are similar to the earlier proof, so we will be terser in this presentation.

Under this assumption, we can similarly conclude that the error of our hypothesis test is at most

Invoking Le Cam’s argument

where $i$ is treated as a random variable under $\mu$ , and $\xi_{i}$ is included, because the two distributions assign $r=0$ with probability one if $\xi_{i}=0$ .

Picking the parameters

It remains to carefully choose $\theta_{1}$ and $\theta_{2}$ . We define $\theta_{2}(i)\equiv 0.5$ , and let $\theta_{1}(i)=\theta_{2}(i)+\Delta_{i}$ , where $\Delta_{i}$ will be chosen to satisfy certain constraints as before. Then, by Lemma 3, the KL divergence in Eq. (27) can be bounded as

It remains to choose $\Delta_{i}$ . Following a similar logic as before, we seek to find a good feasible solution of the maximization problem

For some $\alpha>0$ to be determined shortly, we set

Collecting our arguments so far, we have established that

In order to complete the proof, we need to further upper bound $\mathcal{T}_{2}$ in the decomposition (26).

We begin by bounding its range. From the definition of $\eta$ and $v_{\eta}$ ,

To apply Hoeffding’s inequality, we write $v_{\eta}$ explicitly as

which by construction satisfies $v^{\prime}_{\eta}\leq v_{\eta}\leq v^{\prime}_{\eta}+\delta_{0}$ . Note that the summands $Y_{i}$ can be bounded as

because $\rho^{\prime}_{i}\leq(1+\delta)^{1/2}\rho_{i}$ and $\xi_{i}\mu(B_{i})\leq\xi_{i}\sqrt{\mu(B_{i})}\sqrt{\gamma^{\prime}}$ , because $\xi_{i}=0$ whenever $\mu(B_{i})>\max\{\gamma,\delta\}=\gamma^{\prime}$ . By Hoeffding’s inequality, we thus have

Combining this bound with the bound on $\mathcal{T}_{1}$ yields the theorem.

Sets $B_{i}$ form a partition of $\mathcal{Z}$ , i.e., $\mathcal{Z}=\uplus_{i\in\mathcal{I}}B_{i}$ .

Reals $R_{i}$ and $\rho_{i}$ approximate $R_{\max}$ and $\rho$ , and $\xi_{i}$ equals $\xi_{\gamma}$ as follows:

Each set $B_{i}$ either satisfies $\mu(B_{i})\leq\delta$ or consists of a single $z\in\mathcal{Z}$ .

Let $\mathcal{Z}\coloneqq\mathcal{X}\times\mathcal{A}$ . We begin our construction by separating out atoms, i.e., the elements $z\in\mathcal{Z}$ such that $\mu(z)>0$ . Specifically, we write $\mathcal{Z}=\mathcal{Z}^{\textup{na}}\uplus\mathcal{Z}^{\textup{a}}$ where $\mathcal{Z}^{\textup{a}}$ consists of atoms and $\mathcal{Z}^{\textup{na}}$ of all non-atoms. The set $\mathcal{Z}^{\textup{a}}$ is either finite or countably infinite, so $\mathcal{Z}^{\textup{na}}$ is measurable.

By a theorem of Sierpiński (1922), since $\mu$ does not have any atoms on $\mathcal{Z}^{\textup{na}}$ , it must be continuous on $\mathcal{Z}^{\textup{na}}$ in the sense that if $A$ is a measurable subset of $\mathcal{Z}^{\textup{na}}$ with $\mu(A)=a$ then for any $b\in[0,a]$ , there exists a measurable set $B\subseteq A$ such that $\mu(B)=b$ . This means that we can decompose $\mathcal{Z}^{\textup{na}}$ into $N\coloneqq\lceil 1/\delta\rceil$ sets $\mathcal{Z}^{\textup{na}}_{1},\mathcal{Z}^{\textup{na}}_{2},\dotsc,\mathcal{Z}^{\textup{na}}_{N}$ such that each has a measure at most $\delta$ and $\mathcal{Z}^{\textup{na}}=\biguplus_{j=1}^{N}\mathcal{Z}^{\textup{na}}_{j}$ .

It will also be convenient to set $a_{-\infty}\coloneqq 0$ . Thus, the construction of $I_{j}$ guarantees that for all $j\in\mathcal{J}$ and all $t\in I_{j}$ we have $a_{j}^{2}\leq t^{2}\leq(1+\delta)a_{j}^{2}$ .

The desired partition, with the index set $\mathcal{I}=\mathcal{Z}^{\textup{a}}\;\cup\;[N]\times\mathcal{J}^{2}$ , is as follows:

Appendix B Proof of Theorem 2

Let $A_{x}\coloneqq\left\{a\in\mathcal{A}:\>\rho(x,a)\leq\tau\right\}$ . For brevity, we write $A_{i}\coloneqq A_{x_{i}}$ . We decompose the mean squared error into the squared bias and variance and control each term separately,

We first calculate the bias. Note that bias is incurred only in the terms that fall in $A^{c}_{x}$ , so

Next we upper bound the variance. Note that the variance contributions from the IPS part and the DM part are not independent, since the indicators $\rho(x_{i},a)>\tau$ and $\rho(x_{i},a)\leq\tau$ are mutually exclusive. To simplify the analysis, we use the following inequality that holds for any random variable $X$ and $Y$ :

This allows us to calculate the variance of each part separately.

To complete the proof, note that the last term is further upper bounded using Jensen’s inequality as

where the final inequality uses $\sum_{a\in A^{c}_{x}}\pi(a|x)\leq 1$ and $\hat{r}(x,a)\in[0,R_{\max}(x,a)]$ almost surely.

Combining the bias and variance bounds, we get the stated MSE upper bound. ∎

Appendix C Utility Lemmas

Let $X_{i}\in[a_{i},b_{i}]$ and $X_{1},...,X_{n}$ are drawn independently. Then the empirical mean $\bar{X}=\frac{1}{n}(X_{1}+...+X_{n})$ obeys

Introduction

Setup

Limits of Off-policy Evaluation

2 Minimax Lower Bound for Off-policy Evaluation

Incorporating Reward Models

2 Automatic Parameter Tuning

Experiments

Conclusion

Acknowledgments

References

Appendix A Proof of Theorem 1

Reduction to hypothesis testing

Invoking Le Cam’s argument

Picking the parameters

A.2 Proof of Theorem 4

Reduction to hypothesis testing

Invoking Le Cam’s argument

Picking the parameters

Appendix B Proof of Theorem 2

Appendix C Utility Lemmas

Appendix D Additional Figures from the Experiments