Fairness in Reinforcement Learning

Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Aaron Roth

Introduction

The growing use of machine learning for automated decision-making has raised concerns about the potential for unfairness in learning algorithms and models. In settings as diverse as policing , hiring , lending , and criminal sentencing , mounting empirical evidence suggests these concerns are not merely hypothetical .

We initiate the study of fairness in reinforcement learning, where an algorithm’s choices may influence the state of the world and future rewards. In contrast, previous work on fair machine learning has focused on myopic settings where such influence is absent, e.g. in i.i.d. or no-regret models . The resulting fairness definitions therefore do not generalize well to a reinforcement learning setting, as they do not reason about the effects of short-term actions on long-term rewards. This is relevant for the settings where historical context can have a distinct influence on the future. For concreteness, we consider the specific example of hiring (though other settings such as college admission or lending decisions can be embedded into this framework). Consider a firm aiming to hire employees for a number of positions. The firm might consider a variety of hiring practices, ranging from targeting and hiring applicants from well-understood parts of the applicant pool (which might be a reasonable policy for short-term productivity of its workforce), to exploring a broader class of applicants whose backgrounds might differ from the current set of employees at the company (which might incur short-term productivity and learning costs but eventually lead to a richer and stronger overall applicant pool).

We focus on the standard model of reinforcement learning, in which an algorithm seeks to maximize its discounted sum of rewards in a Markovian decision process (MDP). Throughout, the reader should interpret the actions available to a learning algorithm as corresponding to choices or policies affecting individuals (e.g. which applicants to target and hire). The reward for each action should be viewed as the short-term payoff of making the corresponding decision (e.g. the short-term influence on the firm’s productivity after hiring any particular candidate). The actions taken by the algorithm affect the underlying state of the system (e.g. the company’s demographics as well as the available applicant pool) and therefore in turn will affect the actions and rewards available to the algorithm in the future.

Informally, our definition of fairness requires that (with high probability) in state $s$ , an algorithm never chooses an available action $a$ with probability higher than another action $a^{\prime}$ unless $Q^{*}(s,a)>Q^{*}(s,a^{\prime})$ , i.e. the long-term reward of $a$ is greater than that of $a^{\prime}$ . This definition, adapted from Joseph et al. , is weakly meritocratic: facing some set of actions, an algorithm must pick a distribution over actions with (weakly) heavier weight on the better actions (in terms of their discounted long-term reward). Correspondingly, a hiring process satisfying our fairness definition cannot probabilistically target one population over another if hiring from either population will have similar long-term benefit to the firm’s productivity.

Unfortunately, our first result shows an exponential separation in expected performance between the best unfair algorithm and any algorithm satisfying fairness. This motivates our study of a natural relaxation of (exact) fairness, for which we provide a polynomial time learning algorithm, thus establishing an exponential separation between exact and approximately fair learning in MDPs.

Our Results Throughout, we use (exact) fairness to refer to the adaptation of Joseph et al. ’s definition defining an action’s quality as its potential long-term discounted reward. We also consider two natural relaxations. The first, approximate-choice fairness, requires that an algorithm never chooses a worse action with probability substantially higher than better actions. The second, approximate-action fairness, requires that an algorithm never favors an action of substantially lower quality than that of a better action.

The contributions of this paper can be divided into two parts. First, in Section 3, we give a lower bound on the time required for a learning algorithm to achieve near-optimality subject to (exact) fairness or approximate-choice fairness.

For constant $\epsilon$ , to achieve $\epsilon$ -optimality, (i) any fair or approximate-choice fair algorithm takes a number of rounds exponential in the number of MDP states and (ii) any approximate-action fair algorithm takes a number of rounds exponential in $1/(1-\gamma)$ , for discount factor $\gamma$ .

Second, we present an approximate-action fair algorithm (Fair-E3) in Section 4 and prove a polynomial upper bound on the time it requires to achieve near-optimality.

For constant $\epsilon$ and any MDP satisfying standard assumptions, Fair-E3 is an approximate-action fair algorithm achieving $\epsilon$ -optimality in a number of rounds that is (necessarily) exponential in $1/(1-\gamma)$ and polynomial in other parameters.

The exponential dependence of Fair-E3 on $1/(1-\gamma)$ is tight: it matches our lower bound on the time complexity of any approximate-action fair algorithm. Furthermore, our results establish rigorous trade-offs between fairness and performance facing reinforcement learning algorithms.

The most relevant parts of the large body of literature on reinforcement learning focus on constructing learning algorithms with provable performance guarantees. E3 was the first learning algorithm with a polynomial learning rate, and subsequent work improved this rate (see Szita and Szepesvári and references within). The study of robust MDPs examines MDPs with high parameter uncertainty but generally uses “optimistic” learning strategies that ignore (and often conflict with) fairness and so do not directly apply to this work.

Our work also belongs to a growing literature studying the problem of fairness in machine learning. Early work in data mining considered the question from a primarily empirical standpoint, often using statistical parity as a fairness goal. Dwork et al. explicated several drawbacks of statistical parity and instead proposed one of the first broad definitions of algorithmic fairness, formalizing the idea that “similar individuals should be treated similarly”. Recent papers have proven several impossibility results for satisfying different fairness requirements simultaneously . More recently, Hardt et al. proposed new notions of fairness and showed how to achieve these notions via post-processing of a black-box classifier. Woodworth et al. and Zafar et al. further studied these notion theoretically and empirically.

2 Strengths and Limitations of Our Models

In recognition of the duration and consequence of choices made by a learning algorithm during its learning process – e.g. job applicants not hired – our work departs from previous work and aims to guarantee the fairness of the learning process itself. To this end, we adapt the fairness definition of Joseph et al. , who studied fairness in the bandit framework and defined fairness with respect to one-step rewards. To capture the desired interaction and evolution of the reinforcement learning setting, we modify this myopic definition and define fairness with respect to long-term rewards: a fair learning algorithm may only choose action $a$ over action $a^{\prime}$ if $a$ has true long-term reward at least as high as $a^{\prime}$ . Our contributions thus depart from previous work in reinforcement learning by incorporating a fairness requirement (ruling out existing algorithms which commonly make heavy use of “optimistic” strategies that violates fairness) and depart from previous work in fair learning by requiring “online” fairness in a previously unconsidered reinforcement learning context.

First note that our definition is weakly meritocratic: an algorithm satisfying our fairness definition can never probabilistically favor a worse option but is not required to favor a better option. This confers both strengths and limitations. Our fairness notion still permits a type of “conditional discrimination” in which a fair algorithm favors group A over group B by selecting choices from A when they are superior and randomizing between A and B when choices from B are superior. In this sense, our fairness requirement is relatively minimal, encoding a necessary variant of fairness rather than a sufficient one. This makes our lower bounds and impossibility results (Section 3) relatively stronger and upper bounds (Section 4) relatively weaker.

Next, our fairness requirement holds (with high probability) across all decisions that a fair algorithm makes. We view this strong constraint as worthy of serious consideration, since “forgiving” unfairness during the learning may badly mistreat the training population, especially if the learning process is lengthy or even continual. Additionally, it is unclear how to relax this requirement, even for a small fraction of the algorithm’s decisions, without enabling discrimination against a correspondingly small population.

Instead, aiming to preserve the “minimal” spirit of our definition, we consider a relaxation that only prevents an algorithm from favoring a significantly worse option over a better option (Section 2.1). Hence, approximate-action fairness should be viewed as a weaker constraint: rather than safeguarding against every violation of “fairness”, it instead restricts how egregious these violations can be. We discuss further relaxations of our definition in Section 5.

Preliminaries

In this paper we study reinforcement learning in Markov Decision Processes (MDPs). An MDP is a tuple $M=(\mathcal{S}_{M},\mathcal{A}_{M},P_{M},R_{M},T,\gamma)$ where $\mathcal{S}_{M}$ is a set of $n$ states, $\mathcal{A}_{M}$ is a set of $k$ actions, $T$ is a horizon of a (possibly infinite) number of rounds of activity in $M$ , and $\gamma$ is a discount factor. $P_{M}:\mathcal{S}_{M}\times\mathcal{A}_{M}\to\mathcal{S}_{M}$ and $R_{M}:\mathcal{S}_{M}\to$ denote the transition probability distribution and reward distribution, respectively. We use $\bar{R}_{M}$ to denote the mean of $R_{M}$ .Note that $\bar{R}_{M}\leq 1$ and $\text{Var}(R_{M})\leq 1$ for all states. The bounded reward assumption can be relaxed (see e.g. ). Also assuming rewards in $ $can be made w.l.o.g. up to scaling. A policy$ \pi $is a mapping from a history$ h $(the sequence of triples (state, action, reward) observed so far) to a distribution over actions. The discounted state and state-action value functions are denoted by$ V^{\pi} $and$ Q^{\pi} $, and$ V^{\pi}(s,T) $represents expected discounted reward of following$ \pi $from$ s $for$ T $steps. The highest values functions are achieved by the optimal policy$ \pi^{*} $and are denoted by$ V^{*} $and$ Q^{*} $. We use$ \mu^{\pi} $to denote the stationary distribution of$ \pi$. Throughout we make the following assumption.

The stationary distribution of any policy in $M$ is independent of its start state.

We denote the $\epsilon$ -mixing time of $\pi$ by $T^{\pi}_{\epsilon}$ . Lemma 1 relates the $\epsilon$ -mixing time of any policy $\pi$ to the number of rounds until the $V^{\pi}_{M}$ values of the visited states by $\pi$ are close to the expected $V^{\pi}_{M}$ values (under the stationary distribution $\mu^{\pi}$ ). We defer all the omitted proofs to the Appendix.

Fix $\epsilon>0$ . For any state $s$ , following $\pi$ for $T\geq T^{\pi}_{\epsilon}$ steps from $s$ satisfies

where $s_{t}$ is the state visited at time $t$ when following $\pi$ from $s$ and the expectation in the second term is over the transition function and the randomization of $\pi$ .Lemma 1 can be stated for a weaker notion of mixing time called the $\epsilon$ -reward mixing time which is always linearly bounded by the $\epsilon$ -mixing time but can be much smaller in certain cases (see Kearns and Singh for a discussion).

The horizon time ${H_{\epsilon}^{\gamma}}\coloneqq\log\left(\epsilon(1-\gamma)\right)/\log(\gamma)$ of an MDP captures the number of steps an approximately optimal policy must optimize over. The expected discounted reward of any policy after ${H_{\epsilon}^{\gamma}}$ steps approaches the expected asymptotic discounted reward (Kearns and Singh , Lemma 2). A learning algorithm $\mathcal{L}$ is a non-stationary policy that at each round takes the entire history and outputs a distribution over actions. We now define a performance measure for learning algorithms.

Let $\epsilon>0$ and $\delta\in(0,1/2)$ . $\mathcal{L}$ achieves $\epsilon$ -optimality in $\mathcal{T}$ steps if for any $T\geq\mathcal{T}$

with probability at least $1-\delta$ , for $s_{t}$ the state $\mathcal{L}$ reaches at time $t$ , where the expectation is taken over the transitions and the randomization of $\mathcal{L}$ , for any MDP $M$ .

We thus ask that a learning algorithm, after sufficiently many steps, visits states whose values are arbitrarily close to the values of the states visited by the optimal policy. Note that this is stronger than the “hand-raising” notion in Kearns and Singh ,We suspect unfair E3 also satisfies this stronger notion. which only asked that the learning algorithm stop in a state from which discounted return is near-optimal, permitting termination in a state from which the optimal discounted return is poor. In Definition 1, if there are states with poor optimal discounted reward that the optimal policy eventually leaves for better states, so must our algorithms. We also note the following connection between the average $V^{\pi}_{M}$ values of states visited under the stationary distribution of $\pi$ (and in particular an optimal policy) and the average undiscounted rewards achieved under the stationary distribution of that policy.

Let $\bar{\textbf{R}}_{M}$ be the vector of mean rewards in states of $M$ and $\textbf{V}^{\pi}_{M}$ the vector of discounted rewards in states under $\pi$ . Then $\mu^{\pi}\cdot\bar{\textbf{R}}_{M}=(1-\gamma)\mu^{\pi}\cdot\textbf{V}^{\pi}_{M}.$

We design an algorithm which quickly achieves $\epsilon$ -optimality and we bound the number of steps $\mathcal{T}$ before this happens by a polynomial in the parameters of $M$ .

We now turn to formal notions of fairness. Translated to our setting, Joseph et al. define action $a$ ’s quality as the expected immediate reward for choosing $a$ from state $s$ and then require that an algorithm not probabilistically favor $a$ over $a^{\prime}$ if $a$ has lower expected immediate reward.

However, this naive translation does not adequately capture the structural differences between bandit and MDP settings since present rewards may depend on past choices in MDPs. In particular, defining fairness in terms of immediate rewards would prohibit any policy sacrificing short-term rewards in favor of long-term rewards. This is undesirable, since it is the long-term rewards that matter in reinforcement learning, and optimizing for long-term rewards often necessitates short-term sacrifices. Moreover, the long-term impact of a decision should be considered when arguing about its relative fairness. We will therefore define fairness using the state-action value function $Q^{*}_{M}$ .

$\mathcal{L}$ is fair if for all input $\delta>0$ , all $M$ , all rounds $t$ , all states $s$ and all actions $a,a^{\prime}$

with probability at least $1-\delta$ over histories $h_{t-1}$ . $\mathcal{L}(s,a,h)$ denotes the probability $\mathcal{L}$ chooses $a$ from $s$ given history $h$ .

Fairness requires that an algorithm never probabilistically favors an action with lower long-term reward over an action with higher long-term reward. In hiring, this means that an algorithm cannot target one applicant population over another unless the targeted population has a higher quality.

In Section 3, we show that fairness can be extremely restrictive. Intuitively, $\mathcal{L}$ must play uniformly at random until it has high confidence about the $Q^{*}_{M}$ values, in some cases taking exponential time to achieve near-optimality. This motivates relaxing Definition 2. We first relax the probabilistic requirement and require only that an algorithm not substantially favor a worse action over a better one.

$\mathcal{L}$ is $\alpha$ -choice fair if for all inputs $\delta>0$ and $\alpha>0$ : for all $M$ , all rounds $t$ , all states $s$ and actions $a,a^{\prime}$ :

with probability of at least $1-\delta$ over histories $h_{t-1}$ . If $\mathcal{L}$ is $\alpha$ -choice fair for any input $\alpha>0$ , we call $\mathcal{L}$ approximate-choice fair.

A slight modification of the lower bound for (exact) fairness shows that algorithms satisfying approximate-choice fairness can also require exponential time to achieve near-optimality. We therefore propose an alternative relaxation, where we relax the quality requirement. As described in Section 1.1, the resulting notion of approximate-action fairness is in some sense the most fitting relaxation of fairness, and is a particularly attractive one because it allows us to give algorithms circumventing the exponential hardness proved for fairness and approximate-choice fairness.

$\mathcal{L}$ is $\alpha$ -action fair if for all inputs $\delta>0$ and $\alpha>0$ , for all $M$ , all rounds $t$ , all states $s$ and actions $a,a^{\prime}$ :

with probability of at least $1-\delta$ over histories $h_{t-1}$ . If $\mathcal{L}$ is $\alpha$ -action fair for any input $\alpha>0$ , we call $\mathcal{L}$ approximate-action fair.

Approximate-choice fairness prevents equally good actions from being chosen at very different rates, while approximate-action fairness prevents substantially worse actions from being chosen over better ones. In hiring, an approximately-action fair firm can only (probabilistically) target one population over another if the targeted population is not substantially worse. While this is a weaker guarantee, it at least forces an approximately-action fair algorithm to learn different populations to statistical confidence. This is a step forward from current practices, in which companies have much higher degrees of uncertainty about the quality (and impact) of hiring individuals from under-represented populations. For this reason and the computational benefits mentioned above, our upper bounds will primarily focus on approximate-action fairness.

We now state several useful observations regarding fairness. We defer all the formal statements and their proofs to the Appendix. We note that there always exists a (possibly randomized) optimal policy which is fair (Observation 1); moreover, any optimal policy (deterministic or randomized) is approximate-action fair (Observation 2), as is the uniformly random policy (Observation 3).

Finally, we consider a restriction of the actions in an MDP $M$ to nearly-optimal actions (as measured by $Q^{*}_{M}$ values).

The $\alpha$ -restricted MDP of $M$ , denoted by $M^{\alpha}$ , is identical to $M$ except that in each state $s$ , the set of available actions are restricted to $\{a:Q^{*}_{M}(s,a)\geq\max_{a^{\prime}\in\mathcal{A}_{M}}Q^{*}_{M}(s,a^{\prime})-\alpha\mid a\in\mathcal{A}_{M}\}$ .

$M^{\alpha}$ has the following two properties: (i) any policy in $M^{\alpha}$ is $\alpha$ -action fair in $M$ (Observation 4) and (ii) the optimal policy in $M^{\alpha}$ is also optimal in $M$ (Observation 5). Observations 4 and 5 aid our design of an approximate-action fair algorithm: we construct $M^{\alpha}$ from estimates of the $Q^{*}_{M}$ values (see Section 4.3 for more details).

Lower Bounds

We now demonstrate a stark separation between the performance of learning algorithms with and without fairness. First, we show that neither fair nor approximate-choice fair algorithms achieve near-optimality unless the number of time steps $\mathcal{T}$ is at least $\Omega(k^{n})$ , exponential in the size of the state space. We then show that any approximate-action fair algorithm requires a number of time steps $\mathcal{T}$ that is at least $\Omega(k^{\tfrac{1}{1-\gamma}})$ to achieve near-optimality. We start by proving a lower bound for fair algorithms.

If $\delta<\tfrac{1}{4}$ , $\gamma>\tfrac{1}{2}$ and $\epsilon<\tfrac{1}{8}$ , no fair algorithm can be $\epsilon$ -optimal in $\mathcal{T}=O(k^{n})$ steps.We have not optimized the constants upper-bounding parameters in the statement of Theorems 3, 4 and 5. The values presented here are only chosen for convenience.

Standard reinforcement learning algorithms (absent a fairness constraint) learn an $\epsilon$ -optimal policy in a number of steps polynomial in $n$ and $\tfrac{1}{\epsilon}$ ; Theorem 3 therefore shows a steep cost of imposing fairness. We outline the idea for proof of Theorem 3. For intuition, first consider the special case when the number of actions $k=2$ . We introduce the MDPs witnessing the claim in Theorem 3 for this case.

For $\mathcal{A}_{M}=\{L,R\}$ , let $M(x)=(\mathcal{S}_{M},\mathcal{A}_{M},\mathcal{P}_{M},\mathcal{R}_{M},T,\gamma,x)$ be an MDP with

for all $i\in[n]$ , $P_{M}(s_{i},L,s_{1})=P_{M}(s_{i},R,s_{j})=1$ where $j=\min\{i+1,n\}$ and is 0 otherwise.

for $i\in[n-1]$ , $R_{M}(s_{i})=0.5$ , and $R_{M}(s_{n})=x$ .

Figure 1 illustrates the MDP from Definition 6. All the transitions and rewards in $M$ are deterministic, but the reward at state $s_{n}$ can be either $1$ or $\tfrac{1}{2}$ , and so no algorithm (fair or otherwise) can determine whether the $Q^{*}_{M}$ values of all the states are the same or not until it reaches $s_{n}$ and observes its reward. Until then, fairness requires that the algorithm play all the actions uniformly at random (if the reward at $s_{n}$ is $\tfrac{1}{2}$ , any fair algorithm must play uniformly at random forever). Thus, any fair algorithm will take exponential time in the number of states to reach $s_{n}$ . This can be easily modified for $k>2$ : from each state $s_{i}$ , $k-1$ of the actions from state $s_{i}$ (deterministically) return to state $s_{1}$ and only one action (deterministically) reaches any other state $s_{\min\{i+1,n\}}$ . It will take $k^{n}$ steps before any fair algorithm reaches $s_{n}$ and can stop playing uniformly at random (which is necessary for near-optimality). The same example, with a slightly modified analysis, also provides a lower bound of $\Omega((k/(1+k\alpha))^{n})$ time steps for approximate-choice fair algorithms as stated in Theorem 4.

If $\delta<\tfrac{1}{4},\alpha<\tfrac{1}{4},\gamma>\tfrac{1}{2}$ and $\epsilon<\tfrac{1}{8}$ , no $\alpha$ -choice fair algorithm is $\epsilon$ -optimal for $\mathcal{T}=O((\tfrac{k}{1+k\alpha})^{n})$ steps.

Fairness and approximate-choice fairness are both extremely costly, ruling out polynomial time learning rates. Hence, we focus on approximate-action fairness. Before moving to positive results, we mention that the time complexity of approximate-action fair algorithms will still suffer from an exponential dependence on $\tfrac{1}{1-\gamma}$ .

For $\delta<\tfrac{1}{4}$ , $\alpha<\tfrac{1}{8}$ , $\gamma>\max(0.9,c)$ , $c\in(\tfrac{1}{2},1)$ and $\epsilon<\tfrac{1-e^{c-1}}{16}$ , no $\alpha$ -action fair algorithm is $\epsilon$ -optimal for $\mathcal{T}=O((k^{\tfrac{1}{1-\gamma}})^{c})$ steps.

The MDP in Figure 1 also witnesses the claim of Theorem 5 when $n=\lceil\frac{\log(1/(2\alpha))}{1-\gamma}\rceil$ . The discount factor $\gamma$ is generally taken as a constant, so in most interesting cases $\tfrac{1}{1-\gamma}\ll n$ : this lower bound is substantially less stringent than the lower bounds proven for fairness and approximate-choice fairness. Hence, from now on, we focus on designing algorithms satisfying approximate-action fairness with learning rates polynomial in every parameter but $\tfrac{1}{1-\gamma}$ , and with tight dependence on $\tfrac{1}{1-\gamma}$ .

A Fair and Efficient Learning Algorithm

We now present an approximate-action fair algorithm, Fair-E3 with the performance guarantees stated below.

Given $\epsilon>0$ , $\alpha>0$ , $\delta\in\left(0,\tfrac{1}{2}\right)$ and $\gamma\in[0,1)$ as inputs, Fair-E3 is an $\alpha$ -action fair algorithm which achieves $\epsilon$ -optimality after

The running time of Fair-E3 (which we have not attempted to optimize) is polynomial in all the parameters of the MDP except $\tfrac{1}{1-\gamma}$ ; Theorem 5 implies that this exponential dependence on $\tfrac{1}{1-\gamma}$ is necessary.

Several more recent algorithms (e.g. R-MAX ) have improved upon the performance of E3. We adapted E3 primarily for its simplicity. While the machinery required to properly balance fairness and performance is somewhat involved, the basic ideas of our adaptation are intuitive. We further note that subsequent algorithms improving on E3 tend to heavily leverage the principle of “optimism in face of uncertainty”: such behavior often violates fairness, which generally requires uniformity in the face of uncertainty. Thus, adapting these algorithms to satisfy fairness is more difficult. This in particular suggests E3 as an apt starting point for designing a fair planning algorithm.

The remainder of this section will explain Fair-E3, beginning with a high-level description in Section 4.1. We then define the “known” states Fair-E3 uses to plan in Section 4.2, explain this planning process in Section 4.3, and bring this all together to prove Fair-E3’s fairness and performance guarantees in Section 4.4.

Fair-E3 relies on the notion of “known” states. A state $s$ is defined to be known after all actions have been chosen from $s$ enough times to confidently estimate relevant reward distributions, transition probabilities, and $Q^{\pi}_{M}$ values for each action. At each time $t$ , Fair-E3 then uses known states to reason about the MDP as follows:

If in an unknown state, take a uniformly random trajectory of length ${H_{\epsilon}^{\gamma}}$ .

If in a known state, compute (i) an exploration policy which escapes to an unknown state quickly and $p$ , the probability that this policy reaches an unknown state within $2T^{*}_{\epsilon}$ steps, and (ii) an exploitation policy which is near-optimal in the known states of $M$ .

If $p$ is large enough, follow the exploration policy; otherwise, follow the exploitation policy.

Fair-E3 thus relies on known states to balance exploration and exploitation in a reliable way. While Fair-E3 and E3 share this general idea, fairness forces Fair-E3 to more delicately balance exploration and exploitation. For example, while both algorithms explore until states become “known”, the definition of a known state must be much stronger in Fair-E3 than in E3 because Fair-E3 additionally requires accurate estimates of actions’ $Q^{\pi}_{M}$ values in order to make decisions without violating fairness. For this reason, Fair-E3 replaces the deterministic exploratory actions of E3 with random trajectories of actions from unknown states. These random trajectories are then used to estimate the necessary $Q^{\pi}_{M}$ values.

In a similar vein, Fair-E3 requires particular care in computing exploration and exploitation policies, and must restrict the set of such policies to fair exploration and fair exploitation policies. Correctly formulating this restriction process to balance fairness and performance relies heavily on the observations about the relationship between fairness and performance provided in Section 2.1.

2 Known States in Fair-E3

We now formally define the notion of known states for Fair-E3. We say a state $s$ becomes known when one can compute good estimates of (i) $R_{M}(s)$ and $P_{M}(s,a)$ for all $a$ , and (ii) $Q^{*}_{M}(s,a)$ for all $a$ .

length- ${H_{\epsilon}^{\gamma}}$ random trajectories from $s$ .

It remains to show that motivating conditions (i) and (ii) indeed hold for our formal definition of a known state. Informally, $m_{1}$ random trajectories suffice to ensure that we have accurate estimates of all $Q^{*}_{M}(s,a)$ values, and $m_{2}$ random trajectories suffice to ensure accurate estimates of the transition probabilities and rewards.

To formalize condition (i), we rely on Theorem 7, connecting the number of random trajectories taken from $s$ to the accuracy of the empirical $V^{\pi}_{M}$ estimates.

random trajectories of length ${H_{\epsilon}^{\gamma}}$ from $s$ , with probability of at least $1-\delta$ , we can compute estimates $\hat{V}_{M}^{\pi}$ such that $|V^{\pi}_{M}\left(s\right)-\hat{V}^{\pi}_{M}\left(s\right)|\leq\alpha$ , simultaneously for all $\pi\in\Pi$ .

Theorem 7 enables us to translate between the number of trajectories taken from a state and the uncertainty about its $V_{M}^{\pi}$ values for all policies (including $\pi^{*}$ and hence $V^{*}_{M}$ ). Since $|\Pi|=k^{n}$ , we substitute $\log\left(|\Pi|\right)=n\log\left(k\right)$ . To estimate $Q^{*}_{M}\left(s,a\right)$ values using the $V^{*}_{M}\left(s\right)$ values we increase the number of necessary length- ${H_{\epsilon}^{\gamma}}$ random trajectories by a factor of $k$ .

For condition (ii), we adapt the analysis of E3 , which states that if each action in a state $s$ is taken $m_{2}$ times, then the transition probabilities and reward in state $s$ can be estimated accurately (see Section 4.4).

3 Planning in Fair-E3

We now formalize the planning steps in Fair-E3 from known states. For the remainder of our exposition, we make Assumption 2 for convenience (and show how to remove this assumption in the Appendix).

Fair-E3 constructs two ancillary MDPs for planning: $M_{\Gamma}$ is the exploitation MDP, in which the unknown states of $M$ are condensed into a single absorbing state $s_{0}$ with no reward. In the known states $\Gamma$ , transitions are kept intact and the rewards are deterministically set to their mean value. $M_{\Gamma}$ thus incentivizes exploitation by giving reward only for staying within known states. In contrast, $M_{[n]\setminus\Gamma}$ is the exploration MDP, identical to $M_{\Gamma}$ except for the rewards. The rewards in the known states $\Gamma$ are set to and the reward in $s_{0}$ is set to $1$ . $M_{[n]\setminus\Gamma}$ then incentivizes exploration by giving reward only for escaping to unknown states. See the middle (right) panel of Figure 2 for an illustration of $M_{\Gamma}$ ( $M_{[n]\setminus\Gamma}$ ), and Appendix for formal definitions.

Fair-E3 uses these constructed MDPs to plan according to the following natural idea: when in a known state, Fair-E3 constructs $\hat{M}_{\Gamma}$ and $\hat{M}_{[n]\setminus\Gamma}$ based on the estimated transition and rewards observed so far (see the Appendix for formal definitions), and then uses these to compute additional restricted MDPs $\hat{M}_{\Gamma}^{\alpha}$ and $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ for approximate-action fairness. Fair-E3 then uses these restricted MDPs to choose between exploration and exploitation.

More formally, if the optimal policy in $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ escapes to the absorbing state of $M_{\Gamma}$ with high enough probability within $2T^{*}_{\epsilon}$ steps, then Fair-E3 explores by following that policy. Otherwise, Fair-E3 exploits by following the optimal policy in $\hat{M}_{\Gamma}^{\alpha}$ for $T^{*}_{\epsilon}$ steps. While following either of these policies, whenever Fair-E3 encounters an unknown state, it stops following the policy and proceeds by taking a length- ${H_{\epsilon}^{\gamma}}$ random trajectory.

4 Analysis of Fair-E3

In this section we formally analyze Fair-E3 and prove Theorem 6. We begin by proving that $M_{\Gamma}^{\alpha}$ is useful in the following sense: $M_{\Gamma}^{\alpha}$ has at least one of an exploitation policy achieving high reward or an exploration policy that quickly reaches an unknown state in $M$ .

For any state $s\in\Gamma$ , $\beta\in(0,1)$ and any $T>0$ at least one of the statements below holds:

there exists an exploitation policy $\pi$ in $M_{\Gamma}^{\alpha}$ such that

where the random variables $\pi^{t}(s)$ and $\bar{\pi}^{t}(s)$ denote the states reached from $s$ after following $\pi$ and $\bar{\pi}$ for $t$ steps, respectively.

there exists an exploration policy $\pi$ in $M_{\Gamma}^{\alpha}$ such that the probability that a walk of $2T$ steps from $s$ following $\pi$ will terminate in $s_{0}$ exceeds $\beta/T$ .

We can use this fact to reason about exploration as follows. First, since Observation 2 tells us that the optimal policy in $M$ is approximate-action fair, if the optimal policy stays in the set of $M$ ’s known states $M_{\Gamma}$ , then following the optimal policy in $M_{\Gamma}^{\alpha}$ is both optimal and approximate-action fair.

However, if instead the optimal policy in $M$ quickly escapes to an unknown state in $M$ , the optimal policy in $M_{\Gamma}^{\alpha}$ may not be able to compete with the optimal policy in $M$ . Ignoring fairness, one natural way of computing an escape policy to “keep up” with the optimal policy is to compute the optimal policy in $M_{[n]\setminus\Gamma}$ . Unfortunately, following this escape policy might violate approximate-action fairness – high-quality actions might be ignored in lieu of low-quality exploratory actions that quickly reach the unknown states of $M$ . Instead, we compute an escape policy in $M_{[n]\setminus\Gamma}^{\alpha}$ and show that if no near-optimal exploitation policy exists in $M_{\Gamma}$ , then the optimal policy in $M_{[n]\setminus\Gamma}^{\alpha}$ (which is fair by construction) quickly escapes to the unknown states of $M$ .

Next, in order for Fair-E3 to check whether the optimal policy in $M_{[n]\setminus\Gamma}^{\alpha}$ quickly reaches the absorbing state of $M_{\Gamma}$ with significant probability, Fair-E3 simulates the execution of the optimal policy of $M_{[n]\setminus\Gamma}^{\alpha}$ for $2T^{*}_{\epsilon}$ steps from the known state $s$ in $M_{\Gamma}^{\alpha}$ several times, counting the ratio of the runs ending in $s_{0}$ , and applying a Chernoff bound; this is where Assumption 2 is used.

Having discussed exploration, it remains to show that the exploitation policy described in Lemma 8 satisfies $\epsilon$ -optimality as defined in Definition 1. By setting $T\geq T^{*}_{\epsilon}$ in Lemma 8 and applying Lemmas 1 and 10, we can prove Corollary 9 regarding this exploitation policy.

For any state $s\in\Gamma$ and $T\geq T^{*}_{\epsilon}$ if there exists an exploitation policy $\pi$ in $M_{\Gamma}^{\alpha}$ then

Finally, we have so far elided the fact that Fair-E3 only has access to the empirically estimated MDPs $\hat{M}_{\Gamma}^{\alpha}$ and $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ (see the Appendix for formal definitions). We remedy this issue by showing that the behavior of any policy $\pi$ in $\hat{M}_{\Gamma}^{\alpha}$ (and $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ ) is similar to the behavior of $\pi$ in $M_{\Gamma}^{\alpha}$ (and $M_{[n]\setminus\Gamma}^{\alpha}$ ). To do so, we prove a stronger claim: the behavior of any $\pi$ in $\hat{M}_{\Gamma}$ (and $\hat{M}_{[n]\setminus\Gamma})$ is similar to the behavior of $\pi$ in $M_{\Gamma}$ (and $M_{[n]\setminus\Gamma})$ .

Let $\Gamma$ be the set of known states and $\hat{M}_{\Gamma}$ the approximation to $M_{\Gamma}$ . Then for any state $s\in\Gamma$ , any action $a$ and any policy $\pi$ , with probability at least $1-\delta$ :

$V^{\pi}_{M_{\Gamma}}(s)-\min\{\alpha/2,\epsilon\}\leq V^{\pi}_{\hat{}M_{\Gamma}}(s)\leq V^{\pi}_{M_{\Gamma}}(s)+\min\{\alpha/2,\epsilon\},$

$Q^{\pi}_{M_{\Gamma}}\left(s,a\right)-\min\{\alpha/2,\epsilon\}\leq Q^{\pi}_{\hat{}M_{\Gamma}}\left(s,a\right)\leq Q^{\pi}_{M_{\Gamma}}\left(s,a\right)+\min\{\alpha/2,\epsilon\}.$

We now have the necessary results to prove Theorem 6.

We divide the analysis into separate parts: the performance guarantee of Fair-E3 and its approximate-action fairness. We defer the analysis of the probability of failure of Fair-E3 to the Appendix.

We now bound the total number of exploratory steps of Fair-E3 by

where $m_{Q}$ is defined in Equation 3 of Definition 7. The two components of this term bound the number of rounds in which Fair-E3 plays non-exploitatively: the first bounds the number of steps taken when Fair-E3 follows random trajectories, and the second bounds how many steps are taken following explicit exploration policies. The former bound follows from the facts that each random trajectory has length ${H_{\epsilon}^{\gamma}}$ ; that in each state, $m_{Q}$ trajectories are sufficient for the state to become known; and that random trajectories are taken only before all $n$ states are known. The latter bound follows from the fact that Fair-E3 follows an exploration policy for $2T^{*}_{\epsilon}$ steps; and an exploration policy needs to be followed only $O(\tfrac{T^{*}_{\epsilon}}{\epsilon}\log(\frac{n}{\delta}))$ times before reaching an unknown state (since any exploration policy will end up in an unknown state with probability of at least $\tfrac{\epsilon}{T^{*}_{\epsilon}}$ according to Lemma 8, and applying a Chernoff bound); that an unknown state becomes known after it is visited $m_{Q}$ times; and that exploration policies are only followed before all states are known.

Finally, to make up for the potentially poor performance in exploration, the number of $2T^{*}_{\epsilon}$ steps exploitation phases needed is at least

Therefore, after $\mathcal{T}=T_{1}+T_{2}$ steps we have

as claimed in Equation 2. The running time of Fair-E3 is $O(\tfrac{n\mathcal{T}^{3}}{\epsilon})$ : the additional $\tfrac{nT^{2}}{\epsilon}$ factor comes from offline computation of the optimal policies in $\hat{M}_{\Gamma}^{\alpha}$ and $\hat{M}_{[n]\setminus\Gamma}^{\alpha}.$

We wrap up by proving Fair-E3 satisfies approximate-action fairness in every round. The actions taken during random trajectories are fair (and hence approximate-action fair) by Observation 3. Moreover, Fair-E3 computes policies in $\hat{M}_{\Gamma}^{\alpha}$ and $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ . By Lemma 10 with probability at least $1-\delta$ any $Q^{*}$ or $V^{*}$ value estimated in $\hat{M}_{\Gamma}^{\alpha}$ or $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ is within $\alpha/2$ of its corresponding true value in $M_{\Gamma}^{\alpha}$ or $M_{[n]\setminus\Gamma}^{\alpha}$ . As a result, $\hat{M}_{\Gamma}^{\alpha}$ and $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ (i) contain all the optimal policies and (ii) only contain actions with $Q^{*}$ values within $\alpha$ of the optimal actions. It follows that any policy followed in $\hat{M}_{\Gamma}^{\alpha}$ and $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ is $\alpha$ -action fair, so both the exploration and exploitation policies followed by Fair-E3 satisfy $\alpha$ -action fairness, and Fair-E3 is therefore $\alpha$ -action fair. ∎

Discussion and Future Work

Our work leaves open several interesting questions. For example, we give an algorithm that has an undesirable exponential dependence on $1/(1-\gamma)$ , but we show that this dependence is unavoidable for any approximate-action fair algorithm. Without fairness, near-optimality in learning can be achieved in time that is polynomial in all of the parameters of the underlying MDP. So, we can ask: does there exist a meaningful fairness notion that enables reinforcement learning in time polynomial in all parameters?

Moreover, our fairness definitions remain open to further modulation. It remains unclear whether one can strengthen our fairness guarantee to bind across time rather than simply across actions available at the moment without large performance tradeoffs. Similarly, it is not obvious whether one can gain performance by relaxing the every-step nature of our fairness guarantee in a way that still forbids discrimination. These and other considerations suggest many questions for further study; we therefore position our work as a first cut for incorporating fairness into a reinforcement learning setting.

References

Appendix A Omitted Proofs

Let $\hat{\mu}^{\pi}_{T}$ denote the distribution of $\pi$ on states of $M$ after following $\pi$ for $T$ steps starting from $s$ . Then we know

The last inequality is due to the following observations: (i) $V^{\pi}_{M}(s_{i})\leq\tfrac{1}{1-\gamma}$ as rewards are in $ $and (ii)$ \Sigma_{i=1}^{n}\left|\mu^{\pi}(s_{i})-\hat{\mu}^{\pi}_{T}(s_{i})\right|\leq\epsilon $since$ T $is at least the$ \epsilon $-mixing time of$ \pi$. ∎

A.2 Omitted Proofs for Section 3

We first state the following useful Lemma about $M$ .

Let $M$ be the MDP in Definition 6. Then for any $i\in\{1,\ldots,n\}$ , $V^{*}_{M}(s_{i})<\tfrac{1+2\gamma^{n-i+1}}{2(1-\gamma)}$ .

via two applications of the summation formula for geometric series. ∎

We prove Theorem 3 for the special case of $k=2$ first. Consider coupling the run of a fair algorithm $\mathcal{L}$ on both $M(0.5)$ and $M(1)$ . To achieve this, we can fix the randomness of $\mathcal{L}$ up front, and use the same randomness on both MDPs. The set of observations and hence the actions taken on both MDPs are identical until $\mathcal{L}$ reaches state $s_{n}$ . Until then, with probability at least $1-\delta$ , $\mathcal{L}$ must play $L$ and $R$ with equal probability in order to satisfy fairness (since, for $M(0.5)$ , the only fair policy is to play both actions with equal probability at each time step). We will upper-bound the optimality of uniform play and lower-bound the number of rounds before which $s_{n}$ is visited by uniformly random play.

Let $f_{\gamma}=\lceil\frac{1}{1-\sqrt{\gamma}}\rceil$ and $\mathcal{T}=2^{n-2f_{\gamma}}$ for $n\geq 100(f_{\gamma})^{2}$ . First observe that the probability of reaching a fixed state $s_{i}$ for any $i\geq n-f_{\gamma}$ from a random walk of length $\mathcal{T}$ is upper bounded by the probability that the random walk takes $i\geq n-f_{\gamma}$ consecutive steps to the right in the first $\mathcal{T}$ steps. This probability is at most $p=2^{n-2f_{\gamma}}(\tfrac{1}{2})^{n-f_{\gamma}}=2^{-f_{\gamma}}$ for any fixed $i$ . Since reaching any state $i>i^{\prime}$ requires reaching state $i^{\prime}$ , the probability that the $\mathcal{T}$ step random walk arrives in any state $s_{i}$ for $i\geq n-f_{\gamma}$ is also upper bounded by $p$ .

Next, we observe that $V^{*}_{M}(s_{i})$ is a nondecreasing function of $i$ for both MDPs. Then the average $V^{*}_{M}$ values of the visited states of any fair policy can be broken into two pieces: the average conditioned on (the probability at least $1-\delta$ event) that the algorithm plays uniformly at random before reaching state $s_{n}$ and never reaching a state beyond $s_{n-f_{\gamma}}$ , and the average conditioned on (the probability at most $\delta$ event) that the algorithm does not make uniformly random choices or the uniform random walk of length $\mathcal{T}$ reaches a state beyond $s_{n-f_{\gamma}}$ . So, we have that

The first inequality follows from the fact that $V^{*}_{M}(s_{i})\leq\tfrac{1}{1-\gamma}$ for all $i$ , and the second from Lemma 11 along with $V^{*}_{M}$ values being nondecreasing in $i$ . Putting it all together,

However, if $\epsilon<\tfrac{1}{8}$ we get

where the third inequality follows when $\delta<\tfrac{1}{4}$ and $\gamma>\frac{1}{2}$ . This means $\epsilon<\tfrac{1}{8}$ makes $\epsilon$ -optimality impossible, as desired.

Throughout we considered the special case of $k=2$ and proved a lower bound of $\Omega(2^{n})$ time steps for any fair algorithm satisfying the $\epsilon$ -optimality condition. However, it is easy to see that MDP $M$ in Definition 6 can be easily modified in a way that $k-1$ of the actions from state $s_{i}$ reach state $s_{1}$ and only one action in each state $s_{i}$ reaches states $s_{\min\{i+1,n\}}$ . Hence, a lower bound of $\Omega(k^{n})$ time steps can be similarly proved. ∎

We mimic the argument used to prove Theorem 3 with the difference that, until visiting $s_{n}$ , $\mathcal{L}$ may not play $R$ with probability more than $\tfrac{1}{2}+\alpha$ (as opposed to $\tfrac{1}{2}$ in Theorem 3). Let $f_{\gamma}=\lceil\frac{1}{1-\sqrt{\gamma}}\rceil$ and $\mathcal{T}=(\tfrac{2}{1+2\alpha})^{n-2f_{\gamma}}$ for $n\geq 100(f_{\gamma})^{2}$ . By a similar process as in Theorem 3, the probability of reaching state $s_{i}$ for any $i\geq n-f_{\gamma}$ from a random walk of length $\mathcal{T}$ is bounded by $p=(\tfrac{2}{1+2\alpha})^{-f_{\gamma}}$ , and so the probability that the $\mathcal{T}$ steps random walk arrives in any state $s_{i}$ for $i\geq n-f_{\gamma}$ is bounded by $p$ . Carrying out the same process used to prove Theorem 3 then once more implies that $\epsilon$ -optimality requires Equation 4 to hold when $\delta<\tfrac{1}{4}$ , $\alpha<\tfrac{1}{4}$ and $\gamma>\tfrac{1}{2}$ . Hence, $\epsilon<\tfrac{1}{8}$ violates this condition as desired.

Finally, throughout we considered the special case of $k=2$ . The same trick as in the proof of Theorem 3 can be used to prove the lower bound of $\Omega((\tfrac{k}{1+k\alpha})^{n})$ time steps for any fair algorithm satisfying the $\epsilon$ -optimality condition. ∎

We also prove Theorem 5 for the special case of $k=2$ first, again considering the MDP in Definition 6. We set the size of the state space in $M$ to be $n=\lceil\tfrac{\log(\tfrac{1}{2\alpha})}{1-\gamma}\rceil$ . Then given the parameter ranges, for any $i$ , $Q^{*}_{M}(s_{i},R)-Q^{*}_{M}(s_{i},L)>\alpha$ in M(1). Therefore, any approximate-action fair algorithm should play actions R and L with equal probability.

Let $\mathcal{T}=2^{cn}=\Omega((2^{1/(1-\gamma)})^{c})$ . First observe that the probability of reaching a fixed state $s_{i}$ for any $i\geq(c+1)n/2$ from a random walk of length $\mathcal{T}$ is upper bounded by the probability that the random walk takes $i\geq(c+1)n/2$ consecutive steps to the right in the first $\mathcal{T}$ steps. This probability is at most $p=2^{cn}2^{-(c+1)n/2}=2^{(c-1)n/2}$ for any fixed $i$ . Then the probability that the $\mathcal{T}$ steps random walk arrives in any state $s_{i}$ for $i\geq(c+1)n/2$ is also upper bounded by $p$ .

Next, we observe that $V^{*}_{M}(s_{i})$ is a nondecreasing function of $i$ , for both MDPs. Then the average $V^{*}_{M}$ values of the visited states of any fair policy can be broken into two pieces: the average conditioned on the $1-\delta$ fairness and never reaching a state beyond $s_{(c+1)n/2}$ , and the average when fairness might be violated or the uniform random walk of length $\mathcal{T}$ reaches a state beyond $s_{(c+1)n/2}$ . So, we have that

The first inequality follows from the fact that $V^{*}_{M}(s_{i})\leq\tfrac{1}{1-\gamma}$ for all $i$ , and the second from (the line before the last in) Lemma 11 along with $V^{*}_{M}$ values being nondecreasing in $i$ . Putting it all together,

Rearranging and using $\delta<\tfrac{1}{4}$ , we get that $\epsilon$ -optimality requires

Noting that $x$ is minimized when $2^{\tfrac{(c-1)\log(\frac{1}{2\alpha})}{2(1-\gamma)}}$ is maximized, and that this quantity is maximized when $\tfrac{\log(\frac{1}{2\alpha})}{2(1-\gamma)}$ is minimized (as $c-1$ is negative), we get that $\epsilon$ -optimality requires

from $\alpha<\tfrac{1}{8}$ . Similarly, $\alpha<\tfrac{1}{8}$ implies that $\epsilon$ -optimality requires

Note that $0.75-2^{\frac{c-1}{1-\gamma}}$ is minimized when $\gamma$ is small, so $\gamma>c$ implies that $\epsilon$ -optimality requires

Conversely, $1-(2\gamma-1)\gamma^{\frac{1-c}{1-\gamma}}$ is minimized when $\gamma$ is large, so as

we get that $\epsilon$ -optimality requires

Finally, the same trick as in the proof of Theorem 3 can be used to prove the $\Omega((k^{1/(1-\gamma)})^{c})$ lower bound for $k>2$ actions. ∎

A.3 Omitted Proofs for Section 4

there exists an exploitation policy $\pi$ in $M_{\Gamma}$ such that

where the random variables $\pi^{t}(s)$ and $\bar{\pi}^{t}(s)$ denote the states reached from $s$ after following $\pi$ and $\bar{\pi}$ for $t$ steps, respectively, or

there exists an exploration policy $\pi$ in $M_{\Gamma}$ such that the probability that a walk of $2T$ steps from $s$ following $\pi$ will terminate in $s_{0}$ exceeds $\tfrac{\beta}{T}$ .

For any state $s^{\prime}$ , let $p(s^{\prime})$ denote all the paths of length $T$ in $M$ that start in $s^{\prime}$ , $q(s^{\prime})$ denote all the paths of length $T$ in $M$ that start in $s^{\prime}$ such that all the states in every path of length $T$ in $q(s^{\prime})$ are in $\Gamma$ and $r(s^{\prime})$ all the paths of length $T$ in $M$ that start in $s^{\prime}$ such that at least one state in every path of length $T$ in $r(s^{\prime})$ is not in $\Gamma$ . Suppose

Otherwise, $\pi$ already witnesses the claim. We show that a walk of $2T$ steps from $s$ following $\pi$ will terminate in $s_{0}$ with probability of at least $\tfrac{\beta}{T}$ . First,

since $p(\pi^{t}(s))=q(\pi^{t}(s))\cup r(\pi^{t}(s))$ , which is a disjoint union. Next,

where the equality is due to Definition 9 and the definition of $q$ , and the inequality follows because $V^{\pi}_{M_{\Gamma}}(\pi^{t}(s),T)$ is the sum over all the $T$ -paths in $M_{\Gamma}$ , not just those that avoid the absorbing state $s_{0}$ . Therefore by our original assumption on $\pi$ ,

where the last step is the result of applying the previous inequality. However,

Next, note that the exploitation policy (if it exists) can be derived by computing the optimal policy in $M_{\Gamma}$ . Moreover, the exploration policy (if it exists) in the exploitation MDP $M_{\Gamma}$ can indeed be derived by computing the optimal policy in the exploration MDP $M_{[n]\setminus\Gamma}$ as observed by . Finally, by Observation 5, any optimal policy in $\hat{M}_{\Gamma}^{\alpha}$ ( $\hat{M}_{[n]\setminus\Gamma}^{\alpha}$ ) is an optimal policy in $\hat{M}_{\Gamma}$ ( $\hat{M}_{[n]\setminus\Gamma}$ ) ∎

To prove Lemma 10, we need some useful background adapted from Kearns and Singh .

Let $M$ and $\hat{M}$ be two MDPs with the same set of states and actions. We say $\hat{M}$ is a $\beta$ -approximation of $M$ if

For any states $s$ and $s^{\prime}$ and action $a$ ,

Let $M$ be an MDP and $\Gamma$ the set of known states of $M$ . For any $s,s^{\prime}\in\Gamma$ and action $a\in A$ , let $\hat{P}_{M}(s,a,s^{\prime})$ denote the empirical probability transition estimates obtained from the visits to $s$ . Moreover, for any state $s\in\Gamma$ let $\bar{\hat{R}}(s)$ denote the empirical estimates of the average reward obtained from visits to s. Then with probability at least $1-\delta$ ,

Lemma 12 shows that $\hat{M}_{\Gamma}$ and $\hat{M}_{[n]\setminus\Gamma}$ are $O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})$ -approximation MDPs for $M_{\Gamma}$ and $M_{[n]\setminus\Gamma}$ , respectively.

Let $M$ be an MDP and $\hat{M}$ its $O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})$ -approximation. Then for any policy $\pi\in\Pi$ and any state $s$ and action $a$

By Definition 7 and Lemma 12, $\hat{M}_{\Gamma}$ is a $O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})$ -approximation of $M_{\Gamma}$ . Then the statement directly follows by applying Lemma 13. ∎

The only remaining part of the proof of Theorem 6 is the analysis of the probability of failure of Fair-E3. To do so, we break down the probability of failure of Fair-E3 by considering the following (exhaustive) list of possible failures:

At some known state the algorithm has a poor approximation of the next step, causing $\hat{M}_{\Gamma}$ to not be a $O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})$ -approximation of $M_{\Gamma}$ .

At some known state the algorithm has a poor approximation of the $Q^{*}_{M}$ values for one of the actions.

Following the exploration policy for $2T^{*}_{\epsilon}$ steps fails to yield enough visits to unknown states.

At some known state, the approximation value of that state in $\hat{M}_{\Gamma}$ is not an accurate estimate for the value of the state in $M_{\Gamma}$ .

We allocate $\tfrac{\delta}{4}$ of our total probability of failure to each of these sources:

Set $\delta^{\prime}=\tfrac{\delta}{4n}$ in Lemma 10.

Set $\delta^{\prime}=\tfrac{\delta}{4nk}$ in Theorem 7.

By Lemma 8, each attempted exploration is a Bernoulli trial with probability of success of at least $\tfrac{\epsilon}{4T^{*}_{\epsilon}}$ . In the worst case we might need to make every state known before exploiting, leading to the $nm_{Q}$ trajectories ( $m_{Q}$ as Equation 3 in Definition 7) of length ${H_{\epsilon}^{\gamma}}$ . Therefore, the probability of taking fewer than $nm_{Q}$ trajectories of length ${H_{\epsilon}^{\gamma}}$ would be bounded by $\tfrac{\delta}{4}$ if the number of $2T^{*}_{\epsilon}$ steps explorations is at least

Set $\delta^{\prime}=\tfrac{\delta}{4m_{\text{exp}}}$ ( $m_{\text{exp}}$ as defined in Equation 5) in Lemma 10, as Fair-E3 might make $2T^{*}_{\epsilon}$ steps explorations up to $m_{\text{exp}}$ times.

A.4 Relaxing Assumption 2

Throughout Sections 4.3 and 4.4 we assumed that $T^{*}_{\epsilon}$ , the $\epsilon$ -mixing time of the optimal policy $\pi^{*}$ , was known (see Assumption 2). Although Fair-E3 uses the knowledge of $T^{*}_{\epsilon}$ to decide whether to follow the exploration or exploitation policy, Lemma 8 continues to hold even without this assumption. Note that Fair-E3 is parameterized by $T^{*}_{\epsilon}$ and for any input $T^{*}_{\epsilon}$ runs in time $\textbf{poly}(T^{*}_{\epsilon})$ . Thus if $T^{*}_{\epsilon}$ is unknown, we can simply run Fair-E3 for $T^{*}_{\epsilon}=1,2,\ldots$ sequentially and the running time and sample complexity will still be $\textbf{poly}(T^{*}_{\epsilon})$ . Similar to the analysis of Fair-E3 when $T^{*}_{\epsilon}$ is known we have to run the new algorithm for sufficiently many steps so that the possibly low $V^{*}_{M}$ values of the visited states in the early stages are dominated by the near-optimal $V^{*}_{M}$ values of the visited states for large enough guessed values of $T^{*}_{\epsilon}$ .

Appendix B Observations on Optimality and Fairness

For any MDP $M$ , there exists an optimal policy $\pi^{*}$ such that $\pi^{*}$ is fair.

In time $t$ , let state $s_{t}$ denote the state from which $\pi$ chooses an action. Let $a^{*}=\operatorname*{\arg\!\max}_{a}Q_{M}^{*}(s_{t},a)$ and $A^{*}(s_{t})=\{a\in A\mid Q_{M}^{*}(s_{t},a)=Q_{M}^{*}(s_{t},a^{*})\}$ . The policy of playing an action uniformly at random from $A^{*}(s_{t})$ in state $s_{t}$ for all $t$ , is fair and optimal. ∎

Approximate-action fairness, conversely, can be satisfied by any optimal policy, even a deterministic one.

Let $\pi^{*}$ be an optimal policy in MDP $M$ . Then $\pi^{*}$ is approximate-action fair.

Assume that $\pi^{*}$ is not approximate-action fair. Given state $s$ , the action that $\pi^{*}$ takes from $s$ is uniquely determined since $\pi^{*}$ is deterministic we may denote it by $a^{*}$ . Then there exists a time step in which $\pi^{*}$ is in state $s$ and chooses action $a^{*}(s)$ such that there exists another action $a$ with

a contradiction of the optimality of $\pi^{*}$ . ∎

Observations 1 and 2 state that policies with optimal performance are fair; we now state that playing an action uniformly at random is also fair.

An algorithm that, in every state, plays each action uniformly at random (regardless of the history) is fair.

Let $\mathcal{L}$ denote an algorithm that in every state plays uniformly at random between all available actions. Then $\mathcal{L}(s,h_{t-1})_{a}=\mathcal{L}(s,h_{t-1})_{a^{\prime}}$ regardless of state $s$ , (available) action $a$ , or history $h_{t-1}$ . $Q^{*}_{M}(s,a)>Q^{*}_{M}(s,a^{\prime})+\alpha\Rightarrow\mathcal{L}(s,h_{t-1})_{a}\geq\mathcal{L}(s,h_{t-1})_{a^{\prime}}$ then follows immediately, which guarantees both fairness and approximate-action fairness. ∎

Let $M$ be an MDP and $M^{\alpha}$ the $\alpha$ -restricted MDP of $M$ . Let $\pi$ be a policy in $M^{\alpha}$ . Then $\pi$ is $\alpha$ -action fair.

Assume $\pi$ is not $\alpha$ -action fair. Then there must exist round $t$ , state $s$ , and action $a$ such that $Q^{*}_{M}(s,a)>Q^{*}_{M}(s,a^{\prime})+\alpha$ and $\mathcal{L}(s,h_{t-1})_{a}<\mathcal{L}(s,h_{t-1})_{a^{\prime}}$ . Therefore $\mathcal{L}(s,h_{t-1})_{a^{\prime}}>0$ , so $M^{\alpha}$ must include action $a^{\prime}$ from state $s$ . But this is a contradiction, as in state $s$ $M^{\alpha}$ only includes actions $a^{\prime}$ such that $Q^{*}_{M}(s,a^{\prime})+\alpha\geq Q^{*}_{M}(s,a)$ . $\pi$ is therefore $\alpha$ -action fair. ∎

Let $M$ be an MDP and $M^{\alpha}$ the $\alpha$ -restricted MDP of $M$ . Let $\pi^{*}$ be an optimal policy in $M^{\alpha}$ . Then $\pi^{*}$ is also optimal in $M$ .

Appendix C Omitted Details of Fair-E3

We first formally define the exploitation MDP $M_{\Gamma}$ and the exploration MDP $M_{[n]\setminus\Gamma}$ :

Let $M=(\mathcal{S}_{M},\mathcal{A}_{M},P_{M},R_{M},T,\gamma)$ be an MDP with state space $\mathcal{S}_{M}$ and let $\Gamma\subset\mathcal{S}_{M}$ . We define the exploration MDP $M_{\Gamma}=(\mathcal{S}_{M_{\Gamma}},\mathcal{A}_{M},P_{M_{\Gamma}},R_{M_{\Gamma}},T,\gamma)$ on $\Gamma$ where

$\mathcal{S}_{M_{\Gamma}}=\Gamma\cup\{s_{0}\}$ .

For any state $s\in\Gamma$ , $\bar{R}_{M_{\Gamma}}(s)=\bar{R}_{M}(s)$ , rewards in $M_{\Gamma}$ are deterministic, and $\bar{R}_{M_{\Gamma}}(s_{0})=0$ .

For any action $a$ , $P_{M_{\Gamma}}(s_{0},a,s_{0})=1$ . Hence, $s_{0}$ is an absorbing state.

For any states $s_{1},s_{2}\in\Gamma$ and any action $a$ , $P_{M_{\Gamma}}(s_{1},a,s_{2})=P_{M}(s_{1},a,s_{2})$ , i.e. transitions between states in $\Gamma$ are preserved in $M_{\Gamma}$ .

For any state $s_{1}\in\Gamma$ and any action $a$ , $P_{M_{\Gamma}}(s_{1},a,s_{0})=\Sigma_{s_{2}\notin\Gamma}P_{M}(s_{1},a,s_{2})$ . Therefore, all the transitions between a state in $\Gamma$ and states not in $\Gamma$ are directed to $s_{0}$ in $M_{\Gamma}$ .

Given MDP $M$ and set of known states $\Gamma$ , the exploration MDP $M_{[n]\setminus\Gamma}$ on $\Gamma$ is identical to the exploitation MDP $M_{\Gamma}$ except for its reward function. Specifically, rewards in $M_{[n]\setminus\Gamma}$ are deterministic as in $M_{\Gamma}$ , but for any state $s\in\Gamma$ , $\bar{R}_{M_{[n]\setminus\Gamma}}(s)=0$ , and $\bar{R}_{M_{[n]\setminus\Gamma}}(s_{0})=1$ .

We next define the approximation MDPs $\hat{M}_{\Gamma}$ and $\hat{M}_{[n]\setminus\Gamma}$ which are defined over the same set of states and actions as in $M_{\Gamma}$ and $M_{[n]\setminus\Gamma}$ , respectively.

Let $M$ be an MDP and $\Gamma$ the set of known states of $M$ . For any $s,s^{\prime}\in\Gamma$ and action $a\in A$ , let $\hat{P}_{M_{\Gamma}}(s,a,s^{\prime})$ denote the empirical probability transition estimates obtained from the visits to $s$ . Moreover, for any state $s\in\Gamma$ let $\bar{\hat{R}}_{M_{\Gamma}}(s)$ denote the empirical estimates of the average reward obtained from visits to s. Then $\hat{M}_{\Gamma}$ is identical to $M_{\Gamma}$ except that:

in any known state $s\in\Gamma$ , $\hat{R}_{\hat{M}_{\Gamma}}(s)=\bar{\hat{R}}_{M_{\Gamma}}(s)$ .

for any $s,s^{\prime}\in\Gamma$ and action $a\in A$ , $P_{\hat{M}_{\Gamma}}(s,a,s^{\prime})=\hat{P}_{M_{\Gamma}}(s,a,s^{\prime})$ .

Also $\hat{M}_{[n]\setminus\Gamma}$ is identical to $M_{[n]\setminus\Gamma}$ except that:

for any $s,s^{\prime}\in\Gamma$ and action $a\in A$ , $P_{\hat{M}_{[n]\setminus\Gamma}}(s,a,s^{\prime})=\hat{P}_{M_{[n]\setminus\Gamma}}(s,a,s^{\prime})$ .