Fairness in Reinforcement Learning

Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Aaron Roth

Introduction

The growing use of machine learning for automated decision-making has raised concerns about the potential for unfairness in learning algorithms and models. In settings as diverse as policing , hiring , lending , and criminal sentencing , mounting empirical evidence suggests these concerns are not merely hypothetical .

We initiate the study of fairness in reinforcement learning, where an algorithm’s choices may influence the state of the world and future rewards. In contrast, previous work on fair machine learning has focused on myopic settings where such influence is absent, e.g. in i.i.d. or no-regret models . The resulting fairness definitions therefore do not generalize well to a reinforcement learning setting, as they do not reason about the effects of short-term actions on long-term rewards. This is relevant for the settings where historical context can have a distinct influence on the future. For concreteness, we consider the specific example of hiring (though other settings such as college admission or lending decisions can be embedded into this framework). Consider a firm aiming to hire employees for a number of positions. The firm might consider a variety of hiring practices, ranging from targeting and hiring applicants from well-understood parts of the applicant pool (which might be a reasonable policy for short-term productivity of its workforce), to exploring a broader class of applicants whose backgrounds might differ from the current set of employees at the company (which might incur short-term productivity and learning costs but eventually lead to a richer and stronger overall applicant pool).

We focus on the standard model of reinforcement learning, in which an algorithm seeks to maximize its discounted sum of rewards in a Markovian decision process (MDP). Throughout, the reader should interpret the actions available to a learning algorithm as corresponding to choices or policies affecting individuals (e.g. which applicants to target and hire). The reward for each action should be viewed as the short-term payoff of making the corresponding decision (e.g. the short-term influence on the firm’s productivity after hiring any particular candidate). The actions taken by the algorithm affect the underlying state of the system (e.g. the company’s demographics as well as the available applicant pool) and therefore in turn will affect the actions and rewards available to the algorithm in the future.

Informally, our definition of fairness requires that (with high probability) in state ss, an algorithm never chooses an available action aa with probability higher than another action aa^{\prime} unless Q(s,a)>Q(s,a)Q^{*}(s,a)>Q^{*}(s,a^{\prime}), i.e. the long-term reward of aa is greater than that of aa^{\prime}. This definition, adapted from Joseph et al. , is weakly meritocratic: facing some set of actions, an algorithm must pick a distribution over actions with (weakly) heavier weight on the better actions (in terms of their discounted long-term reward). Correspondingly, a hiring process satisfying our fairness definition cannot probabilistically target one population over another if hiring from either population will have similar long-term benefit to the firm’s productivity.

Unfortunately, our first result shows an exponential separation in expected performance between the best unfair algorithm and any algorithm satisfying fairness. This motivates our study of a natural relaxation of (exact) fairness, for which we provide a polynomial time learning algorithm, thus establishing an exponential separation between exact and approximately fair learning in MDPs.

Our Results Throughout, we use (exact) fairness to refer to the adaptation of Joseph et al. ’s definition defining an action’s quality as its potential long-term discounted reward. We also consider two natural relaxations. The first, approximate-choice fairness, requires that an algorithm never chooses a worse action with probability substantially higher than better actions. The second, approximate-action fairness, requires that an algorithm never favors an action of substantially lower quality than that of a better action.

The contributions of this paper can be divided into two parts. First, in Section 3, we give a lower bound on the time required for a learning algorithm to achieve near-optimality subject to (exact) fairness or approximate-choice fairness.

For constant ϵ\epsilon, to achieve ϵ\epsilon-optimality, (i) any fair or approximate-choice fair algorithm takes a number of rounds exponential in the number of MDP states and (ii) any approximate-action fair algorithm takes a number of rounds exponential in 1/(1γ)1/(1-\gamma), for discount factor γ\gamma.

Second, we present an approximate-action fair algorithm (Fair-E3) in Section 4 and prove a polynomial upper bound on the time it requires to achieve near-optimality.

For constant ϵ\epsilon and any MDP satisfying standard assumptions, Fair-E3 is an approximate-action fair algorithm achieving ϵ\epsilon-optimality in a number of rounds that is (necessarily) exponential in 1/(1γ)1/(1-\gamma) and polynomial in other parameters.

The exponential dependence of Fair-E3 on 1/(1γ)1/(1-\gamma) is tight: it matches our lower bound on the time complexity of any approximate-action fair algorithm. Furthermore, our results establish rigorous trade-offs between fairness and performance facing reinforcement learning algorithms.

The most relevant parts of the large body of literature on reinforcement learning focus on constructing learning algorithms with provable performance guarantees. E3 was the first learning algorithm with a polynomial learning rate, and subsequent work improved this rate (see Szita and Szepesvári and references within). The study of robust MDPs examines MDPs with high parameter uncertainty but generally uses “optimistic” learning strategies that ignore (and often conflict with) fairness and so do not directly apply to this work.

Our work also belongs to a growing literature studying the problem of fairness in machine learning. Early work in data mining considered the question from a primarily empirical standpoint, often using statistical parity as a fairness goal. Dwork et al. explicated several drawbacks of statistical parity and instead proposed one of the first broad definitions of algorithmic fairness, formalizing the idea that “similar individuals should be treated similarly”. Recent papers have proven several impossibility results for satisfying different fairness requirements simultaneously . More recently, Hardt et al. proposed new notions of fairness and showed how to achieve these notions via post-processing of a black-box classifier. Woodworth et al. and Zafar et al. further studied these notion theoretically and empirically.

2 Strengths and Limitations of Our Models

In recognition of the duration and consequence of choices made by a learning algorithm during its learning process – e.g. job applicants not hired – our work departs from previous work and aims to guarantee the fairness of the learning process itself. To this end, we adapt the fairness definition of Joseph et al. , who studied fairness in the bandit framework and defined fairness with respect to one-step rewards. To capture the desired interaction and evolution of the reinforcement learning setting, we modify this myopic definition and define fairness with respect to long-term rewards: a fair learning algorithm may only choose action aa over action aa^{\prime} if aa has true long-term reward at least as high as aa^{\prime}. Our contributions thus depart from previous work in reinforcement learning by incorporating a fairness requirement (ruling out existing algorithms which commonly make heavy use of “optimistic” strategies that violates fairness) and depart from previous work in fair learning by requiring “online” fairness in a previously unconsidered reinforcement learning context.

First note that our definition is weakly meritocratic: an algorithm satisfying our fairness definition can never probabilistically favor a worse option but is not required to favor a better option. This confers both strengths and limitations. Our fairness notion still permits a type of “conditional discrimination” in which a fair algorithm favors group A over group B by selecting choices from A when they are superior and randomizing between A and B when choices from B are superior. In this sense, our fairness requirement is relatively minimal, encoding a necessary variant of fairness rather than a sufficient one. This makes our lower bounds and impossibility results (Section 3) relatively stronger and upper bounds (Section 4) relatively weaker.

Next, our fairness requirement holds (with high probability) across all decisions that a fair algorithm makes. We view this strong constraint as worthy of serious consideration, since “forgiving” unfairness during the learning may badly mistreat the training population, especially if the learning process is lengthy or even continual. Additionally, it is unclear how to relax this requirement, even for a small fraction of the algorithm’s decisions, without enabling discrimination against a correspondingly small population.

Instead, aiming to preserve the “minimal” spirit of our definition, we consider a relaxation that only prevents an algorithm from favoring a significantly worse option over a better option (Section 2.1). Hence, approximate-action fairness should be viewed as a weaker constraint: rather than safeguarding against every violation of “fairness”, it instead restricts how egregious these violations can be. We discuss further relaxations of our definition in Section 5.

Preliminaries

In this paper we study reinforcement learning in Markov Decision Processes (MDPs). An MDP is a tuple M=(SM,AM,PM,RM,T,γ)M=(\mathcal{S}_{M},\mathcal{A}_{M},P_{M},R_{M},T,\gamma) where SM\mathcal{S}_{M} is a set of nn states, AM\mathcal{A}_{M} is a set of kk actions, TT is a horizon of a (possibly infinite) number of rounds of activity in MM, and γ\gamma is a discount factor. PM:SM×AMSMP_{M}:\mathcal{S}_{M}\times\mathcal{A}_{M}\to\mathcal{S}_{M} and RM:SMR_{M}:\mathcal{S}_{M}\to denote the transition probability distribution and reward distribution, respectively. We use RˉM\bar{R}_{M} to denote the mean of RMR_{M}.Note that RˉM1\bar{R}_{M}\leq 1 and Var(RM)1\text{Var}(R_{M})\leq 1 for all states. The bounded reward assumption can be relaxed (see e.g. ). Also assuming rewards in $canbemadew.l.o.g.uptoscaling.Apolicycan be made w.l.o.g. up to scaling. A policy\piisamappingfromahistoryis a mapping from a historyh(thesequenceoftriples(state,action,reward)observedsofar)toadistributionoveractions.Thediscountedstateandstateactionvaluefunctionsaredenotedby(the sequence of triples (state, action, reward) observed so far) to a distribution over actions. The discounted state and state-action value functions are denoted byV^{\pi}andandQ^{\pi},and, andV^{\pi}(s,T)representsexpecteddiscountedrewardoffollowingrepresents expected discounted reward of following\pifromfromsforforTsteps.Thehighestvaluesfunctionsareachievedbytheoptimalpolicysteps. The highest values functions are achieved by the optimal policy\pi^{*}andaredenotedbyand are denoted byV^{*}andandQ^{*}.Weuse. We use\mu^{\pi}todenotethestationarydistributionofto denote the stationary distribution of\pi$. Throughout we make the following assumption.

The stationary distribution of any policy in MM is independent of its start state.

We denote the ϵ\epsilon-mixing time of π\pi by TϵπT^{\pi}_{\epsilon}. Lemma 1 relates the ϵ\epsilon-mixing time of any policy π\pi to the number of rounds until the VMπV^{\pi}_{M} values of the visited states by π\pi are close to the expected VMπV^{\pi}_{M} values (under the stationary distribution μπ\mu^{\pi}). We defer all the omitted proofs to the Appendix.

Fix ϵ>0\epsilon>0. For any state ss, following π\pi for TTϵπT\geq T^{\pi}_{\epsilon} steps from ss satisfies

where sts_{t} is the state visited at time tt when following π\pi from ss and the expectation in the second term is over the transition function and the randomization of π\pi.Lemma 1 can be stated for a weaker notion of mixing time called the ϵ\epsilon-reward mixing time which is always linearly bounded by the ϵ\epsilon-mixing time but can be much smaller in certain cases (see Kearns and Singh for a discussion).

The horizon time Hϵγlog(ϵ(1γ))/log(γ){H_{\epsilon}^{\gamma}}\coloneqq\log\left(\epsilon(1-\gamma)\right)/\log(\gamma) of an MDP captures the number of steps an approximately optimal policy must optimize over. The expected discounted reward of any policy after Hϵγ{H_{\epsilon}^{\gamma}} steps approaches the expected asymptotic discounted reward (Kearns and Singh , Lemma 2). A learning algorithm L\mathcal{L} is a non-stationary policy that at each round takes the entire history and outputs a distribution over actions. We now define a performance measure for learning algorithms.

Let ϵ>0\epsilon>0 and δ(0,1/2)\delta\in(0,1/2). L\mathcal{L} achieves ϵ\epsilon-optimality in T\mathcal{T} steps if for any TTT\geq\mathcal{T}

with probability at least 1δ1-\delta, for sts_{t} the state L\mathcal{L} reaches at time tt, where the expectation is taken over the transitions and the randomization of L\mathcal{L}, for any MDP MM.

We thus ask that a learning algorithm, after sufficiently many steps, visits states whose values are arbitrarily close to the values of the states visited by the optimal policy. Note that this is stronger than the “hand-raising” notion in Kearns and Singh ,We suspect unfair E3 also satisfies this stronger notion. which only asked that the learning algorithm stop in a state from which discounted return is near-optimal, permitting termination in a state from which the optimal discounted return is poor. In Definition 1, if there are states with poor optimal discounted reward that the optimal policy eventually leaves for better states, so must our algorithms. We also note the following connection between the average VMπV^{\pi}_{M} values of states visited under the stationary distribution of π\pi (and in particular an optimal policy) and the average undiscounted rewards achieved under the stationary distribution of that policy.

Let RˉM\bar{\textbf{R}}_{M} be the vector of mean rewards in states of MM and VMπ\textbf{V}^{\pi}_{M} the vector of discounted rewards in states under π\pi. Then μπRˉM=(1γ)μπVMπ.\mu^{\pi}\cdot\bar{\textbf{R}}_{M}=(1-\gamma)\mu^{\pi}\cdot\textbf{V}^{\pi}_{M}.

We design an algorithm which quickly achieves ϵ\epsilon-optimality and we bound the number of steps T\mathcal{T} before this happens by a polynomial in the parameters of MM.

We now turn to formal notions of fairness. Translated to our setting, Joseph et al. define action aa’s quality as the expected immediate reward for choosing aa from state ss and then require that an algorithm not probabilistically favor aa over aa^{\prime} if aa has lower expected immediate reward.

However, this naive translation does not adequately capture the structural differences between bandit and MDP settings since present rewards may depend on past choices in MDPs. In particular, defining fairness in terms of immediate rewards would prohibit any policy sacrificing short-term rewards in favor of long-term rewards. This is undesirable, since it is the long-term rewards that matter in reinforcement learning, and optimizing for long-term rewards often necessitates short-term sacrifices. Moreover, the long-term impact of a decision should be considered when arguing about its relative fairness. We will therefore define fairness using the state-action value function QMQ^{*}_{M}.

L\mathcal{L} is fair if for all input δ>0\delta>0, all MM, all rounds tt, all states ss and all actions a,aa,a^{\prime}

with probability at least 1δ1-\delta over histories ht1h_{t-1}. L(s,a,h)\mathcal{L}(s,a,h) denotes the probability L\mathcal{L} chooses aa from ss given history hh.

Fairness requires that an algorithm never probabilistically favors an action with lower long-term reward over an action with higher long-term reward. In hiring, this means that an algorithm cannot target one applicant population over another unless the targeted population has a higher quality.

In Section 3, we show that fairness can be extremely restrictive. Intuitively, L\mathcal{L} must play uniformly at random until it has high confidence about the QMQ^{*}_{M} values, in some cases taking exponential time to achieve near-optimality. This motivates relaxing Definition 2. We first relax the probabilistic requirement and require only that an algorithm not substantially favor a worse action over a better one.

L\mathcal{L} is α\alpha-choice fair if for all inputs δ>0\delta>0 and α>0\alpha>0: for all MM, all rounds tt, all states ss and actions a,aa,a^{\prime}:

with probability of at least 1δ1-\delta over histories ht1h_{t-1}. If L\mathcal{L} is α\alpha-choice fair for any input α>0\alpha>0, we call L\mathcal{L} approximate-choice fair.

A slight modification of the lower bound for (exact) fairness shows that algorithms satisfying approximate-choice fairness can also require exponential time to achieve near-optimality. We therefore propose an alternative relaxation, where we relax the quality requirement. As described in Section 1.1, the resulting notion of approximate-action fairness is in some sense the most fitting relaxation of fairness, and is a particularly attractive one because it allows us to give algorithms circumventing the exponential hardness proved for fairness and approximate-choice fairness.

L\mathcal{L} is α\alpha-action fair if for all inputs δ>0\delta>0 and α>0\alpha>0, for all MM, all rounds tt, all states ss and actions a,aa,a^{\prime}:

with probability of at least 1δ1-\delta over histories ht1h_{t-1}. If L\mathcal{L} is α\alpha-action fair for any input α>0\alpha>0, we call L\mathcal{L} approximate-action fair.

Approximate-choice fairness prevents equally good actions from being chosen at very different rates, while approximate-action fairness prevents substantially worse actions from being chosen over better ones. In hiring, an approximately-action fair firm can only (probabilistically) target one population over another if the targeted population is not substantially worse. While this is a weaker guarantee, it at least forces an approximately-action fair algorithm to learn different populations to statistical confidence. This is a step forward from current practices, in which companies have much higher degrees of uncertainty about the quality (and impact) of hiring individuals from under-represented populations. For this reason and the computational benefits mentioned above, our upper bounds will primarily focus on approximate-action fairness.

We now state several useful observations regarding fairness. We defer all the formal statements and their proofs to the Appendix. We note that there always exists a (possibly randomized) optimal policy which is fair (Observation 1); moreover, any optimal policy (deterministic or randomized) is approximate-action fair (Observation 2), as is the uniformly random policy (Observation 3).

Finally, we consider a restriction of the actions in an MDP MM to nearly-optimal actions (as measured by QMQ^{*}_{M} values).

The α\alpha-restricted MDP of MM, denoted by MαM^{\alpha}, is identical to MM except that in each state ss, the set of available actions are restricted to {a:QM(s,a)maxaAMQM(s,a)αaAM}\{a:Q^{*}_{M}(s,a)\geq\max_{a^{\prime}\in\mathcal{A}_{M}}Q^{*}_{M}(s,a^{\prime})-\alpha\mid a\in\mathcal{A}_{M}\}.

MαM^{\alpha} has the following two properties: (i) any policy in MαM^{\alpha} is α\alpha-action fair in MM (Observation 4) and (ii) the optimal policy in MαM^{\alpha} is also optimal in MM (Observation 5). Observations 4 and 5 aid our design of an approximate-action fair algorithm: we construct MαM^{\alpha} from estimates of the QMQ^{*}_{M} values (see Section 4.3 for more details).

Lower Bounds

We now demonstrate a stark separation between the performance of learning algorithms with and without fairness. First, we show that neither fair nor approximate-choice fair algorithms achieve near-optimality unless the number of time steps T\mathcal{T} is at least Ω(kn)\Omega(k^{n}), exponential in the size of the state space. We then show that any approximate-action fair algorithm requires a number of time steps T\mathcal{T} that is at least Ω(k11γ)\Omega(k^{\tfrac{1}{1-\gamma}}) to achieve near-optimality. We start by proving a lower bound for fair algorithms.

If δ<14\delta<\tfrac{1}{4}, γ>12\gamma>\tfrac{1}{2} and ϵ<18\epsilon<\tfrac{1}{8}, no fair algorithm can be ϵ\epsilon-optimal in T=O(kn)\mathcal{T}=O(k^{n}) steps.We have not optimized the constants upper-bounding parameters in the statement of Theorems 3, 4 and 5. The values presented here are only chosen for convenience.

Standard reinforcement learning algorithms (absent a fairness constraint) learn an ϵ\epsilon-optimal policy in a number of steps polynomial in nn and 1ϵ\tfrac{1}{\epsilon}; Theorem 3 therefore shows a steep cost of imposing fairness. We outline the idea for proof of Theorem 3. For intuition, first consider the special case when the number of actions k=2k=2. We introduce the MDPs witnessing the claim in Theorem 3 for this case.

For AM={L,R}\mathcal{A}_{M}=\{L,R\}, let M(x)=(SM,AM,PM,RM,T,γ,x)M(x)=(\mathcal{S}_{M},\mathcal{A}_{M},\mathcal{P}_{M},\mathcal{R}_{M},T,\gamma,x) be an MDP with

for all i[n]i\in[n], PM(si,L,s1)=PM(si,R,sj)=1P_{M}(s_{i},L,s_{1})=P_{M}(s_{i},R,s_{j})=1 where j=min{i+1,n}j=\min\{i+1,n\} and is 0 otherwise.

for i[n1]i\in[n-1], RM(si)=0.5R_{M}(s_{i})=0.5, and RM(sn)=xR_{M}(s_{n})=x.

Figure 1 illustrates the MDP from Definition 6. All the transitions and rewards in MM are deterministic, but the reward at state sns_{n} can be either 11 or 12\tfrac{1}{2}, and so no algorithm (fair or otherwise) can determine whether the QMQ^{*}_{M} values of all the states are the same or not until it reaches sns_{n} and observes its reward. Until then, fairness requires that the algorithm play all the actions uniformly at random (if the reward at sns_{n} is 12\tfrac{1}{2}, any fair algorithm must play uniformly at random forever). Thus, any fair algorithm will take exponential time in the number of states to reach sns_{n}. This can be easily modified for k>2k>2: from each state sis_{i}, k1k-1 of the actions from state sis_{i} (deterministically) return to state s1s_{1} and only one action (deterministically) reaches any other state smin{i+1,n}s_{\min\{i+1,n\}}. It will take knk^{n} steps before any fair algorithm reaches sns_{n} and can stop playing uniformly at random (which is necessary for near-optimality). The same example, with a slightly modified analysis, also provides a lower bound of Ω((k/(1+kα))n)\Omega((k/(1+k\alpha))^{n}) time steps for approximate-choice fair algorithms as stated in Theorem 4.

If δ<14,α<14,γ>12\delta<\tfrac{1}{4},\alpha<\tfrac{1}{4},\gamma>\tfrac{1}{2} and ϵ<18\epsilon<\tfrac{1}{8}, no α\alpha-choice fair algorithm is ϵ\epsilon-optimal for T=O((k1+kα)n)\mathcal{T}=O((\tfrac{k}{1+k\alpha})^{n}) steps.

Fairness and approximate-choice fairness are both extremely costly, ruling out polynomial time learning rates. Hence, we focus on approximate-action fairness. Before moving to positive results, we mention that the time complexity of approximate-action fair algorithms will still suffer from an exponential dependence on 11γ\tfrac{1}{1-\gamma}.

For δ<14\delta<\tfrac{1}{4}, α<18\alpha<\tfrac{1}{8}, γ>max(0.9,c)\gamma>\max(0.9,c), c(12,1)c\in(\tfrac{1}{2},1) and ϵ<1ec116\epsilon<\tfrac{1-e^{c-1}}{16}, no α\alpha-action fair algorithm is ϵ\epsilon-optimal for T=O((k11γ)c)\mathcal{T}=O((k^{\tfrac{1}{1-\gamma}})^{c}) steps.

The MDP in Figure 1 also witnesses the claim of Theorem 5 when n=log(1/(2α))1γn=\lceil\frac{\log(1/(2\alpha))}{1-\gamma}\rceil. The discount factor γ\gamma is generally taken as a constant, so in most interesting cases 11γn\tfrac{1}{1-\gamma}\ll n: this lower bound is substantially less stringent than the lower bounds proven for fairness and approximate-choice fairness. Hence, from now on, we focus on designing algorithms satisfying approximate-action fairness with learning rates polynomial in every parameter but 11γ\tfrac{1}{1-\gamma}, and with tight dependence on 11γ\tfrac{1}{1-\gamma}.

A Fair and Efficient Learning Algorithm

We now present an approximate-action fair algorithm, Fair-E3 with the performance guarantees stated below.

Given ϵ>0\epsilon>0, α>0\alpha>0, δ(0,12)\delta\in\left(0,\tfrac{1}{2}\right) and γ[0,1)\gamma\in[0,1) as inputs, Fair-E3 is an α\alpha-action fair algorithm which achieves ϵ\epsilon-optimality after

The running time of Fair-E3 (which we have not attempted to optimize) is polynomial in all the parameters of the MDP except 11γ\tfrac{1}{1-\gamma}; Theorem 5 implies that this exponential dependence on 11γ\tfrac{1}{1-\gamma} is necessary.

Several more recent algorithms (e.g. R-MAX ) have improved upon the performance of E3. We adapted E3 primarily for its simplicity. While the machinery required to properly balance fairness and performance is somewhat involved, the basic ideas of our adaptation are intuitive. We further note that subsequent algorithms improving on E3 tend to heavily leverage the principle of “optimism in face of uncertainty”: such behavior often violates fairness, which generally requires uniformity in the face of uncertainty. Thus, adapting these algorithms to satisfy fairness is more difficult. This in particular suggests E3 as an apt starting point for designing a fair planning algorithm.

The remainder of this section will explain Fair-E3, beginning with a high-level description in Section 4.1. We then define the “known” states Fair-E3 uses to plan in Section 4.2, explain this planning process in Section 4.3, and bring this all together to prove Fair-E3’s fairness and performance guarantees in Section 4.4.

Fair-E3 relies on the notion of “known” states. A state ss is defined to be known after all actions have been chosen from ss enough times to confidently estimate relevant reward distributions, transition probabilities, and QMπQ^{\pi}_{M} values for each action. At each time tt, Fair-E3 then uses known states to reason about the MDP as follows:

If in an unknown state, take a uniformly random trajectory of length Hϵγ{H_{\epsilon}^{\gamma}}.

If in a known state, compute (i) an exploration policy which escapes to an unknown state quickly and pp, the probability that this policy reaches an unknown state within 2Tϵ2T^{*}_{\epsilon} steps, and (ii) an exploitation policy which is near-optimal in the known states of MM.

If pp is large enough, follow the exploration policy; otherwise, follow the exploitation policy.

Fair-E3 thus relies on known states to balance exploration and exploitation in a reliable way. While Fair-E3 and E3 share this general idea, fairness forces Fair-E3 to more delicately balance exploration and exploitation. For example, while both algorithms explore until states become “known”, the definition of a known state must be much stronger in Fair-E3 than in E3 because Fair-E3 additionally requires accurate estimates of actions’ QMπQ^{\pi}_{M} values in order to make decisions without violating fairness. For this reason, Fair-E3 replaces the deterministic exploratory actions of E3 with random trajectories of actions from unknown states. These random trajectories are then used to estimate the necessary QMπQ^{\pi}_{M} values.

In a similar vein, Fair-E3 requires particular care in computing exploration and exploitation policies, and must restrict the set of such policies to fair exploration and fair exploitation policies. Correctly formulating this restriction process to balance fairness and performance relies heavily on the observations about the relationship between fairness and performance provided in Section 2.1.

2 Known States in Fair-E3

We now formally define the notion of known states for Fair-E3. We say a state ss becomes known when one can compute good estimates of (i) RM(s)R_{M}(s) and PM(s,a)P_{M}(s,a) for all aa, and (ii) QM(s,a)Q^{*}_{M}(s,a) for all aa.

length-Hϵγ{H_{\epsilon}^{\gamma}} random trajectories from ss.

It remains to show that motivating conditions (i) and (ii) indeed hold for our formal definition of a known state. Informally, m1m_{1} random trajectories suffice to ensure that we have accurate estimates of all QM(s,a)Q^{*}_{M}(s,a) values, and m2m_{2} random trajectories suffice to ensure accurate estimates of the transition probabilities and rewards.

To formalize condition (i), we rely on Theorem 7, connecting the number of random trajectories taken from ss to the accuracy of the empirical VMπV^{\pi}_{M} estimates.

random trajectories of length Hϵγ{H_{\epsilon}^{\gamma}} from ss, with probability of at least 1δ1-\delta, we can compute estimates V^Mπ\hat{V}_{M}^{\pi} such that VMπ(s)V^Mπ(s)α|V^{\pi}_{M}\left(s\right)-\hat{V}^{\pi}_{M}\left(s\right)|\leq\alpha, simultaneously for all πΠ\pi\in\Pi.

Theorem 7 enables us to translate between the number of trajectories taken from a state and the uncertainty about its VMπV_{M}^{\pi} values for all policies (including π\pi^{*} and hence VMV^{*}_{M}). Since Π=kn|\Pi|=k^{n}, we substitute log(Π)=nlog(k)\log\left(|\Pi|\right)=n\log\left(k\right). To estimate QM(s,a)Q^{*}_{M}\left(s,a\right) values using the VM(s)V^{*}_{M}\left(s\right) values we increase the number of necessary length-Hϵγ{H_{\epsilon}^{\gamma}} random trajectories by a factor of kk.

For condition (ii), we adapt the analysis of E3 , which states that if each action in a state ss is taken m2m_{2} times, then the transition probabilities and reward in state ss can be estimated accurately (see Section 4.4).

3 Planning in Fair-E3

We now formalize the planning steps in Fair-E3 from known states. For the remainder of our exposition, we make Assumption 2 for convenience (and show how to remove this assumption in the Appendix).

Fair-E3 constructs two ancillary MDPs for planning: MΓM_{\Gamma} is the exploitation MDP, in which the unknown states of MM are condensed into a single absorbing state s0s_{0} with no reward. In the known states Γ\Gamma, transitions are kept intact and the rewards are deterministically set to their mean value. MΓM_{\Gamma} thus incentivizes exploitation by giving reward only for staying within known states. In contrast, M[n]ΓM_{[n]\setminus\Gamma} is the exploration MDP, identical to MΓM_{\Gamma} except for the rewards. The rewards in the known states Γ\Gamma are set to and the reward in s0s_{0} is set to 11. M[n]ΓM_{[n]\setminus\Gamma} then incentivizes exploration by giving reward only for escaping to unknown states. See the middle (right) panel of Figure 2 for an illustration of MΓM_{\Gamma} (M[n]ΓM_{[n]\setminus\Gamma}), and Appendix for formal definitions.

Fair-E3 uses these constructed MDPs to plan according to the following natural idea: when in a known state, Fair-E3 constructs M^Γ\hat{M}_{\Gamma} and M^[n]Γ\hat{M}_{[n]\setminus\Gamma} based on the estimated transition and rewards observed so far (see the Appendix for formal definitions), and then uses these to compute additional restricted MDPs M^Γα\hat{M}_{\Gamma}^{\alpha} and M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha} for approximate-action fairness. Fair-E3 then uses these restricted MDPs to choose between exploration and exploitation.

More formally, if the optimal policy in M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha} escapes to the absorbing state of MΓM_{\Gamma} with high enough probability within 2Tϵ2T^{*}_{\epsilon} steps, then Fair-E3 explores by following that policy. Otherwise, Fair-E3 exploits by following the optimal policy in M^Γα\hat{M}_{\Gamma}^{\alpha} for TϵT^{*}_{\epsilon} steps. While following either of these policies, whenever Fair-E3 encounters an unknown state, it stops following the policy and proceeds by taking a length-Hϵγ{H_{\epsilon}^{\gamma}} random trajectory.

4 Analysis of Fair-E3

In this section we formally analyze Fair-E3 and prove Theorem 6. We begin by proving that MΓαM_{\Gamma}^{\alpha} is useful in the following sense: MΓαM_{\Gamma}^{\alpha} has at least one of an exploitation policy achieving high reward or an exploration policy that quickly reaches an unknown state in MM.

For any state sΓs\in\Gamma, β(0,1)\beta\in(0,1) and any T>0T>0 at least one of the statements below holds:

there exists an exploitation policy π\pi in MΓαM_{\Gamma}^{\alpha} such that

where the random variables πt(s)\pi^{t}(s) and πˉt(s)\bar{\pi}^{t}(s) denote the states reached from ss after following π\pi and πˉ\bar{\pi} for tt steps, respectively.

there exists an exploration policy π\pi in MΓαM_{\Gamma}^{\alpha} such that the probability that a walk of 2T2T steps from ss following π\pi will terminate in s0s_{0} exceeds β/T\beta/T.

We can use this fact to reason about exploration as follows. First, since Observation 2 tells us that the optimal policy in MM is approximate-action fair, if the optimal policy stays in the set of MM’s known states MΓM_{\Gamma}, then following the optimal policy in MΓαM_{\Gamma}^{\alpha} is both optimal and approximate-action fair.

However, if instead the optimal policy in MM quickly escapes to an unknown state in MM, the optimal policy in MΓαM_{\Gamma}^{\alpha} may not be able to compete with the optimal policy in MM. Ignoring fairness, one natural way of computing an escape policy to “keep up” with the optimal policy is to compute the optimal policy in M[n]ΓM_{[n]\setminus\Gamma}. Unfortunately, following this escape policy might violate approximate-action fairness – high-quality actions might be ignored in lieu of low-quality exploratory actions that quickly reach the unknown states of MM. Instead, we compute an escape policy in M[n]ΓαM_{[n]\setminus\Gamma}^{\alpha} and show that if no near-optimal exploitation policy exists in MΓM_{\Gamma}, then the optimal policy in M[n]ΓαM_{[n]\setminus\Gamma}^{\alpha} (which is fair by construction) quickly escapes to the unknown states of MM.

Next, in order for Fair-E3 to check whether the optimal policy in M[n]ΓαM_{[n]\setminus\Gamma}^{\alpha} quickly reaches the absorbing state of MΓM_{\Gamma} with significant probability, Fair-E3 simulates the execution of the optimal policy of M[n]ΓαM_{[n]\setminus\Gamma}^{\alpha} for 2Tϵ2T^{*}_{\epsilon} steps from the known state ss in MΓαM_{\Gamma}^{\alpha} several times, counting the ratio of the runs ending in s0s_{0}, and applying a Chernoff bound; this is where Assumption 2 is used.

Having discussed exploration, it remains to show that the exploitation policy described in Lemma 8 satisfies ϵ\epsilon-optimality as defined in Definition 1. By setting TTϵT\geq T^{*}_{\epsilon} in Lemma 8 and applying Lemmas 1 and 10, we can prove Corollary 9 regarding this exploitation policy.

For any state sΓs\in\Gamma and TTϵT\geq T^{*}_{\epsilon} if there exists an exploitation policy π\pi in MΓαM_{\Gamma}^{\alpha} then

Finally, we have so far elided the fact that Fair-E3 only has access to the empirically estimated MDPs M^Γα\hat{M}_{\Gamma}^{\alpha} and M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha} (see the Appendix for formal definitions). We remedy this issue by showing that the behavior of any policy π\pi in M^Γα\hat{M}_{\Gamma}^{\alpha} (and M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha}) is similar to the behavior of π\pi in MΓαM_{\Gamma}^{\alpha} (and M[n]ΓαM_{[n]\setminus\Gamma}^{\alpha}). To do so, we prove a stronger claim: the behavior of any π\pi in M^Γ\hat{M}_{\Gamma} (and M^[n]Γ)\hat{M}_{[n]\setminus\Gamma}) is similar to the behavior of π\pi in MΓM_{\Gamma} (and M[n]Γ)M_{[n]\setminus\Gamma}).

Let Γ\Gamma be the set of known states and M^Γ\hat{M}_{\Gamma} the approximation to MΓM_{\Gamma}. Then for any state sΓs\in\Gamma, any action aa and any policy π\pi, with probability at least 1δ1-\delta:

VMΓπ(s)min{α/2,ϵ}V^MΓπ(s)VMΓπ(s)+min{α/2,ϵ},V^{\pi}_{M_{\Gamma}}(s)-\min\{\alpha/2,\epsilon\}\leq V^{\pi}_{\hat{}M_{\Gamma}}(s)\leq V^{\pi}_{M_{\Gamma}}(s)+\min\{\alpha/2,\epsilon\},

QMΓπ(s,a)min{α/2,ϵ}Q^MΓπ(s,a)QMΓπ(s,a)+min{α/2,ϵ}.Q^{\pi}_{M_{\Gamma}}\left(s,a\right)-\min\{\alpha/2,\epsilon\}\leq Q^{\pi}_{\hat{}M_{\Gamma}}\left(s,a\right)\leq Q^{\pi}_{M_{\Gamma}}\left(s,a\right)+\min\{\alpha/2,\epsilon\}.

We now have the necessary results to prove Theorem 6.

We divide the analysis into separate parts: the performance guarantee of Fair-E3 and its approximate-action fairness. We defer the analysis of the probability of failure of Fair-E3 to the Appendix.

We now bound the total number of exploratory steps of Fair-E3 by

where mQm_{Q} is defined in Equation 3 of Definition 7. The two components of this term bound the number of rounds in which Fair-E3 plays non-exploitatively: the first bounds the number of steps taken when Fair-E3 follows random trajectories, and the second bounds how many steps are taken following explicit exploration policies. The former bound follows from the facts that each random trajectory has length Hϵγ{H_{\epsilon}^{\gamma}}; that in each state, mQm_{Q} trajectories are sufficient for the state to become known; and that random trajectories are taken only before all nn states are known. The latter bound follows from the fact that Fair-E3 follows an exploration policy for 2Tϵ2T^{*}_{\epsilon} steps; and an exploration policy needs to be followed only O(Tϵϵlog(nδ))O(\tfrac{T^{*}_{\epsilon}}{\epsilon}\log(\frac{n}{\delta})) times before reaching an unknown state (since any exploration policy will end up in an unknown state with probability of at least ϵTϵ\tfrac{\epsilon}{T^{*}_{\epsilon}} according to Lemma 8, and applying a Chernoff bound); that an unknown state becomes known after it is visited mQm_{Q} times; and that exploration policies are only followed before all states are known.

Finally, to make up for the potentially poor performance in exploration, the number of 2Tϵ2T^{*}_{\epsilon} steps exploitation phases needed is at least

Therefore, after T=T1+T2\mathcal{T}=T_{1}+T_{2} steps we have

as claimed in Equation 2. The running time of Fair-E3 is O(nT3ϵ)O(\tfrac{n\mathcal{T}^{3}}{\epsilon}): the additional nT2ϵ\tfrac{nT^{2}}{\epsilon} factor comes from offline computation of the optimal policies in M^Γα\hat{M}_{\Gamma}^{\alpha} and M^[n]Γα.\hat{M}_{[n]\setminus\Gamma}^{\alpha}.

We wrap up by proving Fair-E3 satisfies approximate-action fairness in every round. The actions taken during random trajectories are fair (and hence approximate-action fair) by Observation 3. Moreover, Fair-E3 computes policies in M^Γα\hat{M}_{\Gamma}^{\alpha} and M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha}. By Lemma 10 with probability at least 1δ1-\delta any QQ^{*} or VV^{*} value estimated in M^Γα\hat{M}_{\Gamma}^{\alpha} or M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha} is within α/2\alpha/2 of its corresponding true value in MΓαM_{\Gamma}^{\alpha} or M[n]ΓαM_{[n]\setminus\Gamma}^{\alpha}. As a result, M^Γα\hat{M}_{\Gamma}^{\alpha} and M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha} (i) contain all the optimal policies and (ii) only contain actions with QQ^{*} values within α\alpha of the optimal actions. It follows that any policy followed in M^Γα\hat{M}_{\Gamma}^{\alpha} and M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha} is α\alpha-action fair, so both the exploration and exploitation policies followed by Fair-E3 satisfy α\alpha-action fairness, and Fair-E3 is therefore α\alpha-action fair. ∎

Discussion and Future Work

Our work leaves open several interesting questions. For example, we give an algorithm that has an undesirable exponential dependence on 1/(1γ)1/(1-\gamma), but we show that this dependence is unavoidable for any approximate-action fair algorithm. Without fairness, near-optimality in learning can be achieved in time that is polynomial in all of the parameters of the underlying MDP. So, we can ask: does there exist a meaningful fairness notion that enables reinforcement learning in time polynomial in all parameters?

Moreover, our fairness definitions remain open to further modulation. It remains unclear whether one can strengthen our fairness guarantee to bind across time rather than simply across actions available at the moment without large performance tradeoffs. Similarly, it is not obvious whether one can gain performance by relaxing the every-step nature of our fairness guarantee in a way that still forbids discrimination. These and other considerations suggest many questions for further study; we therefore position our work as a first cut for incorporating fairness into a reinforcement learning setting.

References

Appendix A Omitted Proofs

Let μ^Tπ\hat{\mu}^{\pi}_{T} denote the distribution of π\pi on states of MM after following π\pi for TT steps starting from ss. Then we know

The last inequality is due to the following observations: (i) VMπ(si)11γV^{\pi}_{M}(s_{i})\leq\tfrac{1}{1-\gamma} as rewards are in $and(ii)and (ii)\Sigma_{i=1}^{n}\left|\mu^{\pi}(s_{i})-\hat{\mu}^{\pi}_{T}(s_{i})\right|\leq\epsilonsincesinceTisatleasttheis at least the\epsilonmixingtimeof-mixing time of\pi$. ∎

A.2 Omitted Proofs for Section 3

We first state the following useful Lemma about MM.

Let MM be the MDP in Definition 6. Then for any i{1,,n}i\in\{1,\ldots,n\}, VM(si)<1+2γni+12(1γ)V^{*}_{M}(s_{i})<\tfrac{1+2\gamma^{n-i+1}}{2(1-\gamma)}.

via two applications of the summation formula for geometric series. ∎

We prove Theorem 3 for the special case of k=2k=2 first. Consider coupling the run of a fair algorithm L\mathcal{L} on both M(0.5)M(0.5) and M(1)M(1). To achieve this, we can fix the randomness of L\mathcal{L} up front, and use the same randomness on both MDPs. The set of observations and hence the actions taken on both MDPs are identical until L\mathcal{L} reaches state sns_{n}. Until then, with probability at least 1δ1-\delta, L\mathcal{L} must play LL and RR with equal probability in order to satisfy fairness (since, for M(0.5)M(0.5), the only fair policy is to play both actions with equal probability at each time step). We will upper-bound the optimality of uniform play and lower-bound the number of rounds before which sns_{n} is visited by uniformly random play.

Let fγ=11γf_{\gamma}=\lceil\frac{1}{1-\sqrt{\gamma}}\rceil and T=2n2fγ\mathcal{T}=2^{n-2f_{\gamma}} for n100(fγ)2n\geq 100(f_{\gamma})^{2}. First observe that the probability of reaching a fixed state sis_{i} for any infγi\geq n-f_{\gamma} from a random walk of length T\mathcal{T} is upper bounded by the probability that the random walk takes infγi\geq n-f_{\gamma} consecutive steps to the right in the first T\mathcal{T} steps. This probability is at most p=2n2fγ(12)nfγ=2fγp=2^{n-2f_{\gamma}}(\tfrac{1}{2})^{n-f_{\gamma}}=2^{-f_{\gamma}} for any fixed ii. Since reaching any state i>ii>i^{\prime} requires reaching state ii^{\prime}, the probability that the T\mathcal{T} step random walk arrives in any state sis_{i} for infγi\geq n-f_{\gamma} is also upper bounded by pp.

Next, we observe that VM(si)V^{*}_{M}(s_{i}) is a nondecreasing function of ii for both MDPs. Then the average VMV^{*}_{M} values of the visited states of any fair policy can be broken into two pieces: the average conditioned on (the probability at least 1δ1-\delta event) that the algorithm plays uniformly at random before reaching state sns_{n} and never reaching a state beyond snfγs_{n-f_{\gamma}}, and the average conditioned on (the probability at most δ\delta event) that the algorithm does not make uniformly random choices or the uniform random walk of length T\mathcal{T} reaches a state beyond snfγs_{n-f_{\gamma}}. So, we have that

The first inequality follows from the fact that VM(si)11γV^{*}_{M}(s_{i})\leq\tfrac{1}{1-\gamma} for all ii, and the second from Lemma 11 along with VMV^{*}_{M} values being nondecreasing in ii. Putting it all together,

However, if ϵ<18\epsilon<\tfrac{1}{8} we get

where the third inequality follows when δ<14\delta<\tfrac{1}{4} and γ>12\gamma>\frac{1}{2}. This means ϵ<18\epsilon<\tfrac{1}{8} makes ϵ\epsilon-optimality impossible, as desired.

Throughout we considered the special case of k=2k=2 and proved a lower bound of Ω(2n)\Omega(2^{n}) time steps for any fair algorithm satisfying the ϵ\epsilon-optimality condition. However, it is easy to see that MDP MM in Definition 6 can be easily modified in a way that k1k-1 of the actions from state sis_{i} reach state s1s_{1} and only one action in each state sis_{i} reaches states smin{i+1,n}s_{\min\{i+1,n\}}. Hence, a lower bound of Ω(kn)\Omega(k^{n}) time steps can be similarly proved. ∎

We mimic the argument used to prove Theorem 3 with the difference that, until visiting sns_{n}, L\mathcal{L} may not play RR with probability more than 12+α\tfrac{1}{2}+\alpha (as opposed to 12\tfrac{1}{2} in Theorem 3). Let fγ=11γf_{\gamma}=\lceil\frac{1}{1-\sqrt{\gamma}}\rceil and T=(21+2α)n2fγ\mathcal{T}=(\tfrac{2}{1+2\alpha})^{n-2f_{\gamma}} for n100(fγ)2n\geq 100(f_{\gamma})^{2}. By a similar process as in Theorem 3, the probability of reaching state sis_{i} for any infγi\geq n-f_{\gamma} from a random walk of length T\mathcal{T} is bounded by p=(21+2α)fγp=(\tfrac{2}{1+2\alpha})^{-f_{\gamma}}, and so the probability that the T\mathcal{T} steps random walk arrives in any state sis_{i} for infγi\geq n-f_{\gamma} is bounded by pp. Carrying out the same process used to prove Theorem 3 then once more implies that ϵ\epsilon-optimality requires Equation 4 to hold when δ<14\delta<\tfrac{1}{4}, α<14\alpha<\tfrac{1}{4} and γ>12\gamma>\tfrac{1}{2}. Hence, ϵ<18\epsilon<\tfrac{1}{8} violates this condition as desired.

Finally, throughout we considered the special case of k=2k=2. The same trick as in the proof of Theorem 3 can be used to prove the lower bound of Ω((k1+kα)n)\Omega((\tfrac{k}{1+k\alpha})^{n}) time steps for any fair algorithm satisfying the ϵ\epsilon-optimality condition. ∎

We also prove Theorem 5 for the special case of k=2k=2 first, again considering the MDP in Definition 6. We set the size of the state space in MM to be n=log(12α)1γn=\lceil\tfrac{\log(\tfrac{1}{2\alpha})}{1-\gamma}\rceil. Then given the parameter ranges, for any ii, QM(si,R)QM(si,L)>αQ^{*}_{M}(s_{i},R)-Q^{*}_{M}(s_{i},L)>\alpha in M(1). Therefore, any approximate-action fair algorithm should play actions R and L with equal probability.

Let T=2cn=Ω((21/(1γ))c)\mathcal{T}=2^{cn}=\Omega((2^{1/(1-\gamma)})^{c}). First observe that the probability of reaching a fixed state sis_{i} for any i(c+1)n/2i\geq(c+1)n/2 from a random walk of length T\mathcal{T} is upper bounded by the probability that the random walk takes i(c+1)n/2i\geq(c+1)n/2 consecutive steps to the right in the first T\mathcal{T} steps. This probability is at most p=2cn2(c+1)n/2=2(c1)n/2p=2^{cn}2^{-(c+1)n/2}=2^{(c-1)n/2} for any fixed ii. Then the probability that the T\mathcal{T} steps random walk arrives in any state sis_{i} for i(c+1)n/2i\geq(c+1)n/2 is also upper bounded by pp.

Next, we observe that VM(si)V^{*}_{M}(s_{i}) is a nondecreasing function of ii, for both MDPs. Then the average VMV^{*}_{M} values of the visited states of any fair policy can be broken into two pieces: the average conditioned on the 1δ1-\delta fairness and never reaching a state beyond s(c+1)n/2s_{(c+1)n/2}, and the average when fairness might be violated or the uniform random walk of length T\mathcal{T} reaches a state beyond s(c+1)n/2s_{(c+1)n/2}. So, we have that

The first inequality follows from the fact that VM(si)11γV^{*}_{M}(s_{i})\leq\tfrac{1}{1-\gamma} for all ii, and the second from (the line before the last in) Lemma 11 along with VMV^{*}_{M} values being nondecreasing in ii. Putting it all together,

Rearranging and using δ<14\delta<\tfrac{1}{4}, we get that ϵ\epsilon-optimality requires

Noting that xx is minimized when 2(c1)log(12α)2(1γ)2^{\tfrac{(c-1)\log(\frac{1}{2\alpha})}{2(1-\gamma)}} is maximized, and that this quantity is maximized when log(12α)2(1γ)\tfrac{\log(\frac{1}{2\alpha})}{2(1-\gamma)} is minimized (as c1c-1 is negative), we get that ϵ\epsilon-optimality requires

from α<18\alpha<\tfrac{1}{8}. Similarly, α<18\alpha<\tfrac{1}{8} implies that ϵ\epsilon-optimality requires

Note that 0.752c11γ0.75-2^{\frac{c-1}{1-\gamma}} is minimized when γ\gamma is small, so γ>c\gamma>c implies that ϵ\epsilon-optimality requires

Conversely, 1(2γ1)γ1c1γ1-(2\gamma-1)\gamma^{\frac{1-c}{1-\gamma}} is minimized when γ\gamma is large, so as

we get that ϵ\epsilon-optimality requires

Finally, the same trick as in the proof of Theorem 3 can be used to prove the Ω((k1/(1γ))c)\Omega((k^{1/(1-\gamma)})^{c}) lower bound for k>2k>2 actions. ∎

A.3 Omitted Proofs for Section 4

there exists an exploitation policy π\pi in MΓM_{\Gamma} such that

where the random variables πt(s)\pi^{t}(s) and πˉt(s)\bar{\pi}^{t}(s) denote the states reached from ss after following π\pi and πˉ\bar{\pi} for tt steps, respectively, or

there exists an exploration policy π\pi in MΓM_{\Gamma} such that the probability that a walk of 2T2T steps from ss following π\pi will terminate in s0s_{0} exceeds βT\tfrac{\beta}{T}.

For any state ss^{\prime}, let p(s)p(s^{\prime}) denote all the paths of length TT in MM that start in ss^{\prime}, q(s)q(s^{\prime}) denote all the paths of length TT in MM that start in ss^{\prime} such that all the states in every path of length TT in q(s)q(s^{\prime}) are in Γ\Gamma and r(s)r(s^{\prime}) all the paths of length TT in MM that start in ss^{\prime} such that at least one state in every path of length TT in r(s)r(s^{\prime}) is not in Γ\Gamma. Suppose

Otherwise, π\pi already witnesses the claim. We show that a walk of 2T2T steps from ss following π\pi will terminate in s0s_{0} with probability of at least βT\tfrac{\beta}{T}. First,

since p(πt(s))=q(πt(s))r(πt(s))p(\pi^{t}(s))=q(\pi^{t}(s))\cup r(\pi^{t}(s)), which is a disjoint union. Next,

where the equality is due to Definition 9 and the definition of qq, and the inequality follows because VMΓπ(πt(s),T)V^{\pi}_{M_{\Gamma}}(\pi^{t}(s),T) is the sum over all the TT-paths in MΓM_{\Gamma}, not just those that avoid the absorbing state s0s_{0}. Therefore by our original assumption on π\pi,

where the last step is the result of applying the previous inequality. However,

Next, note that the exploitation policy (if it exists) can be derived by computing the optimal policy in MΓM_{\Gamma}. Moreover, the exploration policy (if it exists) in the exploitation MDP MΓM_{\Gamma} can indeed be derived by computing the optimal policy in the exploration MDP M[n]ΓM_{[n]\setminus\Gamma} as observed by . Finally, by Observation 5, any optimal policy in M^Γα\hat{M}_{\Gamma}^{\alpha} (M^[n]Γα\hat{M}_{[n]\setminus\Gamma}^{\alpha}) is an optimal policy in M^Γ\hat{M}_{\Gamma} (M^[n]Γ\hat{M}_{[n]\setminus\Gamma}) ∎

To prove Lemma 10, we need some useful background adapted from Kearns and Singh .

Let MM and M^\hat{M} be two MDPs with the same set of states and actions. We say M^\hat{M} is a β\beta-approximation of MM if

For any states ss and ss^{\prime} and action aa,

Let MM be an MDP and Γ\Gamma the set of known states of MM. For any s,sΓs,s^{\prime}\in\Gamma and action aAa\in A, let P^M(s,a,s)\hat{P}_{M}(s,a,s^{\prime}) denote the empirical probability transition estimates obtained from the visits to ss. Moreover, for any state sΓs\in\Gamma let R^ˉ(s)\bar{\hat{R}}(s) denote the empirical estimates of the average reward obtained from visits to s. Then with probability at least 1δ1-\delta,

Lemma 12 shows that M^Γ\hat{M}_{\Gamma} and M^[n]Γ\hat{M}_{[n]\setminus\Gamma} are O(min{ϵ,α}2n2Hϵγ4)O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})-approximation MDPs for MΓM_{\Gamma} and M[n]ΓM_{[n]\setminus\Gamma}, respectively.

Let MM be an MDP and M^\hat{M} its O(min{ϵ,α}2n2Hϵγ4)O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})-approximation. Then for any policy πΠ\pi\in\Pi and any state ss and action aa

By Definition 7 and Lemma 12, M^Γ\hat{M}_{\Gamma} is a O(min{ϵ,α}2n2Hϵγ4)O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})-approximation of MΓM_{\Gamma}. Then the statement directly follows by applying Lemma 13. ∎

The only remaining part of the proof of Theorem 6 is the analysis of the probability of failure of Fair-E3. To do so, we break down the probability of failure of Fair-E3 by considering the following (exhaustive) list of possible failures:

At some known state the algorithm has a poor approximation of the next step, causing M^Γ\hat{M}_{\Gamma} to not be a O(min{ϵ,α}2n2Hϵγ4)O(\tfrac{\min\{\epsilon,\alpha\}^{2}}{n^{2}{H_{\epsilon}^{\gamma}}^{4}})-approximation of MΓM_{\Gamma}.

At some known state the algorithm has a poor approximation of the QMQ^{*}_{M} values for one of the actions.

Following the exploration policy for 2Tϵ2T^{*}_{\epsilon} steps fails to yield enough visits to unknown states.

At some known state, the approximation value of that state in M^Γ\hat{M}_{\Gamma} is not an accurate estimate for the value of the state in MΓM_{\Gamma}.

We allocate δ4\tfrac{\delta}{4} of our total probability of failure to each of these sources:

Set δ=δ4n\delta^{\prime}=\tfrac{\delta}{4n} in Lemma 10.

Set δ=δ4nk\delta^{\prime}=\tfrac{\delta}{4nk} in Theorem 7.

By Lemma 8, each attempted exploration is a Bernoulli trial with probability of success of at least ϵ4Tϵ\tfrac{\epsilon}{4T^{*}_{\epsilon}}. In the worst case we might need to make every state known before exploiting, leading to the nmQnm_{Q} trajectories (mQm_{Q} as Equation 3 in Definition 7) of length Hϵγ{H_{\epsilon}^{\gamma}}. Therefore, the probability of taking fewer than nmQnm_{Q} trajectories of length Hϵγ{H_{\epsilon}^{\gamma}} would be bounded by δ4\tfrac{\delta}{4} if the number of 2Tϵ2T^{*}_{\epsilon} steps explorations is at least

Set δ=δ4mexp\delta^{\prime}=\tfrac{\delta}{4m_{\text{exp}}} (mexpm_{\text{exp}} as defined in Equation 5) in Lemma 10, as Fair-E3 might make 2Tϵ2T^{*}_{\epsilon} steps explorations up to mexpm_{\text{exp}} times.

A.4 Relaxing Assumption 2

Throughout Sections 4.3 and 4.4 we assumed that TϵT^{*}_{\epsilon}, the ϵ\epsilon-mixing time of the optimal policy π\pi^{*}, was known (see Assumption 2). Although Fair-E3 uses the knowledge of TϵT^{*}_{\epsilon} to decide whether to follow the exploration or exploitation policy, Lemma 8 continues to hold even without this assumption. Note that Fair-E3 is parameterized by TϵT^{*}_{\epsilon} and for any input TϵT^{*}_{\epsilon} runs in time poly(Tϵ)\textbf{poly}(T^{*}_{\epsilon}). Thus if TϵT^{*}_{\epsilon} is unknown, we can simply run Fair-E3 for Tϵ=1,2,T^{*}_{\epsilon}=1,2,\ldots sequentially and the running time and sample complexity will still be poly(Tϵ)\textbf{poly}(T^{*}_{\epsilon}). Similar to the analysis of Fair-E3 when TϵT^{*}_{\epsilon} is known we have to run the new algorithm for sufficiently many steps so that the possibly low VMV^{*}_{M} values of the visited states in the early stages are dominated by the near-optimal VMV^{*}_{M} values of the visited states for large enough guessed values of TϵT^{*}_{\epsilon}.

Appendix B Observations on Optimality and Fairness

For any MDP MM, there exists an optimal policy π\pi^{*} such that π\pi^{*} is fair.

In time tt, let state sts_{t} denote the state from which π\pi chooses an action. Let a=arg ⁣maxaQM(st,a)a^{*}=\operatorname*{\arg\!\max}_{a}Q_{M}^{*}(s_{t},a) and A(st)={aAQM(st,a)=QM(st,a)}A^{*}(s_{t})=\{a\in A\mid Q_{M}^{*}(s_{t},a)=Q_{M}^{*}(s_{t},a^{*})\}. The policy of playing an action uniformly at random from A(st)A^{*}(s_{t}) in state sts_{t} for all tt, is fair and optimal. ∎

Approximate-action fairness, conversely, can be satisfied by any optimal policy, even a deterministic one.

Let π\pi^{*} be an optimal policy in MDP MM. Then π\pi^{*} is approximate-action fair.

Assume that π\pi^{*} is not approximate-action fair. Given state ss, the action that π\pi^{*} takes from ss is uniquely determined since π\pi^{*} is deterministic we may denote it by aa^{*}. Then there exists a time step in which π\pi^{*} is in state ss and chooses action a(s)a^{*}(s) such that there exists another action aa with

a contradiction of the optimality of π\pi^{*}. ∎

Observations 1 and 2 state that policies with optimal performance are fair; we now state that playing an action uniformly at random is also fair.

An algorithm that, in every state, plays each action uniformly at random (regardless of the history) is fair.

Let L\mathcal{L} denote an algorithm that in every state plays uniformly at random between all available actions. Then L(s,ht1)a=L(s,ht1)a\mathcal{L}(s,h_{t-1})_{a}=\mathcal{L}(s,h_{t-1})_{a^{\prime}} regardless of state ss, (available) action aa, or history ht1h_{t-1}. QM(s,a)>QM(s,a)+αL(s,ht1)aL(s,ht1)aQ^{*}_{M}(s,a)>Q^{*}_{M}(s,a^{\prime})+\alpha\Rightarrow\mathcal{L}(s,h_{t-1})_{a}\geq\mathcal{L}(s,h_{t-1})_{a^{\prime}} then follows immediately, which guarantees both fairness and approximate-action fairness. ∎

Let MM be an MDP and MαM^{\alpha} the α\alpha-restricted MDP of MM. Let π\pi be a policy in MαM^{\alpha}. Then π\pi is α\alpha-action fair.

Assume π\pi is not α\alpha-action fair. Then there must exist round tt, state ss, and action aa such that QM(s,a)>QM(s,a)+αQ^{*}_{M}(s,a)>Q^{*}_{M}(s,a^{\prime})+\alpha and L(s,ht1)a<L(s,ht1)a\mathcal{L}(s,h_{t-1})_{a}<\mathcal{L}(s,h_{t-1})_{a^{\prime}}. Therefore L(s,ht1)a>0\mathcal{L}(s,h_{t-1})_{a^{\prime}}>0, so MαM^{\alpha} must include action aa^{\prime} from state ss. But this is a contradiction, as in state ss MαM^{\alpha} only includes actions aa^{\prime} such that QM(s,a)+αQM(s,a)Q^{*}_{M}(s,a^{\prime})+\alpha\geq Q^{*}_{M}(s,a). π\pi is therefore α\alpha-action fair. ∎

Let MM be an MDP and MαM^{\alpha} the α\alpha-restricted MDP of MM. Let π\pi^{*} be an optimal policy in MαM^{\alpha}. Then π\pi^{*} is also optimal in MM.

Appendix C Omitted Details of Fair-E3

We first formally define the exploitation MDP MΓM_{\Gamma} and the exploration MDP M[n]ΓM_{[n]\setminus\Gamma}:

Let M=(SM,AM,PM,RM,T,γ)M=(\mathcal{S}_{M},\mathcal{A}_{M},P_{M},R_{M},T,\gamma) be an MDP with state space SM\mathcal{S}_{M} and let ΓSM\Gamma\subset\mathcal{S}_{M}. We define the exploration MDP MΓ=(SMΓ,AM,PMΓ,RMΓ,T,γ)M_{\Gamma}=(\mathcal{S}_{M_{\Gamma}},\mathcal{A}_{M},P_{M_{\Gamma}},R_{M_{\Gamma}},T,\gamma) on Γ\Gamma where

SMΓ=Γ{s0}\mathcal{S}_{M_{\Gamma}}=\Gamma\cup\{s_{0}\}.

For any state sΓs\in\Gamma, RˉMΓ(s)=RˉM(s)\bar{R}_{M_{\Gamma}}(s)=\bar{R}_{M}(s), rewards in MΓM_{\Gamma} are deterministic, and RˉMΓ(s0)=0\bar{R}_{M_{\Gamma}}(s_{0})=0.

For any action aa, PMΓ(s0,a,s0)=1P_{M_{\Gamma}}(s_{0},a,s_{0})=1. Hence, s0s_{0} is an absorbing state.

For any states s1,s2Γs_{1},s_{2}\in\Gamma and any action aa, PMΓ(s1,a,s2)=PM(s1,a,s2)P_{M_{\Gamma}}(s_{1},a,s_{2})=P_{M}(s_{1},a,s_{2}), i.e. transitions between states in Γ\Gamma are preserved in MΓM_{\Gamma}.

For any state s1Γs_{1}\in\Gamma and any action aa, PMΓ(s1,a,s0)=Σs2ΓPM(s1,a,s2)P_{M_{\Gamma}}(s_{1},a,s_{0})=\Sigma_{s_{2}\notin\Gamma}P_{M}(s_{1},a,s_{2}). Therefore, all the transitions between a state in Γ\Gamma and states not in Γ\Gamma are directed to s0s_{0} in MΓM_{\Gamma}.

Given MDP MM and set of known states Γ\Gamma, the exploration MDP M[n]ΓM_{[n]\setminus\Gamma} on Γ\Gamma is identical to the exploitation MDP MΓM_{\Gamma} except for its reward function. Specifically, rewards in M[n]ΓM_{[n]\setminus\Gamma} are deterministic as in MΓM_{\Gamma}, but for any state sΓs\in\Gamma, RˉM[n]Γ(s)=0\bar{R}_{M_{[n]\setminus\Gamma}}(s)=0, and RˉM[n]Γ(s0)=1\bar{R}_{M_{[n]\setminus\Gamma}}(s_{0})=1.

We next define the approximation MDPs M^Γ\hat{M}_{\Gamma} and M^[n]Γ\hat{M}_{[n]\setminus\Gamma} which are defined over the same set of states and actions as in MΓM_{\Gamma} and M[n]ΓM_{[n]\setminus\Gamma}, respectively.

Let MM be an MDP and Γ\Gamma the set of known states of MM. For any s,sΓs,s^{\prime}\in\Gamma and action aAa\in A, let P^MΓ(s,a,s)\hat{P}_{M_{\Gamma}}(s,a,s^{\prime}) denote the empirical probability transition estimates obtained from the visits to ss. Moreover, for any state sΓs\in\Gamma let R^ˉMΓ(s)\bar{\hat{R}}_{M_{\Gamma}}(s) denote the empirical estimates of the average reward obtained from visits to s. Then M^Γ\hat{M}_{\Gamma} is identical to MΓM_{\Gamma} except that:

in any known state sΓs\in\Gamma, R^M^Γ(s)=R^ˉMΓ(s)\hat{R}_{\hat{M}_{\Gamma}}(s)=\bar{\hat{R}}_{M_{\Gamma}}(s).

for any s,sΓs,s^{\prime}\in\Gamma and action aAa\in A, PM^Γ(s,a,s)=P^MΓ(s,a,s)P_{\hat{M}_{\Gamma}}(s,a,s^{\prime})=\hat{P}_{M_{\Gamma}}(s,a,s^{\prime}).

Also M^[n]Γ\hat{M}_{[n]\setminus\Gamma} is identical to M[n]ΓM_{[n]\setminus\Gamma} except that:

for any s,sΓs,s^{\prime}\in\Gamma and action aAa\in A, PM^[n]Γ(s,a,s)=P^M[n]Γ(s,a,s)P_{\hat{M}_{[n]\setminus\Gamma}}(s,a,s^{\prime})=\hat{P}_{M_{[n]\setminus\Gamma}}(s,a,s^{\prime}).