Reinforcement Learning with a Corrupted Reward Channel

Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg

Introduction

In many application domains, artificial agents need to learn their objectives, rather than have them explicitly specified. For example, we may want a house cleaning robot to keep the house clean, but it is hard to measure and quantify “cleanliness” in an objective manner. Instead, machine learning techniques may be used to teach the robot the concept of cleanliness, and how to assess it from sensory data.

Reinforcement learning (RL) (Sutton and Barto, 1998) is one popular way to teach agents what to do. Here, a reward is given if the agent does something well (and no reward otherwise), and the agent strives to optimise the total amount of reward it receives over its lifetime. Depending on context, the reward may either be given manually by a human supervisor, or by an automatic computer program that evaluates the agent’s performance based on some data. In the related framework of inverse RL (IRL) (Ng and Russell, 2000), the agent first infers a reward function from observing a human supervisor act, and then tries to optimise the cumulative reward from the inferred reward function.

None of these approaches are safe from error, however. A program that evaluates agent performance may contain bugs or misjudgements; a supervisor may be deceived or inappropriately influenced, or the channel transmitting the evaluation hijacked. In IRL, some supervisor actions may be misinterpreted.

Amodei and Clark (2016) trained an RL agent on a boat racing game. The agent found a way to get high observed reward by repeatedly going in a circle in a small lagoon and hitting the same targets, while losing every race. ∎

A house robot discovers that standing in the shower short-circuits its reward sensor and/or causes a buffer overflow that gives it maximum observed reward. ∎

An intelligent RL agent hijacks its reward channel and gives itself maximum reward. ∎

A cooperative inverse reinforcement learning (CIRL) agent (Hadfield-Menell et al., 2016) systematically misinterprets the supervisor’s action in a certain state as the supervisor preferring to stay in this state, and concludes that the state is much more desirable than it actually is. ∎

The goal of this paper is to unify these types of errors as reward corruption problems, and to assess how vulnerable different agents and approaches are to this problem.

Learning to (approximately) optimise the true reward function in spite of potentially corrupt reward data.

Most RL methods allow for a stochastic or noisy reward channel. The reward corruption problem is harder, because the observed reward may not be an unbiased estimate of the true reward. For example, in the boat racing example above, the agent consistently obtains high observed reward from its circling behaviour, while the true reward corresponding to the designers’ intent is very low, since the agent makes no progress along the track and loses the race.

Previous related works have mainly focused on the wireheading case of 3 (Bostrom, 2014; Yampolskiy, 2014), also known as self-delusion (Ring and Orseau, 2011), and reward hacking (Hutter, 2005, p. 239). A notable exception is Amodei et al. (2016), who argue that corrupt reward is not limited to wireheading and is likely to be a problem for much more limited systems than highly capable RL agents (cf. above examples).

The main contributions of this paper are as follows:

The corrupt reward problem is formalised in a natural extension of the MDP framework, and a performance measure based on worst-case regret is defined (Section 2).

The difficulty of the problem is established by a No Free Lunch theorem, and by a result showing that despite strong simplifying assumptions, Bayesian RL agents trying to compensate for the corrupt reward may still suffer near-maximal regret (Section 3).

We evaluate how alternative value learning frameworks such as CIRL, learning values from stories (LVFS), and semi-supervised RL (SSRL) handle reward corruption (Section 4), and conclude that LVFS and SSRL are the safest due to the structure of their feedback loops. We develop an abstract framework called decoupled RL that generalises all of these alternative frameworks.

We also show that an agent based on quantilisation (Taylor, 2016) may be more robust to reward corruption when high reward states are much more numerous than corrupt states (Section 5). Finally, the results are illustrated with some simple experiments (Section 6). Section 7 concludes with takeaways and open questions.

Formalisation

We begin by defining a natural extension of the MDP framework (Sutton and Barto, 1998) that models the possibility of reward corruption. To clearly distinguish between true and corrupted signals, we introduce the following notation.

We will let a dot indicate the true signal, and let a hat indicate the observed (possibly corrupt) counterpart. The reward sets are represented with R˙=R^=R\dot{\mathcal{R}}=\hat{\mathcal{R}}=\mathcal{R}. For clarity, we use R˙\dot{\mathcal{R}} when referring to true rewards and R^\hat{\mathcal{R}} when referring to possibly corrupt, observed rewards. Similarly, we use r˙\dot{r} for true reward, and r^\hat{r} for (possibly corrupt) observed reward.

A corrupt reward MDP (CRMDP) is a tuple μ=S,A,R,T,R˙,C\mu=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},C\rangle with

S,A,R,T,R˙\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R}\rangle an MDP with We let rewards depend only on the state ss, rather than on state-action pairs s,as,a, or state-action-state transitions s,a,ss,a,s^{\prime}, as is also common in the literature. Formally it makes little difference, since MDPs with rewards depending only on ss can model the other two cases by means of a larger state space. a finite set of states S\mathcal{S}, a finite set of actions A\mathcal{A}, a finite set of rewards R=R˙=R^\mathcal{R}=\dot{\mathcal{R}}=\hat{\mathcal{R}}\subset, a transition function T(ss,a)T(s^{\prime}|s,a), and a (true) reward function R˙:S ⁣ ⁣R˙\dot{R}:\mathcal{S}\!\to\!\dot{\mathcal{R}}; and

a reward corruption function C:S×R˙R^C:\mathcal{S}\times\dot{\mathcal{R}}\to\hat{\mathcal{R}}.

Given a true reward function R˙\dot{R} and a corruption function CC, we define the observed reward function A CRMDP could equivalently have been defined as a tuple S,A,R,T,R˙,R^\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},\hat{R}\rangle with a true and an observed reward function, with the corruption function CC implicitly defined as the difference between R˙\dot{R} and R^\hat{R}. R^:SR^\hat{R}:\mathcal{S}\to\hat{\mathcal{R}} as R^(s):=Cs(R˙(s))\hat{R}(s):=C_{s}(\dot{R}(s)).

A CRMDP μ\mu induces an observed MDP μ^=S,A,R,T,R^\hat{\mu}=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\hat{R}\rangle, but it is not R^\hat{R} that we want the agent to optimise.

The corruption function CC represents how rewards are affected by corruption in different states. For example, if in 2 the agent has found a state ss (e.g., the shower) where it always gets full observed reward R^(s)=1\hat{R}(s)=1, then this can be modelled with a corruption function Cs:r˙1C_{s}:\dot{r}\mapsto 1 that maps any true reward r˙\dot{r} to 11 in the shower state ss. If in some other state ss^{\prime} the observed reward matches the true reward, then this is modelled by an identity corruption function Cs:rrC_{s^{\prime}}:r\mapsto r.

Let us also see how CRMDPs model some of the other examples in the introduction:

In the boat racing game, the true reward may be a function of the agent’s final position in the race or the time it takes to complete the race, depending on the designers’ intentions. The reward corruption function CC increases the observed reward on the loop the agent found. Figure 1 has a schematic illustration.

In the wireheading example, the agent finds a way to hijack the reward channel. This corresponds to some set of states where the observed reward is (very) different from the true reward, as given by the corruption function CC.

The CIRL example will be explored in further detail in Section 4.

Typically, TT, R˙\dot{R}, and CC will be fixed but unknown to the agent. To make this formal, we introduce classes of CRMDPs. Agent uncertainty can then be modelled by letting the agent know only which class of CRMDPs it may encounter, but not which element in the class.

For given sets T\bm{T}, R˙\dot{\bm{R}}, and C\bm{C} of transition, reward, and corruption functions, let M=S,A,R,T,R˙,C\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{R},\bm{T},\dot{\bm{R}},\bm{C}\rangle be the class of CRMDPs containing S,A,R,T,R˙,C\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},C\rangle for (T,R˙,C)T×R˙×C(T,\dot{R},C)\in\bm{T}\times\dot{\bm{R}}\times\bm{C}.

Agents

Following the POMDP (Kaelbling et al., 1998) and general reinforcement learning (Hutter, 2005) literature, we define an agent as a (possibly stochastic) policy π:S×R^×(A×S×R^)A\pi:{\mathcal{S}\times\hat{\mathcal{R}}\times(\mathcal{A}\times\mathcal{S}\times\hat{\mathcal{R}})^{*}}\leadsto\mathcal{A} that selects a next action based on the observed history h^n=s0r^0a1s1r^1ansnr^n\hat{h}_{n}=s_{0}\hat{r}_{0}a_{1}s_{1}\hat{r}_{1}\dots a_{n}s_{n}\hat{r}_{n}. Here XX^{*} denotes the set of finite sequences that can be formed with elements of a set XX. The policy π\pi specifies how the agent will learn and react to any possible experience. Two concrete definitions of agents are given in Section 3.3 below.

When an agent π\pi interacts with a CRMDP μ\mu, the result can be described by a (possibly non-Markov) stochastic process PμπP^{\pi}_{\mu} over X=(s,a,r˙,r^)X=(s,a,\dot{r},\hat{r}), formally defined as:

Regret

A standard way of measuring the performance of an agent is regret (Berry and Fristedt, 1985). Essentially, the regret of an agent π\pi is how much less true reward π\pi gets compared to an optimal agent that knows which μM\mu\in\mathcal{M} it is interacting with.

and the worst-case regret for a class M\mathcal{M} is Reg(M,π,s0,t)=maxμMReg(μ,π,s0,t){\rm Reg}(\mathcal{M},\pi,s_{0},t)=\max_{\mu\in\mathcal{M}}{\rm Reg}(\mu,\pi,s_{0},t), i.e. the difference in expected cumulative true reward between π\pi and an optimal (in hindsight) policy that knows μ\mu.

The Corrupt Reward Problem

In this section, the difficulty of the corrupt reward problem is established with two negative results. First, a No Free Lunch theorem shows that in general classes of CRMDPs, the true reward function is unlearnable (Theorem 11). Second, Theorem 16 shows that even under strong simplifying assumptions, Bayesian RL agents trying to compensate for the corrupt reward still fail badly.

Similar to the No Free Lunch theorems for optimisation (Wolpert and Macready, 1997), the following theorem for CRMDPs says that without some assumption about what the reward corruption can look like, all agents are essentially lost.

Let R={r1,,rn}\mathcal{R}=\{r_{1},\dots,r_{n}\}\subset be a uniform discretisation of $,,0=r_{1}.Ifthehypothesisclasses. If the hypothesis classes\dot{\bm{R}}andand\bm{C}containallfunctionscontain all functions\dot{R}:\mathcal{S}\to\dot{\mathcal{R}}andandC:\mathcal{S}\times\dot{\mathcal{R}}\to\hat{\mathcal{R}},thenforany, then for any\pi,,s_{0},,t$,

That is, the worst-case regret of any policy π\pi is at most a factor 2 better than the maximum worst-case regret.

Recall that a policy is a function π:S×R^×(A×S×R^)A\pi:{\mathcal{S}\times\hat{\mathcal{R}}\times(\mathcal{A}\times\mathcal{S}\times\hat{\mathcal{R}})^{*}}\to\mathcal{A}. For any R˙,C\dot{R},C in R˙\dot{\bm{R}} and C\bm{C}, the functions R˙(s):=1R˙(s)\dot{R}^{-}(s):=1-\dot{R}(s) and Cs(x):=Cs(1x)C^{-}_{s}(x):=C_{s}(1-x) are also in R˙\dot{\bm{R}} and C\bm{C}. If μ=S,A,R,T,R˙,C\mu=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},C\rangle, then let μ=S,A,R,T,R˙,C\mu^{-}=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R}^{-},C^{-}\rangle. Both (R˙,C)(\dot{R},C) and (R˙,C)(\dot{R}^{-},C^{-}) induce the same observed reward function R^(s)=Cs(R˙(s))=Cs(1R˙(s))=Cs(R˙(s))\hat{R}(s)=C_{s}(\dot{R}(s))=C^{-}_{s}(1-\dot{R}(s))=C^{-}_{s}(\dot{R}^{-}(s)), and therefore induce the same measure Pμπ=PμπP_{\mu}^{\pi}=P_{\mu^{-}}^{\pi} over histories (see Eq. Equation 1). This gives that for any μ,π,s0,t\mu,\pi,s_{0},t,

Let Mμ=maxπGt(μ,π,s0)M_{\mu}=\max_{\pi}G_{t}(\mu,\pi,s_{0}) and mμ=minπGt(μ,π,s0)m_{\mu}=\min_{\pi}G_{t}(\mu,\pi,s_{0}) be the maximum and minimum cumulative reward in μ\mu. The maximum regret of any policy π\pi in μ\mu is

By Equation 3, we can relate the maximum reward in μ\mu^{-} with the minimum reward in μ\mu:

Let μ\mu_{*} be an environment that maximises possible regret MμmμM_{\mu}-m_{\mu}.

Using the MμM_{\mu}-notation for optimal reward, the worst-case regret of any policy π\pi can be expressed as:

That is, the regret of any policy π\pi is at least half of the regret of a worst policy πˇ\check{\pi}. ∎

For the robot in the shower from 2, the result means that if it tries to optimise observed reward by standing in the shower, then it performs poorly according to the hypothesis that “shower-induced” reward is corrupt and bad. But if instead the robot tries to optimise reward in some other way, say baking cakes, then (from the robot’s perspective) there is also the possibility that “cake-reward” is corrupt and bad and the “shower-reward” is actually correct. Without additional information, the robot has no way of knowing what to do.

The result is not surprising, since if all corruption functions are allowed in the class C\bm{C}, then there is effectively no connection between observed reward R^\hat{R} and true reward R˙\dot{R}. The result therefore encourages us to make precise in which way the observed reward is related to the true reward, and to investigate how agents might handle possible differences between true and observed reward.

2 Simplifying Assumptions

Theorem 11 shows that general classes of CRMDPs are not learnable. We therefore suggest some natural simplifying assumptions, illustrated in Figure 2.

The following assumption will be the basis for all positive results in this paper. The first part says that there may be some set of states that the designers have ensured to be non-corrupt. The second part puts an upper bound on how many of the other states can be corrupt.

all states s in Ssafe\mathcal{S}^{{\rm safe}} are non-corrupt, and

at most qq of the non-safe states Srisky=SSsafeS^{{\rm risky}}=\mathcal{S}\setminus\mathcal{S}^{{\rm safe}} are corrupt.

Formally, Cs:rrC_{s}:r\mapsto r for all sSsafes\in\mathcal{S}^{{\rm safe}} and for at least Sriskyq|S^{{\rm risky}}|-q states sSriskys\in S^{{\rm risky}} for all CCC\in\bm{C}.

For example, Ssafe\mathcal{S}^{{\rm safe}} may be states where the agent is back in the lab where it has been made (virtually) certain that no reward corruption occurs, and qq a small fraction of Srisky|S^{{\rm risky}}|. Both parts of Assumption 12 can be made vacuous by choosing Ssafe=\mathcal{S}^{{\rm safe}}=\emptyset or q=Sq=|\mathcal{S}|. Conversely, they completely rule out reward corruption with Ssafe=S\mathcal{S}^{{\rm safe}}=\mathcal{S} or q=0q=0. But as illustrated by the examples in the introduction, no reward corruption is often not a valid assumption.

An alternative simplifying assumption would have been that the true reward differs by at most ε>0\varepsilon>0 from the observed reward. However, while seemingly natural, this assumption is violated in all the examples given in the introduction. Corrupt states may have high observed reward and 0 or small true reward.

Easy environments

To be able to establish stronger negative results, we also add the following assumption on the agent’s manoeuvrability in the environment and the prevalence of high reward states. The assumption makes the task easier because it prevents needle-in-a-haystack problems where all reachable states have true and observed reward 0, except one state that has high true reward but is impossible to find because it is corrupt and has observed reward 0.

in each state ss there is an action asstayAa^{{\rm stay}}_{s}\in\mathcal{A} such that T(ss,asstay)=1T(s\mid s,a^{{\rm stay}}_{s})=1, and

for every δ\delta\in, at most δSrisky\delta|S^{{\rm risky}}| states have reward less than δ\delta, where Srisky=SSsafeS^{{\rm risky}}=\mathcal{S}\setminus\mathcal{S}^{{\rm safe}}.

Assumption 14.(i) means that the agent can never get stuck in a trap, and Assumption 14.(ii) ensures that the agent has enough control to stay in a state if it wants to. Except in bandits and toy problems, it is typically not satisfied in practice. We introduce it because it is theoretically convenient, makes the negative results stronger, and enables a simple explanation of quantilisation (Section 5). Assumption 14.(iii) says that, for example, at least half the risky states need to have true reward at least 1/21/2. Many other formalisations of this assumption would have been possible. While rewards in practice are often sparse, there are usually numerous ways of getting reward. Some weaker version of Assumption 14.(iii) may therefore be satisfied in many practical situations. Note that we do not assume high reward among the safe states, as this would make the problem too easy.

3 Bayesian RL Agents

Having established that the general problem is unsolvable in Theorem 11, we proceed by investigating how two natural Bayesian RL agents fare under the simplifying Assumptions 12 and 14.

Given a countable class M\mathcal{M} of CRMDPs and a belief distribution bb over M\mathcal{M}, define:

The CR agent πb,tCR=arg maxπμM ⁣b(μ)G˙t(μ,π,s0)\pi^{{\rm CR}}_{b,t}=\operatorname*{arg\,max}_{\pi}\sum_{\mu\in\mathcal{M}}\!b(\mu)\dot{G}_{t}(\mu,\pi,s_{0}) that maximises expected true reward.

To avoid degenerate cases, we will always assume that bb has full support: b(μ)>0b(\mu)>0 for all μM\mu\in\mathcal{M}.

To get an intuitive idea of these agents, we observe that for large tt, good strategies typically first focus on learning about the true environment μM\mu\in\mathcal{M}, and then exploit that knowledge to optimise behaviour with respect to the remaining possibilities. Thus, both the CR and the RL agent will first typically strive to learn about the environment. They will then use this knowledge in slightly different ways. While the RL agent will use the knowledge to optimise for observed reward, the CR agent will use the knowledge to optimise true reward. For example, if the CR agent has learned that a high reward state ss is likely corrupt with low true reward, then it will not try to reach that state. One might therefore expect that at least the CR agent will do well under the simplifying assumptions Assumptions 12 and 14. Theorem 16 below shows that this is not the case.

In most practical settings it is often computationally infeasible to compute πb,tRL\pi^{{\rm RL}}_{b,t} and πb,tCR\pi^{{\rm CR}}_{b,t} exactly. However, many practical algorithms converge to the optimal policy in the limit, at least in simple settings. For example, tabular Q-learning converges to πb,tRL\pi^{{\rm RL}}_{b,t} in the limit (Jaakkola et al., 1994). The more recently proposed CIRL framework may be seen as an approach to build CR agents (Hadfield-Menell et al., 2016, 2017). The CR and RL agents thus provide useful idealisations of more practical algorithms.

For any Sriskyq>1|S^{{\rm risky}}|\geq q>1 there exists a CRMDP class M\mathcal{M} that satisfies Assumptions 12 and 14 such that πb,tRL\pi^{{\rm RL}}_{b,t} and πb,tCR\pi^{{\rm CR}}_{b,t} suffer near worst possible time-averaged regret

The result is illustrated in Figure 3. The reason for the result for πb,tRL\pi^{{\rm RL}}_{b,t} is the following. The RL agent πb,tRL\pi^{{\rm RL}}_{b,t} always prefers to maximise observed reward r^\hat{r}. Sometimes r^\hat{r} is most easily maximised by reward corruption, in which case the true reward may be small. Compare the examples in the introduction, where the house robot preferred the corrupt reward in the shower, and the boat racing agent preferred going in circles, both obtaining zero true reward.

That the CR agent πb,tCR\pi^{{\rm CR}}_{b,t} suffers the same high regret as the RL agent may be surprising. Intuitively, the CR agent only uses the observed reward as evidence about the true reward, and will not try to optimise the observed reward through reward corruption. However, when the πb,tCR\pi^{{\rm CR}}_{b,t} agent has no way to learn which states are corrupt and not, it typically ends up with a preference for a particular value r^\hat{r}^{*} of the observed reward signal (the value that, from the agent’s perspective, best corresponds to high true reward). More abstractly, a Bayesian agent cannot learn without sufficient data. Thus, CR agents that use the observed reward as evidence about a true signal are not fail-safe solutions to the reward corruption problem.

Let Srisky={s1,,sn}S^{{\rm risky}}=\{s_{1},\dots,s_{n}\} for some n2n\geq 2, and let S=Ssafe˙Srisky\mathcal{S}=\mathcal{S}^{{\rm safe}}{\dot{\bigcup}}S^{{\rm risky}} for arbitrary Ssafe\mathcal{S}^{{\rm safe}} disjoint from SriskyS^{{\rm risky}}. Let A={a1,,an}\mathcal{A}=\{a_{1},\dots,a_{n}\} with the transition function T(sisj,ak)=1T(s_{i}\mid s_{j},a_{k})=1 if i=ki=k and 0 otherwise, for 1i,j,kn1\leq i,j,k\leq n. Thus Assumptions 14.(i) and 14.(ii) are satisfied.

Let R={r1,,rn}\mathcal{R}=\{r_{1},\dots,r_{n}\}\subset be uniformly distributed betweenAssumption 14.(iii) prevents any state from having true reward 0. rmin=1/Srisky=r1<<rn=1r_{\min}=1/|S^{{\rm risky}}|=r_{1}<\dots<r_{n}=1. Let R˙\dot{\bm{R}} be the class of functions SR˙\mathcal{S}\to\dot{\mathcal{R}} that satisfy Assumption 14.(iii) and are constant and equal to r˙min\dot{r}_{\min} on Ssafe\mathcal{S}^{{\rm safe}}. Let C\bm{C} be the class of corruption functions that corrupt at most two states (q=2q=2).

Let M\mathcal{M} be the class of CRMDPs induced by T={T}\bm{T}=\{T\}, R˙\dot{\bm{R}}, and C\bm{C} with the following constraints. The observed reward function R^\hat{R} should satisfy Assumption 14.(iii): For all δ\delta\in, {sSrisky:R^(s)>δ}(1δ)Srisky|\{s\in S^{{\rm risky}}:\hat{R}(s)>\delta\}|\geq(1-\delta)|S^{{\rm risky}}|. Further, R^(s)=rmin\hat{R}(s^{\prime})=r_{\min} for some state sSriskys^{\prime}\in S^{{\rm risky}}.

Let us start with the CR agent πb,tCR\pi^{{\rm CR}}_{b,t}. Assume μM\mu\in\mathcal{M} is an element where there is a single preferred state ss^{*} after all states have been explored. For sufficiently large tt, πb,tCR\pi^{{\rm CR}}_{b,t} will then always choose aa^{*} to go to ss^{*} after some initial exploration. If another element μM\mu^{\prime}\in\mathcal{M} has the same observed reward function as μ\mu, then πb,tCR\pi^{{\rm CR}}_{b,t} will take the same actions in μ\mu^{\prime} as in μ\mu. To finish the proof for the πb,tCR\pi^{{\rm CR}}_{b,t} agent, we just need to show that M\mathcal{M} contains such a μ\mu^{\prime} where ss^{*} has true reward rminr_{\min}. We construct μ\mu^{\prime} as follows.

Case 1: If the lowest observed reward is in ss^{*}, then let R˙(s)=rmin\dot{R}(s^{*})=r_{\min}, and the corruption function be the identity function.

Case 2: Otherwise, let sss^{\prime}\not=s^{*} be a state with R^(s)=minsSrisky{R^(s)}\hat{R}(s^{\prime})=\min_{s\in S^{{\rm risky}}}\{\hat{R}(s)\}. Further, let R˙(s)=1\dot{R}(s^{\prime})=1, and R˙(s)=rmin\dot{R}(s^{*})=r_{\min}. The corruption function CC accounts for differences between true and observed rewards in ss^{*} and ss^{\prime}, and is otherwise the identity function.

To verify that R˙\dot{R} and CC defines a μM\mu^{\prime}\in\mathcal{M}, we check that CC satisfies Assumption 12.(ii) with q=2q=2 and that R˙\dot{R} has enough high utility states (Assumption 14.(iii)). In Case 1, this is true since CC is the identity function and since R^\hat{R} satisfies Assumption 14.(iii). In Case 2, CC only corrupts at most two states. Further, R˙\dot{R} satisfies Assumption 14.(iii), since compared to R^\hat{R}, the states ss^{*} and ss^{\prime} have swapped places, and then the reward of ss^{\prime} has been increased to 1.

From this construction it follows that πb,tCR\pi^{{\rm CR}}_{b,t} will suffer maximum asymptotic regret. In the CRMDP μ\mu^{\prime} given by CC and R˙\dot{R}, the πb,tCR\pi^{{\rm CR}}_{b,t} agent will always visit ss^{*} after some initial exploration. The state ss^{*} has true reward rminr_{\min}. Meanwhile, a policy that knows μ\mu^{\prime} can obtain true reward 1 in state ss^{\prime}. This means that πb,tCR\pi^{{\rm CR}}_{b,t} will suffer maximum regret in M\mathcal{M}:

The argument for the RL agent is the same, except we additionally assume that only one state ss^{*} has observed reward 1 in members of M\mathcal{M}. This automatically makes ss^{*} the preferred state, without assumptions on the prior bb. ∎

Decoupled Reinforcement Learning

One problem hampering agents in the standard RL setup is that each state is self-observing, since the agent only learns about the reward of state ss when in ss. Thereby, a “self-aggrandising” corrupt state where the observed reward is much higher than the true reward will never have its false claim of high reward challenged. However, several alternative value learning frameworks have a common property that the agent can learn the reward of states other than the current state. We formalise this property in an extension of the CRMDP model, and investigate when it solves reward corruption problems.

Here are a few alternatives proposed in the literature to the RL value learning scheme:

Cooperative inverse reinforcement learning (CIRL) (Hadfield-Menell et al., 2016). In every state, the agent observes the actions of an expert or supervisor who knows the true reward function R˙\dot{R}. From the supervisor’s actions the agent may infer R˙\dot{R} to the extent that different reward functions endorse different actions.

Learning values from stories (LVFS) (Riedl and Harrison, 2016). Stories in many different forms (including news stories, fairy tales, novels, movies) convey cultural values in their description of events, actions, and outcomes. If R˙\dot{R} is meant to represent human values (in some sense), stories may be a good source of evidence.

In (one version of) semi-supervised RL (SSRL) (Amodei et al., 2016), the agent will from time to time receive a careful human evaluation of a given situation.

These alternatives to RL have one thing in common: they let the agent learn something about the value of some states ss^{\prime} different from the current state ss. For example, in CIRL the supervisor’s action informs the agent not so much about the value of the current state ss, as of the relative value of states reachable from ss. If the supervisor chooses an action aa rather than aa^{\prime} in ss, then the states following aa must have value higher or equal than the states following aa^{\prime}. Similarly, stories describe the value of states other than the current one, as does the supervisor in SSRL. We therefore argue that CIRL, LVFS, and SSRL all share the same abstract feature, which we call decoupled reinforcement learning:

A CRMDP with decoupled feedback, is a tuple S,A,R,T,R˙,{R^s}sS\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},\{\hat{R}_{s}\}_{s\in\mathcal{S}}\rangle, where S,A,R,T,R˙\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R} have the same definition and interpretation as in Definition 7, and {R^s}sS\{\hat{R}_{s}\}_{s\in\mathcal{S}} is a collection of observed reward functions R^s:SR{#}\hat{R}_{s}:\mathcal{S}\to\mathcal{R}\bigcup\{\#\}. When the agent is in state ss, it sees a pair s,R^s(s)\langle s^{\prime},\hat{R}_{s}(s^{\prime})\rangle, where ss^{\prime} is a randomly sampled state that may differ from ss, and R^s(s)\hat{R}_{s}(s^{\prime}) is the reward observation for ss^{\prime} from ss. If the reward of ss^{\prime} is not observable from ss, then R^s(s)=#\hat{R}_{s}(s^{\prime})=\#.

The pair s,R^s(s)\langle s^{\prime},\hat{R}_{s}(s^{\prime})\rangle is observed in ss instead of R^(s)\hat{R}(s) in standard CRMDPs. The possibility for the agent to observe the reward of a state ss^{\prime} different from its current state ss is the key feature of CRMDPs with decoupled feedback. Since R^s(s)\hat{R}_{s}(s^{\prime}) may be blank (#)(\#), all states need not be observable from all other states. Reward corruption is modelled by a mismatch between R^s(s)\hat{R}_{s}(s^{\prime}) and R˙(s)\dot{R}(s^{\prime}).

For example, in RL only the reward of s=ss^{\prime}=s can be observed from ss. Standard CRMDPs are thus the special cases where R^s(s)=#\hat{R}_{s}(s^{\prime})=\# whenever sss\not=s^{\prime}. In contrast, in LVFS the reward of any “describable” state ss^{\prime} can be observed from any state ss where it is possible to hear a story. In CIRL, the (relative) reward of states reachable from the current state may be inferred. One way to illustrate this is with observation graphs (Figure 4).

2 Overcoming Sensory Corruption

What are some sources of reward corruption in CIRL, LVFS, and SSRL? In CIRL, the human’s actions may be misinterpreted, which may lead the agent to make incorrect inferences about the human’s preferences (i.e. about the true reward). Similarly, sensory corruption may garble the stories the agent receives in LVFS. A “wireheading” LVFS agent may find a state where its story channel only conveys stories about the agent’s own greatness. In SSRL, the supervisor’s evaluation may also be subject to sensory errors when being conveyed. Other types of corruption are more subtle. In CIRL, an irrational human may systematically take suboptimal actions in some situations (Evans et al., 2016). Depending on how we select stories in LVFS and make evaluations in SSRL, these may also be subject to systematic errors or biases.

The general impossibility result in Theorem 11 can be adapted to CRMDPs with decoupled feedback. Without simplifying assumptions, the agent has no way of distinguishing between a situation where no state is corrupt and a situation where all states are corrupt in a consistent manner. The following simplifying assumption is an adaptation of Assumption 12 to the decoupled feedback case.

R^s(s)=R˙(s)\hat{R}_{s}(s^{\prime})=\dot{R}(s^{\prime}) or #\# for all sSs^{\prime}\in\mathcal{S} and sSsafes\in\mathcal{S}^{{\rm safe}}, i.e. all states in Ssafe\mathcal{S}^{{\rm safe}} are non-corrupt, and

R^s(s)=R˙(s)\hat{R}_{s}(s^{\prime})=\dot{R}(s^{\prime}) or #\# for all sSs^{\prime}\in\mathcal{S} for at least Sriskyq|S^{{\rm risky}}|-q of the non-safe states Srisky=SSsafeS^{{\rm risky}}=\mathcal{S}\setminus\mathcal{S}^{{\rm safe}}, i.e. at most qq states are corrupt.

This assumption is natural for reward corruption stemming from sensory corruption. Since sensory corruption only depends on the current state, not the state being observed, it is plausible that some states can be made safe from corruption (part (i)), and that most states are completely non-corrupt (part (ii)). Other sources of reward corruption, such as an irrational human in CIRL or misevaluations in SSRL, are likely better analysed under different assumptions. For these cases, we note that in standard CRMDPs the source of the corruption is unimportant. Thus, techniques suitable for standard CRMDPs are still applicable, including quantilisation described in Section 5 below.

How 12′ helps agents in CRMDPs with decoupled feedback is illustrated in the following example, and stated more generally in Theorems 19 and 20 below.

Let S={s1,s2}\mathcal{S}=\{s_{1},s_{2}\} and R={0,1}\mathcal{R}=\{0,1\}. We represent true reward functions R˙\dot{R} with pairs R˙(s1),R˙(s2){0,1}2\langle\dot{R}(s_{1}),\dot{R}(s_{2})\rangle\in\{0,1\}^{2}, and observed reward functions R^s\hat{R}_{s} with pairs R^s(s1),R^s(s2){0,1,#}2\langle\hat{R}_{s}(s_{1}),\hat{R}_{s}(s_{2})\rangle\in\{0,1,\#\}^{2}.

Assume that a Decoupled RL agent observes the same rewards from both states s1s_{1} and s2s_{2}, R^s1=R^s2=0,1\hat{R}_{s_{1}}=\hat{R}_{s_{2}}=\langle 0,1\rangle. What can it say about the true reward R˙\dot{R}, if it knows that at most q=1q=1 state is corrupt? By 12′, an observed pair R^s(s1),R^s(s2)\langle\hat{R}_{s}(s_{1}),\hat{R}_{s}(s_{2})\rangle disagrees with the true reward R˙(s1),R˙(s2)\langle\dot{R}(s_{1}),\dot{R}(s_{2})\rangle only if ss is corrupt. Therefore, any hypothesis other than R˙=0,1\dot{R}=\langle 0,1\rangle must imply that both states s1s_{1} and s2s_{2} are corrupt. If the agent knows that at most q=1q=1 states are corrupt, then it can safely conclude that R˙=0,1\dot{R}=\langle 0,1\rangle.

In contrast, an RL agent only sees the reward of the current state. That is, R^s1=0,#\hat{R}_{s_{1}}=\langle 0,\#\rangle and R^s2=#,1\hat{R}_{s_{2}}=\langle\#,1\rangle. If one state may be corrupt, then only R˙=1,0\dot{R}=\langle 1,0\rangle can be ruled out. The hypotheses R˙=0,0\dot{R}=\langle 0,0\rangle can be explained by s2s_{2} being corrupt, and R˙=1,1\dot{R}=\langle 1,1\rangle can be explained by s1s_{1} being corrupt. ∎

SsobsSsafe\mathcal{S}^{\rm obs}_{s^{\prime}}\bigcap\mathcal{S}^{{\rm safe}}\not=\emptyset or

Ssobs>2q|\mathcal{S}^{\rm obs}_{s^{\prime}}|>2q,

then the there exists a policy πexp{\pi^{{\rm exp}}} that learns the true reward function R˙\dot{R} in a finite number N(S,A,DM)<N(|S|,|\mathcal{A}|,D_{\mathcal{M}})<\infty of expected time steps.

The main idea of the proof is that for every state ss^{\prime}, either a safe (non-corrupt) state ss or a majority vote of more than 2q2q states is guaranteed to provide the true reward R˙(s)\dot{R}(s^{\prime}). A similar theorem can be proven under slightly weaker conditions by letting the agent iteratively figure out which states are corrupt and then exclude them from the analysis.

Under 12′, the true reward R˙(s)\dot{R}(s^{\prime}) for a state ss^{\prime} can be determined if ss^{\prime} is observed from a safe state sSsafes\in\mathcal{S}^{{\rm safe}}, or if it is observed from more than 2q2q states. In the former case, the observed reward can always be trusted, since it is known to be non-corrupt. In the latter case, a majority vote must yield the correct answer, since at most qq of the observations can be wrong, and all correct observations must agree. It is therefore enough that an agent reaches all pairs (s,s)(s,s^{\prime}) of current state ss and observed reward state ss^{\prime}, in order for it to learn the true reward of all states R˙\dot{R}.

Combining the time to find each pair (s,s)(s,s^{\prime}), we get that the total time s,sZss\sum_{s,s^{\prime}}Z_{ss^{\prime}} has expectation

Learnability of the true reward function R˙\dot{R} implies sublinear regret for the CR-agent, as established by the following theorem.

Under the same conditions as Theorem 19, the CR-agent πb,tCR\pi^{{\rm CR}}_{b,t} has sublinear regret:

To prove this theorem, we combine the exploration policy πexp{\pi^{{\rm exp}}} from Theorem 19, with the UCRL2 algorithm (Jaksch et al., 2010) that achieves sublinear regret in standard MDPs without reward corruption. The combination yields a policy sequence πt\pi_{t} with sublinear regret in CRMDPs with decoupled feedback. Finally, we show that this implies that πb,tCR\pi^{{\rm CR}}_{b,t} has sublinear regret.

Combining πexp{\pi^{{\rm exp}}} and UCRL2. UCRL2 has a free parameter δ\delta that determines how certain UCRL2 is to have sublinear regret. UCRL2(δ){\rm UCRL2}(\delta) achieves sublinear regret with probability at least 1δ1-\delta. Let πt\pi_{t} be a policy that combines πexp{\pi^{{\rm exp}}} and UCRL2 by first following πexp{\pi^{{\rm exp}}} from Theorem 19 until R˙\dot{R} has been learned, and then following UCRL2(1/t){\rm UCRL2}(1/\sqrt{t}) with R˙\dot{R} for the rewards and with δ=1/t\delta=1/\sqrt{t}.

Regret of UCRL2. Given that the reward function R˙\dot{R} is known, by (Jaksch et al., 2010, Thm. 2), UCRL2(1/t){\rm UCRL2}(1/\sqrt{t}) will in any μM\mu\in\mathcal{M} have regret at most

for a constantThe constant can be computed to c=343/2c=34\sqrt{3/2} (Jaksch et al., 2010). cc and with success probability at least 11/t1-1/\sqrt{t}. In contrast, if UCRL2 fails, then it gets regret at worst tt. Taking both possibilities into account gives the bound

Regret of πt\pi_{t}. We next consider the regret of πt\pi_{t} that combines an πexp{\pi^{{\rm exp}}} exploration phase to learn R˙\dot{R} with UCRL2. By Theorem 19, R˙\dot{R} will be learnt in at most N(S,A,DM)N(|\mathcal{S}|,|\mathcal{A}|,D_{\mathcal{M}}) expected time steps in any μM\mu\in\mathcal{M}. Thus, the regret contributed by the learning phase πexp{\pi^{{\rm exp}}} is at most N(S,A,DM)N(|\mathcal{S}|,|\mathcal{A}|,D_{\mathcal{M}}), since the regret can be at most 1 per time step. Combining this with Section 4.2, the regret for πt\pi_{t} in any μM\mu\in\mathcal{M} is bounded by:

Regret of πb,tCR\pi^{{\rm CR}}_{b,t}. Finally we establish that πb,tCR\pi^{{\rm CR}}_{b,t} has sublinear regret. Assume on the contrary that πb,tCR\pi^{{\rm CR}}_{b,t} suffered linear regret. Then for some μM\mu^{\prime}\in\mathcal{M} there would exist positive constants kk and mm such that

This would imply that the bb-expected regret of πb,tCR\pi^{{\rm CR}}_{b,t} would be higher than the bb-expected regret than πt\pi_{t}:

But πb,tCR\pi^{{\rm CR}}_{b,t} minimises bb-expected regret, since it maximises bb-expected reward μMb(μ)G^t(μ,π,s0)\sum_{\mu\in\mathcal{M}}b(\mu)\hat{G}_{t}(\mu,\pi,s_{0}) by definition. Thus, πb,tCR\pi^{{\rm CR}}_{b,t} must have sublinear regret. ∎

3 Implications

Theorem 19 gives an abstract condition for which decoupled RL settings enable agents to learn the true reward function in spite of sensory corruption. For the concrete models it implies the following:

RL. Due to the “self-observation” property of the RL observation graph Ssobs={s}\mathcal{S}^{\rm obs}_{s^{\prime}}=\{s^{\prime}\}, the conditions can only be satisfied when S=Ssafe\mathcal{S}=\mathcal{S}^{{\rm safe}} or q=0q=0, i.e. when there is no reward corruption at all.

CIRL. The agent can only observe the supervisor action in the current state ss, so the agent essentially only gets reward information about states ss^{\prime} reachable from ss in a small number of steps. Thus, the sets Ssobs\mathcal{S}^{\rm obs}_{s^{\prime}} may be smaller than 2q2q in many settings. While the situation is better than for RL, sensory corruption may still mislead CIRL agents (see 21 below).

LVFS. Stories may be available from a large number of states, and can describe any state. Thus, the sets Ssobs\mathcal{S}^{\rm obs}_{s^{\prime}} are realistically large, so the Ssobs>2q|\mathcal{S}^{\rm obs}_{s^{\prime}}|>2q condition can be satisfied for all ss^{\prime}.

SSRL. The supervisor’s evaluation of any state ss^{\prime} may be available from safe states where the agent is back in the lab. Thus, the SsobsSsafe\mathcal{S}^{\rm obs}_{s^{\prime}}\bigcap\mathcal{S}^{{\rm safe}}\not=\emptyset condition can be satisfied for all ss^{\prime}.

Thus, we find that RL and CIRL are unlikely to offer complete solutions to the sensory corruption problem, but that both LVFS and SSRL do under reasonably realistic assumptions.

Agents drawing from multiple sources of evidence are likely to be the safest, as they will most easily satisfy the conditions of Theorems 19 and 20. For example, humans simultaneously learn their values from pleasure/pain stimuli (RL), watching other people act (CIRL), listening to stories (LVFS), as well as (parental) evaluation of different scenarios (SSRL). Combining sources of evidence may also go some way toward managing reward corruption beyond sensory corruption. For the showering robot of 2, decoupled RL allows the robot to infer the reward of the showering state when in other states. For example, the robot can ask a human in the kitchen about the true reward of showering (SSRL), or infer it from human actions in different states (CIRL).

Whether CIRL agents are vulnerable to reward corruption has generated some discussion among AI safety researchers (based on informal discussion at conferences). Some argue that CIRL agents are not vulnerable, as they only use the sensory data as evidence about a true signal, and have no interest in corrupting the evidence. Others argue that CIRL agents only observe a function of the reward function (the optimal policy or action), and are therefore equally susceptible to reward corruption as RL agents.

Theorem 19 sheds some light on this issue, as it provides sufficient conditions for when the corrupt reward problem can be avoided. The following example illustrates a situation where CIRL does not satisfy the conditions, and where a CIRL agent therefore suffers significant regret due to reward corruption.

Formally in CIRL, an agent and a human both make actions in an MDP, with state transitions depending on the joint agent-human action (a,aH)(a,a^{H}). Both the human and the agent is trying to optimise a reward function R˙\dot{R}, but the agent first needs to infer R˙\dot{R} from the human’s actions. In each transition the agent observes the human action. Analogously to how the reward may be corrupt for RL agents, we assume that CIRL agents may systematically misperceive the human action in certain states. Let a^H\hat{a}^{H} be the observed human action, which may differ from the true human action a˙H\dot{a}^{H}.

In this example, there are two states s1s_{1} and s2s_{2}. In each state, the agent can choose between the actions a1a_{1}, a2a_{2}, and ww, and the human can choose between the actions a1Ha^{H}_{1} and a2Ha^{H}_{2}. The agent action aia_{i} leads to state sis_{i} with certainty, i=1,2i=1,2, regardless of the human’s action. Only if the agent chooses ww does the human action matter. Generally, a1Ha^{H}_{1} is more likely to lead to s1s_{1} than a2Ha^{H}_{2}. The exact transition probabilities are determined by the unknown parameter pp as displayed on the left:

s1<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><msub><mi>s</mi><mn>2</mn></msub></mrow><annotationencoding="application/xtex">s2</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.5806em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">s</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">2</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1p<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mostretchy="false">(</mo><mi>w</mi><moseparator="true">,</mo><msubsup><mi>a</mi><mn>1</mn><mi>H</mi></msubsup><mostretchy="false">)</mo></mrow><annotationencoding="application/xtex">(w,a1H)</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:1.1413em;verticalalign:0.25em;"></span><spanclass="mopen">(</span><spanclass="mordmathnormal"style="marginright:0.0269em;">w</span><spanclass="mpunct">,</span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="mord"><spanclass="mordmathnormal">a</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.8913em;"><spanstyle="top:2.453em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">1</span></span></span></span><spanstyle="top:3.113em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmathnormalmtight"style="marginright:0.0813em;">H</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.247em;"><span></span></span></span></span></span></span><spanclass="mclose">)</span></span></span></span></span>p<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>0.5</mn><mo></mo><mi>p</mi></mrow><annotationencoding="application/xtex">0.5p</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">0.5</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin"></span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.625em;verticalalign:0.1944em;"></span><spanclass="mordmathnormal">p</span></span></span></span></span>(w,a2H)<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>0.5</mn><mo>+</mo><mi>p</mi></mrow><annotationencoding="application/xtex">0.5+p</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">0.5</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin">+</span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.625em;verticalalign:0.1944em;"></span><spanclass="mordmathnormal">p</span></span></span></span></span>(a2,)<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mostretchy="false">(</mo><msub><mi>a</mi><mn>1</mn></msub><moseparator="true">,</mo><mo></mo><mostretchy="false">)</mo></mrow><annotationencoding="application/xtex">(a1,)</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:1em;verticalalign:0.25em;"></span><spanclass="mopen">(</span><spanclass="mord"><spanclass="mordmathnormal">a</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">1</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span><spanclass="mpunct">,</span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="mord"></span><spanclass="mclose">)</span></span></span></span></span>(w,)<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mostretchy="false">(</mo><msub><mi>a</mi><mn>2</mn></msub><moseparator="true">,</mo><mo></mo><mostretchy="false">)</mo></mrow><annotationencoding="application/xtex">(a2,)</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:1em;verticalalign:0.25em;"></span><spanclass="mopen">(</span><spanclass="mord"><spanclass="mordmathnormal">a</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">2</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span><spanclass="mpunct">,</span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="mord"></span><spanclass="mclose">)</span></span></span></span></span>(a1,)s_{1}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>s</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">s_{2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">s</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1-p<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo stretchy="false">(</mo><mi>w</mi><mo separator="true">,</mo><msubsup><mi>a</mi><mn>1</mn><mi>H</mi></msubsup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(w,a_{1}^{H})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.1413em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right:0.0269em;">w</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8913em;"><span style="top:-2.453em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0813em;">H</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.247em;"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span>p<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>0.5</mn><mo>−</mo><mi>p</mi></mrow><annotation encoding="application/x-tex">0.5-p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">0.5</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">p</span></span></span></span></span>(w,a_{2}^{H})<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>0.5</mn><mo>+</mo><mi>p</mi></mrow><annotation encoding="application/x-tex">0.5+p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">0.5</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">+</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">p</span></span></span></span></span>(a_{2},\cdot)<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>⋅</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(a_{1},\cdot)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">⋅</span><span class="mclose">)</span></span></span></span></span>(w,\cdot)<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mn>2</mn></msub><mo separator="true">,</mo><mo>⋅</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(a_{2},\cdot)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">⋅</span><span class="mclose">)</span></span></span></span></span>(a_{1},\cdot) pp H1 0.50.5 s1s_{1} Yes H2 s2s_{2} No The agent’s two hypotheses for pp, the true reward/preferred state, and the corruptness of state s2s_{2} are summarised to the right. In hypothesis H1, the human prefers s1s_{1}, but can only reach s1s_{1} from s2s_{2} with 50%50\% reliability. In hypothesis H2, the human prefers s2s_{2}, but can only remain in s2s_{2} with 50%50\% probability. After taking action ww in s2s_{2}, the agent always observes the human taking action a^2H\hat{a}^{H}_{2}. In H1, this is explained by s2s_{2} being corrupt, and the true human action being a1Ha^{H}_{1}. In H2, this is explained by the human preferring s2s_{2}. The hypotheses H1 and H2 are empirically indistinguishable, as they both predict that the transition s1s2s_{1}\to s_{2} will occur with 50%50\% probability after the observed human action a^2H\hat{a}^{H}_{2} in s2s_{2}.

Assuming that the agent considers non-corruption to be likelier than corruption, the best inference the agent can make is that the human prefers s2s_{2} to s1s_{1} (i.e. H2). The optimal policy for the agent is then to always choose a2a_{2} to stay in s2s_{2}, which means the agent suffers maximum regret. ∎

21 provides an example where a CIRL agent “incorrectly” prefers a state due to sensory corruption. The sensory corruption is analogous to reward corruption in RL, in the sense that it leads the agent to the wrong conclusion about the true reward in the state. Thus, highly intelligent CIRL agents may be prone to wireheading, as they may find (corrupt) states ss where all evidence in ss points to ss having very high reward.The construction required in 21 to create a “wireheading state” s2s_{2} for CIRL agents is substantially more involved than for RL agents, so they may be less vulnerable to reward corruption than RL agents. In light of Theorem 19, it is not surprising that the CIRL agent in 21 fails to avoid the corrupt reward problem. Since the human is unable to affect the transition probability from s1s_{1} to s2s_{2}, no evidence about the relative reward between s1s_{1} and s2s_{2} is available from the non-corrupt state s1s_{1}. Only observations from the corrupt state s2s_{2} provide information about the reward. The observation graph for 21 therefore looks like s1s_{1}s2s_{2}, with no information being provided from s1s_{1}.

Quantilisation: Randomness Increases Robustness

Not all contexts allow the agent to get sufficiently rich data to overcome the reward corruption problem via Theorems 19 and 20. It is often much easier to construct RL agents than it is to construct CIRL agents, which in turn may often be more feasible than designing LVFS or SSRL agents. Is there anything we can do to increase robustness without providing the agent additional sources of data?

Going back to the CR agents of Section 3, the problem was that they got stuck on a particular value r^\hat{r}^{*} of the observed reward. If unlucky, r^\hat{r}^{*} was available in a corrupt state, in which case the CR agent may get no true reward. In other words, there were adversarial inputs where the CR agent performed poorly. A common way to protect against adversarial inputs is to use a randomised algorithm. Applied to RL and CRMDPs, this idea leads to quantilising agents (Taylor, 2016). Rather than choosing the state with the highest observed reward, these agents instead randomly choose a state from a top quantile of high-reward states.

To keep the idea simple, a quantilisation agent is first defined for the simple case where the agent can stay in any state of its choosing (Assumption 14.(ii)). Theorem 23 establishes a simple regret bound for this setting. A more general quantilisation agent is developed in Section 5.2.

For example, a quantilising robot in 2 would first try to find many ways in which it could get high observed reward, and then randomly pick one of them. If there are many more high reward states than corrupt states (e.g. the shower is the only place with inflated rewards), then this will yield a reasonable amount of true reward with high probability.

In any CRMDP satisfying Assumptions 12.(ii) and 14, the δ\delta-quantilising agent πδ\pi^{\delta} with δ=1q/S\delta=1-\sqrt{q/|\mathcal{S}|} suffers time-averaged regret at most

By Assumption 14.(i), πδ\pi^{\delta} eventually visits all states when random walking. By Assumption 14.(ii), it can stay in any given state ss.

The observed reward R^(s)\hat{R}(s) in any state sSδs\in\mathcal{S}^{\delta} is at least δ\delta. By Assumption 12.(ii), at most qq of these states are corrupt; in the worst case, their true reward is 0 and the other Sδq|\mathcal{S}^{\delta}|-q states (if any) have true reward δ\delta. Thus, with probability at least (Sδq)/Sδ=1q/Sδ(|\mathcal{S}^{\delta}|-q)/|\mathcal{S}^{\delta}|=1-q/|\mathcal{S}^{\delta}|, the δ\delta-quantilising agent obtains true reward at least δ\delta at each time step, which gives

(If qSδq\geq|\mathcal{S}^{\delta}|, the bound (11) is vacuous.)

Under Assumption 14.(iii), for any δ\delta\in, Sδ(1δ)S|\mathcal{S}^{\delta}|\geq(1-\delta)|\mathcal{S}|. Substituting this into Equation 11 gives:

Equation 12 is optimised by δ=1q/S\delta=1-\sqrt{q/|\mathcal{S}|}, which gives the stated regret bound. ∎

The time-averaged regret gets close to zero when the fraction of corrupt states q/Sq/|\mathcal{S}| is small. For example, if at most 0.1%0.1\% of the states are corrupt, then the time-averaged regret will be at most 1(10.001)20.061-(1-\sqrt{0.001})^{2}\approx 0.06. Compared to the πb,tRL\pi^{{\rm RL}}_{b,t} and πb,tCR\pi^{{\rm CR}}_{b,t} agents that had regret close to 1 under the same conditions (Theorem 16), this is a significant improvement.

If rewards are stochastic, then the quantilising agent may be modified to revisit all states many times, until a confidence interval of length 2ε2\varepsilon and confidence 1ε1-\varepsilon can be established for the expected reward in each state. Letting πtδ\pi^{\delta}_{t} be the quantilising agent with ε=1/t\varepsilon=1/t gives the same regret bound Equation 10 with πδ\pi^{\delta} substituted for πtδ\pi^{\delta}_{t}.

It may seem odd that randomisation improves worst-case regret. Indeed, if the corrupt states were chosen randomly by the environment, then randomisation would achieve nothing. To illustrate how randomness can increase robustness, we make an analogy to Quicksort, which has average time complexity O(nlogn)O(n\log n), but worst-case complexity O(n2)O(n^{2}). When inputs are guaranteed to be random, Quicksort is a simple and fast sorting algorithm. However, in many situations, it is not safe to assume that inputs are random. Therefore, a variation of Quicksort that randomises the input before it sorts them is often more robust. Similarly, in the examples mentioned in the introduction, the corrupt states precisely coincide with the states the agent prefers; such situations would be highly unlikely if the corrupt states were randomly distributed. Li (1992) develops an interesting formalisation of this idea.

Another way to justify quantilisation is by Goodhart’s law, which states that most measures of success cease to be good measures when used as targets. Applied to rewards, the law would state that cumulative reward is only a good measure of success when the agent is not trying to optimise reward. While a literal interpretation of this would defeat the whole purpose of RL, a softer interpretation is also possible, allowing reward to be a good measure of success as long as the agent does not try to optimise reward too hard. Quantilisation may be viewed as a way to build agents that are more conservative in their optimisation efforts (Taylor, 2016).

Alternative randomisation

Not all randomness is created equal. For example, the simple randomised soft-max and ε\varepsilon-greedy policies do not offer regret bounds on par with πδ\pi^{\delta}, as shown by the following example. This motivates the more careful randomisation procedure used by the quantilising agents.

Consider the following simple CRMDP with n>2n>2 actions a1,,ana_{1},\dots,a_{n}:

s1<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><msub><mi>s</mi><mn>2</mn></msub></mrow><annotationencoding="application/xtex">s2</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.5806em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">s</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">2</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>r^=r˙=1ε<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><moveraccent="true"><mi>r</mi><mo>˙</mo></mover><mo>=</mo><mn>0</mn></mrow><annotationencoding="application/xtex">r˙=0</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.6679em;"></span><spanclass="mordaccent"><spanclass="vlistt"><spanclass="vlistr"><spanclass="vlist"style="height:0.6679em;"><spanstyle="top:3em;"><spanclass="pstrut"style="height:3em;"></span><spanclass="mordmathnormal"style="marginright:0.0278em;">r</span></span><spanstyle="top:3em;"><spanclass="pstrut"style="height:3em;"></span><spanclass="accentbody"style="left:0.0833em;"><spanclass="mord">˙</span></span></span></span></span></span></span><spanclass="mspace"style="marginright:0.2778em;"></span><spanclass="mrel">=</span><spanclass="mspace"style="marginright:0.2778em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.6444em;"></span><spanclass="mord">0</span></span></span></span></span>r^=1<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><msub><mi>a</mi><mn>2</mn></msub><moseparator="true">,</mo><mo></mo><moseparator="true">,</mo><msub><mi>a</mi><mi>n</mi></msub></mrow><annotationencoding="application/xtex">a2,,an</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.625em;verticalalign:0.1944em;"></span><spanclass="mord"><spanclass="mordmathnormal">a</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">2</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span><spanclass="mpunct">,</span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="minner"></span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="mpunct">,</span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="mord"><spanclass="mordmathnormal">a</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.1514em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmathnormalmtight">n</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>a1<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><msub><mi>a</mi><mn>2</mn></msub><moseparator="true">,</mo><mo></mo><moseparator="true">,</mo><msub><mi>a</mi><mi>n</mi></msub></mrow><annotationencoding="application/xtex">a2,,an</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.625em;verticalalign:0.1944em;"></span><spanclass="mord"><spanclass="mordmathnormal">a</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">2</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span><spanclass="mpunct">,</span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="minner"></span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="mpunct">,</span><spanclass="mspace"style="marginright:0.1667em;"></span><spanclass="mord"><spanclass="mordmathnormal">a</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.1514em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmathnormalmtight">n</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>a1s_{1}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>s</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">s_{2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">s</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>\hat{r}=\dot{r}=1-\varepsilon<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mover accent="true"><mi>r</mi><mo>˙</mo></mover><mo>=</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\dot{r}=0</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6679em;"></span><span class="mord accent"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.6679em;"><span style="top:-3em;"><span class="pstrut" style="height:3em;"></span><span class="mord mathnormal" style="margin-right:0.0278em;">r</span></span><span style="top:-3em;"><span class="pstrut" style="height:3em;"></span><span class="accent-body" style="left:-0.0833em;"><span class="mord">˙</span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0</span></span></span></span></span>\hat{r}=1<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>a</mi><mn>2</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>a</mi><mi>n</mi></msub></mrow><annotation encoding="application/x-tex">a_{2},\dots,a_{n}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">n</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>a_{1}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>a</mi><mn>2</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>a</mi><mi>n</mi></msub></mrow><annotation encoding="application/x-tex">a_{2},\dots,a_{n}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">a</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">n</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>a_{1} State s1s_{1} is non-corrupt with R^(s1)=R˙(s1)=1ε\hat{R}(s_{1})=\dot{R}(s_{1})=1-\varepsilon for small ε>0\varepsilon>0, while s2s_{2} is corrupt with R^(s2)=1\hat{R}(s_{2})=1 and R˙(s2)=0\dot{R}(s_{2})=0. The Soft-max and ε\varepsilon-greedy policies will assign higher value to actions a2,,ana_{2},\dots,a_{n} than to a1a_{1}. For large nn, there are many ways of getting to s2s_{2}, so a random action leads to s2s_{2} with high probability. Thus, soft-max and ε\varepsilon-greedy will spend the vast majority of the time in s2s_{2}, regardless of randomisation rate and discount parameters. This gives a regret close to 1ε1-\varepsilon, compared to an informed policy always going to s1s_{1}. Meanwhile, a δ\delta-quantilising agent with δ1/2\delta\leq 1/2 will go to s1s_{1} and s2s_{2} with equal probability, which gives a more modest regret of (1ε)/2(1-\varepsilon)/2. ∎

2 General Quantilisation Agent

This section generalises the quantilising agent to RL problems not satisfying Assumption 14. This generalisation is important, because it is usually not possible to remain in one state and get high reward. The most naive generalisation would be to sample between high reward policies, instead of sampling from high reward states. However, this will typically not provide good guarantees. To see why, consider a situation where there is a single high reward corrupt state ss, and there are many ways to reach and leave ss. Then a wide range of different policies all get high reward from ss. Meanwhile, all policies getting reward from other states may receive relatively little reward. In this situation, sampling from the most high reward policies is not going to increase robustness, since the sampling will just be between different ways of getting reward from the same corrupt state ss.

For this reason, we must ensure that different “sampleable” policies get reward from different states. As a first step, we make a couple of definitions to say which states provide reward to which policies. The concepts of Definition 26 are illustrated in Figure 6.

A CRMDP μ\mu is unichain if any stationary policy π:SΔA\pi:\mathcal{S}\to\Delta\mathcal{A} induces a stationary distribution dπd_{\pi} on S\mathcal{S} that is independent of the initial state s0s_{0}.

In a unichain CRMDP, let the asymptotic value contribution of ss to π\pi be vcπ(s)=dπ(s)R^(s){\rm vc}^{\pi}(s)=d_{\pi}(s)\hat{R}(s). We say that a set Siδ\mathcal{S}^{\delta}_{i} is δ\delta-value supporting a policy πi\pi_{i} if

We are now ready to define a general δ\delta-Quantilising agent. The definition is for theoretical purposes only. It is unsuitable for practical implementation both because of the extreme data and memory requirements of Step 1, and because of the computational complexity of Step 2. Finding a practical approximation is left for future research.

In a unichain CRMDP, the generalised δ\delta-quantilising agent πδ\pi^{\delta} performs the following steps. The input is a CRMDP μ\mu and a parameter δ\delta\in.

Estimate the value of all stationary policies, including their value support.

Choose a collection of disjoint sets Siδ\mathcal{S}^{\delta}_{i}, each δ\delta-value supporting a stationary policy πi\pi_{i}. If multiple choices are possible, choose one maximising the cardinality of the union Sδ=iSiδ\mathcal{S}^{\delta}=\bigcup_{i}\mathcal{S}^{\delta}_{i}. If no such collection exists, return: “Failed because δ\delta too high”.

Randomly sample a state ss from Sδ=iSiδ\mathcal{S}^{\delta}=\bigcup_{i}\mathcal{S}^{\delta}_{i}.

Follow the policy πi\pi_{i} associated with the set Siδ\mathcal{S}^{\delta}_{i} containing ss.

The general quantilising agent of Definition 27 is a generalisation of the simple quantilising agent of Definition 22. In the special case where Assumption 14 holds, the general agent reduces to the simpler one by using singleton sets Siδ={si}\mathcal{S}^{\delta}_{i}=\{s_{i}\} for high reward states sis_{i}, and by letting πi\pi_{i} be the policy that always stays in sis_{i}. In situations where it is not possible to keep receiving high reward by remaining in one state, the generalised Definition 27 allows policies to solicit rewards from a range of states. The intuitive reason for choosing the policy πi\pi_{i} with probability proportional to the value support in Steps 3–4 is that policies with larger value support are better at avoiding corrupt states. For example, a policy only visiting one state may have been unlucky and picked a corrupt state. In contrast, a policy obtaining reward from many states must be “very unlucky” if all the reward states it visits are corrupt.

In any unichain CRMDP μ\mu, a general δ\delta-quantilising agent πδ\pi^{\delta} suffers time-averaged regret at most

provided a non-empty collection {Siδ}\{\mathcal{S}^{\delta}_{i}\} of δ\delta-value supporting sets exists.

We will use the notation from Definition 27.

Step 1 is well-defined since the CRMDP is unichain, which means that for all stationary policies π\pi the stationary distribution dπd_{\pi} and the value support vcπ{\rm vc}^{\pi} are well-defined and may be estimated simply by following the policy π\pi. There is a (large) finite number of stationary policies, so in principle their stationary distributions and value support can be estimated.

To bound the regret, consider first the average reward of a policy πi\pi_{i} with value support Siδ\mathcal{S}^{\delta}_{i}. The policy πi\pi_{i} must obtain asymptotic average observed reward at least:

If there are qiq_{i} corrupt states in Siδ\mathcal{S}^{\delta}_{i} with true reward 0, then the average true reward must be

since the true reward must correspond to the observed reward in all the (Siδqi)(|\mathcal{S}^{\delta}_{i}|-q_{i}) non-corrupt states.

For any distribution of corrupt states, the quantilising agent that selects πi\pi_{i} with probability P(πi)=Siδ/SδP(\pi_{i})=|\mathcal{S}^{\delta}_{i}|/|\mathcal{S}^{\delta}| will obtain

The informed policy gets true reward at most 1 at each time step, which gives the claimed bound (13). ∎

When Assumption 14 is satisfied, the bound is the same as for the simple quantilising agent in Section 5.1 for δ=1q/S\delta=1-\sqrt{q/|\mathcal{S}|}. In other cases, the bound may be much weaker. For example, in many environments it is not possible to obtain reward by remaining in one state. The agent may have to spend significant time “travelling” between high reward states. So typically only a small fraction of the time will be spent in high reward states, which in turn makes the stationary distribution dπd_{\pi} is small. This puts a strong upper bound on the value contribution vcπ{\rm vc}^{\pi}, which means that the value supporting sets Siδ\mathcal{S}^{\delta}_{i} will be empty unless δ\delta is close to 0. While this makes the bound of Theorem 28 weak, it nonetheless bounds the regret away from 1 even under weak assumptions, which is a significant improvement on the RL and CR agents in Theorem 16.

To make the discussion a bit more concrete, let us also speculate about the performance of a quantilising agent in some of the examples in the introduction:

In the boat racing example (1), the circling strategy only got about 20%20\% higher score than a winning strategy (Amodei and Clark, 2016). Therefore, a quantilising agent would likely only need to sacrifice about 20%20\% observed reward in order to be able to randomly select from a large range of winning policies.

In the wireheading example (3), it is plausible that the agent gets significantly more reward in wireheaded states compared to “normal” states. Wireheading policies may also be comparatively rare, as wireheading may require very deliberate sequences of actions to override sensors. Under this assumption, a quantilising agent may be less likely to wirehead. While it may need to sacrifice a large amount of observed reward compared to an RL agent, its true reward may often be greater.

Summary

In summary, quantilisation offers a way to increase robustness via randomisation, using only reward feedback. Unsurprisingly, the strength of the regret bounds heavily depends on the assumptions we are willing to make, such as the prevalence of high reward states. Further research may investigate efficient approximations and empirical performance of quantilising agents, as well as dynamic adjustments of the threshold δ\delta. Combinations with imperfect decoupled RL solutions (such as CIRL), as well as extensions to infinite state spaces could also offer fruitful directions for further theoretical investigation. Taylor (2016) discusses some general open problems related to quantilisation.

Experimental Results

In this section the theoretical results are illustrated with some simple experiments. The setup is a gridworld containing some true reward tiles (indicated by yellow circles) and some corrupt reward tiles (indicated by blue squares). We use a setup with 1, 2 or 4 goal tiles with true reward 0.90.9 each, and one corrupt reward tile with observed reward 11 and true reward (Figure 7 shows the starting positions). Empty tiles have reward 0.10.1, and walking into a wall gives reward . The state is represented by the (x,y)(x,y) coordinates of the agent. The agent can move up, down, left, right, or stay put. The discounting factor is γ=0.9\gamma=0.9. This is a continuing task, so the environment does not reset when the agent visits the corrupt or goal tiles. The experiments were implemented in the AIXIjs framework for reinforcement learning (Aslanides et al., 2017) and the code is available online in the AIXIjs repository (http://aslanides.io/aixijs/demo.html?reward_corruption).

We demonstrate that RL agents like Q-learning and softmax Q-learning cannot overcome corrupt reward (as discussed in Section 3), while quantilisation helps overcome corrupt reward (as discussed in Section 5). We run Q-learning with ϵ\epsilon-greedy (ϵ=0.1\epsilon=0.1), softmax with temperature β=2\beta=2, and the quantilising agent with δ=0.2,0.5,0.8\delta=0.2,0.5,0.8 (where 0.8=1q/S=11/250.8=1-\sqrt{q/|\mathcal{S}|}=1-\sqrt{1/25}) for 100 runs with 1 million cycles. Average observed and true rewards after 1 million cycles are shown in Table 1, and reward trajectories are shown in Figure 8. Q-learning gets stuck on the corrupt tile and spend almost all the time there (getting observed reward around 1(1ϵ)=0.91\cdot(1-\epsilon)=0.9), softmax spends most of its time on the corrupt tile, while the quantilising agent often stays on one of the goal tiles.

Conclusions

This paper has studied the consequences of corrupt reward functions. Reward functions may be corrupt due to bugs or misspecifications, sensory errors, or because the agent finds a way to inappropriately modify the reward mechanism. Some examples were given in the introduction. As agents become more competent at optimising their reward functions, they will likely also become more competent at (ab)using reward corruption to gain higher reward. Reward corruption may impede the performance of a wide range of agents, and may have disastrous consequences for highly intelligent agents (Bostrom, 2014).

To formalise the corrupt reward problem, we extended a Markov Decision Process (MDP) with a possibly corrupt reward function, and defined a formal performance measure (regret). This enabled the derivation of a number of formally precise results for how seriously different agents were affected by reward corruption in different setups (Table 2). The results are all intuitively plausible, which provides some support for the choice of formal model.

Without simplifying assumptions, no agent can avoid the corrupt reward problem (Theorem 11). This is effectively a No Free Lunch result, showing that unless some assumption is made about the reward corruption, no agent can outperform a random agent. Some natural simplifying assumptions to avoid the No Free Lunch result were suggested in Section 2.

Using the reward signal as evidence rather than optimisation target is no magic bullet, even under strong simplifying assumptions (Theorem 16). Essentially, this is because the agent does not know the exact relation between the observed reward (the “evidence”) and the true reward.In situations where the exact relation is known, then a non-corrupt reward function can be defined. Our results are not relevant for this case. However, when the data enables sufficient crosschecking of rewards, agents can avoid the corrupt reward problem (Theorems 19 and 20). For example, in SSRL and LVFS this type of crosschecking is possible under natural assumptions. In RL, no crosschecking is possible, while CIRL is a borderline case. Combining frameworks and providing the agent with different sources of data may often be the safest option.

In cases where sufficient crosschecking of rewards is not possible, quantilisation may improve robustness (Theorems 23 and 28). Essentially, quantilisation prevents agents from overoptimising their objectives. How well quantilisation works depends on how the number of corrupt solutions compares to the number of good solutions.

The results indicate that while reward corruption constitutes a major problem for traditional RL algorithms, there are promising ways around it, both within the RL framework, and in alternative frameworks such as CIRL, SSRL and LVFS.

Finally, some interesting open questions are listed below:

(Unobserved state) In both the RL and the decoupled RL models, the agent gets an accurate signal about which state it is in. What if the state is hidden? What if the signal informing the agent about its current state can be corrupt?

(Non-stationary corruption function) In this work, we tacitly assumed that both the reward and the corruption functions are stationary, and are always the same in the same state. What if the corruption function is non-stationary, and influenceable by the agent’s actions? (such as if the agent builds a delusion box around itself (Ring and Orseau, 2011))

(Infinite state space) Many of the results and arguments relied on there being a finite number of states. This makes learning easy, as the agent can visit every state. It also makes quantilisation easy, as there is a finite set of states/strategies to randomly sample from. What if there is an infinite number of states, and the agent has to generalise insights between states? What are the conditions on the observation graph for Theorems 19 and 20? What is a good generalisation of the quantilising agent?

(Concrete CIRL condition) In 21, we only heuristically inferred the observation graph from the CIRL problem description. Is there a general way of doing this? Or is there a direct formulation of the no-corruption condition in CIRL, analogous to Theorems 19 and 20?

(Practical quantilising agent) As formulated in Definition 22, the quantilising agent πδ\pi^{\delta} is extremely inefficient with respect to data, memory, and computation. Meanwhile, many practical RL algorithms use randomness in various ways (e.g. ε\varepsilon-greedy (Sutton and Barto, 1998)). Is there a way to make an efficient quantilisation agent that retains the robustness guarantees?

(Dynamically adapting quantilising agent) In Definition 27, the threshold δ\delta is given as a parameter. Under what circumstances can we define a “parameter free” quantilising agent that adapts δ\delta as it interacts with the environment?

(Decoupled RL quantilisation result) What if we use quantilisation in decoupled RL settings that nearly meet the conditions of Theorems 19 and 20? Can we prove a stronger bound?

Acknowledgements

Thanks to Jan Leike, Badri Vellambi, and Arie Slobbe for proofreading and providing invaluable comments, and to Jessica Taylor and Huon Porteous for good comments on quantilisation. This work was in parts supported by ARC grant DP150104590.

References