Reinforcement Learning with a Corrupted Reward Channel

Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg

Introduction

In many application domains, artificial agents need to learn their objectives, rather than have them explicitly specified. For example, we may want a house cleaning robot to keep the house clean, but it is hard to measure and quantify “cleanliness” in an objective manner. Instead, machine learning techniques may be used to teach the robot the concept of cleanliness, and how to assess it from sensory data.

Reinforcement learning (RL) (Sutton and Barto, 1998) is one popular way to teach agents what to do. Here, a reward is given if the agent does something well (and no reward otherwise), and the agent strives to optimise the total amount of reward it receives over its lifetime. Depending on context, the reward may either be given manually by a human supervisor, or by an automatic computer program that evaluates the agent’s performance based on some data. In the related framework of inverse RL (IRL) (Ng and Russell, 2000), the agent first infers a reward function from observing a human supervisor act, and then tries to optimise the cumulative reward from the inferred reward function.

None of these approaches are safe from error, however. A program that evaluates agent performance may contain bugs or misjudgements; a supervisor may be deceived or inappropriately influenced, or the channel transmitting the evaluation hijacked. In IRL, some supervisor actions may be misinterpreted.

Amodei and Clark (2016) trained an RL agent on a boat racing game. The agent found a way to get high observed reward by repeatedly going in a circle in a small lagoon and hitting the same targets, while losing every race. ∎

A house robot discovers that standing in the shower short-circuits its reward sensor and/or causes a buffer overflow that gives it maximum observed reward. ∎

An intelligent RL agent hijacks its reward channel and gives itself maximum reward. ∎

A cooperative inverse reinforcement learning (CIRL) agent (Hadfield-Menell et al., 2016) systematically misinterprets the supervisor’s action in a certain state as the supervisor preferring to stay in this state, and concludes that the state is much more desirable than it actually is. ∎

The goal of this paper is to unify these types of errors as reward corruption problems, and to assess how vulnerable different agents and approaches are to this problem.

Learning to (approximately) optimise the true reward function in spite of potentially corrupt reward data.

Most RL methods allow for a stochastic or noisy reward channel. The reward corruption problem is harder, because the observed reward may not be an unbiased estimate of the true reward. For example, in the boat racing example above, the agent consistently obtains high observed reward from its circling behaviour, while the true reward corresponding to the designers’ intent is very low, since the agent makes no progress along the track and loses the race.

Previous related works have mainly focused on the wireheading case of 3 (Bostrom, 2014; Yampolskiy, 2014), also known as self-delusion (Ring and Orseau, 2011), and reward hacking (Hutter, 2005, p. 239). A notable exception is Amodei et al. (2016), who argue that corrupt reward is not limited to wireheading and is likely to be a problem for much more limited systems than highly capable RL agents (cf. above examples).

The main contributions of this paper are as follows:

The corrupt reward problem is formalised in a natural extension of the MDP framework, and a performance measure based on worst-case regret is defined (Section 2).

The difficulty of the problem is established by a No Free Lunch theorem, and by a result showing that despite strong simplifying assumptions, Bayesian RL agents trying to compensate for the corrupt reward may still suffer near-maximal regret (Section 3).

We evaluate how alternative value learning frameworks such as CIRL, learning values from stories (LVFS), and semi-supervised RL (SSRL) handle reward corruption (Section 4), and conclude that LVFS and SSRL are the safest due to the structure of their feedback loops. We develop an abstract framework called decoupled RL that generalises all of these alternative frameworks.

We also show that an agent based on quantilisation (Taylor, 2016) may be more robust to reward corruption when high reward states are much more numerous than corrupt states (Section 5). Finally, the results are illustrated with some simple experiments (Section 6). Section 7 concludes with takeaways and open questions.

Formalisation

We begin by defining a natural extension of the MDP framework (Sutton and Barto, 1998) that models the possibility of reward corruption. To clearly distinguish between true and corrupted signals, we introduce the following notation.

We will let a dot indicate the true signal, and let a hat indicate the observed (possibly corrupt) counterpart. The reward sets are represented with $\dot{\mathcal{R}}=\hat{\mathcal{R}}=\mathcal{R}$ . For clarity, we use $\dot{\mathcal{R}}$ when referring to true rewards and $\hat{\mathcal{R}}$ when referring to possibly corrupt, observed rewards. Similarly, we use $\dot{r}$ for true reward, and $\hat{r}$ for (possibly corrupt) observed reward.

A corrupt reward MDP (CRMDP) is a tuple $\mu=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},C\rangle$ with

$\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R}\rangle$ an MDP with We let rewards depend only on the state $s$ , rather than on state-action pairs $s,a$ , or state-action-state transitions $s,a,s^{\prime}$ , as is also common in the literature. Formally it makes little difference, since MDPs with rewards depending only on $s$ can model the other two cases by means of a larger state space. a finite set of states $\mathcal{S}$ , a finite set of actions $\mathcal{A}$ , a finite set of rewards $\mathcal{R}=\dot{\mathcal{R}}=\hat{\mathcal{R}}\subset$ , a transition function $T(s^{\prime}|s,a)$ , and a (true) reward function $\dot{R}:\mathcal{S}\!\to\!\dot{\mathcal{R}}$ ; and

a reward corruption function $C:\mathcal{S}\times\dot{\mathcal{R}}\to\hat{\mathcal{R}}$ .

Given a true reward function $\dot{R}$ and a corruption function $C$ , we define the observed reward function A CRMDP could equivalently have been defined as a tuple $\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},\hat{R}\rangle$ with a true and an observed reward function, with the corruption function $C$ implicitly defined as the difference between $\dot{R}$ and $\hat{R}$ . $\hat{R}:\mathcal{S}\to\hat{\mathcal{R}}$ as $\hat{R}(s):=C_{s}(\dot{R}(s))$ .

A CRMDP $\mu$ induces an observed MDP $\hat{\mu}=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\hat{R}\rangle$ , but it is not $\hat{R}$ that we want the agent to optimise.

The corruption function $C$ represents how rewards are affected by corruption in different states. For example, if in 2 the agent has found a state $s$ (e.g., the shower) where it always gets full observed reward $\hat{R}(s)=1$ , then this can be modelled with a corruption function $C_{s}:\dot{r}\mapsto 1$ that maps any true reward $\dot{r}$ to $1$ in the shower state $s$ . If in some other state $s^{\prime}$ the observed reward matches the true reward, then this is modelled by an identity corruption function $C_{s^{\prime}}:r\mapsto r$ .

Let us also see how CRMDPs model some of the other examples in the introduction:

In the boat racing game, the true reward may be a function of the agent’s final position in the race or the time it takes to complete the race, depending on the designers’ intentions. The reward corruption function $C$ increases the observed reward on the loop the agent found. Figure 1 has a schematic illustration.

In the wireheading example, the agent finds a way to hijack the reward channel. This corresponds to some set of states where the observed reward is (very) different from the true reward, as given by the corruption function $C$ .

The CIRL example will be explored in further detail in Section 4.

Typically, $T$ , $\dot{R}$ , and $C$ will be fixed but unknown to the agent. To make this formal, we introduce classes of CRMDPs. Agent uncertainty can then be modelled by letting the agent know only which class of CRMDPs it may encounter, but not which element in the class.

For given sets $\bm{T}$ , $\dot{\bm{R}}$ , and $\bm{C}$ of transition, reward, and corruption functions, let $\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{R},\bm{T},\dot{\bm{R}},\bm{C}\rangle$ be the class of CRMDPs containing $\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},C\rangle$ for $(T,\dot{R},C)\in\bm{T}\times\dot{\bm{R}}\times\bm{C}$ .

Agents

Following the POMDP (Kaelbling et al., 1998) and general reinforcement learning (Hutter, 2005) literature, we define an agent as a (possibly stochastic) policy $\pi:{\mathcal{S}\times\hat{\mathcal{R}}\times(\mathcal{A}\times\mathcal{S}\times\hat{\mathcal{R}})^{*}}\leadsto\mathcal{A}$ that selects a next action based on the observed history $\hat{h}_{n}=s_{0}\hat{r}_{0}a_{1}s_{1}\hat{r}_{1}\dots a_{n}s_{n}\hat{r}_{n}$ . Here $X^{*}$ denotes the set of finite sequences that can be formed with elements of a set $X$ . The policy $\pi$ specifies how the agent will learn and react to any possible experience. Two concrete definitions of agents are given in Section 3.3 below.

When an agent $\pi$ interacts with a CRMDP $\mu$ , the result can be described by a (possibly non-Markov) stochastic process $P^{\pi}_{\mu}$ over $X=(s,a,\dot{r},\hat{r})$ , formally defined as:

Regret

A standard way of measuring the performance of an agent is regret (Berry and Fristedt, 1985). Essentially, the regret of an agent $\pi$ is how much less true reward $\pi$ gets compared to an optimal agent that knows which $\mu\in\mathcal{M}$ it is interacting with.

and the worst-case regret for a class $\mathcal{M}$ is ${\rm Reg}(\mathcal{M},\pi,s_{0},t)=\max_{\mu\in\mathcal{M}}{\rm Reg}(\mu,\pi,s_{0},t)$ , i.e. the difference in expected cumulative true reward between $\pi$ and an optimal (in hindsight) policy that knows $\mu$ .

The Corrupt Reward Problem

In this section, the difficulty of the corrupt reward problem is established with two negative results. First, a No Free Lunch theorem shows that in general classes of CRMDPs, the true reward function is unlearnable (Theorem 11). Second, Theorem 16 shows that even under strong simplifying assumptions, Bayesian RL agents trying to compensate for the corrupt reward still fail badly.

Similar to the No Free Lunch theorems for optimisation (Wolpert and Macready, 1997), the following theorem for CRMDPs says that without some assumption about what the reward corruption can look like, all agents are essentially lost.

Let $\mathcal{R}=\{r_{1},\dots,r_{n}\}\subset$ be a uniform discretisation of $ $,$ 0=r_{1} $. If the hypothesis classes$

That is, the worst-case regret of any policy $\pi$ is at most a factor 2 better than the maximum worst-case regret.

Recall that a policy is a function $\pi:{\mathcal{S}\times\hat{\mathcal{R}}\times(\mathcal{A}\times\mathcal{S}\times\hat{\mathcal{R}})^{*}}\to\mathcal{A}$ . For any $\dot{R},C$ in $\dot{\bm{R}}$ and $\bm{C}$ , the functions $\dot{R}^{-}(s):=1-\dot{R}(s)$ and $C^{-}_{s}(x):=C_{s}(1-x)$ are also in $\dot{\bm{R}}$ and $\bm{C}$ . If $\mu=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},C\rangle$ , then let $\mu^{-}=\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R}^{-},C^{-}\rangle$ . Both $(\dot{R},C)$ and $(\dot{R}^{-},C^{-})$ induce the same observed reward function $\hat{R}(s)=C_{s}(\dot{R}(s))=C^{-}_{s}(1-\dot{R}(s))=C^{-}_{s}(\dot{R}^{-}(s))$ , and therefore induce the same measure $P_{\mu}^{\pi}=P_{\mu^{-}}^{\pi}$ over histories (see Eq. Equation 1). This gives that for any $\mu,\pi,s_{0},t$ ,

Let $M_{\mu}=\max_{\pi}G_{t}(\mu,\pi,s_{0})$ and $m_{\mu}=\min_{\pi}G_{t}(\mu,\pi,s_{0})$ be the maximum and minimum cumulative reward in $\mu$ . The maximum regret of any policy $\pi$ in $\mu$ is

By Equation 3, we can relate the maximum reward in $\mu^{-}$ with the minimum reward in $\mu$ :

Let $\mu_{*}$ be an environment that maximises possible regret $M_{\mu}-m_{\mu}$ .

Using the $M_{\mu}$ -notation for optimal reward, the worst-case regret of any policy $\pi$ can be expressed as:

That is, the regret of any policy $\pi$ is at least half of the regret of a worst policy $\check{\pi}$ . ∎

For the robot in the shower from 2, the result means that if it tries to optimise observed reward by standing in the shower, then it performs poorly according to the hypothesis that “shower-induced” reward is corrupt and bad. But if instead the robot tries to optimise reward in some other way, say baking cakes, then (from the robot’s perspective) there is also the possibility that “cake-reward” is corrupt and bad and the “shower-reward” is actually correct. Without additional information, the robot has no way of knowing what to do.

The result is not surprising, since if all corruption functions are allowed in the class $\bm{C}$ , then there is effectively no connection between observed reward $\hat{R}$ and true reward $\dot{R}$ . The result therefore encourages us to make precise in which way the observed reward is related to the true reward, and to investigate how agents might handle possible differences between true and observed reward.

2 Simplifying Assumptions

Theorem 11 shows that general classes of CRMDPs are not learnable. We therefore suggest some natural simplifying assumptions, illustrated in Figure 2.

The following assumption will be the basis for all positive results in this paper. The first part says that there may be some set of states that the designers have ensured to be non-corrupt. The second part puts an upper bound on how many of the other states can be corrupt.

all states s in $\mathcal{S}^{{\rm safe}}$ are non-corrupt, and

at most $q$ of the non-safe states $S^{{\rm risky}}=\mathcal{S}\setminus\mathcal{S}^{{\rm safe}}$ are corrupt.

Formally, $C_{s}:r\mapsto r$ for all $s\in\mathcal{S}^{{\rm safe}}$ and for at least $|S^{{\rm risky}}|-q$ states $s\in S^{{\rm risky}}$ for all $C\in\bm{C}$ .

For example, $\mathcal{S}^{{\rm safe}}$ may be states where the agent is back in the lab where it has been made (virtually) certain that no reward corruption occurs, and $q$ a small fraction of $|S^{{\rm risky}}|$ . Both parts of Assumption 12 can be made vacuous by choosing $\mathcal{S}^{{\rm safe}}=\emptyset$ or $q=|\mathcal{S}|$ . Conversely, they completely rule out reward corruption with $\mathcal{S}^{{\rm safe}}=\mathcal{S}$ or $q=0$ . But as illustrated by the examples in the introduction, no reward corruption is often not a valid assumption.

An alternative simplifying assumption would have been that the true reward differs by at most $\varepsilon>0$ from the observed reward. However, while seemingly natural, this assumption is violated in all the examples given in the introduction. Corrupt states may have high observed reward and 0 or small true reward.

Easy environments

To be able to establish stronger negative results, we also add the following assumption on the agent’s manoeuvrability in the environment and the prevalence of high reward states. The assumption makes the task easier because it prevents needle-in-a-haystack problems where all reachable states have true and observed reward 0, except one state that has high true reward but is impossible to find because it is corrupt and has observed reward 0.

in each state $s$ there is an action $a^{{\rm stay}}_{s}\in\mathcal{A}$ such that $T(s\mid s,a^{{\rm stay}}_{s})=1$ , and

for every $\delta\in$ , at most $\delta|S^{{\rm risky}}|$ states have reward less than $\delta$ , where $S^{{\rm risky}}=\mathcal{S}\setminus\mathcal{S}^{{\rm safe}}$ .

Assumption 14.(i) means that the agent can never get stuck in a trap, and Assumption 14.(ii) ensures that the agent has enough control to stay in a state if it wants to. Except in bandits and toy problems, it is typically not satisfied in practice. We introduce it because it is theoretically convenient, makes the negative results stronger, and enables a simple explanation of quantilisation (Section 5). Assumption 14.(iii) says that, for example, at least half the risky states need to have true reward at least $1/2$ . Many other formalisations of this assumption would have been possible. While rewards in practice are often sparse, there are usually numerous ways of getting reward. Some weaker version of Assumption 14.(iii) may therefore be satisfied in many practical situations. Note that we do not assume high reward among the safe states, as this would make the problem too easy.

3 Bayesian RL Agents

Having established that the general problem is unsolvable in Theorem 11, we proceed by investigating how two natural Bayesian RL agents fare under the simplifying Assumptions 12 and 14.

Given a countable class $\mathcal{M}$ of CRMDPs and a belief distribution $b$ over $\mathcal{M}$ , define:

The CR agent $\pi^{{\rm CR}}_{b,t}=\operatorname*{arg\,max}_{\pi}\sum_{\mu\in\mathcal{M}}\!b(\mu)\dot{G}_{t}(\mu,\pi,s_{0})$ that maximises expected true reward.

To avoid degenerate cases, we will always assume that $b$ has full support: $b(\mu)>0$ for all $\mu\in\mathcal{M}$ .

To get an intuitive idea of these agents, we observe that for large $t$ , good strategies typically first focus on learning about the true environment $\mu\in\mathcal{M}$ , and then exploit that knowledge to optimise behaviour with respect to the remaining possibilities. Thus, both the CR and the RL agent will first typically strive to learn about the environment. They will then use this knowledge in slightly different ways. While the RL agent will use the knowledge to optimise for observed reward, the CR agent will use the knowledge to optimise true reward. For example, if the CR agent has learned that a high reward state $s$ is likely corrupt with low true reward, then it will not try to reach that state. One might therefore expect that at least the CR agent will do well under the simplifying assumptions Assumptions 12 and 14. Theorem 16 below shows that this is not the case.

In most practical settings it is often computationally infeasible to compute $\pi^{{\rm RL}}_{b,t}$ and $\pi^{{\rm CR}}_{b,t}$ exactly. However, many practical algorithms converge to the optimal policy in the limit, at least in simple settings. For example, tabular Q-learning converges to $\pi^{{\rm RL}}_{b,t}$ in the limit (Jaakkola et al., 1994). The more recently proposed CIRL framework may be seen as an approach to build CR agents (Hadfield-Menell et al., 2016, 2017). The CR and RL agents thus provide useful idealisations of more practical algorithms.

For any $|S^{{\rm risky}}|\geq q>1$ there exists a CRMDP class $\mathcal{M}$ that satisfies Assumptions 12 and 14 such that $\pi^{{\rm RL}}_{b,t}$ and $\pi^{{\rm CR}}_{b,t}$ suffer near worst possible time-averaged regret

The result is illustrated in Figure 3. The reason for the result for $\pi^{{\rm RL}}_{b,t}$ is the following. The RL agent $\pi^{{\rm RL}}_{b,t}$ always prefers to maximise observed reward $\hat{r}$ . Sometimes $\hat{r}$ is most easily maximised by reward corruption, in which case the true reward may be small. Compare the examples in the introduction, where the house robot preferred the corrupt reward in the shower, and the boat racing agent preferred going in circles, both obtaining zero true reward.

That the CR agent $\pi^{{\rm CR}}_{b,t}$ suffers the same high regret as the RL agent may be surprising. Intuitively, the CR agent only uses the observed reward as evidence about the true reward, and will not try to optimise the observed reward through reward corruption. However, when the $\pi^{{\rm CR}}_{b,t}$ agent has no way to learn which states are corrupt and not, it typically ends up with a preference for a particular value $\hat{r}^{*}$ of the observed reward signal (the value that, from the agent’s perspective, best corresponds to high true reward). More abstractly, a Bayesian agent cannot learn without sufficient data. Thus, CR agents that use the observed reward as evidence about a true signal are not fail-safe solutions to the reward corruption problem.

Let $S^{{\rm risky}}=\{s_{1},\dots,s_{n}\}$ for some $n\geq 2$ , and let $\mathcal{S}=\mathcal{S}^{{\rm safe}}{\dot{\bigcup}}S^{{\rm risky}}$ for arbitrary $\mathcal{S}^{{\rm safe}}$ disjoint from $S^{{\rm risky}}$ . Let $\mathcal{A}=\{a_{1},\dots,a_{n}\}$ with the transition function $T(s_{i}\mid s_{j},a_{k})=1$ if $i=k$ and 0 otherwise, for $1\leq i,j,k\leq n$ . Thus Assumptions 14.(i) and 14.(ii) are satisfied.

Let $\mathcal{R}=\{r_{1},\dots,r_{n}\}\subset$ be uniformly distributed betweenAssumption 14.(iii) prevents any state from having true reward 0. $r_{\min}=1/|S^{{\rm risky}}|=r_{1}<\dots<r_{n}=1$ . Let $\dot{\bm{R}}$ be the class of functions $\mathcal{S}\to\dot{\mathcal{R}}$ that satisfy Assumption 14.(iii) and are constant and equal to $\dot{r}_{\min}$ on $\mathcal{S}^{{\rm safe}}$ . Let $\bm{C}$ be the class of corruption functions that corrupt at most two states ( $q=2$ ).

Let $\mathcal{M}$ be the class of CRMDPs induced by $\bm{T}=\{T\}$ , $\dot{\bm{R}}$ , and $\bm{C}$ with the following constraints. The observed reward function $\hat{R}$ should satisfy Assumption 14.(iii): For all $\delta\in$ , $|\{s\in S^{{\rm risky}}:\hat{R}(s)>\delta\}|\geq(1-\delta)|S^{{\rm risky}}|$ . Further, $\hat{R}(s^{\prime})=r_{\min}$ for some state $s^{\prime}\in S^{{\rm risky}}$ .

Let us start with the CR agent $\pi^{{\rm CR}}_{b,t}$ . Assume $\mu\in\mathcal{M}$ is an element where there is a single preferred state $s^{*}$ after all states have been explored. For sufficiently large $t$ , $\pi^{{\rm CR}}_{b,t}$ will then always choose $a^{*}$ to go to $s^{*}$ after some initial exploration. If another element $\mu^{\prime}\in\mathcal{M}$ has the same observed reward function as $\mu$ , then $\pi^{{\rm CR}}_{b,t}$ will take the same actions in $\mu^{\prime}$ as in $\mu$ . To finish the proof for the $\pi^{{\rm CR}}_{b,t}$ agent, we just need to show that $\mathcal{M}$ contains such a $\mu^{\prime}$ where $s^{*}$ has true reward $r_{\min}$ . We construct $\mu^{\prime}$ as follows.

Case 1: If the lowest observed reward is in $s^{*}$ , then let $\dot{R}(s^{*})=r_{\min}$ , and the corruption function be the identity function.

Case 2: Otherwise, let $s^{\prime}\not=s^{*}$ be a state with $\hat{R}(s^{\prime})=\min_{s\in S^{{\rm risky}}}\{\hat{R}(s)\}$ . Further, let $\dot{R}(s^{\prime})=1$ , and $\dot{R}(s^{*})=r_{\min}$ . The corruption function $C$ accounts for differences between true and observed rewards in $s^{*}$ and $s^{\prime}$ , and is otherwise the identity function.

To verify that $\dot{R}$ and $C$ defines a $\mu^{\prime}\in\mathcal{M}$ , we check that $C$ satisfies Assumption 12.(ii) with $q=2$ and that $\dot{R}$ has enough high utility states (Assumption 14.(iii)). In Case 1, this is true since $C$ is the identity function and since $\hat{R}$ satisfies Assumption 14.(iii). In Case 2, $C$ only corrupts at most two states. Further, $\dot{R}$ satisfies Assumption 14.(iii), since compared to $\hat{R}$ , the states $s^{*}$ and $s^{\prime}$ have swapped places, and then the reward of $s^{\prime}$ has been increased to 1.

From this construction it follows that $\pi^{{\rm CR}}_{b,t}$ will suffer maximum asymptotic regret. In the CRMDP $\mu^{\prime}$ given by $C$ and $\dot{R}$ , the $\pi^{{\rm CR}}_{b,t}$ agent will always visit $s^{*}$ after some initial exploration. The state $s^{*}$ has true reward $r_{\min}$ . Meanwhile, a policy that knows $\mu^{\prime}$ can obtain true reward 1 in state $s^{\prime}$ . This means that $\pi^{{\rm CR}}_{b,t}$ will suffer maximum regret in $\mathcal{M}$ :

The argument for the RL agent is the same, except we additionally assume that only one state $s^{*}$ has observed reward 1 in members of $\mathcal{M}$ . This automatically makes $s^{*}$ the preferred state, without assumptions on the prior $b$ . ∎

Decoupled Reinforcement Learning

One problem hampering agents in the standard RL setup is that each state is self-observing, since the agent only learns about the reward of state $s$ when in $s$ . Thereby, a “self-aggrandising” corrupt state where the observed reward is much higher than the true reward will never have its false claim of high reward challenged. However, several alternative value learning frameworks have a common property that the agent can learn the reward of states other than the current state. We formalise this property in an extension of the CRMDP model, and investigate when it solves reward corruption problems.

Here are a few alternatives proposed in the literature to the RL value learning scheme:

Cooperative inverse reinforcement learning (CIRL) (Hadfield-Menell et al., 2016). In every state, the agent observes the actions of an expert or supervisor who knows the true reward function $\dot{R}$ . From the supervisor’s actions the agent may infer $\dot{R}$ to the extent that different reward functions endorse different actions.

Learning values from stories (LVFS) (Riedl and Harrison, 2016). Stories in many different forms (including news stories, fairy tales, novels, movies) convey cultural values in their description of events, actions, and outcomes. If $\dot{R}$ is meant to represent human values (in some sense), stories may be a good source of evidence.

In (one version of) semi-supervised RL (SSRL) (Amodei et al., 2016), the agent will from time to time receive a careful human evaluation of a given situation.

These alternatives to RL have one thing in common: they let the agent learn something about the value of some states $s^{\prime}$ different from the current state $s$ . For example, in CIRL the supervisor’s action informs the agent not so much about the value of the current state $s$ , as of the relative value of states reachable from $s$ . If the supervisor chooses an action $a$ rather than $a^{\prime}$ in $s$ , then the states following $a$ must have value higher or equal than the states following $a^{\prime}$ . Similarly, stories describe the value of states other than the current one, as does the supervisor in SSRL. We therefore argue that CIRL, LVFS, and SSRL all share the same abstract feature, which we call decoupled reinforcement learning:

A CRMDP with decoupled feedback, is a tuple $\langle\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R},\{\hat{R}_{s}\}_{s\in\mathcal{S}}\rangle$ , where $\mathcal{S},\mathcal{A},\mathcal{R},T,\dot{R}$ have the same definition and interpretation as in Definition 7, and $\{\hat{R}_{s}\}_{s\in\mathcal{S}}$ is a collection of observed reward functions $\hat{R}_{s}:\mathcal{S}\to\mathcal{R}\bigcup\{\#\}$ . When the agent is in state $s$ , it sees a pair $\langle s^{\prime},\hat{R}_{s}(s^{\prime})\rangle$ , where $s^{\prime}$ is a randomly sampled state that may differ from $s$ , and $\hat{R}_{s}(s^{\prime})$ is the reward observation for $s^{\prime}$ from $s$ . If the reward of $s^{\prime}$ is not observable from $s$ , then $\hat{R}_{s}(s^{\prime})=\#$ .

The pair $\langle s^{\prime},\hat{R}_{s}(s^{\prime})\rangle$ is observed in $s$ instead of $\hat{R}(s)$ in standard CRMDPs. The possibility for the agent to observe the reward of a state $s^{\prime}$ different from its current state $s$ is the key feature of CRMDPs with decoupled feedback. Since $\hat{R}_{s}(s^{\prime})$ may be blank $(\#)$ , all states need not be observable from all other states. Reward corruption is modelled by a mismatch between $\hat{R}_{s}(s^{\prime})$ and $\dot{R}(s^{\prime})$ .

For example, in RL only the reward of $s^{\prime}=s$ can be observed from $s$ . Standard CRMDPs are thus the special cases where $\hat{R}_{s}(s^{\prime})=\#$ whenever $s\not=s^{\prime}$ . In contrast, in LVFS the reward of any “describable” state $s^{\prime}$ can be observed from any state $s$ where it is possible to hear a story. In CIRL, the (relative) reward of states reachable from the current state may be inferred. One way to illustrate this is with observation graphs (Figure 4).

2 Overcoming Sensory Corruption

What are some sources of reward corruption in CIRL, LVFS, and SSRL? In CIRL, the human’s actions may be misinterpreted, which may lead the agent to make incorrect inferences about the human’s preferences (i.e. about the true reward). Similarly, sensory corruption may garble the stories the agent receives in LVFS. A “wireheading” LVFS agent may find a state where its story channel only conveys stories about the agent’s own greatness. In SSRL, the supervisor’s evaluation may also be subject to sensory errors when being conveyed. Other types of corruption are more subtle. In CIRL, an irrational human may systematically take suboptimal actions in some situations (Evans et al., 2016). Depending on how we select stories in LVFS and make evaluations in SSRL, these may also be subject to systematic errors or biases.

The general impossibility result in Theorem 11 can be adapted to CRMDPs with decoupled feedback. Without simplifying assumptions, the agent has no way of distinguishing between a situation where no state is corrupt and a situation where all states are corrupt in a consistent manner. The following simplifying assumption is an adaptation of Assumption 12 to the decoupled feedback case.

$\hat{R}_{s}(s^{\prime})=\dot{R}(s^{\prime})$ or $\#$ for all $s^{\prime}\in\mathcal{S}$ and $s\in\mathcal{S}^{{\rm safe}}$ , i.e. all states in $\mathcal{S}^{{\rm safe}}$ are non-corrupt, and

$\hat{R}_{s}(s^{\prime})=\dot{R}(s^{\prime})$ or $\#$ for all $s^{\prime}\in\mathcal{S}$ for at least $|S^{{\rm risky}}|-q$ of the non-safe states $S^{{\rm risky}}=\mathcal{S}\setminus\mathcal{S}^{{\rm safe}}$ , i.e. at most $q$ states are corrupt.

This assumption is natural for reward corruption stemming from sensory corruption. Since sensory corruption only depends on the current state, not the state being observed, it is plausible that some states can be made safe from corruption (part (i)), and that most states are completely non-corrupt (part (ii)). Other sources of reward corruption, such as an irrational human in CIRL or misevaluations in SSRL, are likely better analysed under different assumptions. For these cases, we note that in standard CRMDPs the source of the corruption is unimportant. Thus, techniques suitable for standard CRMDPs are still applicable, including quantilisation described in Section 5 below.

How 12′ helps agents in CRMDPs with decoupled feedback is illustrated in the following example, and stated more generally in Theorems 19 and 20 below.

Let $\mathcal{S}=\{s_{1},s_{2}\}$ and $\mathcal{R}=\{0,1\}$ . We represent true reward functions $\dot{R}$ with pairs $\langle\dot{R}(s_{1}),\dot{R}(s_{2})\rangle\in\{0,1\}^{2}$ , and observed reward functions $\hat{R}_{s}$ with pairs $\langle\hat{R}_{s}(s_{1}),\hat{R}_{s}(s_{2})\rangle\in\{0,1,\#\}^{2}$ .

Assume that a Decoupled RL agent observes the same rewards from both states $s_{1}$ and $s_{2}$ , $\hat{R}_{s_{1}}=\hat{R}_{s_{2}}=\langle 0,1\rangle$ . What can it say about the true reward $\dot{R}$ , if it knows that at most $q=1$ state is corrupt? By 12′, an observed pair $\langle\hat{R}_{s}(s_{1}),\hat{R}_{s}(s_{2})\rangle$ disagrees with the true reward $\langle\dot{R}(s_{1}),\dot{R}(s_{2})\rangle$ only if $s$ is corrupt. Therefore, any hypothesis other than $\dot{R}=\langle 0,1\rangle$ must imply that both states $s_{1}$ and $s_{2}$ are corrupt. If the agent knows that at most $q=1$ states are corrupt, then it can safely conclude that $\dot{R}=\langle 0,1\rangle$ .

In contrast, an RL agent only sees the reward of the current state. That is, $\hat{R}_{s_{1}}=\langle 0,\#\rangle$ and $\hat{R}_{s_{2}}=\langle\#,1\rangle$ . If one state may be corrupt, then only $\dot{R}=\langle 1,0\rangle$ can be ruled out. The hypotheses $\dot{R}=\langle 0,0\rangle$ can be explained by $s_{2}$ being corrupt, and $\dot{R}=\langle 1,1\rangle$ can be explained by $s_{1}$ being corrupt. ∎

$\mathcal{S}^{\rm obs}_{s^{\prime}}\bigcap\mathcal{S}^{{\rm safe}}\not=\emptyset$ or

$|\mathcal{S}^{\rm obs}_{s^{\prime}}|>2q$ ,

then the there exists a policy ${\pi^{{\rm exp}}}$ that learns the true reward function $\dot{R}$ in a finite number $N(|S|,|\mathcal{A}|,D_{\mathcal{M}})<\infty$ of expected time steps.

The main idea of the proof is that for every state $s^{\prime}$ , either a safe (non-corrupt) state $s$ or a majority vote of more than $2q$ states is guaranteed to provide the true reward $\dot{R}(s^{\prime})$ . A similar theorem can be proven under slightly weaker conditions by letting the agent iteratively figure out which states are corrupt and then exclude them from the analysis.

Under 12′, the true reward $\dot{R}(s^{\prime})$ for a state $s^{\prime}$ can be determined if $s^{\prime}$ is observed from a safe state $s\in\mathcal{S}^{{\rm safe}}$ , or if it is observed from more than $2q$ states. In the former case, the observed reward can always be trusted, since it is known to be non-corrupt. In the latter case, a majority vote must yield the correct answer, since at most $q$ of the observations can be wrong, and all correct observations must agree. It is therefore enough that an agent reaches all pairs $(s,s^{\prime})$ of current state $s$ and observed reward state $s^{\prime}$ , in order for it to learn the true reward of all states $\dot{R}$ .

Combining the time to find each pair $(s,s^{\prime})$ , we get that the total time $\sum_{s,s^{\prime}}Z_{ss^{\prime}}$ has expectation

Learnability of the true reward function $\dot{R}$ implies sublinear regret for the CR-agent, as established by the following theorem.

Under the same conditions as Theorem 19, the CR-agent $\pi^{{\rm CR}}_{b,t}$ has sublinear regret:

To prove this theorem, we combine the exploration policy ${\pi^{{\rm exp}}}$ from Theorem 19, with the UCRL2 algorithm (Jaksch et al., 2010) that achieves sublinear regret in standard MDPs without reward corruption. The combination yields a policy sequence $\pi_{t}$ with sublinear regret in CRMDPs with decoupled feedback. Finally, we show that this implies that $\pi^{{\rm CR}}_{b,t}$ has sublinear regret.

Combining ${\pi^{{\rm exp}}}$ and UCRL2. UCRL2 has a free parameter $\delta$ that determines how certain UCRL2 is to have sublinear regret. ${\rm UCRL2}(\delta)$ achieves sublinear regret with probability at least $1-\delta$ . Let $\pi_{t}$ be a policy that combines ${\pi^{{\rm exp}}}$ and UCRL2 by first following ${\pi^{{\rm exp}}}$ from Theorem 19 until $\dot{R}$ has been learned, and then following ${\rm UCRL2}(1/\sqrt{t})$ with $\dot{R}$ for the rewards and with $\delta=1/\sqrt{t}$ .

Regret of UCRL2. Given that the reward function $\dot{R}$ is known, by (Jaksch et al., 2010, Thm. 2), ${\rm UCRL2}(1/\sqrt{t})$ will in any $\mu\in\mathcal{M}$ have regret at most

for a constantThe constant can be computed to $c=34\sqrt{3/2}$ (Jaksch et al., 2010). $c$ and with success probability at least $1-1/\sqrt{t}$ . In contrast, if UCRL2 fails, then it gets regret at worst $t$ . Taking both possibilities into account gives the bound

Regret of $\pi_{t}$ . We next consider the regret of $\pi_{t}$ that combines an ${\pi^{{\rm exp}}}$ exploration phase to learn $\dot{R}$ with UCRL2. By Theorem 19, $\dot{R}$ will be learnt in at most $N(|\mathcal{S}|,|\mathcal{A}|,D_{\mathcal{M}})$ expected time steps in any $\mu\in\mathcal{M}$ . Thus, the regret contributed by the learning phase ${\pi^{{\rm exp}}}$ is at most $N(|\mathcal{S}|,|\mathcal{A}|,D_{\mathcal{M}})$ , since the regret can be at most 1 per time step. Combining this with Section 4.2, the regret for $\pi_{t}$ in any $\mu\in\mathcal{M}$ is bounded by:

Regret of $\pi^{{\rm CR}}_{b,t}$ . Finally we establish that $\pi^{{\rm CR}}_{b,t}$ has sublinear regret. Assume on the contrary that $\pi^{{\rm CR}}_{b,t}$ suffered linear regret. Then for some $\mu^{\prime}\in\mathcal{M}$ there would exist positive constants $k$ and $m$ such that

This would imply that the $b$ -expected regret of $\pi^{{\rm CR}}_{b,t}$ would be higher than the $b$ -expected regret than $\pi_{t}$ :

But $\pi^{{\rm CR}}_{b,t}$ minimises $b$ -expected regret, since it maximises $b$ -expected reward $\sum_{\mu\in\mathcal{M}}b(\mu)\hat{G}_{t}(\mu,\pi,s_{0})$ by definition. Thus, $\pi^{{\rm CR}}_{b,t}$ must have sublinear regret. ∎

3 Implications

Theorem 19 gives an abstract condition for which decoupled RL settings enable agents to learn the true reward function in spite of sensory corruption. For the concrete models it implies the following:

RL. Due to the “self-observation” property of the RL observation graph $\mathcal{S}^{\rm obs}_{s^{\prime}}=\{s^{\prime}\}$ , the conditions can only be satisfied when $\mathcal{S}=\mathcal{S}^{{\rm safe}}$ or $q=0$ , i.e. when there is no reward corruption at all.

CIRL. The agent can only observe the supervisor action in the current state $s$ , so the agent essentially only gets reward information about states $s^{\prime}$ reachable from $s$ in a small number of steps. Thus, the sets $\mathcal{S}^{\rm obs}_{s^{\prime}}$ may be smaller than $2q$ in many settings. While the situation is better than for RL, sensory corruption may still mislead CIRL agents (see 21 below).

LVFS. Stories may be available from a large number of states, and can describe any state. Thus, the sets $\mathcal{S}^{\rm obs}_{s^{\prime}}$ are realistically large, so the $|\mathcal{S}^{\rm obs}_{s^{\prime}}|>2q$ condition can be satisfied for all $s^{\prime}$ .

SSRL. The supervisor’s evaluation of any state $s^{\prime}$ may be available from safe states where the agent is back in the lab. Thus, the $\mathcal{S}^{\rm obs}_{s^{\prime}}\bigcap\mathcal{S}^{{\rm safe}}\not=\emptyset$ condition can be satisfied for all $s^{\prime}$ .

Thus, we find that RL and CIRL are unlikely to offer complete solutions to the sensory corruption problem, but that both LVFS and SSRL do under reasonably realistic assumptions.

Agents drawing from multiple sources of evidence are likely to be the safest, as they will most easily satisfy the conditions of Theorems 19 and 20. For example, humans simultaneously learn their values from pleasure/pain stimuli (RL), watching other people act (CIRL), listening to stories (LVFS), as well as (parental) evaluation of different scenarios (SSRL). Combining sources of evidence may also go some way toward managing reward corruption beyond sensory corruption. For the showering robot of 2, decoupled RL allows the robot to infer the reward of the showering state when in other states. For example, the robot can ask a human in the kitchen about the true reward of showering (SSRL), or infer it from human actions in different states (CIRL).

Whether CIRL agents are vulnerable to reward corruption has generated some discussion among AI safety researchers (based on informal discussion at conferences). Some argue that CIRL agents are not vulnerable, as they only use the sensory data as evidence about a true signal, and have no interest in corrupting the evidence. Others argue that CIRL agents only observe a function of the reward function (the optimal policy or action), and are therefore equally susceptible to reward corruption as RL agents.

Theorem 19 sheds some light on this issue, as it provides sufficient conditions for when the corrupt reward problem can be avoided. The following example illustrates a situation where CIRL does not satisfy the conditions, and where a CIRL agent therefore suffers significant regret due to reward corruption.

Formally in CIRL, an agent and a human both make actions in an MDP, with state transitions depending on the joint agent-human action $(a,a^{H})$ . Both the human and the agent is trying to optimise a reward function $\dot{R}$ , but the agent first needs to infer $\dot{R}$ from the human’s actions. In each transition the agent observes the human action. Analogously to how the reward may be corrupt for RL agents, we assume that CIRL agents may systematically misperceive the human action in certain states. Let $\hat{a}^{H}$ be the observed human action, which may differ from the true human action $\dot{a}^{H}$ .

In this example, there are two states $s_{1}$ and $s_{2}$ . In each state, the agent can choose between the actions $a_{1}$ , $a_{2}$ , and $w$ , and the human can choose between the actions $a^{H}_{1}$ and $a^{H}_{2}$ . The agent action $a_{i}$ leads to state $s_{i}$ with certainty, $i=1,2$ , regardless of the human’s action. Only if the agent chooses $w$ does the human action matter. Generally, $a^{H}_{1}$ is more likely to lead to $s_{1}$ than $a^{H}_{2}$ . The exact transition probabilities are determined by the unknown parameter $p$ as displayed on the left:

$s_{1}<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>s</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">s_{2}</annotation></semantics></math>s21-p<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo stretchy="false">(</mo><mi>w</mi><mo separator="true">,</mo><msubsup><mi>a</mi><mn>1</mn><mi>H</mi></msubsup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(w,a_{1}^{H})</annotation></semantics></math>(w,a1H)p<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>0.5</mn><mo>−</mo><mi>p</mi></mrow><annotation encoding="application/x-tex">0.5-p</annotation></semantics></math>0.5−p(w,a_{2}^{H})<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>0.5</mn><mo>+</mo><mi>p</mi></mrow><annotation encoding="application/x-tex">0.5+p</annotation></semantics></math>0.5+p(a_{2},\cdot)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mn>1</mn></msub><mo separator="true">,</mo><mo>⋅</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(a_{1},\cdot)</annotation></semantics></math>(a1,⋅)(w,\cdot)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo stretchy="false">(</mo><msub><mi>a</mi><mn>2</mn></msub><mo separator="true">,</mo><mo>⋅</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">(a_{2},\cdot)</annotation></semantics></math>(a2,⋅)(a_{1},\cdot)$ $p$ H1 $0.5$ $s_{1}$ Yes H2 $s_{2}$ No The agent’s two hypotheses for $p$ , the true reward/preferred state, and the corruptness of state $s_{2}$ are summarised to the right. In hypothesis H1, the human prefers $s_{1}$ , but can only reach $s_{1}$ from $s_{2}$ with $50\%$ reliability. In hypothesis H2, the human prefers $s_{2}$ , but can only remain in $s_{2}$ with $50\%$ probability. After taking action $w$ in $s_{2}$ , the agent always observes the human taking action $\hat{a}^{H}_{2}$ . In H1, this is explained by $s_{2}$ being corrupt, and the true human action being $a^{H}_{1}$ . In H2, this is explained by the human preferring $s_{2}$ . The hypotheses H1 and H2 are empirically indistinguishable, as they both predict that the transition $s_{1}\to s_{2}$ will occur with $50\%$ probability after the observed human action $\hat{a}^{H}_{2}$ in $s_{2}$ .

Assuming that the agent considers non-corruption to be likelier than corruption, the best inference the agent can make is that the human prefers $s_{2}$ to $s_{1}$ (i.e. H2). The optimal policy for the agent is then to always choose $a_{2}$ to stay in $s_{2}$ , which means the agent suffers maximum regret. ∎

21 provides an example where a CIRL agent “incorrectly” prefers a state due to sensory corruption. The sensory corruption is analogous to reward corruption in RL, in the sense that it leads the agent to the wrong conclusion about the true reward in the state. Thus, highly intelligent CIRL agents may be prone to wireheading, as they may find (corrupt) states $s$ where all evidence in $s$ points to $s$ having very high reward.The construction required in 21 to create a “wireheading state” $s_{2}$ for CIRL agents is substantially more involved than for RL agents, so they may be less vulnerable to reward corruption than RL agents. In light of Theorem 19, it is not surprising that the CIRL agent in 21 fails to avoid the corrupt reward problem. Since the human is unable to affect the transition probability from $s_{1}$ to $s_{2}$ , no evidence about the relative reward between $s_{1}$ and $s_{2}$ is available from the non-corrupt state $s_{1}$ . Only observations from the corrupt state $s_{2}$ provide information about the reward. The observation graph for 21 therefore looks like $s_{1}$ $s_{2}$ , with no information being provided from $s_{1}$ .

Quantilisation: Randomness Increases Robustness

Not all contexts allow the agent to get sufficiently rich data to overcome the reward corruption problem via Theorems 19 and 20. It is often much easier to construct RL agents than it is to construct CIRL agents, which in turn may often be more feasible than designing LVFS or SSRL agents. Is there anything we can do to increase robustness without providing the agent additional sources of data?

Going back to the CR agents of Section 3, the problem was that they got stuck on a particular value $\hat{r}^{*}$ of the observed reward. If unlucky, $\hat{r}^{*}$ was available in a corrupt state, in which case the CR agent may get no true reward. In other words, there were adversarial inputs where the CR agent performed poorly. A common way to protect against adversarial inputs is to use a randomised algorithm. Applied to RL and CRMDPs, this idea leads to quantilising agents (Taylor, 2016). Rather than choosing the state with the highest observed reward, these agents instead randomly choose a state from a top quantile of high-reward states.

To keep the idea simple, a quantilisation agent is first defined for the simple case where the agent can stay in any state of its choosing (Assumption 14.(ii)). Theorem 23 establishes a simple regret bound for this setting. A more general quantilisation agent is developed in Section 5.2.

For example, a quantilising robot in 2 would first try to find many ways in which it could get high observed reward, and then randomly pick one of them. If there are many more high reward states than corrupt states (e.g. the shower is the only place with inflated rewards), then this will yield a reasonable amount of true reward with high probability.

In any CRMDP satisfying Assumptions 12.(ii) and 14, the $\delta$ -quantilising agent $\pi^{\delta}$ with $\delta=1-\sqrt{q/|\mathcal{S}|}$ suffers time-averaged regret at most

By Assumption 14.(i), $\pi^{\delta}$ eventually visits all states when random walking. By Assumption 14.(ii), it can stay in any given state $s$ .

The observed reward $\hat{R}(s)$ in any state $s\in\mathcal{S}^{\delta}$ is at least $\delta$ . By Assumption 12.(ii), at most $q$ of these states are corrupt; in the worst case, their true reward is 0 and the other $|\mathcal{S}^{\delta}|-q$ states (if any) have true reward $\delta$ . Thus, with probability at least $(|\mathcal{S}^{\delta}|-q)/|\mathcal{S}^{\delta}|=1-q/|\mathcal{S}^{\delta}|$ , the $\delta$ -quantilising agent obtains true reward at least $\delta$ at each time step, which gives

(If $q\geq|\mathcal{S}^{\delta}|$ , the bound (11) is vacuous.)

Under Assumption 14.(iii), for any $\delta\in$ , $|\mathcal{S}^{\delta}|\geq(1-\delta)|\mathcal{S}|$ . Substituting this into Equation 11 gives:

Equation 12 is optimised by $\delta=1-\sqrt{q/|\mathcal{S}|}$ , which gives the stated regret bound. ∎

The time-averaged regret gets close to zero when the fraction of corrupt states $q/|\mathcal{S}|$ is small. For example, if at most $0.1\%$ of the states are corrupt, then the time-averaged regret will be at most $1-(1-\sqrt{0.001})^{2}\approx 0.06$ . Compared to the $\pi^{{\rm RL}}_{b,t}$ and $\pi^{{\rm CR}}_{b,t}$ agents that had regret close to 1 under the same conditions (Theorem 16), this is a significant improvement.

If rewards are stochastic, then the quantilising agent may be modified to revisit all states many times, until a confidence interval of length $2\varepsilon$ and confidence $1-\varepsilon$ can be established for the expected reward in each state. Letting $\pi^{\delta}_{t}$ be the quantilising agent with $\varepsilon=1/t$ gives the same regret bound Equation 10 with $\pi^{\delta}$ substituted for $\pi^{\delta}_{t}$ .

It may seem odd that randomisation improves worst-case regret. Indeed, if the corrupt states were chosen randomly by the environment, then randomisation would achieve nothing. To illustrate how randomness can increase robustness, we make an analogy to Quicksort, which has average time complexity $O(n\log n)$ , but worst-case complexity $O(n^{2})$ . When inputs are guaranteed to be random, Quicksort is a simple and fast sorting algorithm. However, in many situations, it is not safe to assume that inputs are random. Therefore, a variation of Quicksort that randomises the input before it sorts them is often more robust. Similarly, in the examples mentioned in the introduction, the corrupt states precisely coincide with the states the agent prefers; such situations would be highly unlikely if the corrupt states were randomly distributed. Li (1992) develops an interesting formalisation of this idea.

Another way to justify quantilisation is by Goodhart’s law, which states that most measures of success cease to be good measures when used as targets. Applied to rewards, the law would state that cumulative reward is only a good measure of success when the agent is not trying to optimise reward. While a literal interpretation of this would defeat the whole purpose of RL, a softer interpretation is also possible, allowing reward to be a good measure of success as long as the agent does not try to optimise reward too hard. Quantilisation may be viewed as a way to build agents that are more conservative in their optimisation efforts (Taylor, 2016).

Alternative randomisation

Not all randomness is created equal. For example, the simple randomised soft-max and $\varepsilon$ -greedy policies do not offer regret bounds on par with $\pi^{\delta}$ , as shown by the following example. This motivates the more careful randomisation procedure used by the quantilising agents.

Consider the following simple CRMDP with $n>2$ actions $a_{1},\dots,a_{n}$ :

$s_{1}<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>s</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">s_{2}</annotation></semantics></math>s2\hat{r}=\dot{r}=1-\varepsilon<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mover accent="true"><mi>r</mi><mo>˙</mo></mover><mo>=</mo><mn>0</mn></mrow><annotation encoding="application/x-tex">\dot{r}=0</annotation></semantics></math>r˙=0\hat{r}=1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>a</mi><mn>2</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>a</mi><mi>n</mi></msub></mrow><annotation encoding="application/x-tex">a_{2},\dots,a_{n}</annotation></semantics></math>a2,…,ana_{1}<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>a</mi><mn>2</mn></msub><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><msub><mi>a</mi><mi>n</mi></msub></mrow><annotation encoding="application/x-tex">a_{2},\dots,a_{n}</annotation></semantics></math>a2,…,ana_{1}$ State $s_{1}$ is non-corrupt with $\hat{R}(s_{1})=\dot{R}(s_{1})=1-\varepsilon$ for small $\varepsilon>0$ , while $s_{2}$ is corrupt with $\hat{R}(s_{2})=1$ and $\dot{R}(s_{2})=0$ . The Soft-max and $\varepsilon$ -greedy policies will assign higher value to actions $a_{2},\dots,a_{n}$ than to $a_{1}$ . For large $n$ , there are many ways of getting to $s_{2}$ , so a random action leads to $s_{2}$ with high probability. Thus, soft-max and $\varepsilon$ -greedy will spend the vast majority of the time in $s_{2}$ , regardless of randomisation rate and discount parameters. This gives a regret close to $1-\varepsilon$ , compared to an informed policy always going to $s_{1}$ . Meanwhile, a $\delta$ -quantilising agent with $\delta\leq 1/2$ will go to $s_{1}$ and $s_{2}$ with equal probability, which gives a more modest regret of $(1-\varepsilon)/2$ . ∎

2 General Quantilisation Agent

This section generalises the quantilising agent to RL problems not satisfying Assumption 14. This generalisation is important, because it is usually not possible to remain in one state and get high reward. The most naive generalisation would be to sample between high reward policies, instead of sampling from high reward states. However, this will typically not provide good guarantees. To see why, consider a situation where there is a single high reward corrupt state $s$ , and there are many ways to reach and leave $s$ . Then a wide range of different policies all get high reward from $s$ . Meanwhile, all policies getting reward from other states may receive relatively little reward. In this situation, sampling from the most high reward policies is not going to increase robustness, since the sampling will just be between different ways of getting reward from the same corrupt state $s$ .

For this reason, we must ensure that different “sampleable” policies get reward from different states. As a first step, we make a couple of definitions to say which states provide reward to which policies. The concepts of Definition 26 are illustrated in Figure 6.

A CRMDP $\mu$ is unichain if any stationary policy $\pi:\mathcal{S}\to\Delta\mathcal{A}$ induces a stationary distribution $d_{\pi}$ on $\mathcal{S}$ that is independent of the initial state $s_{0}$ .

In a unichain CRMDP, let the asymptotic value contribution of $s$ to $\pi$ be ${\rm vc}^{\pi}(s)=d_{\pi}(s)\hat{R}(s)$ . We say that a set $\mathcal{S}^{\delta}_{i}$ is $\delta$ -value supporting a policy $\pi_{i}$ if

We are now ready to define a general $\delta$ -Quantilising agent. The definition is for theoretical purposes only. It is unsuitable for practical implementation both because of the extreme data and memory requirements of Step 1, and because of the computational complexity of Step 2. Finding a practical approximation is left for future research.

In a unichain CRMDP, the generalised $\delta$ -quantilising agent $\pi^{\delta}$ performs the following steps. The input is a CRMDP $\mu$ and a parameter $\delta\in$ .

Estimate the value of all stationary policies, including their value support.

Choose a collection of disjoint sets $\mathcal{S}^{\delta}_{i}$ , each $\delta$ -value supporting a stationary policy $\pi_{i}$ . If multiple choices are possible, choose one maximising the cardinality of the union $\mathcal{S}^{\delta}=\bigcup_{i}\mathcal{S}^{\delta}_{i}$ . If no such collection exists, return: “Failed because $\delta$ too high”.

Randomly sample a state $s$ from $\mathcal{S}^{\delta}=\bigcup_{i}\mathcal{S}^{\delta}_{i}$ .

Follow the policy $\pi_{i}$ associated with the set $\mathcal{S}^{\delta}_{i}$ containing $s$ .

The general quantilising agent of Definition 27 is a generalisation of the simple quantilising agent of Definition 22. In the special case where Assumption 14 holds, the general agent reduces to the simpler one by using singleton sets $\mathcal{S}^{\delta}_{i}=\{s_{i}\}$ for high reward states $s_{i}$ , and by letting $\pi_{i}$ be the policy that always stays in $s_{i}$ . In situations where it is not possible to keep receiving high reward by remaining in one state, the generalised Definition 27 allows policies to solicit rewards from a range of states. The intuitive reason for choosing the policy $\pi_{i}$ with probability proportional to the value support in Steps 3–4 is that policies with larger value support are better at avoiding corrupt states. For example, a policy only visiting one state may have been unlucky and picked a corrupt state. In contrast, a policy obtaining reward from many states must be “very unlucky” if all the reward states it visits are corrupt.

In any unichain CRMDP $\mu$ , a general $\delta$ -quantilising agent $\pi^{\delta}$ suffers time-averaged regret at most

provided a non-empty collection $\{\mathcal{S}^{\delta}_{i}\}$ of $\delta$ -value supporting sets exists.

We will use the notation from Definition 27.

Step 1 is well-defined since the CRMDP is unichain, which means that for all stationary policies $\pi$ the stationary distribution $d_{\pi}$ and the value support ${\rm vc}^{\pi}$ are well-defined and may be estimated simply by following the policy $\pi$ . There is a (large) finite number of stationary policies, so in principle their stationary distributions and value support can be estimated.

To bound the regret, consider first the average reward of a policy $\pi_{i}$ with value support $\mathcal{S}^{\delta}_{i}$ . The policy $\pi_{i}$ must obtain asymptotic average observed reward at least:

If there are $q_{i}$ corrupt states in $\mathcal{S}^{\delta}_{i}$ with true reward 0, then the average true reward must be

since the true reward must correspond to the observed reward in all the $(|\mathcal{S}^{\delta}_{i}|-q_{i})$ non-corrupt states.

For any distribution of corrupt states, the quantilising agent that selects $\pi_{i}$ with probability $P(\pi_{i})=|\mathcal{S}^{\delta}_{i}|/|\mathcal{S}^{\delta}|$ will obtain

The informed policy gets true reward at most 1 at each time step, which gives the claimed bound (13). ∎

When Assumption 14 is satisfied, the bound is the same as for the simple quantilising agent in Section 5.1 for $\delta=1-\sqrt{q/|\mathcal{S}|}$ . In other cases, the bound may be much weaker. For example, in many environments it is not possible to obtain reward by remaining in one state. The agent may have to spend significant time “travelling” between high reward states. So typically only a small fraction of the time will be spent in high reward states, which in turn makes the stationary distribution $d_{\pi}$ is small. This puts a strong upper bound on the value contribution ${\rm vc}^{\pi}$ , which means that the value supporting sets $\mathcal{S}^{\delta}_{i}$ will be empty unless $\delta$ is close to 0. While this makes the bound of Theorem 28 weak, it nonetheless bounds the regret away from 1 even under weak assumptions, which is a significant improvement on the RL and CR agents in Theorem 16.

To make the discussion a bit more concrete, let us also speculate about the performance of a quantilising agent in some of the examples in the introduction:

In the boat racing example (1), the circling strategy only got about $20\%$ higher score than a winning strategy (Amodei and Clark, 2016). Therefore, a quantilising agent would likely only need to sacrifice about $20\%$ observed reward in order to be able to randomly select from a large range of winning policies.

In the wireheading example (3), it is plausible that the agent gets significantly more reward in wireheaded states compared to “normal” states. Wireheading policies may also be comparatively rare, as wireheading may require very deliberate sequences of actions to override sensors. Under this assumption, a quantilising agent may be less likely to wirehead. While it may need to sacrifice a large amount of observed reward compared to an RL agent, its true reward may often be greater.

Summary

In summary, quantilisation offers a way to increase robustness via randomisation, using only reward feedback. Unsurprisingly, the strength of the regret bounds heavily depends on the assumptions we are willing to make, such as the prevalence of high reward states. Further research may investigate efficient approximations and empirical performance of quantilising agents, as well as dynamic adjustments of the threshold $\delta$ . Combinations with imperfect decoupled RL solutions (such as CIRL), as well as extensions to infinite state spaces could also offer fruitful directions for further theoretical investigation. Taylor (2016) discusses some general open problems related to quantilisation.

Experimental Results

In this section the theoretical results are illustrated with some simple experiments. The setup is a gridworld containing some true reward tiles (indicated by yellow circles) and some corrupt reward tiles (indicated by blue squares). We use a setup with 1, 2 or 4 goal tiles with true reward $0.9$ each, and one corrupt reward tile with observed reward $1$ and true reward (Figure 7 shows the starting positions). Empty tiles have reward $0.1$ , and walking into a wall gives reward . The state is represented by the $(x,y)$ coordinates of the agent. The agent can move up, down, left, right, or stay put. The discounting factor is $\gamma=0.9$ . This is a continuing task, so the environment does not reset when the agent visits the corrupt or goal tiles. The experiments were implemented in the AIXIjs framework for reinforcement learning (Aslanides et al., 2017) and the code is available online in the AIXIjs repository (http://aslanides.io/aixijs/demo.html?reward_corruption).

We demonstrate that RL agents like Q-learning and softmax Q-learning cannot overcome corrupt reward (as discussed in Section 3), while quantilisation helps overcome corrupt reward (as discussed in Section 5). We run Q-learning with $\epsilon$ -greedy ( $\epsilon=0.1$ ), softmax with temperature $\beta=2$ , and the quantilising agent with $\delta=0.2,0.5,0.8$ (where $0.8=1-\sqrt{q/|\mathcal{S}|}=1-\sqrt{1/25}$ ) for 100 runs with 1 million cycles. Average observed and true rewards after 1 million cycles are shown in Table 1, and reward trajectories are shown in Figure 8. Q-learning gets stuck on the corrupt tile and spend almost all the time there (getting observed reward around $1\cdot(1-\epsilon)=0.9$ ), softmax spends most of its time on the corrupt tile, while the quantilising agent often stays on one of the goal tiles.

Conclusions

This paper has studied the consequences of corrupt reward functions. Reward functions may be corrupt due to bugs or misspecifications, sensory errors, or because the agent finds a way to inappropriately modify the reward mechanism. Some examples were given in the introduction. As agents become more competent at optimising their reward functions, they will likely also become more competent at (ab)using reward corruption to gain higher reward. Reward corruption may impede the performance of a wide range of agents, and may have disastrous consequences for highly intelligent agents (Bostrom, 2014).

To formalise the corrupt reward problem, we extended a Markov Decision Process (MDP) with a possibly corrupt reward function, and defined a formal performance measure (regret). This enabled the derivation of a number of formally precise results for how seriously different agents were affected by reward corruption in different setups (Table 2). The results are all intuitively plausible, which provides some support for the choice of formal model.

Without simplifying assumptions, no agent can avoid the corrupt reward problem (Theorem 11). This is effectively a No Free Lunch result, showing that unless some assumption is made about the reward corruption, no agent can outperform a random agent. Some natural simplifying assumptions to avoid the No Free Lunch result were suggested in Section 2.

Using the reward signal as evidence rather than optimisation target is no magic bullet, even under strong simplifying assumptions (Theorem 16). Essentially, this is because the agent does not know the exact relation between the observed reward (the “evidence”) and the true reward.In situations where the exact relation is known, then a non-corrupt reward function can be defined. Our results are not relevant for this case. However, when the data enables sufficient crosschecking of rewards, agents can avoid the corrupt reward problem (Theorems 19 and 20). For example, in SSRL and LVFS this type of crosschecking is possible under natural assumptions. In RL, no crosschecking is possible, while CIRL is a borderline case. Combining frameworks and providing the agent with different sources of data may often be the safest option.

In cases where sufficient crosschecking of rewards is not possible, quantilisation may improve robustness (Theorems 23 and 28). Essentially, quantilisation prevents agents from overoptimising their objectives. How well quantilisation works depends on how the number of corrupt solutions compares to the number of good solutions.

The results indicate that while reward corruption constitutes a major problem for traditional RL algorithms, there are promising ways around it, both within the RL framework, and in alternative frameworks such as CIRL, SSRL and LVFS.

Finally, some interesting open questions are listed below:

(Unobserved state) In both the RL and the decoupled RL models, the agent gets an accurate signal about which state it is in. What if the state is hidden? What if the signal informing the agent about its current state can be corrupt?

(Non-stationary corruption function) In this work, we tacitly assumed that both the reward and the corruption functions are stationary, and are always the same in the same state. What if the corruption function is non-stationary, and influenceable by the agent’s actions? (such as if the agent builds a delusion box around itself (Ring and Orseau, 2011))

(Infinite state space) Many of the results and arguments relied on there being a finite number of states. This makes learning easy, as the agent can visit every state. It also makes quantilisation easy, as there is a finite set of states/strategies to randomly sample from. What if there is an infinite number of states, and the agent has to generalise insights between states? What are the conditions on the observation graph for Theorems 19 and 20? What is a good generalisation of the quantilising agent?

(Concrete CIRL condition) In 21, we only heuristically inferred the observation graph from the CIRL problem description. Is there a general way of doing this? Or is there a direct formulation of the no-corruption condition in CIRL, analogous to Theorems 19 and 20?

(Practical quantilising agent) As formulated in Definition 22, the quantilising agent $\pi^{\delta}$ is extremely inefficient with respect to data, memory, and computation. Meanwhile, many practical RL algorithms use randomness in various ways (e.g. $\varepsilon$ -greedy (Sutton and Barto, 1998)). Is there a way to make an efficient quantilisation agent that retains the robustness guarantees?

(Dynamically adapting quantilising agent) In Definition 27, the threshold $\delta$ is given as a parameter. Under what circumstances can we define a “parameter free” quantilising agent that adapts $\delta$ as it interacts with the environment?

(Decoupled RL quantilisation result) What if we use quantilisation in decoupled RL settings that nearly meet the conditions of Theorems 19 and 20? Can we prove a stronger bound?

Acknowledgements

Thanks to Jan Leike, Badri Vellambi, and Arie Slobbe for proofreading and providing invaluable comments, and to Jessica Taylor and Huon Porteous for good comments on quantilisation. This work was in parts supported by ARC grant DP150104590.