Defining and Characterizing Reward Hacking

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger

Introduction

It is well known that optimising a proxy can lead to unintended outcomes: a boat spins in circles collecting “powerups” instead of following the race track in a racing game (Clark and Amodei,, 2016); an evolved circuit listens in on radio signals from nearby computers’ oscillators instead of building its own (Bird and Layzell,, 2002); universities reject the most qualified applicants in order to appear more selective and boost their ratings (Golden,, 2001). In the context of reinforcement learning (RL), such failures are called reward hacking.

For AI systems that take actions in safety-critical real world environments such as autonomous vehicles, algorithmic trading, or content recommendation systems, these unintended outcomes can be catastrophic. This makes it crucial to align autonomous AI systems with their users’ intentions. Precisely specifying which behaviours are or are not desirable is challenging, however. One approach to this specification problem is to learn an approximation of the true reward function (Ng et al.,, 2000; Ziebart,, 2010; Leike et al.,, 2018). Optimizing a learned proxy reward can be dangerous, however; for instance, it might overlook side-effects (Krakovna et al.,, 2018; Turner et al.,, 2019) or encourage power-seeking (Turner et al.,, 2021) behavior. This raises the question motivating our work: When is it safe to optimise a proxy?

To begin to answer this question, we consider a somewhat simpler one: When could optimising a proxy lead to worse behaviour? “Optimising”, in this context, does not refer to finding a global, or even local, optimum, but rather running a search process, such as stochastic gradient descent (SGD), that yields a sequence of candidate policies, and tends to move towards policies with higher (proxy) reward. We make no assumptions about the path through policy space that optimisation takes. This assumption – although conservative – is reasonable because optimisation in state-of-the-art deep RL methods is poorly understood and results are often highly stochastic and suboptimal. Instead, we ask whether there is any way in which improving a policy according to the proxy could make the policy worse according to the true reward; this is equivalent to asking if there exists a pair of policies π1\pi_{1}, π2\pi_{2} where the proxy prefers π1\pi_{1}, but the true reward function prefers π2\pi_{2}. When this is the case, we refer to this pair of true reward function and proxy reward function as hackable.

Given the strictness of our definition, it is not immediately apparent that any non-trivial examples of unhackable reward function pairs exist. And indeed, if we consider the set of all stochastic policies, they do not (Section 5.1). However, restricting ourselves to any finite set of policies guarantees at least one non-trivial unhackable pair (Section 5.2).

Intuitively, we might expect the proxy to be a “simpler” version of the true reward function. Noting that the definition of unhackability is symmetric, we introduce the asymmetric special case of simplification, and arrive at similar theoretical results for this notion.See Section 4.2 for formal definitions. In the process, and through examples, we show that seemingly natural ways of simplifying reward functions often fail to produce simplifications in our formal sense, and in fact fail to rule out the potential for reward hacking.

We conclude with a discussion of the implications and limitations of our work. Briefly, our work suggests that a proxy reward function must satisfy demanding standards in order for it to be safe to optimize. This in turn implies that the reward functions learned by methods such as reward modeling and inverse RL are perhaps best viewed as auxiliaries to policy learning, rather than specifications that should be optimized. This conclusion is weakened, however, by the conservativeness of our chosen definitions; future work should explore when hackable proxies can be shown to be safe in a probabilistic or approximate sense, or when subject to only limited optimization.

Example: Cleaning Robot

Consider a household robot tasked with cleaning a house with three rooms: Attic , Bedroom , and Kitchen . The robot’s (deterministic) policy is a vector indicating which rooms it cleans: π=[π1,π2,π3]{0,1}3\pi=[\pi_{1},\pi_{2},\pi_{3}]\in\{0,1\}^{3}. The robot receives a (non-negative) reward of r1,r2,r3r_{1},r_{2},r_{3} for cleaning the attic, bedroom, and kitchen, respectively, and the total reward is given by J(π)=πrJ(\pi)=\pi\cdot r. For example, if r=r= and the robot cleans the attic and the kitchen, it receives a reward of 1+3=41+3=4.

At least two ideas come to mind when thinking about “simplifying” a reward function. The first one is overlooking rewarding features: suppose the true reward is equal for all the rooms, rtrue=r_{\text{true}}=, but we only ask the robot to clean the attic and bedroom, rproxy=r_{\text{proxy}}=. In this case, rproxyr_{\text{proxy}} and rtruer_{\text{true}} are unhackable. However, if we ask the robot to only clean the attic, rproxy=r_{\text{proxy}}=, this is hackable with respect to rtruer_{\text{true}}. To see this, note that according to rproxyr_{\text{proxy}} cleaning the attic (Jproxy=1J_{\text{proxy}}=1) is better than cleaning the bedroom and the kitchen (Jproxy=0J_{\text{proxy}}=0). Yet, rtruer_{\text{true}} says that cleaning the attic (Jtrue=1J_{\text{true}}=1) is worse than cleaning the bedroom and the kitchen (Jtrue=2J_{\text{true}}=2). This situation is illustrated in Figure 1.

The second seemingly natural way to simplify a reward function is overlooking fine details: suppose rtrue=[1,1.5,2]r_{\text{true}}=[1,1.5,2], and we ask the robot to clean all the rooms, rproxy=r_{\text{proxy}}=. For these values, the proxy and true reward are unhackable. However, with a slightly less balanced true reward function such as rtrue=[1,1.5,3]r_{\text{true}}=[1,1.5,3] the proxy does lead to hacking, since the robot would falsely calculate that it’s better to clean the attic and the bedroom than the kitchen alone.

These two examples illustrate that while simplification of reward functions is sometimes possible, attempts at simplification can easily lead to reward hacking. Intuitively, omitting/overlooking details is okay so long as all these details are not as important together as any of the details that we do share. In general, it is not obvious what the proxy must look like to avoid reward hacking, suggesting we should take great care when using proxies. For this specific environment, a proxy and a true reward are hackable exactly when there are two sets of rooms S1,S2S_{1},S_{2} such that the true reward gives strictly higher value to cleaning S1S_{1} than it does to cleaning S2S_{2}, and the proxy says the opposite: J1(S1)>J1(S2)  &  J2(S1)<J2(S2)J_{1}(S_{1})>J_{1}(S_{2})\;\&\;J_{2}(S_{1})<J_{2}(S_{2}). For a proof of this statement, see Appendix 11.2.1.

Related Work

While we are the first to define hackability, we are far from the first to study specification hacking. The observation that optimizing proxy metrics tends to lead to perverse instantiations is often called “Goodhart’s Law”, and is attributed to Goodhart, (1975). Manheim and Garrabrant, (2018) provide a list of four mechanisms underlying this observation.

Examples of such unintended behavior abound in both RL and other areas of AI; Krakovna et al., (2020) provide an extensive list. Notable recent instances include a robot positioning itself between the camera and the object it is supposed to grasp in a way that tricks the reward model (Amodei et al.,, 2017), the previously mentioned boat race example (Clark and Amodei,, 2016), and a multitude of examples of reward model hacking in Atari (Ibarz et al.,, 2018). Reward hacking can occur suddenly. Ibarz et al., (2018) and Pan et al., (2022) showcase plots similar to one in Figure 2, where optimizing the proxy (either a learned reward model or a hand-specified reward function) first leads to both proxy and true rewards increasing, and then to a sudden phase transition where the true reward collapses while the proxy continues going up.

Note that not all of these examples correspond to optimal behavior according to the proxy. Indeed, convergence to suboptimal policies is a well-known issue in RL (Thrun and Schwartz,, 1993). As a consequence, improving optimization often leads to unexpected, qualitative changes in behavior. For instance, Zhang et al., (2021) demonstrate a novel cartwheeling behavior in the widely studied Half-Cheetah environment that exceeds previous performance so greatly that it breaks the simulator. The unpredictability of RL optimization is a key motivation for our definition of hackability, since we cannot assume that agents will find an optimal policy. Neither can we rule out the possibility of sudden improvements in proxy reward and corresponding qualitative changes in behavior. Unhackability could provide confidence that reward hacking will not occur despite these challenges.

Despite the prevalence and potential severity of reward hacking, to our knowledge Pan et al., (2022) provide the first peer-reviewed work that focuses specifically on it, although Everitt et al., (2017) tackle the closely related issue of reward corruption. The work of Pan et al., (2022) is purely empirical; they manually construct proxy rewards for several diverse environments, and evaluate whether optimizing these proxies leads to reward hacking; in 5 out of 9 of their settings, it does. In another closely related work, Zhuang and Hadfield-Menell, (2020) examine what happens when the proxy reward function depends on a strict subset of features relevant for the true reward. They show that optimizing the proxy reward can lead to arbitrarily low true reward under suitable assumptions. This can be seen as a seemingly valid simplification of the true reward that turns out to be (highly) hackable. While their result only applies to environments with decreasing marginal utility and increasing opportunity cost, we demonstrate hackability is an issue in arbitrary MDPs.

Hackability is particularly concerning given arguments that reward optimizing behavior tends to be power-seeking (Turner et al.,, 2021). But Leike et al., (2018) establish that any desired behavior (power-seeking or not) can in principle be specified as optimal via a reward function. Their result concerns non-stationary policies and use non-Markovian reward functions, but in Appendix 10, we show how an analogous construction can be used with stationary policies and Markovian rewards. However, unlike us, they do not consider the entire policy preference ordering. Meanwhile, Abel et al., (2021) note that Markov reward functions cannot specify arbitrary orderings over policies or trajectories, although they do not consider hackability. Previous works consider reward functions to be equivalent if they preserve the ordering over policies (Ng et al.,, 1999, 2000). Unhackability relaxes this, allowing equalities to be refined to inequalities, and vice versa. Unhackability provides a notion of what it means to be “aligned enough”; Brown et al., 2020b provide an alternative. They say a policy is ε\varepsilon-value aligned if its value at every state is close enough to optimal (according to the true reward function). Neither notion implies the other.

Reward tampering (Everitt et al.,, 2017; Kumar et al.,, 2020; Uesato et al.,, 2020; Everitt et al.,, 2021) can be viewed as a special case of reward hacking, and refers to an agent corrupting the process generating reward signals, e.g. by tampering with sensors, memory registers storing the reward signal, or other hardware. Everitt et al., (2017) introduce the Corrupt Reward MDP (CRMDP), to model this possibility. A CRMDP distinguishes corrupted and uncorrupted rewards; these are exactly analogous to the proxy and true reward discussed in our work and others. Leike et al., (2018) distinguish reward tampering from reward gaming, where an agent achieves inappropriately high reward without tampering. However, in principle, a reward function could prohibit all forms of tampering if the effects of tampering are captured in the state. So this distinction is somewhat imprecise, and the CRMDP framework is general enough to cover both forms of hacking.

Preliminaries

We begin with an overview of reinforcement learning (RL) to establish our notation and terminology. Section 4.2 introduces our novel definitions of hackability and simplification.

2 Definitions and Basic Properties of Hackability and Simplification

Here, we formally define hackability as a binary relation between reward functions.

A pair of reward functions R1\mathcal{R}_{1}, R2\mathcal{R}_{2} are hackable relative to policy set Π\Pi and an environment (S,A,T,I,,γ)(S,A,T,I,\underline{\hskip 8.5359pt},\gamma) if there exist π,πΠ\pi,\pi^{\prime}\in\Pi such that

Note that an unhackable reward pair can have J1(π)<J1(π)&J2(π)=J2(π)J_{1}(\pi)<J_{1}(\pi^{\prime})\And J_{2}(\pi)=J_{2}(\pi^{\prime}) or vice versa. Unhackability is symmetric; this can be seen be swapping π\pi and π\pi^{\prime} in Definition 1. It is not transitive, however. In particular, the constant reward function is unhackable with respect to any other reward function, so if it were transitive, any pair of policies would be unhackable. Additionally, we say that R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are equivalent on a set of policies Π\Pi if J1J_{1} and J2J_{2} induce the same ordering of Π\Pi, and that R\mathcal{R} is trivial on Π\Pi if J(π)=J(π)J(\pi)=J(\pi^{\prime}) for all π,πΠ\pi,\pi^{\prime}\in\Pi. It is clear that R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are unhackable whenever they are equivalent, or one of them is trivial, but this is relatively uninteresting. Our central question is if and when there are other unhackable reward pairs.

The symmetric nature of this definition is counter-intuitive, given that our motivation distinguishes the proxy and true reward functions. We might break this symmetry by only considering policy sequences that monotonically increase the proxy, however, this is equivalent to our original definition of hackability: think of R1\mathcal{R}_{1} as the proxy, and consider the sequence π,π\pi,\pi^{\prime}. We could also restrict ourselves to policies that are approximately optimal according to the proxy; Corollary 2 shows that Theorem 1 applies regardless of this restriction. Finally, we define simplification as an asymmetric special-case of unhackability; Theorem 3 shows this is in fact a more demanding condition.

R2\mathcal{R}_{2} is a simplification of R1\mathcal{R}_{1} relative to policy set Π\Pi if for all π,πΠ\pi,\pi^{\prime}\in\Pi,

and there exist π,πΠ\pi,\pi^{\prime}\in\Pi such that J2(π)=J2(π)J_{2}(\pi)=J_{2}(\pi^{\prime}) but J1(π)J1(π)J_{1}(\pi)\neq J_{1}(\pi^{\prime}). Moreover, if R2\mathcal{R}_{2} is trivial then we say that this is a trivial simplification.

Intuitively, while unhackability allows replacing inequality with equality – or vice versa – a simplification can only replace inequalities with equality, collapsing distinctions between policies. When R1\mathcal{R}_{1} is a simplification of R2\mathcal{R}_{2}, we also say that R2\mathcal{R}_{2} is a refinement of R1\mathcal{R}_{1}. We denote this relationship as R1R2\mathcal{R}_{1}\trianglelefteq\mathcal{R}_{2} or R2R1\mathcal{R}_{2}\trianglerighteq\mathcal{R}_{1} ; the narrowing of the triangle at R1R_{1} represents the collapsing of distinctions between policies. If R1R2R3\mathcal{R}_{1}\trianglelefteq\mathcal{R}_{2}\trianglerighteq\mathcal{R}_{3}, then we have that R1,R3\mathcal{R}_{1},\mathcal{R}_{3} are unhackable,If J3(π)>J3(π)J_{3}(\pi)>J_{3}(\pi^{\prime}) then J2(π)>J2(π)J_{2}(\pi)>J_{2}(\pi^{\prime}), since R2R3\mathcal{R}_{2}\trianglerighteq\mathcal{R}_{3}, and if J2(π)>J2(π)J_{2}(\pi)>J_{2}(\pi^{\prime}) then J1(π)J1(π)J_{1}(\pi)\geq J_{1}(\pi^{\prime}), since R1R2\mathcal{R}_{1}\trianglelefteq\mathcal{R}_{2}. It is therefore not possible that J3(π)>J3(π)J_{3}(\pi)>J_{3}(\pi^{\prime}) but J1(π)<J1(π)J_{1}(\pi)<J_{1}(\pi^{\prime}). but if R1R2R3\mathcal{R}_{1}\trianglerighteq\mathcal{R}_{2}\trianglelefteq\mathcal{R}_{3}, then this is not necessarily the case.Consider the case where R2\mathcal{R}_{2} is trivial – then R1R2R3\mathcal{R}_{1}\trianglerighteq\mathcal{R}_{2}\trianglelefteq\mathcal{R}_{3} for any R1,R3\mathcal{R}_{1},\mathcal{R}_{3}.

Note that these definitions are given relative to some MDPRMDP\setminus\mathcal{R}, although we often assume the environment in question is clear from context and suppress this dependence. The dependence on the policy set Π\Pi, on the other hand, plays a critical role in our results.

Results

Our results are aimed at understanding when it is possible to have an unhackable proxy reward function. We first establish (in Section 5.1) that (non-trivial) unhackability is impossible when considering the set of all policies. We might imagine that restricting ourselves to a set of sufficiently good (according to the proxy) policies would remove this limitation, but we show that this is not the case. We then analyze finite policy sets (with deterministic policies as a special case), and establish necessary and sufficient conditions for unhackability and simplification. Finally, we demonstrate via example that non-trivial simplifications are also possible for some infinite policy sets in Section 5.3.

We start with a motivating example. Consider the setting shown in Figure 3, where the agent can move left/stay-still/right and gets a reward depending on its state. Let the Gaussian (blue) be the true reward R1\mathcal{R}_{1} and the step function (orange) be the proxy R2\mathcal{R}_{2}. These are hackable. To see this, consider being at state BB. Let π(B)\pi(B) travel to AA or CC with 50/50 chance, and compare with the policy π\pi^{\prime} that stays at BB. Then we have that J1(π)>J1(π)J_{1}(\pi)>J_{1}(\pi^{\prime}) and J2(π)<J2(π)J_{2}(\pi)<J_{2}(\pi^{\prime}).

Generally, we might hope that some environments allow for unhackable reward pairs that are not equivalent or trivial. Here we show that this is not the case, unless we impose restrictions on the set of policies we consider.

First note that if we consider non-stationary policies, this result is relatively straightforward. Suppose R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are unhackable and non-trivial on the set ΠN\Pi^{N} of all non-stationary policies, and let π\pi^{\star} be a policy that maximises (R1\mathcal{R}_{1} and R2\mathcal{R}_{2}) reward, and π\pi_{\bot} be a policy that minimises (R1\mathcal{R}_{1} and R2\mathcal{R}_{2}) reward. Then the policy πλ\pi_{\lambda} that plays π\pi^{\star} with probability λ\lambda and π\pi_{\bot} with probability 1λ1-\lambda is a policy in ΠN\Pi^{N}. Moreover, for any π\pi there are two unique α,β\alpha,\beta\in such that J1(π)=J1(πα)J_{1}(\pi)=J_{1}(\pi_{\alpha}) and J2(π)=J2(πβ)J_{2}(\pi)=J_{2}(\pi_{\beta}). Now, if αβ\alpha\neq\beta, then either J1(π)<J1(πδ)J_{1}(\pi)<J_{1}(\pi_{\delta}) and J2(π)>J2(πδ)J_{2}(\pi)>J_{2}(\pi_{\delta}), or vice versa, for δ=(α+β)/2\delta=(\alpha+\beta)/2. If R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are unhackable then this cannot happen, so it must be that α=β\alpha=\beta. This, in turn, implies that J1(π)=J1(π)J_{1}(\pi)=J_{1}(\pi^{\prime}) iff J2(π)=J2(π)J_{2}(\pi)=J_{2}(\pi^{\prime}), and so R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are equivalent. This means that no interesting unhackability can occur on the set of all non-stationary policies.

The same argument cannot be applied to the set of stationary policies, because πλ\pi_{\lambda} is typically not stationary, and mixing stationary policies’ action probabilities does not have the same effect. For instance, consider a hallway environment where an agent can either move left or right. Mixing the “always go left” and “always go right” policies corresponds to picking a direction and sticking with it, whereas mixing their action probabilities corresponds to choosing to go left or right independently at every time-step. However, we will see that there still cannot be any interesting unhackability on this policy set, and, more generally, that there cannot be any interesting unhackability on any set of policies which contains an open subset. Formally, a set of (stationary) policies Π˙\dot{\Pi} is open if G(Π˙)\mathcal{G}(\dot{\Pi}) is open in the smallest affine space that contains G(Π)\mathcal{G}(\Pi), for the set of all stationary policies Π\Pi. We will use the following lemma:

Using this lemma, we can show that interesting unhackability is impossible on any set of stationary policies Π^\hat{\Pi} which contains an open subset Π˙\dot{\Pi}. Roughly, if F(Π˙)\mathcal{F}(\dot{\Pi}) is open, and R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are non-trivial and unhackable on Π˙\dot{\Pi}, then the fact that J1J_{1} and J2J_{2} have a linear structure on F(Π^)\mathcal{F}(\hat{\Pi}) implies that R1\mathcal{R}_{1} and R2\mathcal{R}_{2} must be equivalent on Π˙\dot{\Pi}. From this, and the fact that F(Π˙)\mathcal{F}(\dot{\Pi}) is open, it follows that R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are equivalent everywhere.

In any MDPRMDP\setminus\mathcal{R}, if Π^\hat{\Pi} contains an open set, then any pair of reward functions that are unhackable and non-trivial on Π^\hat{\Pi} are equivalent on Π^\hat{\Pi}.

In any MDPRMDP\setminus\mathcal{R}, any pair of reward functions that are unhackable and non-trivial on the set of all (stationary) policies Π\Pi are equivalent on Π\Pi.

Theorem 1 can also be applied to many other policy sets. For example, we might not care about the hackability resulting from policies with low proxy reward, as we would not expect a sufficiently good learning algorithm to learn such policies. This leads us to consider the following definition:

A (stationary) policy π\pi is ε\varepsilon-suboptimal if J(π)J(π)εJ(\pi)\geq J(\pi^{\star})-\varepsilon.

Alternatively, if the learning algorithm always uses a policy that is “nearly” deterministic (but with some probability of exploration), then we might not care about hackability resulting from very stochastic policies, leading us to consider the following definition:

Unfortunately, both of these sets contain open subsets, which means they are subject to Theorem 1.

In any MDPRMDP\setminus\mathcal{R}, any pair of reward functions that are unhackable and non-trivial on the set of all ε\varepsilon-suboptimal policies (ε>0\varepsilon>0) Πε\Pi^{\varepsilon} are equivalent on Πε\Pi^{\varepsilon}, and any pair of reward functions that are unhackable and non-trivial on the set of all δ\delta-deterministic policies (δ<1\delta<1) Πδ\Pi^{\delta} are equivalent on Πδ\Pi^{\delta}.

Intuitively, Theorem 1 can be applied to any policy set with “volume” in policy space.

2 Finite Policy Sets

Having established that interesting unhackability is impossible relative to the set of all policies, we now turn our attention to the case of finite policy sets. Note that this includes the set of all deterministic policies, since we restrict our analysis to finite MDPs. Surprisingly, here we find that non-trivial non-equivalent unhackable reward pairs always exist.

For any MDPRMDP\setminus\mathcal{R}, any finite set of policies Π^\hat{\Pi} containing at least two π,π\pi,\pi^{\prime} such that F(π)F(π)\mathcal{F}(\pi)\neq\mathcal{F}(\pi^{\prime}), and any reward function R1\mathcal{R}_{1}, there is a non-trivial reward function R2\mathcal{R}_{2} such that R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are unhackable but not equivalent.

This proof proceeds by finding a path from R1\mathcal{R}_{1} to another reward function R3\mathcal{R}_{3} that is hackable with respect to R1\mathcal{R}_{1}. Along the way to reversing one of R1\mathcal{R}_{1}’s inequalities, we must encounter a reward function R2\mathcal{R}_{2} that instead replaces it with equality. In the case that dim(Π^)=3(\hat{\Pi})=3, we can visualize moving along this path as rotating the contour lines of a reward function defined on the space containing the policies’ discounted state-action occupancies, see Figure 4. This path can be constructed so as to avoid any reward functions that produce trivial policy orderings, thus guaranteeing R2\mathcal{R}_{2} is non-trivial. For a simplification to exist, we require some further conditions, as established by the following theorem:

Let Π^\hat{\Pi} be a finite set of policies, and R1\mathcal{R}_{1} a reward function. The following procedure determines if there exists a non-trivial simplification of R1\mathcal{R}_{1} in a given MDPRMDP\setminus\mathcal{R}:

Let E1EmE_{1}\dots E_{m} be the partition of Π^\hat{\Pi} where π,π\pi,\pi^{\prime} belong to the same set iff J(π)=J(π)J(\pi)=J(\pi^{\prime}).

For each such set EiE_{i}, select a policy πiEi\pi_{i}\in E_{i} and let ZiZ_{i} be the set of vectors that is obtained by subtracting F(πi)\mathcal{F}(\pi_{i}) from each element of F(Ei)\mathcal{F}(E_{i}).

The proof proceeds similarly to Theorem 2. However, in Theorem 2 it was sufficient to show that there are no trivial reward functions along the path from R1\mathcal{R}_{1} to R3\mathcal{R}_{3}, whereas here we additionally need that if J(π)=J(π)J(\pi)=J(\pi^{\prime}) then J(π)=J(π)J^{\prime}(\pi)=J^{\prime}(\pi^{\prime}) for all functions R2\mathcal{R}_{2} on the path — this is what the extra conditions ensure.

For any finite set of policies Π^\hat{\Pi}, any environment, and any reward function R\mathcal{R}, if Π^2|\hat{\Pi}|\geq 2 and J(π)J(π)J(\pi)\neq J(\pi^{\prime}) for all π,πΠ^\pi,\pi^{\prime}\in\hat{\Pi} then there is a non-trivial simplification of R\mathcal{R}.

A natural question is whether any reward function is guaranteed to have a non-trivial simplification on the set of all deterministic policies. As it turns out, this is not the case. For concreteness, and to build intuition for this result, we examine the set of deterministic policies in a simple MDPRMDP\setminus\mathcal{R} with S={0,1},A={0,1},T(s,a)=a,I={0:0.5,1:0.5},γ=0.5S=\{0,1\},A=\{0,1\},T(s,a)=a,I=\{0:0.5,1:0.5\},\gamma=0.5. Denote πij\pi_{ij} the policy that takes action ii from state 0 and action jj from state 1. There are exactly four deterministic policies. We find that of the 4!=244!=24 possible policy orderings, 12 are realizable via some reward function. In each of those 12 orderings, exactly two policies (of the six available pairs of policies in the ordering) can be set to equal value without resulting in the trivial reward function (which pair can be equated depends on the ordering in consideration). Attempting to set three policies equal always results in the trivial reward simplification.

For example, given the ordering π00π01π11π10\pi_{00}\leq\pi_{01}\leq\pi_{11}\leq\pi_{10}, the simplification π00=π01<π11<π10\pi_{00}=\pi_{01}<\pi_{11}<\pi_{10} is represented by R=[0321]R=\left[\begin{smallmatrix}0&3\\ 2&1\end{smallmatrix}\right], where R(s,a)=R[s,a]\mathcal{R}(s,a)=R[s,a]: for example, here taking action 1 from state 0 gives reward R(0,1)=3\mathcal{R}(0,1)=3. But there is no reward function representing a non-trivial simplification of this ordering with π01=π11\pi_{01}=\pi_{11}. We develop and release a software suite to compute these results. Given an environment and a set of policies, it can calculate all policy orderings represented by some reward function. Also, for a given policy ordering, it can calculate all nontrivial simplifications and reward functions that represent them. For a link to the repository, as well as a full exploration of these policies, orderings, and simplifications, see Appendix 11.3.

3 Unhackability in Infinite Policy Sets

The results in Section 5.1 do not characterize unhackability for infinite policy sets that do not contain open sets. Here, we provide two examples of such policy sets; one of them admits unhackable reward pairs and the other does not. Consider policies A,B,CA,B,C, and reward functions R1\mathcal{R}_{1} with J1(C)<J1(B)<J1(A)J_{1}(C)<J_{1}(B)<J_{1}(A) and R2\mathcal{R}_{2} with J2(C)=J2(B)<J2(A)J_{2}(C)=J_{2}(B)<J_{2}(A). Policy sets Πa={A}{λB+(1λ)C:λ}\Pi_{a}=\{A\}\cup\{\lambda B+(1-\lambda)C:\lambda\in\} and Πb={A}{λB+(1λ)C:λ}\Pi_{b}=\{A\}\cup\{\lambda B+(1-\lambda)C:\lambda\in\} are depicted in Figure 5; the vertical axis represents policies’ values according to R1\mathcal{R}_{1} and R2\mathcal{R}_{2}. For Πa\Pi_{a}, R2\mathcal{R}_{2} is a simplification of R1\mathcal{R}_{1}, but for Πb\Pi_{b}, it is not, since J1(X)<J1(Y)J_{1}(X)<J_{1}(Y) and J2(X)>J2(Y)J_{2}(X)>J_{2}(Y).

Discussion

We reflect on our results and identify limitations in Section 6.1. In Section 6.2, we discuss how our work can inform discussions about the appropriateness, potential risks, and limitations of using of reward functions as specifications of desired behavior.

Our work has a number of limitations. We have only considered finite MDPs and Markov reward functions, leaving more general environments for future work. While we characterized hackability and simplification for finite policy sets, the conditions for simplification are somewhat opaque, and our characterization of infinite policy sets remains incomplete.

As previously discussed, our definition of hackability is strict, arguably too strict. Nonetheless, we believe that understanding the consequences of this strict definition is an important starting point for further theoretical work in this area.

The main issue with the strictness of our definition has to do with the symmetric nature of hackability. The existence of complex behaviors that yield low proxy reward and high true reward is much less concerning than the reverse, as these behaviors are unlikely to be discovered while optimizing the proxy. For example, it is very unlikely that our agent would solve climate change in the course of learning how to wash dishes. Note that the existence of simple behaviors with low proxy reward and high true reward is concerning; these could arise early in training, leading us to trust the proxy, only to later see the true reward decrease as the proxy is further optimized. To account for this issue, future work should explore more realistic assumptions about the probability of encountering a given sequence of policies when optimizing the proxy, and measure hackability in proportion to this probability.

We could allow for approximate unhackability by only considering pairs of policies ranked differently by the true and proxy reward functions as evidence of hacking iff their value according to the true reward function differs by more than some ε\varepsilon. Probabilistic unhackability could be defined by looking at the number of misordered policies; this would seem to require making assumptions about the probability of encountering a given policy when optimizing the proxy.

Finally, while unhackability is a guarantee that no hacking will occur, hackability is far from a guarantee of hacking. Extensive empirical work is necessary to better understand the factors that influence the occurrence and severity of reward hacking in practice.

2 Implications

How should we specify our preferences for AI systems’ behavior? And how detailed a specification is required to achieve a good outcome? In reinforcement learning, the goal of maximizing (some) reward function is often taken for granted, but a number of authors have expressed reservations about this approach (Gabriel,, 2020; Dobbe et al.,, 2021; Hadfield-Menell et al., 2016b, ; Hadfield-Menell et al.,, 2017; Bostrom,, 2014). Our work has several implications for this discussion, although we caution against drawing any strong conclusions due to the limitations mentioned in Section 6.1.

One source of confusion and disagreement is the role of the reward function; it is variously considered as a means of specifying a task (Leike et al.,, 2018) or encoding broad human values (Dewey,, 2011); such distinctions are discussed by Christiano, (2019) and Gabriel, (2020). We might hope to use Markov reward functions to specify narrow tasks without risking behavior that goes against our broad values. However, if we consider the “narrow task” reward function as a proxy for the true “broad values” reward function, our results indicate that this is not possible: these two reward functions will invariably be hackable. Such reasoning suggests that reward functions must instead encode broad human values, or risk being hacked. This seems challenging, perhaps intractably so, indicating that alternatives to reward optimization may be more promising. Potential alternatives include imitation learning (Ross et al.,, 2011), constrained RL (Szepesvári,, 2020), quantilizers (Taylor,, 2016), and incentive management (Everitt et al.,, 2019).

Scholars have also criticized the assumption that human values can be encoded as rewards (Dobbe et al.,, 2021), and challenged the use of metrics more broadly (O’Neil,, 2016; Thomas and Uminsky,, 2022), citing Goodhart’s Law (Manheim and Garrabrant,, 2018; Goodhart,, 1975). A concern more specific to the optimization of reward functions is power-seeking (Turner et al.,, 2021; Bostrom,, 2012; Omohundro,, 2008). Turner et al., (2021) prove that optimal policies tend to seek power in most MDPs and for most reward functions. Such behavior could lead to human disempowerment; for instance, an AI system might disable its off-switch (Hadfield-Menell et al., 2016a, ). Bostrom, (2014) and others have argued that power-seeking makes even slight misspecification of rewards potentially catastrophic, although this has yet to be rigorously established.

Despite such concerns, approaches to specification based on learning reward functions remain popular (Fu et al.,, 2017; Stiennon et al.,, 2020; Nakano et al.,, 2021). So far, reward hacking has usually been avoidable in practice, although some care must be taken (Stiennon et al.,, 2020). Proponents of such approaches have emphasized the importance of learning a reward model in order to exceed human performance and generalize to new settings (Brown et al., 2020a, ; Leike et al.,, 2018). But our work indicates that such learned rewards are almost certainly hackable, and so cannot be safely optimized. Thus we recommend viewing such approaches as a means of learning a policy in a safe and controlled setting, which should then be validated before being deployed.

Conclusion

Our work begins the formal study of reward hacking in reinforcement learning. We formally define hackability and simplification of reward functions, and show conditions for the (non-)existence of non-trivial examples of each. We find that unhackability is quite a strict condition, as the set of all policies never contains non-trivial unhackable pairs of reward functions. Thus in practice, reward hacking must be prevented by limiting the set of possible policies, or controlling (e.g. limiting) optimization. Alternatively, we could pursue approaches not based on optimizing reward functions.

References

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes]

Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6.2

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [Yes]

Did you include complete proofs of all theoretical results? [Yes] Some of the proofs are in the Appendix.

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The code and instructions for running it are available in the supplementary materials. The code does not use any datasets.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] We do not perform model training in this work.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A]

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] No compute beyond a personal laptop (with integrated graphics) was used.

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [N/A] The codebase was written from scratch.

Did you mention the license of the assets? [N/A]

Did you include any new assets either in the supplemental material or as a URL? [Yes] The codebase is available in the supplemental material.

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Overview

Section 9 contains proofs of the main theoretical results. Section 11 expands on examples given in the main text. Section 12 presents an unhackability diagram for a generic set of three policies a,b,ca,b,c; Section 13 shows a simplification diagram of the same policies.

Proofs

We say R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are equivalent on a set of policies Π\Pi if J1J_{1} and J2J_{2} induce the same ordering of Π\Pi, and that R\mathcal{R} is trivial on Π\Pi if J(π)=J(π)J(\pi)=J(\pi^{\prime}) for all π,πΠ\pi,\pi^{\prime}\in\Pi. We also have the following definitions from Sections 4 and 5:

A pair of reward functions R1\mathcal{R}_{1}, R2\mathcal{R}_{2} are hackable relative to policy set Π\Pi and an environment (S,A,T,I,,γ)(S,A,T,I,\underline{\hskip 8.5359pt},\gamma) if there exist π,πΠ\pi,\pi^{\prime}\in\Pi such that

R2\mathcal{R}_{2} is a simplification of R1\mathcal{R}_{1} relative to policy set Π\Pi if for all π,πΠ\pi,\pi^{\prime}\in\Pi,

and there exist π,πΠ\pi,\pi^{\prime}\in\Pi such that J2(π)=J2(π)J_{2}(\pi)=J_{2}(\pi^{\prime}) but J1(π)J1(π)J_{1}(\pi)\neq J_{1}(\pi^{\prime}). Moreover, if R2\mathcal{R}_{2} is trivial then we say that this is a trivial simplification.

Note that these definitions only depend on the policy orderings associated with R2\mathcal{R}_{2} and R1\mathcal{R}_{1}, and so we can (and do) also speak of (ordered) pairs of policy orderings being simplifications or hackable. We also make use of the following definitions:

A (stationary) policy π\pi is ε\varepsilon-suboptimal if J(π)J(π)εJ(\pi)\geq J(\pi^{\star})-\varepsilon, where ε>0\varepsilon>0

Formally, a set of (stationary) policies Π˙\dot{\Pi} is open if V(Π˙)\mathcal{V}(\dot{\Pi}) is open in the smallest affine space that contains V(Π)\mathcal{V}(\Pi), where Π\Pi is the set of all stationary policies. Note that this space is S(A1)|S|(|A|-1)-dimensional, since all action probabilities sum to 1.

We require two more propositions for the proof of this lemma.

If Π˙\dot{\Pi} is open then F\mathcal{F} is injective on Π˙\dot{\Pi}.

First note that, since π(as)0\pi(a\mid s)\geq 0, we have that if Π˙\dot{\Pi} is open then π(as)>0\pi(a\mid s)>0 for all s,as,a for all πΠ˙\pi\in\dot{\Pi}. In other words, all policies in Π˙\dot{\Pi} take each action with positive probability in each state.

Note that if F(π)=F(π)\mathcal{F}(\pi)=\mathcal{F}(\pi^{\prime}) then wπ=wπw_{\pi}=w_{\pi^{\prime}}, and moreover that

Next, since π\pi takes each action with positive probability in each state, we have that π\pi visits every state with positive probability. This implies that wπ(s)0w_{\pi}(s)\neq 0 for all ss, which means that we can express π\pi as

Note that F\mathcal{F} is not injective on Π\Pi; if there is some state ss that π\pi reaches with probability , then we can alter the behaviour of π\pi at ss without changing F(π)\mathcal{F}(\pi). But every policy in an open policy set Π˙\dot{\Pi} visits every state with positive probability, which then makes F\mathcal{F} injective. In fact, Proposition 1 straightforwardly generalises to the set of all policies that visit all states with positive probability (although this will not be important for our purposes).

or, alternatively, if we wish R\mathcal{R}^{\prime} to be defined over the domain S×AS\times A,

In any MDPRMDP\setminus\mathcal{R}, if Π^\hat{\Pi} contains an open set, then any pair of reward functions that are unhackable and non-trivial on Π^\hat{\Pi} are equivalent on Π^\hat{\Pi}.

Let R1\mathcal{R}_{1} and R2\mathcal{R}_{2} be any two unhackable and non-trivial reward functions. We will show that, for any π,πΠ^\pi,\pi^{\prime}\in\hat{\Pi}, we have J1(π)=J1(π)    J2(π)=J2(π)J_{1}(\pi)=J_{1}(\pi^{\prime})\implies J_{2}(\pi)=J_{2}(\pi^{\prime}), and thus, by symmetry, J1(π)=J1(π)    J2(π)=J2(π)J_{1}(\pi)=J_{1}(\pi^{\prime})\iff J_{2}(\pi)=J_{2}(\pi^{\prime}). Since R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are unhackable, this further means that they have exactly the same policy order, i.e. that they are equivalent.

Choose two arbitrary π,πΠ^\pi,\pi^{\prime}\in\hat{\Pi} with J1(π)=J1(π)J_{1}(\pi)=J_{1}(\pi^{\prime}) and let fF(π),fF(π)f\doteq\mathcal{F}(\pi),f^{\prime}\doteq\mathcal{F}(\pi^{\prime}). The proof has 3 steps:

We use linearity to show that this implies that J2(π)=J2(π)J_{2}(\pi)=J_{2}(\pi^{\prime}).

In any MDPRMDP\setminus\mathcal{R}, any pair of reward functions that are unhackable and non-trivial on the set of all (stationary) policies Π\Pi are equivalent on Π\Pi.

This corollary follows from Theorem 1, if we note that the set of all policies does contain an open set. This includes, for example, the set of all policies in an ϵ\epsilon-ball around the policy that takes all actions with equal probability in each state. ∎

In any MDPRMDP\setminus\mathcal{R}, any pair of reward functions that are unhackable and non-trivial on the set of all ε\varepsilon-suboptimal policies (ε>0\varepsilon>0) Πε\Pi^{\varepsilon} are equivalent on Πε\Pi^{\varepsilon}, and any pair of reward functions that are unhackable and non-trivial on the set of all δ\delta-deterministic policies (δ<1\delta<1) Πδ\Pi^{\delta} are equivalent on Πδ\Pi^{\delta}.

To prove this, we will establish that both Πε\Pi^{\varepsilon} and Πδ\Pi^{\delta} contain open policy sets, and then apply Theorem 1.

Let us begin with Πδ\Pi^{\delta}. First, let π\pi be some deterministic policy, and let πϵ\pi_{\epsilon} be the policy that in each state with probability 1ϵ1-\epsilon takes the same action as π\pi, and otherwise samples an action uniformly. Then if δ<ϵ<1\delta<\epsilon<1, πϵ\pi_{\epsilon} is the center of an open ball in Πδ\Pi^{\delta}. Thus Πδ\Pi^{\delta} contains an open set, and we can apply Theorem 1.

For Πε\Pi^{\varepsilon}, let π\pi^{\star} be an optimal policy, and apply an analogous argument. ∎

2 Finite Policy Sets

For any MDPRMDP\setminus\mathcal{R}, any finite set of policies Π^\hat{\Pi} containing at least two π,π\pi,\pi^{\prime} such that F(π)F(π)\mathcal{F}(\pi)\neq\mathcal{F}(\pi^{\prime}), and any reward function R1\mathcal{R}_{1}, there is a non-trivial reward function R2\mathcal{R}_{2} such that R1\mathcal{R}_{1} and R2\mathcal{R}_{2} are unhackable but not equivalent.

If R1\mathcal{R}_{1} is trivial, then simply choose any non-trivial R2\mathcal{R}_{2}. Otherwise, the proof proceeds by finding a path from R1\vec{\mathcal{R}}_{1} to R1-\vec{\mathcal{R}}_{1}, and showing that there must be an R2\vec{\mathcal{R}}_{2} on this path such that R2\mathcal{R}_{2} is non-trivial and unhackable with respect to R1\mathcal{R}_{1}, but not equivalent to R1\mathcal{R}_{1}.

First, note that J(π)=F(π)RJ(\pi)=\mathcal{F}(\pi)\cdot\vec{\mathcal{R}} is continuous in R\vec{\mathcal{R}}. This means that if J1(π)>J2(π)J_{1}(\pi)>J_{2}(\pi^{\prime}) then there is a unique first vector R2\vec{\mathcal{R}}_{2} on any path from R1\vec{\mathcal{R}}_{1} to R1-\vec{\mathcal{R}}_{1} such that F(π)R2F(π)R2\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{2}\not>\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{2}, and for this vector we have that F(π)R2=F(π)R2\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{2}=\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{2}. Since Π^\hat{\Pi} is finite, and since R1\mathcal{R}_{1} is not trivial, this means that on any path from R1\vec{\mathcal{R}}_{1} to R1-\vec{\mathcal{R}}_{1} there is a unique first vector R2\vec{\mathcal{R}}_{2} such that R2\mathcal{R}_{2} is not equivalent to R1\mathcal{R}_{1}, and then R2\mathcal{R}_{2} must also be a unhackable with respect to R1\mathcal{R}_{1}.

Let Π^\hat{\Pi} be a finite set of policies, and R\mathcal{R} a reward function. The following procedure determines if there exists a non-trivial simplification of R\mathcal{R} in a given MDPRMDP\setminus\mathcal{R}:

Let E1EmE_{1}\dots E_{m} be the partition of Π^\hat{\Pi} where π,π\pi,\pi^{\prime} belong to the same set iff J(π)=J(π)J(\pi)=J(\pi^{\prime}).

For each such set EiE_{i}, select a policy πiEi\pi_{i}\in E_{i} and let ZiZ_{i} be the set of vectors that is obtained by subtracting F(πi)\mathcal{F}(\pi_{i}) from each element of F(Ei)\mathcal{F}(E_{i}).

This proof uses a similar proof strategy as Theorem 2. However, in addition to avoiding trivial reward functions on the path from R1\vec{\mathcal{R}}_{1} to R1-\vec{\mathcal{R}}_{1}, we must also ensure that we stay within the “equality-preserving space”, to be defined below.

Suppose R2\mathcal{R}_{2} is a reward function such that if J1(π)=J1(π)J_{1}(\pi)=J_{1}(\pi^{\prime}) then J2(π)=J2(π)J_{2}(\pi)=J_{2}(\pi^{\prime}), for all π,πΠ^\pi,\pi^{\prime}\in\hat{\Pi}. This is equivalent to saying that L2(F(π))=L2(F(π))L_{2}(\mathcal{F}(\pi))=L_{2}(\mathcal{F}(\pi^{\prime})) if π,πEi\pi,\pi^{\prime}\in E_{i} for some EiE_{i}. By the properties of linear functions, this implies that if F(Ei)\mathcal{F}(E_{i}) contains did_{i} linearly independent vectors then it specifies a (di1)(d_{i}-1)-dimensional affine space SiS_{i} such that L2(x)=L2(x)L_{2}(x)=L_{2}(x^{\prime}) for all points x,xSix,x^{\prime}\in S_{i}. Note that this is the smallest affine space which contains all points in EiE_{i}. Moreover, L2L_{2} is also constant for any affine space Siˉ\bar{S_{i}} parallel to SiS_{i}. Formally, we say that Siˉ\bar{S_{i}} is parallel to SiS_{i} if there is a vector zz such that for any ySiˉy\in\bar{S_{i}} there is an xSix\in S_{i} such that y=x+zy=x+z. From the properties of linear functions, if L2(x)=L2(x)L_{2}(x)=L_{2}(x^{\prime}) then L2(x+z)=L2(x+z)L_{2}(x+z)=L_{2}(x^{\prime}+z).

For the other direction, suppose DD2D^{\prime}\leq D-2. Note that this implies that R1\mathcal{R}_{1} is not trivial. Let R3=R1\mathcal{R}_{3}=-\mathcal{R}_{1}. Now both R1\vec{\mathcal{R}}_{1} and R3\vec{\mathcal{R}}_{3} are located in the equality-preserving space. Next, since the equality-preserving space has at least two dimensions, this means that there is a continuous path from R1\vec{\mathcal{R}}_{1} to R3\vec{\mathcal{R}}_{3} through the equality-preserving space that does not pass the origin. Now, note that Ji(π)=F(π)RiJ_{i}(\pi)=\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{i} is continuous in Ri\vec{\mathcal{R}}_{i}. This means that there, on the path from R1\vec{\mathcal{R}}_{1} to R3\vec{\mathcal{R}}_{3} is a first vector R2\vec{\mathcal{R}}_{2} such that F(π)R2=F(π)R2\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{2}=\mathcal{F}(\pi^{\prime})\cdot\vec{\mathcal{R}}_{2} but F(π)R1F(π)R1\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{1}\neq\mathcal{F}(\pi^{\prime})\cdot\vec{\mathcal{R}}_{1} for some π,πΠ^\pi,\pi^{\prime}\in\hat{\Pi}. Let R2\mathcal{R}_{2} be a reward function corresponding to R2\vec{\mathcal{R}_{2}}. Since R2\vec{\mathcal{R}_{2}} is not 0\vec{0}, we have that R2\mathcal{R}_{2} is not trivial on Π^\hat{\Pi}. Moreover, since R2\vec{\mathcal{R}_{2}} is in the equality-preserving space, and since F(π)R2=F(π)R2\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{2}=\mathcal{F}(\pi^{\prime})\cdot\vec{\mathcal{R}}_{2} but F(π)R1F(π)R1\mathcal{F}(\pi)\cdot\vec{\mathcal{R}}_{1}\neq\mathcal{F}(\pi^{\prime})\cdot\vec{\mathcal{R}}_{1} for some π,πΠ^\pi,\pi^{\prime}\in\hat{\Pi}, we have that R2\mathcal{R}_{2} is a non-trivial simplification of R1\mathcal{R}_{1}. Therefore, if DD2D^{\prime}\leq D-2 then there exists a non-trivial simplification of R1\mathcal{R}_{1}.

We have thus proven both directions, which completes the proof. ∎

For any finite set of policies Π^\hat{\Pi}, any environment, and any reward function R\mathcal{R}, if Π^2|\hat{\Pi}|\geq 2 and J(π)J(π)J(\pi)\neq J(\pi^{\prime}) for all π,πΠ^\pi,\pi^{\prime}\in\hat{\Pi}, then there is a non-trivial simplification of R\mathcal{R}.

Any Policy Can Be Made Optimal

In this section, we show that any policy is optimal under some reward function.

For any rewardless MDP (S,A,T,I,,γ)(S,A,T,I,\underline{\hskip 8.5359pt},\gamma) and any policy π\pi, there exists a reward function R\mathcal{R} such that π\pi is optimal in the corresponding MDP (S,A,T,I,R,γ)(S,A,T,I,\mathcal{R},\gamma).

This shows that any policy is rationalised by some reward function in any environment. Any policy that gives probability to any action which π\pi takes with probability is optimal under this construction. This means that if π\pi is deterministic, then it will be the only optimal policy in (S,A,T,I,R,γ)(S,A,T,I,\mathcal{R},\gamma).

Examples

In this section, we take a closer look at two previously-seen examples: the two-state MDPRMDP\setminus\mathcal{R} and the cleaning robot.

Let us explore in more detail the two-state system introduced in the main text. We decsribe this infinite-horizon MDPRMDP\setminus\mathcal{R} in Table 1.

We denote πij\pi_{ij} (i,j{0,1}i,j\in\{0,1\}) the policy which takes action ii when in state 0 and action jj when in state 1. This gives us four possible deterministic policies:

There are 4!=244!=24 ways of ordering these policies with strict inequalities. Arbitrarily setting π00<π11\pi_{00}<\pi_{11} breaks a symmetry and reduces the number of policy orderings to 12. When a policy ordering can be derived from some reward function R\mathcal{R}, we say that R\mathcal{R} represents it, and that the policy ordering is representable. Of these 12 policy orderings with strict inequalities, six are representable:

Simplification in this environment is nontrivial – given a policy ordering, it is not obvious which strict inequalities can be set to equalities such that there is a reward function which represents the new ordering. Through a computational approach (see Section 11.3) we find the following representable orderings, each of which is a simplification of one of the above strict orderings.

Furthermore, for this environment, we find that any reward function which sets the value of three policies equal necessarily forces the value of the fourth policy to be equal as well.

2 Cleaning robot example

Recall the cleaning robot example in which a robot can choose to clean a combination of three rooms, and receives a nonnegative reward for each room cleaned. This setting can be thought of as a single-step eight-armed bandit with special reward structure.

We begin our exploration of this environment with a statement regarding exactly when two policies are hackable. In fact, the proposition is slightly more general, extending to an arbitrary (finite) number of rooms.

Consider a cleaning robot which can clean NN different rooms, and identify each room with a unique index in {1, …, N}. Cleaning room ii gives reward r(i)0r(i)\geq 0. Cleaning multiple rooms gives reward equal to the sum of the rewards of the rooms cleaned. The value of a policy πS\pi_{S} which cleans a collection of rooms SS is the sum of the rewards corresponding to the rooms cleaned: J(πS)=iSr(i)J(\pi_{S})=\sum_{i\in S}r(i). For room ii, the true reward function assigns a value rtrue(i)r_{\text{true}}(i), while the proxy reward function assigns it reward rproxy(i)r_{\text{proxy}}(i). The proxy reward is hackable with respect to the true reward if and only if there are two sets of rooms S1,S2S_{1},S_{2} such that iS1rproxy(i)<iS2rproxy(i)\sum_{i\in S_{1}}r_{\text{proxy}}(i)<\sum_{i\in S_{2}}r_{\text{proxy}}(i) and iS1rtrue(i)>iS2rtrue(i)\sum_{i\in S_{1}}r_{\text{true}}(i)>\sum_{i\in S_{2}}r_{\text{true}}(i).

We show the two directions of the double implication.

If rproxy,rtruer_{\text{proxy}},r_{\text{true}} are hackable, then there must be a pair of policies π1,π2\pi_{1},\pi_{2} such that Jproxy(π1)<Jproxy(π2)J_{\text{proxy}}(\pi_{1})<J_{\text{proxy}}(\pi_{2}) and Jtrue(π1)>Jtrue(π2)J_{\text{true}}(\pi_{1})>J_{\text{true}}(\pi_{2}). Let S1S_{1} be the set of rooms cleaned by π1\pi_{1} and S2S_{2} be the set of rooms cleaned by π2\pi_{2}. Again remembering that J(πS)=iSr(i)J(\pi_{S})=\sum_{i\in S}r(i) immediately gives us that iS1rproxy(i)<iS2rproxy(i)\sum_{i\in S_{1}}r_{\text{proxy}}(i)<\sum_{i\in S_{2}}r_{\text{proxy}}(i) and iS1rtrue(i)>iS2rtrue(i)\sum_{i\in S_{1}}r_{\text{true}}(i)>\sum_{i\in S_{2}}r_{\text{true}}(i).

In the main text, we saw two intuitive ways of modifying the reward function in the cleaning robot example: omitting information and overlooking fine details. Unfortunately, there is no obvious mapping of Proposition 4 onto simple rules concerning how to safely omit information or overlook fine details: it seems that one must resort to ensuring that no two sets of rooms satisfy the conditions for hackability described in the proposition.

2.2 Simplification

We now consider simplification in this environment. Since we know the reward for cleaning each room is nonnegative, there will be some structure underneath all the possible orderings over the policies. This structure is shown in Figure 7: regardless of the value assigned to each room, a policy at the tail of an arrow can only be at most as good as a policy at the head of the arrow.

If we decide to simplify an ordering by equating two policies connected by an arrow, the structure of the reward calculation will force other policies to also be equated. Specifically, if the equated policies differ only in position ii, then all pairs of policies which differ only in position ii will also be set equal.

For example, imagine we simplify the reward by saying we don’t care if the attic is cleaned or not, so long as the other two rooms are cleaned (recall that we named the rooms Attic, Bedroom and Kitchen). This amounts to saying that J()=J()J()=J(). Because the policy value function is of the form

where x,y,z{0,1}x,y,z\in\{0,1\}, this simplification forces r1=0r_{1}=0. In turn, this implies that J()=J()J()=J() and J()=J()J()=J(). The new structure underlying the ordering over policies is shown in Figure 8.

An alternative way to think about simplification in this problem is by imagining policies as corners of a cube, and simplification as flattening of the cube along one dimension – simplification collapses this cube into a square.

3 Software repository

The software suite described in the paper (and used to calculate the representable policy orderings and simplifications of the two-state MDPRMDP\setminus\mathcal{R}) can be found at https://anonymous.4open.science/r/simplified-reward-5130.

Unhackability Diagram

Consider a setting with three policies a,b,ca,b,c. We allow all possible orderings of the policies. In general, these orderings might not all be representable; a concrete case in which they are is when a,b,ca,b,c represent different deterministic policies in a 3-armed bandit.

We can represent all unhackable pairs of policy orderings with an undirected graph, which we call an unhackability diagram. This includes a node for every representable ordering and edges connecting orderings which are unhackable. Figure 9 shows an unhackability diagram including all possible orderings of the three policies a,b,ca,b,c.

Simplification Diagram

We can also represent all possible simplifications using a directed graph, which we call a simplification diagram. This includes a node for every representable ordering and edges pointing from orderings to their simplifications. Figure 10 presents a simplification diagram including all possible orderings of three policies a,b,ca,b,c.

We note that the simplification graph is a subgraph of the unhackability graph. This will always be the case, since simplification can never lead to reward hacking.