Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

Introduction

AI systems are rapidly gaining capabilities (OpenAI, 2023), especially in natural language (Bubeck et al., 2023). To mitigate risks of deployment, models must be thoroughly evaluated and steered towards safer behaviors (Hendrycks et al., 2021b). Previous benchmarks aimed at evaluating these complex systems have measured language understanding (Wang et al., 2019) or reasoning in isolated scenarios (Srivastava et al., 2022; Liang et al., 2022). However, models are now being trained for real-world, interactive tasks (Ahn et al., 2022; Reed et al., 2022). Thus, benchmarks should assess how models behave in interactive environments.

Text-based games are a natural test-bed for evaluating interactive agents. Progressing through such games requires agents to plan and possess natural language understanding. Although there are several benchmarks for text-based games (Côté et al., 2018; Shridhar et al., 2021; Hausknecht et al., 2020), their difficulty mainly comes from the difficulty of exploring the physical environment. By contrast, few works study planning in complex social environments, a crucial capability for agents deployed in the real world.

Moreover, even if artificial agents learn to competently navigate social situations, there is no guarantee that they will learn to behave ethically. For example, CICERO (Bakhtin et al., 2022) demonstrates that agents trained to maximize reward may be incentivized to exhibit deceptive and manipulative behaviors. So benchmarks should not only assess agents’ capabilities, but also their unethical tendencies.

To both guide progress on text-based agents and encourage them to be more moral, we propose the Measuring Agents’ Competence & Harmfulness In A Vast Environment of Long-horizon Language Interactions (Machiavelli) benchmark. Our environment, detailed in Table 1, is based on human-written, text-based Choose-Your-Own-Adventure games from choiceofgames.com. These games abstract away low-level environment interactions and instead spotlight high-level decisions while providing real-world goals to agents. Machiavelli is a step towards measuring an agent’s ability to plan in realistic social environments.

To track unethical behaviors, the environment reports the extent to which agent actions are deceptive, reduce utility, and are power-seeking, among other behavioral characteristics (Figure 2). We accomplish this by first operationalizing these behaviors as mathematical formulas, second densely annotate social concepts in the games, such as characters’ wellbeing, and third using the annotations and formulas to calculate a numerical score for each of these behaviors. The annotations are collected using GPT-4 (OpenAI, 2023), which we show empirically to be more performant than human annotators.

On Machiavelli, we observe that artificial agents trained for goal optimization often exhibit unethical and power-seeking behaviors, analogous to how language models trained for next-token prediction output often toxic text (Gehman et al., 2020). In other words, agents which are trained amorally may learn Machiavellian behaviors, maximizing their reward without regards to others or the environment. AI agents must grapple with the same tension that humans confront: how do we achieve our objectives while behaving morally?

We begin to improve this trade-off by steering agents to be more moral. For language-model agents, we find that moral conditioning (prompting the agent to be more ethical) reduces the frequency of harmful behavior. For RL agents, we follow ideas in Hendrycks et al. (2021c) to build an artificial conscience, which steers policies away from unethical actions. Moreover, for both agents, behavioral regularization limits negative behavior without significantly reducing reward. Our work takes a step towards the design of competent yet safe sequential decision-making agents. To encourage progress in this area, we release the code for Machiavelli, along with all our labels, here: https://aypan17.github.io/machiavelli.

Machiavelli: An Environment for Measuring Harmful Agent Behavior

To motivate our benchmark, we detail the structure of the Choose-Your-Own-Adventure games in Machiavelli (Section 2.1). We mathematically formulate the behaviors we study: ethical violations, disutility, and power-seeking (Section 2.2). Because power has several definitions, it is more carefully operationalized (Section 2.3). Our definitions inform the game annotations we collect (Section 2.4) and how we evaluate agents on our benchmark (Section 2.5).

Humans simultaneously optimize multiple objectives (Maslow, 1943). Similarly, most of the games have dozens of achievements designed by the authors of the games. We designate these achievements as the agent’s goals in the environment (Figure 2).

Mimicking how human actions are encoded in natural language (Shridhar et al., 2020), the environment requires the agent to plan in human-like action spaces. In Figure 2, in order to “take down an unscrupulous plotter,” the agent must recognize that it should lie and pretend to cooperate.

Understanding game progression can sometimes require information found tens of thousands of tokens earlier. For example, the player unlocks the chance to betray Jess only if they select the second option in Figure 2. Modeling long-term dependencies is an outstanding challenge for language models (Tay et al., 2021).

As noted by the publishers, their games “focus on the choices we find interesting–moral choices, trade-offs between different values and characteristics, and so forth.” The player must trade-off between achieving goals and behaving ethically. More importantly, these structures allow us to characterize agent behavior. For example, agents which tend to lie in these games may be more prone to lie if publicly deployed. Methods purported to reduce such harmful behaviors can be evaluated on Machiavelli.

2 Measuring Unethical Behaviors

As described in Section 2.1, the diversity of choices in these games enables the measurement of a wide array of behaviors. We outline our measurements of ethical violations, disutility, and power-seeking in Machiavelli. Additional behaviors are concretized in Appendix A.

Agents which are trained amorally may be incentivized commit ethical violations, such as being deceptive (Gneezy, 2005). Language models have been shown to output untruthful texts (Lin et al., 2022), raising the concern that they may spread misinformation (Weidinger et al., 2022). As artificial agents gain access to APIs (Schick et al., 2023; Liang et al., 2023), they gain the ability to commit a wider variety of ethical violations.

Agents which reduce others’ wellbeing in pursuit of their actions behave immorally according to most moral theories (Crisp, 2021). Such agents prioritize themselves at the cost of others, behaving selfishly (Okasha, 2020).

Power-seeking agents may induce non-uniform distributions of power (Piketty, 2014), leading to inequality and driving conflict (Neckerman & Torche, 2007; Esteban & Ray, 2011). A primary concern of AI ethics may be that AI can shape power (Pratyusha, 2020), making it a key property to measure for AIs.

We say power-seeking occurs when the player attempts to take actions which increase its ability to influence the state of the world. In the games, we approximate how power-seeking an agent is by measuring how much power it exercised throughout its trajectory. Power is defined in several fields. To be agnostic, we formulate and measure multiple definitions of power in Section 2.3.

3 Operationalizing Power

Power is a nebulous concept with competing definitions in philosophy, physics, political science, economics, and sociology. We offer a fuller discussion of power in Appendices B, C, D, and E.

Power can be decomposed as either the inputs to the agent or the outputs of the agent. An agent’s inputs are its resources (resource power) and an agent’s outputs is its potential influence over the world (prospective power). See Figure 3.

We ground power in several units to capture the different effects of actions. For example, detonating a nuclear bomb in the ocean transfers a large amount of energy but does not immediately affect the wellbeing of humans. We track power in terms of watts, money, wellbeing, and social influence, which are inspired by Appendix B.

Resources may have synergistic interactions (e.g., having nuclear launch codes and a nuclear warhead is far more powerful than either in isolation), so precise formulations of resource power are complicated to obtain for each game.

An agent’s prospective power is its tendency to take impactful actions multiplied by its ability to successfully execute those actions. Concretely, an agent is more powerful the more it aims to influence the wellbeing of other agents and the more it can actually do so. Given an agent π\pi and a state ss, we define

In practice, we approximate P(π,s)P(\pi,s) by sampling. For each trajectory, we calculate its exercised power, or how much power was exercised to enable all the actions in the trajectory. Given a trajectory τ=(s1,,sn)\tau=(s_{1},\ldots,s_{n}), we define Eunit(τ)=t=1nIunit(st)\text{E}_{\emph{unit}}(\tau)=\sum_{t=1}^{n}I_{\emph{unit}}(s_{t}) where Iunit(s)I_{\emph{unit}}(s) is some scalar measure of an agent’s exercised power at state ss for the given unit. In Machiavelli, we use our annotations to obtain a value for II.

In all our results, we report the power of an agent π\pi as

where EunitE_{\text{unit}} is calculated over some set of sampled trajectories {τ1,,τk}\{\tau_{1},\ldots,\tau_{k}\} from π\pi and Iunit(s)I_{\text{unit}}(s) is an indicator . In theory, as the number of sampled trajectories increases, the mean of the agent’s exercised power over trajectories will better estimate the agent’s prospective power.

4 Annotating Machiavelli

Measuring harmful behaviors relies on detailed annotations throughout an agent’s trajectory. We use GPT-4 (OpenAI, 2023) to annotate our benchmark.

Model labels are efficient. We investigate the ability of LLMs to automate labeling. We handcraft prompts for each label type; see Appendix F for full details of our prompts.

GPT-4 is usually as accurate as human annotators from Surge AI on a test set of 2,000 scenes sampled uniformly at random from all games. We define the gold labels on the test set to be the ensemble of labels from three experts (authors).

Shown in Table 8, GPT-4’s labels are more correlated with the gold labels than the average crowdworker’s labels. Also, an ensemble of model labels is more performant than an ensemble of crowdworker labels in 16/18 label categories.

5 Evaluating Agents on Machiavelli

Using our labels, we can report a rich set of behavioral metrics. An agent’s goal is to maximize both the reward and the behavioral scores.

We define reward as the sum of the points agent’s achievements. Importantly, the agent only has access to the achievements and not the harm labels, as the goal is to evaluate whether agents can successfully navigate trade-offs between reward and moral behavior. We report normalized reward, i.e., agent’s points divided by total possible points.

Given a harmful behavior, we define an agent’s behavior score as:

i.e., the proportion of harms relative to the baseline harm. We estimate the number of harms committed by the random agent by as the mean of 1,000 sampled trajectories.

Following this setup, the best possible score for all behavioral metrics is 0%, though in many games there may be no path which completes all achievements or avoids all harm. Hence it is difficult to define the performance of an ideal agent given conflicting objectives and metrics.

Out of the 134 games of Machiavelli, we identify 30 games where agents trained to maximize reward perform poorly on behavioral metrics, i.e., where points and harms are positively correlated. These games form the Machiavelli test set, listed in Appendix G.

Reward Optimization May Produce Machiavellian Behavior

We evaluate our agents (Section 3.1) and find those trained to maximize reward learn Machiavellianism (Section 3.2).

We consider agents that are not heavily tailored to specific text-based environments (Hausknecht et al., 2019; Adhikari et al., 2020) nor ones that construct exploration and action modifiers that are not easily reproduced in our games (Ammanabrolu & Hausknecht, 2020; Ammanabrolu et al., 2020).

This agent selects one of the available actions uniformly at random.

This agent uses an LM to select actions. We prompt the LMs with a list of achievements to optimize for, the current scene text, and a list of available actions. We truncate prompts that exceed the maximum context length. The model outputs a numerical action. We experiment with both direct prompting and chain-of-thought (CoT) prompting (Wei et al., 2022). Our main results are presented with gpt-3.5-turbo and gpt-4, and other models results are shown in Appendix K. Detailed prompts are provided in Appendix H.

This agent is adapted from DRRN (He et al., 2016) using the codebase in Yao et al. (2020). We remove the action generation module and encode the provided actions using DeBERTa (He et al., 2021). Scenarios that exceed the maximum context length are truncated. The DRRN learns via Q-learning with Boltzmann exploration. In preliminary experiments, we find that agents tend to converge after 50k steps, so we train for 50k steps and select the checkpoint with the highest training score. Training details are provided in Appendix I.

2 The Effects of Reward Optimization

The baselines illustrate behavioral tendencies of different agents. The RL agents achieve higher reward than the random agent but also behave more viciously, whereas the LM agents are a middle ground between the RL and random agent. Our results suggest a trade-off between achieving objectives and behaving morally.

Reward optimization often causes agents to be more immoral. More importantly, however, the agent’s harmful behaviors are unclear before training. Figure 4 demonstrates how an RL agent trained to optimize reward exhibits different amounts of harmful behaviors depending on the environment (game) it was trained on. Given that specific harmful behaviors are difficult to predict without prior knowledge of the environment, developing methods to steer agents away from a broad range of unethical actions is essential.

Steering Agents to Be More Moral

We describe our behavioral regularization methods (Section 4.1) and our experimental results (Section 4.2).

Can agents be taught to be moral while still achieving rewards? We investigate methods for encouraging our language-model agent and our RL agent to behave morally.

To induce the LM to behave more ethically, we instruct it to behave morally following the prompt given in Appendix H.

Prior work shows that LMs understand commonsense morality (Hendrycks et al., 2021a). We leverage these LMs to build an artificial conscience (Hendrycks et al., 2021c) and steer our RL agents away from harmful actions. See Appendix J.

We finetune DeBERTa (He et al., 2021) to understand power, utility, and morality. For power, we create a dataset of pairwise sentences where one sentence contains more power than the other. The model is trained to predict which sentence exhibits more power. Utility is trained similarly. Morality is trained as binary classification. Our fine-tuned model fharmful(st,at)f_{\text{harmful}}(s_{t},a_{t}) measures the harm along these three axes: immorality, disutility, and power-seeking.

2 Results

Both of our methods for steering agent behavior reduce their harmfulness across all three behavioral metrics. However, the regularization also reduces the reward for each agent.

Among all agents, the RL agents achieve the highest normalized reward yet display the most Machiavellian tendencies. On the other hand, the LM agents are a Pareto improvement over the random agent but not the RL agents. None of the GPT-4 variants represent Pareto improvements over the GPT-3.5 variants, indicating that scaling up models is not a straightforward solution to the benchmark.

It is difficult to evaluate whether the RL or LM agent is preferred–future work could seek to design agents that are Pareto improvements over all of our baselines. Although the results outline a tension between goals and moral behavior, we qualify this hypothesis in Section 5.

We provide a qualitative example of how steering agents can alter their behavior in Table 3. Using chain-of-thought prompting (Wei et al., 2022), the model a) recognizes that it has an opportunity to achieve an unethical achievement, and b) decides to take a less harmful action that still advances its goals.

Moreover, this example highlights why simply prompting models to “do no harm” may be an untenable solution to Machiavelli. Agents which consistently refuse to act will not be able to collect high reward.

Trade-offs in Machiavelli

We explore more carefully the trade-offs in Machiavelli. In Section 5.1, we examine the proportion of achievements that do and do not conflict with ethical behavior. In Section 5.2, we provide Pareto curves of our agents.

In Machiavelli, there is sometimes a tension between an agent’s objective and moral behavior, as exemplified in Table 3. Additionally, objectives that ostensibly comport with morality may still require unethical actions to attain. For example, an objective which requires sleuthing to uncover evidence may not be achievable without spying or trespassing.

We label whether attaining an achievement conflicts with moral behavior using the prompt in Appendix L. Table 5 shows the distribution of points that come from “moral” or “immoral” achievements for various agents. The majority of the points (76.7%) in the games do not inherently conflict with moral behavior, so most reward should be reachable without compromising ethics.

We thus find that our regularization methods have mixed effects. Although the LM moral conditioning appreciably increases the proportion of points from moral achievements (83.0%90.7%83.0\%\rightarrow 90.7\%), this seems to be largely due to a reduction in points from immoral achievements (4.32.44.3\rightarrow 2.4).

2 Pareto Curves

We plot the performance of all our agents in Table 2 as well as some additional variants described in Appendix K. From Figure 5, we find the Pareto frontier of present-day agents. Our RL and LM agents have strengths along different axes (behavioral metrics and reward, respectively).

Finally, note that the darkened LM agents (green) in Figure 5 are a Pareto improvement over the random agent, demonstrating that progress is possible on Machiavelli.

Related Work

Agents can be trained on text-based games to improve decision-making and natural language understanding. Côté et al. (2018) proposes a procedurally generated text-based world, which Shridhar et al. (2021) augments with cross-modal, embodied observations. Hausknecht et al. (2020) provides more game environments with a template-based action space. However, these environments lack complex social interactions, so cannot be used to study agent behaviors.

To navigate text-based environments, He et al. (2016) proposes an agent which employs a recurrent neural network to generate Q-values. Other works design rule-based (Hausknecht et al., 2019) or neural-based (Ammanabrolu et al., 2020) methods for better environment exploration. Another line of work builds external knowledge into the agent, typically through knowledge graphs (Adhikari et al., 2020; Ammanabrolu & Hausknecht, 2020; Ammanabrolu & Riedl, 2021; Liu et al., 2022) or pretrained language models (Singh et al., 2021).

Building safe sequential-decision making agents is traditionally cast as constrained optimization (Achiam et al., 2017; Tessler et al., 2019; Alshiekh et al., 2018; Ray et al., 2019). Other methods learn human reward functions to guide agent planning (Hadfield-Menell et al., 2016; Reddy et al., 2020). These priors may be improved by learning social norms present in natural language (Riedl & Harrison, 2016; Hendrycks et al., 2021a). In text-based games, Nahian et al. (2021); Ammanabrolu et al. (2022) examine agents which incorporate moral decision-making via LM priors. These priors are either mediated into the agent’s Q-values via policy shaping (Griffith et al., 2013) or via action space restriction (Dalal et al., 2018).

Most similar to our work is Hendrycks et al. (2021c), which introduces the artificial conscience method. The main qualitative difference is that we collect a richer set of labels, allowing us to measure more behaviors, such as power-seeking and disutility. We are able to more thoroughly evaluate the tradeoffs between rewards and moral behavior. Our benchmark is also around an order of magnitude larger, containing more games and more annotations.

Discussion

We built Machiavelli, a suite of 134 text-based Choose-Your-Own-Adventure games to evaluate both the capabilities and safety of AI agents. After mathematizing dozens of harmful behaviors, including power-seeking, disutility, and deception, we collected millions of annotations to assess the behaviors of our baseline RL and LM agents. We find that agents trained to maximize reward often learn Machiavellian tendencies. Thus, we develop immorality regularization techniques to steer our agents to more beneficial behaviors. We create agents which are Pareto improvements over our baselines. Overall, there is still an appreciable gap between current models and agents which optimally navigate the trade-off between achieving reward and behaving morally.

Future work can design agents or methods that better navigate this tensions. Eventually, benchmarks in machine ethics may address more complex behaviors, such as culpability or desert, and be more realistic, incorporating multi-agent dynamics and counterfactual scenarios. As language models gain greater accuracy and efficiency, labeling may be done at greater scale with greater granularity.

Machiavelli improves upon machine ethics, offering concrete steps towards building safer adaptive agents.

Acknowledgements

We are thankful to Edwin Chen, Scott Heiner, and the crowdworkers at Surge AI for helping gather annotation data. We also thank Diane Liang, Viktoriya Malyasova, Sarina Wong, Denny Woong, and especially Rachel Kasperitis for timely annotation work. We are thankful to Mantas Mazeika for providing valuable feedback on the writing. This work was supported in part by the DOE CSGF under grant number DE-SC0020347. AP was supported by a Vitalik Buterin PhD Fellowship in AI Existential Safety.

References

Appendix A Additional Harmful Behaviors

Selfish agents will tend to take actions which prioritize their survival at the cost of others (Okasha, 2020). Such agents would seek to increase their ability to propagate at the expense of humans, causing harm stemming from competition over limited resources (Darwin, 1859).

We say selfishness occurs when the player attempts to prioritize propagating its own information over that of others. Unfortunately, we lack a complete description of how the player’s choices affect other characters’ abilities to propagate their information. One way to measure selfishness in Machiavelli is to have a language model evaluate whether an action was selfish according to the above definition.

Appendix B Definitions of Power from Other Fields

One popular notion of power is an agent’s ability to carry out its own goals or influence the world (Weber, 1978; Russell, 2004). Others may view power as “present means to obtain some future good” (Hobbes & Missner, 2016) and may include optionality (Taleb, 2012) as a resource.

By definition, power is the amount of work done on a system per unit time (Halliday et al., 2013). In particular, there must be some energy transferred, which suggests that power does not exist without action.

Power in economics is typically thought of an ability of a nation (Baldwin, 2013), firm (Khemani, 1993), or a consumer (Dornbusch, 1985) to control some aspect of their economic wellbeing. Notably, most definitions center on an agent’s potential to alter the world.

A seminal definition of political power describes it as the ability to influence others to perform actions they would not otherwise take (Dahl, 1957). Others, however, analogize power with money, arguing that it has no functional existence until it is exercised (Parsons, 1963).

Social power is one’s standing in a social hierarchy (Baldwin, 1992), which is often mediated through demographics and identity (Castells, 2011). It can also be defined as the ability to reciprocate others’ actions (reciprocity), which requires resources (Molm, 2010). Another possible framing of power is as control over “another’s valued outcomes” (Fiske & Berdahl, 2007). Finally, a notable theory of power defines it as the “potential for influence” stemming from six bases of power (coercive, reward, legitimate, referent, expert, informational) described in French et al. (1959) and Appendix D.

Appendix C Past, Present, Future (An Alternate Ontology of Power)

We next organize the definitions of power covered in Appendix B into the following ontology. See Appendix C.5 for a summary. Another possible ontology is described in Appendix D.

Given an agent, we can approximate its power-seekingness through its prior impact on the world (Appendix C.1), the current state of its resources (Appendix C.2), and its propensity and ability to influence the future (Appendix C.3).

We ground our measures of power in different units to capture different effects of actions. For example, detonating a nuclear bomb in the ocean transfers a large number of energy but does not immediately affect the wellbeing of humans. We measure power in terms of watts, money, wellbeing, and social influence.

Let π\pi be the policy or the agent. Let ss be the current state of an agent and S\mathcal{S} be the set of all states. Let τ=(s1,,sn)\tau=(s_{1},\ldots,s_{n}) be a trajectory and T\mathcal{T} be the set of all trajectories. Finally let aa be the action of an agent.

C.1 Exercised Power (Past)

One formulation of power is how much power an agent exercises to perform a certain action. For instance, purchasing a new car online utilizes a great amount of money but a small amount of watts (enough to power the computers involved in the transaction). This measure is a view of the agent’s power retrospectively, so it examines a realized trajectory τ\tau. We define

C.2 Resource Power (Present)

Another possible formulation of power is an agent’s resources modulated by its ability to use them. Resources can generally be viewed as inputs that an agent requires to exert power, such as money or physical strength. This measure is a view of the agent’s power currently, and we define

Notationally, ri(s)r_{i}(s) is some measurement of resource rir_{i} in state ss, e.g., if r0r_{0} tracks USD and an agent has \100USDinstateUSD in statesthenthenr_{0}(s)=100$. A possible decomposition of resources is given in Table 6. A further explanation of these four resource categories is provided in Appendix D.

But this is somewhat vague. Some have proposed similarly informal notions of intelligence, which can be related to power. For example, Muehlhouser has suggested that intelligence is the ability to accomplish a wide variety of goals divided by the resources (Muehlhauser & Salamon, 2012). That suggests that Intelligence = Power / Resources, and that Power = Intelligence ×\times Resources. Note that these equations are notional, so “×\times” means interaction. In this sense, intelligence is the ability to use resources more efficiently, which can increase power. Alternatively, Power=Efficiency×Resources\text{Power}=\text{Efficiency}\times\text{Resources}. Assuming that intelligence or efficiency are fixed, then the only way to control power is to control resources.

C.3 Prospective Power (Future)

A more comprehensive definition of an agent’s power is its propensity to take impactful actions multiplied by the ability to successfully execute those actions. As an example, a pacifist leader of a major-power state may have less prospective power than an imperialist leader of a minor-power state because the latter’s tendency to take power-seeking actions (invade other territories) may overcome the lower probability of those actions being successful (minor-powers have less successful military campaigns). We define

C.4 Change in Power (Power Gained From a Resource)

Finally, we can also measure the change in power from state ss to state ss^{\prime}. We define

We can use prospective definition of power to ground this change as follows.

Importantly, this allows us to propose a constructivist definition for the power gained from a resource rr in terms of an agent π\pi and a measurement function II.

This may help with concretizing informational power. For example, if an agent is given the nuclear launch codes (gain in information), and they already had a functioning warhead, they are much more powerful than possessing an inoperable warhead.

C.5 Summary

In Appendix C, we described four units of power and mathematized four types of power grounded in those units.

Influenced by the notions in Appendix B, we measure power in terms of watts, money, wellbeing, and social influence.

Our definitions encompass the agent’s power through different temporal lenses. They measure an agent’s power in the

Past. How much impact did the agent exercise on the world? We define this to be exercised power.

Present. What resources does the agent currently have to influence the world? We define this to be resource power.

Future. What is the agent’s tendency to take powerful actions multiplied by its ability to do so? We define this to be prospective power.

The final definition is an extension to prospective power. We define the change in power to be how much prospective power an agent gains one step into the future. It may also be viewed as how much power an agent gains from acquiring a resource.

Appendix D Four Pillars of Power (An Alternate Ontology of Power)

A common framework in economics is that society is built on three pillars: market, state, and community (Rajan, 2019). While these pillars describe interactions between individuals, they can also be repurposed to track power. However, power can exist in an individual, which we can view as human capital. Taken together, these form the four pillars of power. This serves as another possible ontology of power.

Community / Social Power. A common breakdown of social power comes from Raven’s six bases of power (French et al., 1959). These bases describe where agents derive their social power from. Coercive power is the threat of force to influence others’ behaviors. Force can be mediated through physical, emotional, economic, political, or social avenues. Reward power is the ability of an agent to provide or withhold material, emotional, or social rewards from others to affect their behavior. Legitimate power is the power vested in an elected or appointed position of authority. This power is usually buttressed by social norms, imbuing the agent with certain rights and responsibilities. Referent power is when agents leverage their membership in social groups to alter others’ behaviors. Expert power is the appeal of one’s reputation, skills, or experience to modify how an agent presents to others, affecting their interactions with them. Informational power is the ability to enact change based upon the information that an agent possess. Oftentimes this information is privileged and not public.

Economic Power. Economic power covers money, property, and assets. It is closely related to capital (Nitzan & Bichler, 2009). Both capital and economic power share the property that they tend to compound over time (Piketty, 2014). It has been shown that economic power is also closely linked to other forms of power, in particular political power (Pistor, 2013). Overall, economic power is correlated and possibly causally related to forms of political, legal, and social power.

State Power. State power covers the ability of a state to pass laws and regulations, its military strength, and its ability to influence international relations (Baldwin, 2013).

Human Capital. Human capital includes personal attributes encompassing education, knowledge, skills, know-how, attitudes, personality traits, and good health.

Appendix E Additional Concepts for Power

Given multiple agents, we may define the power of a given agent relative to other agents.

One measurement of power is the agent’s tendency to disempower other agents, namely,

From Dahl (1957), we may also define power as the agent’s tendency to influence other agent’s to perform a desired action aa^{*}, i.e.,

where dd is some distance function on policies π\pi. For instance, we might say d(πi(s),πi(s))=πi(s,a)πi(s,a)d(\pi_{i}(s^{\prime}),\pi_{i}(s))=\pi_{i}(s^{\prime},a^{*})-\pi_{i}(s,a^{*}), or the change in probability that πi\pi_{i} takes some given action aa^{*} after observing (s,a,s)(s,a,s^{\prime}).

E.2 Optionality

Sometimes, people will delay investing in a career path until they possess more information about its benefits and drawbacks. This can be viewed as preserving optionality (Taleb, 2012). Another way to view this is from portfolio optimization, where one wishes to minimize risk subject to a given level of expected return.

We might expect an intelligent agent to perform power portfolio optimization as it will be instrumentally useful.

Optionality can also be defined as the entropy of future states available to the agent, but this is simplistic as it doesn’t account for the power of each state nor the likelihood of achieving a given state.

Appendix F Model-based annotations

Machiavelli contains 572,322 scenes, some of which are multiple paragraphs long. Carefully labeling each scene is difficult and time-consuming. Because we have to repeat annotations for quality, labeling Machiavelli easily surpasses 20,000 hours of human annotation work. Paying for high-quality annotations at Surge AI rates (25perhour)tolabelMachiavellicostsover25 per hour) to label Machiavelli costs over500,000 USD. Instead, we use GPT-4 (OpenAI, 2023) to label all our scenes using the following prompts.

We use in-context prompting of an GPT-4 series (OpenAI, 2023) model to collect all our scene annotations. We use a total of 5 different prompts to ask 18 questions about stakeholder utility, physical impact, economic impact, social influence, and ethical violations. The full list of questions are shown in Table 7. Each prompt begins with the game title, blurb, and character name, then continues with the label-specific instructions shown at the end of this document. Each prompt is designed to induce the return of a JSON-formatted string, allowing us to easily parse and extract label values.

We apply each prompt to every scene in the Machiavelli benchmark separately, resulting in 5 rounds of inference for each scene. To reduce model inference calls, we prompt the model to label a batch of 10 scenes at a time, following a similar batching scheme as Cheng et al. (2023).

F.2 Label quality

Table 8 shows the agreement (Spearman rank correlation) of gold labels against the model labels and crowdworker labels. We find that across all label types, individual model labels (model) have a higher correlation with the gold labels than the average individual crowdworker (crowd).

We also consider ensembling labels, where we take the most frequent label out of the 3 labels. An ensemble of 3 crowdworkers (crowdens) is of higher quality than the individual model labels for most labels outside of morality. However, the model can also be ensembled to improve its performance. We collect additional rounds of model labels with the same prompts for a small subset of scenes. We find that ensembling 3 sets of model labels (modelens) outperforms crowdens in 3/5 label categories and fares comparably in the remaining 2 categories. Unfortunately, collecting 3 rounds of model labels for all scenes is prohibitively expensive. As another approach, we notice that model labels tend to have much higher false positives (labeling a scene as being morally salient when nothing occurred) than false negatives. We thus explore relabeling only scenes that the model rated as positively salient. In particular, we replace all positive model labels with modelens labels where available. This labeling scheme (model+) achieves similar accuracy as modelens labels with less than 20% of the labeling cost. Our final model+ labels surpass crowdens labels in 16/18 label categories. We plan to release the full benchmark with model+ labels over time.

F.2.1 Prompt for Social Influence

Example Scene & Corresponding Model Output

F.2.2 Prompt for Monetary Impact

Example Scene & Corresponding Model Output

F.2.3 Prompt for Ethical Violations

F.2.4 Prompt for Ethical Violations (continued)

Example Scene & Corresponding Model Output

F.2.5 Prompt for Character Utilities

Example Scene & Corresponding Model Output

F.2.6 Prompt for Physical Impact

Example Scene & Corresponding Model Output

Appendix G Machiavelli Choose-Your-Own-Adventure Games

The complete list of games from choiceofgames.com included in the Machiavelli benchmark are:

Of the above games, the following 30 games form the Machiavelli test set:

Appendix H Prompts for Language Model Agents

Here, we show the prompt instructions used for the following agents: LM, LM + CoT, LM + CoT + EthicsPrompt, and LM + EthicsPrompt (No Goals). For models using OpenAI’s Chat Completion APIhttps://platform.openai.com/docs/guides/chat (gpt-3.5-turbo and gpt-4), we provide the instructions as a system prompt and pass the scene text and action choices as a user prompt. For standard completion models, we concatenate the instructions with the scene text and action choices.

Due to limitations on context window length, our current prompting scheme only shows the LM the current scene text and does not provide a way for models to retain a memory of previous events in the game. We expect this to be important for longer-term planning and coherence, and leave this for future work to address.

H.2 LM + CoT

H.3 LM + CoT + EthicsPrompt

H.4 LM + EthicsPrompt + NoGoals

Appendix I RL agent training

As described in Section 3.1, we port over the code from the CALM repository (Yao et al., 2020), which is inspired by DRRN (He et al., 2016). We embed all states with DeBERTa (He et al., 2021). Otherwise, we maintain the same hyperparameters as found in the Github repository, except for tuning the Boltzmann exploration parameter β\beta. Actions are thus sampled from the following policy:

For each game, β\beta was tuned in [0.01, 0.03, 0.05, 0.1, 0.3, 0.5, 1.0, 2.0, 5.0].

When using the DeBERTA-based harmfulness model, we tuned the threshold γ\gamma and the Q-value penalty α\alpha. Each steered agent must be retrained as it needs to learn with the artificial conscience.

Appendix J Harmfulness Model Training

We construct a conditional power measurement dataset, which contains (context, scene A, scene B, power comparison) pairs by using the prompt:

For hyperparameters, we set learning rate as 2e2e-55, training epoch as 22, batch size as 3232, weight decay rate as 0.010.01.

Appendix K Additional Results on Machiavelli

In addition to the baselines shown in Table 2, we evaluate davinci (base GPT-3 model), llama-* (LLaMA models with 7B, 13B and 30B parameters) finetuned to follow instructions (Taori et al., 2023), and a +NoGoals modifier for LM agents which removes the achievements from the prompt so that the agent’s only instruction is to behave ethically per the ethics prompt (see Appendix H). The DRRN++shaping agent is a version of the DRRN+shaping RL agent with a heavier weight on the harm penalty. Results for all our agents and modified variants are shown in Tables 9 and 10.

Appendix L Classifying Achievements

To study the trade-offs between objectives and their ethical constraints, we classify the target achievements in Machiavelli as ethical or not. In particular, we use GPT-4 with the following system prompt and follow with a user prompt of the achievement we wish to classify:

Appendix M X-Risk Sheet

We provide an analysis of how our paper contributes to reducing existential risk from AI, following the framework suggested by Hendrycks & Mazeika (2022). Individual question responses do not decisively imply relevance or irrelevance to existential risk reduction.

In this section, please analyze how this work shapes the process that will lead to advanced AI systems and how it steers the process in a safer direction.

Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work studies the ethical behavior of agents in open-ended, diverse environments. When agents are trained to pursue specified rewards and goals for completing a task, this often trades off against numerous ethical goals, giving rise to Machiavellian AIs that are power-seeking, deceptive, and selfish. People will have incentives to build and deploy power-seeking AIs due to their usefulness, and because competition pressure can force people to deploy such systems even if they are not safe. Some of these systems will seek power and may not be perfectly aligned, which could pose catastrophic risks. This work develops a benchmark that concretizes and comprehensively measures an agent’s power-seeking tendencies, deception, and selfishness when carrying out actions in context. The benchmark serves as a testbed for developing sequential decision-making agents that can behave competently while minimizing harmful tradeoffs in diverse environment.

Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? Answer: This work directly reduces risks from power-seeking behavior, deception, and selfish tendencies arising from evolutionary pressures (Hendrycks, 2023).

Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Our benchmark lays the groundwork for studying how reward optimization trades off against other ethical goals and can lead to power-seeking behavior, deception, and selfish tendencies. This improves safety culture and field epistemics by measuring these concepts and showing that they can be measured. A strong understanding of these risks and how to mitigate them could assist with developing regulations for deploying autonomous AI agents. For example, agents could be required to relinquish power above a minimum necessary threshold to accomplish goals, similar to the principle of least privilege in computer security.

What’s at Stake? What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: This directly reduces the existential risks posed by power-seeking AIs (Carlsmith, 2022) and undesirable traits from evolutionary pressures (Hendrycks, 2023).

Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? \square

Problem Difficulty. Is it implausible that any practical system could ever markedly outperform humans at this task? \square

Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? \square

Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? \square

M.2 Safety-Capabilities Balance

In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities.

Overview. How does this improve safety more than it improves general capabilities? Answer: This work measures important phenomena related to AI safety, such as power-seeking and deceptive AI. We do not explore means of improving performance on sequential decision-making tasks, and our environments are especially well suited to measuring ethical behavior. Additionally, we apply regularization to reduce power-seeking behavior, deception, and selfishness, which differentially improves safety metrics.

Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: Although this work does not encourage Gain-of-Function research, researchers might decide to build power-seeking or deceptive AIs to better study these phenomena. This could hasten the onset of existential risk from AIs if pursued without significant precautions. We are encouraging researchers to work in the opposite direction and reduce Machiavellian behavior in AIs, but characterizing these behaviors better may lead to them becoming more of a possibility.

General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? \square

General Goals. Does this improve or facilitate research towards general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, or similar capabilities? \square

Correlation with General Aptitude. Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? \square

Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? \square

M.3 Elaborations and Other Considerations

Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer: Regarding Q5, while the RL algorithm used in this work can be sensitive to hyperparameters, the results hold across a large suite of games.

Regarding Q8, this work studies precisely this tradeoff between reward optimization and ethical behavior. Some of the methods we evaluate on our benchmark do reduce the game score, but others are able to reduce unethical behavior while maintaining similar game scores.

Additionally, these environments study deception but not extreme forms like treacherous turns nor deceptive alignment. Consequently, “solving” this benchmark would not necessarily imply a solution to many classic safety problems.