A Long Way to Go: Investigating Length Correlations in RLHF

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

Introduction

Reinforcement Learning from Human Feedback (RLHF) is widely used to align large language models (LLMs) with desired downstream properties such as helpfulness or harmlessness (Ouyang et al., 2022; Bai et al., 2022). Specifically, starting from a policy model that is either a base LLM or an LLM that has undergone supervised fine-tuning, the standard paradigm generally consists of (1) training a reward model (RM) on a labeled set of preferences on pairs of outputs for the same inputs, and (2) optimizing the policy model with respect to the reward model via an RL algorithm like proximal policy optimization (PPO) (Schulman et al., 2017).

This procedure relies on two things. First, the reward model must be correctly specified and not misaligned with human preferences (Zhuang & Hadfield-Menell, 2021; Pang et al., 2022; Bobu et al., 2023). Second, the optimization algorithm must do a good job of balancing reward optimization with staying close to the initial policy distribution. Not meeting these conditions generally leads to over-optimization of the reward model at the expense of human judgments (Dubois et al., 2023), which in the worst case leads to pathological “reward hacking” (Skalse et al., 2022). Ad hoc adjustments (Touvron et al., 2023b) and improvements in PPO (Zheng et al., 2023b) have stabilized the process and eliminated overt reward hacking in many LLM fine-tuning settings. However, it is not always clear what changes in the policy model’s behavior are responsible for reward improvements, and to what extent these correspond to meaningful improvements in quality versus optimization of spurious correlations in the reward function (Pang et al., 2022).

Given that the vast majority of recent work reports an increase in output length after RLHF for helpfulness, (Dubois et al., 2023; Zheng et al., 2023b; Sun et al., 2023; Wu et al., 2023; Nakano et al., 2021; Stiennon et al., 2020), this paper focuses on the question of length and asks whether this is a correlation being optimized for. Length does not necessarily represent a spurious correlation, as human raters may legitimately prefer longer and more informative outputs. Nevertheless, we explore how much of the optimization and improvement is purely based on length as opposed to other features. We find that length often constitutes a majority of the reward and downstream improvements of RLHF, indicating that length may play a much larger role than previously documented.

We organize our investigation into three parts: (1) We investigate whether PPO with standard reward models optimizes for length in three different helpfulness-oriented settings. At fixed output lengths, PPO only gives mild improvements in reward; in two settings, nearly all reward improvement comes from shifting the distribution over lengths produced by the policy. (2) We investigate preference data and reward models, aiming to understand the source of length correlation and whether this can be mitigated through a series of interventions. We find these biases to originate from data imbalances, as well as significant robustness issues in standard reward modeling. (3) We conduct an experiment where we measure how much doing PPO with a reward based only on length can reproduce PPO quality gains with trained reward models.

We postulate that further improvements to RLHF will require the disentanglement of length from both optimization, and in particular, reward models: RLHF research still has a long way to go.

Our Contributions: (1) We conduct a multi-faceted exploration of a prevalent correlation between length and reward in RLHF. (2) We explore several interventions to study and mitigate length increases, and characterize their performance across three datasets. (3) We plan to release a diverse set of reward and generation models to support future open work in RLHF 111Code available at https://github.com/PrasannS/rlhf-length-biases.

Task Setup

RLHF is technique for optimizing the performance of text generation systems (Sutskever et al., 2014; Bahdanau et al., 2015), in which we place a distribution over target output $\mathbf{y}=(y_{1},\ldots,y_{n})$ given input sequences of words $\mathbf{x}$ via a generation model $\pi_{\theta}$ : $p(\mathbf{y}\mid\mathbf{x};\pi_{\theta})=\prod_{k=1}^{n}p(y_{k}\mid\mathbf{y}_{<k},\mathbf{x};\pi_{\theta})$ . Historically, these models were trained with both language modeling pre-training (learning to predict the next word given context) and supervised fine-tuning (SFT; learning to generate outputs to maximize the likelihood of references on some dataset, also referred to as behavioral cloning).

RLHF is a technique introduced to further improve upon this approach, and can be broken into three components. First, it requires a set of preference judgments over model outputs of the form $P=\{(x_{1},y_{1}^{+},y_{1}^{-}),\ldots,(x_{n},y_{n}^{+},y_{n}^{-})\}$ with triples of prompts $x_{i}$ , preferred continuations $y_{i}^{+}$ , and dispreferred continuations $,y_{i}^{-}$ .

Then, given some $P$ , the task is to train a scalar reward model $R(q,x)$ such that for any given preference triple, $R(x_{i},y_{i}^{+})>R(x_{i},y_{i}^{-})$ . We use the standard Bradley-Terry preference model (Bradley & Terry, 1952), where $P(y_{1}\succ y_{2}\mid x)=\frac{\exp(R(x,y_{1}))}{\exp(R(x,y_{1}))+\exp(R(x,y_{2}))}$ and the reward model is trained to optimize the log likelihood of the observed preferences.

Finally, given $R$ , we use reinforcement learning, specifically proximal policy optimization (Schulman et al., 2017, PPO) to optimize a supervised fine-tuned (SFT) model $\pi_{\theta}^{\mathrm{SFT}}$ to get a model $\pi_{\theta}^{\mathrm{RL}}=\mathrm{PPO}(\pi_{\theta}^{\mathrm{SFT}},R)$ that, for a query distribution $X=(x_{1},\ldots,x_{m})$ , maximizes the reward $R(x_{i},\pi_{\theta}(x_{i}))$ , with a constraint that we not deviate too strongly from the initial distribution. RL optimization in PPO is based on the maximization of the following equation:

where $\lambda$ controls the strength of a Kullback-Leibler (KL) divergence penalty between the original policy $\pi_{\theta}^{\mathrm{SFT}}$ and the current policy $\pi_{\theta}^{*}$ at a given step.

We explore a collection of three preference datasets corresponding to three tasks (examples in Appendix C). We selected these datasets to provide a diversity of tasks oriented towards helpfulness that are still challenging for our base model, LLaMA-7B (Touvron et al., 2023a). Conveniently, we also have three types of preference supervision: explicit human labels, implicit preferences from upvotes, and synthetic preferences.222Note: Our settings are oriented towards helpfulness, which we infer to be closer related to length, however studying our approaches on other objectives such as harmlessness could be interesting future work.

This dataset (Nakano et al., 2021) contains human annotated preference labels between two outputs for the open-domain long-form question answering (LFQA) task (Fan et al., 2019). As human annotation is expensive, this dataset is relatively smaller at only 19.6K examples (mean tokens per $y=169$ ) compared to the others we study.

Released by Hugging Face, this dataset collects technical questions and answers from StackExchange (Lambert et al., 2023). The preference label between two answers is derived using the number of upvotes; the one with more upvotes is assumed to be preferred. We use a subset of 100K (mean tokens per $y=236$ ) pairs from the dataset following the Hugging Face implementation (von Werra et al., 2020).

Finally, we explore multi-turn dialogue style data, released by Yang et al. (2023). Starting from the input instructions in the Helpful/Harmless dataset by Anthropic (Bai et al., 2022), they automatically generated preferred and not-preferred outputs using prompt heuristics, e.g. appending “generate unhelpful outputs” to the prompt. The “helpfulness” subset that we use consists of 40K examples and mean tokens per $y=45$ .

2 Experimental Setup

We use the standard implementation and hyperparameters for the 3 components of RLHF to maintain consistency. We base our RLHF implementation on the Huggingface TRL framework with hyperparameters we find to work best based on reward convergence and downstream evaluation ( $\lambda=0.04$ , batch size 64, see more details in Appendix A) (von Werra et al., 2020), and use LoRA (rank=16) (Hu et al., 2021) to enable training large Llama-7B models (Touvron et al., 2023a) with limited GPU memory. For our SFT models we use the released AlpacaFarm SFT model for WebGPT and RLCD as we find it to work well, and the TRL SFT model for Stack.

Our evaluation relies on two factors. First, reward is an intrinsic metric optimized by the PPO process. Second, we follow past work in AlpacaFarm (Dubois et al., 2023) to conduct downstream evaluation using more powerful LLMs as proxies for human preferences. Specifically, we sample responses on fixed held-out test sets of 500 prompts for each setting, then use their exact evaluation scheme based on using a panel of 12 simulated OpenAI API based “annotators,” which they show correspond well with human preference judgements. The final format is an overall pairwise “win rate” of one set of paired outputs vs another, which we call simulated preferences.

Examining PPO

In this section, we first show that: (1) Output length increases during PPO (Figure 2). (2) There exists a positive correlation between length and reward model scores (Figure 3). Taken together, this evidence suggests that simply increasing length could be a successful way to improve reward. Motivated by this, we investigate the following question: Is length increase the primary factor for reward models scores increasing during PPO, or are other features also optimized?

To contextualize the rest of the work, we first show that length actually does increase as a result of PPO. Indeed, when comparing histograms of generation lengths (see Figure 2) on a fixed query set before and after our initial PPO runs, we find that PPO causes notable length increases.

We now investigate the extent to which other features are learned, with two different settings of the KL weight $\lambda$ in the objective. Figure 3 shows reward scores stratified by length, binned into buckets of 20 tokens for the higher $\lambda$ variant (high kl). While reward score does increase in each bin on average, the increases in reward are uneven. Furthermore, the increases are less strong than the length trends: generating an answer that’s 40 tokens longer (shifted over by two bins) often provides a larger improvement than PPO. (See Figure 10 for a plot with our standard, lower-KL PPO setting.)

To quantify this more precisely, we estimate the percentage of length-based optimization as the ratio of weighted reward gain (wrg) to the overall reward improvement ( $\Delta R$ ) from PPO, where weighted reward gain is the sum of each bin’s difference value multiplied by the total number of examples in each bin. Weights are computed by total examples from SFT and PPO combined.

Table 1 reports results. Revisiting this in the context of Figure 3, we see that around 70%–90% of the improvement on WebGPT and RLCD is explained purely by shifts in length. stack shows a lower value here, with only about 40% of the gain arising from length. One reason for this is that stack outputs are close to the length limit during training,333Stack, due to SFT having higher initial length, tends to generate unboundedly long outputs after PPO. We set a higher max length (216) than the source TRL codebase (128) for Stack; however the pattern remains. so gain from increasing length is not possible to achieve. Second, Stack’s technical QA setting represents a different style of answer that we believe does require optimizing for features beyond length.

2 Intervening on Optimization

We see that in a standard pipeline, PPO has a tendency to optimize only on length, but what if we constrain optimization to mitigate this? We test the effects of several interventions below.

The simplest intervention to PPO to encourage short outputs is to just increase the KL coefficient $\lambda$ (h-kl) (Equation 1), with the intuition that closer to the initial distribution should mean closer to the initial length. We experiment with setting it to 0.12 instead of 0.04; larger values impede model convergence.

We also experiment with a scalar penalty on the reward to control length (len-c). We set $R^{\prime}=\sigma\left(1-\frac{\mathrm{len}(y)}{N}\right)$ , where $N$ is a maximum length value that we do not want PPO to exceed, and $\sigma$ is a moving average of batch reward standard deviation.444We try several variants of this idea, such as a scalar penalty past a length threshold, and note similar convergence failures. In general, we find that stricter versions of these constraints negatively affects convergence.

A similar option to prevent outputs from getting longer may just be to altogether omit (omit) outputs beyond a length threshold from PPO, so that no update is made to encourage these. In practice we swap these examples with randomly sampled outputs from the batch.

Finally, prior work examining ways to improve implementations of PPO mentions that reward scaling (rm-sc) can be useful for “controlling training fluctuations” and reducing over-optimization (Zheng et al., 2023b). Similar to batch normalization (Ioffe & Szegedy, 2015), for each batch $X,Y$ of sampled outputs, we compute the mean ( $\mu$ ) and standard deviation ( $\sigma$ ) of $R$ . We then take a moving average of these values across $N$ previous batches and “scale” $R$ to become $R^{\prime}=\frac{R-\mu}{\sigma}$ , where we note $\sigma$ remains relatively constant across training.

We report results for the interventions on the reward score and PPO in Table 2. Note the rm row is comparable within each setting since we use the same underlying reward models, and thus we use it as our primary metric to reason about length and reward tradeoffs. We also report simulated preferences (see Section 2.2) vs std, where $<50\%$ indicates being worse than standard PPO on downstream answer quality.

We find that across all interventions, length always increases relative to sft, and reward model score is always worse than standard PPO. These patterns suggest that a strong component of PPO is related to length. Including the fact that length control (len-c) led to convergence failure (reward not increasing during training) on w-gpt and stack, this suggests that length is a difficult feature to disentangle post-hoc from reward.

Recalling the scatter plots from Figure 3, we note that across all of these different interventions, the scatter plots display similar patterns (see Appendix B), implying that while these interventions reduce the overall optimization towards length, they don’t change the fundamental tendency of PPO to avoid optimizing for other features. However, while length still increases with respect to sft, several interventions do allow for length increases to be mitigated while still recovering a large portion of reward and downstream performance gain (e.g., rm-sc).

Examining Reward Modeling

Section 3.2 showed that interventions during PPO do not fully mitigate the issue of reward gains coming from length increases. We now investigate whether we can intervene even earlier in the process, on the preference data itself, in order to circumvent this length dependence.

One root cause of length correlation is length imbalances in the preference datasets, where longer answers are systematically preferred to shorter answers. We can measure this with length heuristic agreement: the accuracy of always predicting that the longer output is the gold preferred output (see Table 3): We see that all datasets are slightly imbalanced towards longer outputs. However, this doesn’t fully explain the strong correlations suggested earlier in Figure 3.

To understand this better, we can study training dynamics of reward model learning by computing statistics over several epochs of training. Given reward model $R$ being trained on preference dataset $P$ for $E$ epochs, we can track each data point $(x_{i},y_{i}^{+},y_{i}^{-})\in P$ where we compute the distribution of confidence (RM score of “preferred” subtracted from “dispreferred”), at each epoch $c_{i}=\{(e,R(x_{i},y_{i}^{+})-R(x_{i},y_{i}^{-})):e\in\{2,\ldots,E\}\}$ , where we exclude epoch 1 to mitigate noise.

First, we note that when examining “cartography” plots (Swayamdipta et al., 2020) examining the mean ( $\overline{c_{i}}$ ) and variance ( $\sigma(c_{i})$ ) of different $c_{i}$ (see Appendix B.1), we find that the values are largely centered at zero, suggesting that reward models are not able to make progress on most training examples: the predictions are low-confidence and largely do not change. This suggests that most features are instead learned on the set of “easy” examples with higher $\overline{c_{i}}$ .

With the hypothesis that length may be related to “easy” examples, we use length heuristic accuracy again, but this time, we compute it on slices where we bin training examples based on $\overline{c_{i}}$ , plotting these bins by confidence (x-axis) against length heuristic accuracy (y-axis) on each slice as scatter plots in Figure 4.

The figure shows strikingly clean patterns, with the mean confidence $\overline{c_{i}}$ for data in an interval of training examples correlating strongly with the length heuristic. This means that (1) the length heuristic applies to most examples that are easy, and (2) perhaps more tellingly, the overwhelming majority of “hard” examples are cases where the model follows the length heuristic to confidently predict the wrong answer. Overall, this supports that length is one of the strongest features learned in these models. Note that WebGPT, with the strongest pattern, also displayed the lowest wrg from Table 1, implying that these correlations propagate through all stages.

2 Interventions on Preference Data

Given the strong length biases learned from preference data in standard RMs (std), we now examine whether we can eliminate these biases by strategically modifying preference data.

The simplest intervention is to remove length biases from the preference data. Specifically we balance data such that the distribution of pair length differences are symmetric by bins of 10. Suppose there are more examples where preferred responses are 20 tokens longer than dispreferred ones compared to the reverse case; we then subsample the cases which are 20 tokens longer until they match the number of cases which are 20 tokens shorter, thereby balancing the data.

Our previous results suggest that something more data-specific beyond a surface length bias may influence training: for example, a particular set of “easy” examples may be corrupting the data, and removing them may help, as established in literature on dataset cartography Swayamdipta et al. (2020). Given that we’ve trained some $R_{\mathrm{base}}$ , and computed $\overline{c_{i}}$ on dataset $P$ (Section 4.1), we can test this idea by training a new RM $R_{\mathrm{trunc}}$ on a subset of $P$ where $\overline{c_{i}}<\theta_{1}$ and $\overline{c_{i}}>\theta_{2}$ , with threshold hyper-parameters $\theta_{1}$ , and $\theta_{2}$ . We experiment with several variants (see Appendix B.1), keeping sets of 5̃0% of the data for each. Below we report results when we set $\theta_{1}<\theta_{2}$ , keeping a central subset of data.

In line with the hypothesis that over-optimization stems from spurious correlations in the data, another potential intervention is data augmentation, specifically using “random pairing” where we can pair matching prompt output pairs $q_{i},p_{i}^{-}$ from $P$ with $p_{i}^{-}$ serving as a “prefered” example, and a randomly sampled $p^{+}_{j}$ from another prompt serving as a “dispreferred” example. This serves to encourage disregarding stylistic features in favor of relevance to the query.

2.2 Results

We first report in Table 4 the evaluation accuracy of these different reward models, as well as a correlation within batch (corr) measure which, given sets of 8 generations, is the mean Pearson correlation between output length and reward model score for each batch. While the standard reward model (std) achieves high accuracies across settings, this comes with high length correlation.

Data Augmentation (r-da) improves on both of these partially, while confidence-based truncation (c-tr) brings length correlation down at the cost of accuracy. Note that, when using correlation within batch, we find that bal leads to length bias being reversed, but at near-random accuracies, while other truncation strategies don’t yield notable differences. These patterns indicate that, perhaps because RMs fail to learn on most examples, they are particularly brittle, and can learn spurious correlations easily. As the only setting where length balancing eliminates correlation and maintains above-random accuracy, we see more evidence that stack is the one setting of our three where reward models can learn features other than length.

We then show results for downstream adjustments to preference data in Table 5: Length still usually increases from the SFT starting point, though many interventions are shorter relative to std. bal on stack, perhaps due to there being other easy non-length features to learn, even leads to shorter outputs than sft, confirming the importance of preference data to final PPO length biases.

Unlike our PPO interventions described in Table 2, simulated preference doesn’t always decrease with preference data interventions: On stack, where bal is shorter than sft, it also improves sim pref over normal PPO, suggesting that at least in noisier settings there is somehow room for PPO to do more than just increase length, but this pattern is inconsistent. Compared to later stages, interventions on preference data seem to be the most promising for overall improvement of RLHF beyond length, though the fundamental inability of reward models to learn well from data remains.

How far can length go?

Many of our experiments suggest that our reward models are primarily guiding PPO to produce longer outputs, yet we still see improvements on downstream simulated preferences. One explanations for this is that humans and models like GPT-4 have a bias towards preferring longer outputs in the settings we study (Zheng et al., 2023a). Another possibility is that optimizing for length with PPO intrinsically improves the quality of generation even in the absence of other features.

We investigate two interventions aimed purely at increasing length, which show how far optimizing for this single aspect can go. First, we sample 8 outputs from the SFT model and choose the longest one (sft-long). Second, we use length as our reward for PPO (keeping the standard KL term) with $R^{*}(y)=1-\left|\frac{len(y)}{N}-1\right|$ . In this case, $N$ is a target length hyperparameter (set to 156, 120, and 200 on WebGPT, RLCD, and stack respectively). We call this setting lppo, and also explore a variant of length-only PPO with $\lambda$ set to 0 (lppo $\lambda=0$ ) in Table 6.

First, we note that sft-long can lead to moderate improvements (57% winrate vs SFT on stack and 52% on RLCD), though not on WebGPT. When we then compare to lppo, we find that purely optimizing for length actually reproduces most of the performance improvements of RLHF with the reward models. Notably, this approach yields simulated preference improvements over sft-long, which has even longer outputs.

It is still possible that RLHF with our reward models does lead to other changes or improvements in the outputs beyond length. This experiment also does not necessarily establish flaws in the preference judgments; these outputs with the right length are often more informative and more useful (Figure 1). However, it does show that a significant fraction of the downstream gains can be explained by optimizing for length.

Related Work

Reinforcement learning from human feedback has been explored extensively (Knox & Stone, 2009), often being used in robotics tasks to extrapolate reward signal beyond an initial preference set (Brown et al., 2019). Recent work in NLP has explored implementations (Zheng et al., 2023b; Touvron et al., 2023b), objectives (Wu et al., 2023), and even alternatives (Rafailov et al., 2023; Zhao et al., 2022; 2023) for RLHF, but have generally overlooked or dismissed length increases. Our work is largely orthogonal to these directions, using the issue of length to analyze the lack of robustness in current reward models. Finally, other past uses of RL in NLP (Ammanabrolu & Riedl, 2018; Martin et al., 2017; Ramamurthy et al., 2023) have largely faced different sets of issues due to reward not coming from models learned over human preferences.

In the context of noisy and biased preference data, are reward models able to learn robust features reflecting the underlying preferences? In broader NLP, dataset artifacts have been a prevalent issue even on simpler settings like natural language inference (Gururangan et al., 2018; Poliak et al., 2018). In the context of RLHF, Stiennon et al. (2020) notes that over-optimizing for a reward model leads to pathological summaries, Dubois et al. (2023) notes a pattern of human preferences going up briefly then down as reward model score increases, and Pang et al. (2022) present some cases where reward hacking can be produced within synthetic settings. Our work, in comparison, delves further into what causes reward over-optimization in realistic settings, while also further exploring diagnostics and solutions. We focus on length as it is the most prevalent, but our experimental paradigm is applicable to any analysis of over-optimization in RLHF.

Techniques outside of RLHF for controlling length of NLP models have been explored (Kikuchi et al., 2016; Ficler & Goldberg, 2017). Length divergences specifically between training time and test time have been explored in the machine translation literature (Riley & Chiang, 2022), but these have been attributed to inference techniques and label bias in text generation methods. The open-ended nature of our generation problems is quite different from MT. Murray & Chiang (2018) use a per-word reward similar to our per-word penalty in RL, though to solve the opposite problem of outputs being too short. Finally, in discriminative “text matching” tasks like paraphrasing, past work has observed similar length heuristics, Jiang et al. (2022), but the sentence-pair format of these tasks makes their issues somewhat different.

Conclusion and Limitations

In this work we study correlations of length and reward in RLHF. Across three datasets and across several stages of observational and intervention-based exploration, we make a case that RLHF in these settings achieves a large part of its gains by optimizing for response length.

While the extent of the patterns we find are surprising, this doesn’t necessarily invalidate the potential of RLHF. We note that our Stack setting, which involves the most technical responses, does demonstrate improvements in reward even for outputs already at our maximum length. Furthermore, optimizing purely for length does seem to lead to “qualitative” improvements beyond just sampling from the base model and choosing longer outputs, indicating that the learning dynamics of RLHF may be beneficial for LM training. Rather than claiming length to be an inherent shortcoming, we seek to use it as a vehicle to analyzing RLHF’s successes and failures.

One limitation of our work is that, while we explore diverse settings, we are restricted to open-source preference datasets. Recent work such as Llama-2 (Touvron et al., 2023b) develops an extensive dataset of preferences and pursues a sophisticated RLHF strategy, which may not face the limitations we do. Furthermore, we focus primarily on a broad “helpfulness” objective (again, aligning with these preference datasets) using LLaMA-7B as the base model. While these represent a substantial fraction of research on open reward models, our findings may not necessarily apply to RLHF running on larger closed-source models, or with alternate objectives like “harmlessness”.

Despite these limitations, we believe our work shows that RLHF with these reward models is not yet achieving its full potential. We believe that developing more accurate and robust reward models, either by changing the reward model, its objective, or the preference collection process, may hold the key to unlocking the full capabilities of RLHF.

Reproducibility

For our various studies on the relationship between RLHF and length, we first trained a set of reward models and policy models. In order to support future open RLHF research, we release our code as well as reward and policy models. In addition to detailing our experimental setup and evaluation scheme in Section 2.2, as well as describing our interventions in detail in Section 3.2 and Section 3, we include further hyper-parameters and instructions in Appendix A. Note that we use open preference datasets, publicly available base models, and open-source RLHF code that doesn’t require prohibitive computational resources.

Acknowledgments

This work was supported by NSF CAREER Award IIS-2145280, a grant from Open Philanthropy, a gift from Salesforce, Inc., and a gift from Amazon. Thanks to Eunsol Choi and members of the UT TAUR lab for helpful discussion and feedback.

References

Appendix A Training / Evaluation Details

All experiments were conducted across 2 workstations, one with 4 NVIDIA RTX A6000 GPUs and another with 8 NVIDIA A40 GPUs. However, all of our individual experiments were run across 2 GPUs. In this configuration, training an RM takes around 6 hours on the Stack dataset, 3 hours on the RLCD dataset, and 1 hour on the WebGPT dataset after 1̃ epoch. For PPO, training takes around 12-18 hours for a single run.

A.1 Reward Models

For StackExchange, we have a train set of 100K examples and an evaluation set of 10K examples. For WebGPT, since we use 18k examples from training and 1.6K as an evaluation set. For RLCD, we use 40K training examples and 2.6 examples for the test set, where we use these test sets in all of the evaluation we show above.

For training, we follow a rule of continuing to train until eval accuracy stops going up. Prior work finds that reward model training for just 1 epoch is most effective to avoid over-fitting, however for some of our preference data interventions we note that convergence takes longer. Overall, this ends up with usually 1-2 epochs of training at most for the checkpoints that we use. We use bfloat16, learning rate of 1e-5, and batch size of 2 with 2 gradient accumulation steps. With these configurations, 1 epoch on 10K examples takes around 2 GPU hours.

Note that for the training dynamics analysis (Figure 4), we run reward model training for 5 epochs to reduce variance, however we don’t use those models directly (though we note that eval accuracy doesn’t go down significantly even at that point).

A.2 PPO

For our RLHF step, as stated before, we use LoRA and 8-bit quantization for the policy and reward models, since the TRL training configuration requires having all used models on each device used for training. We merge reward model and generation models with LoRA adapters before PPO.

Past work has commented on the stability of PPO and “secrets” needed to get it working well (Zheng et al., 2023b). We found that setting the right KL coefficient and batch size were the most important for stable convergence.

For training we generally run training for between 150-200 steps, where this is a hyperparameter for each dataset depending on speed of convergence and allowing sufficient steps for KL to decrease in certain settings (Figure 5). We experimented with runs of up to 400 steps and generally did not find improvement in simulated preference or reward.

With 2 GPUs, batch size of 32 on each, training takes around 16 hours to complete 200 steps, giving an overall time of 32 GPU hours per PPO model. Note that we use max length of 156 on WebGPT and RLCD, and 216 on stack since it has a higher starting point for the SFT model (note that this hyperparameter has a strong influence on training speed).

Figures 5, 6, 7 show statistics over the course of training for our standard settings, with KL of 0.04. We note that RLHF does successfully increase reward score. The last half of training usually yields a decrease in KL divergence, as the model has optimized for reward and is regularized closer to the initial policy model by the KL term.

A.3 Inference / Evaluation

Once we have our trained PPO models, we finally sample outputs that we can use to compare different systems and evaluation. For all results, unless otherwise stated, we generate 500 outputs each from a fixed set of the held out data from each dataset, and base our results on those (we find this to be a sufficient amount, especially when comparing patterns across a set of interventions / settings which themselves serve as additional datapoints). Computing simulated preference (Dubois et al., 2023) for 100 outputs costs around $3.5 USD using the OpenAI API. These calls are made to gpt-4-0314, gpt-3.5-turbo-0301, and text-davinci-003.

We decode with nucleus sampling (Holtzman et al., 2020) with a $p$ value of 0.9, maximum length of 256, temperature 0.9, and a repetition penalty of 1.2 (based on the TRL repository default hyperparameters). Whenever we sample multiple outputs from a single prompt, we draw 8 samples.

A.4 Interventions

For length control, we set the length center $N$ to the starting mean SFT length, noting similar patterns with different configurations as well (50 tokens for RLCD, 100 for WGPT, 200 for stack), for omission, we use 24 tokens above these value (to allow for stable training). For reward data augmentation, we augment with 25% additional data, noting that 50% gives similar patterns, however data augmentation may need to be explored further in future work.

Appendix B Extra Plots

We here include dataset cartography plots in the style of Swayamdipta et al. (2020) for our reward modeling tasks. First, Figure 8 shows the dataset cartography plots on our respective settings. Note that WebGPT, the dataset with the strongest length biases, seems the most centered at 0 variance, and symmetric with respect to the x-axis. RLCD and stack, where there seems to be more room for other features, on the other hand, demonstrate an “upward tilt” where at higher variances, the models are able to learn correct higher confidence features. This provides evidence for our hypothesis that strong length biases emerge as a symptom of reward models being unable to learn clear features from most training data.

Figure 9 shows plots for two additional settings. First, we see that when doing data augmentation, the augmented data emerges clearly and separately in the “high confidence” zone, while the initial plot remains in a similar space as before, though interestingly now displaying more of the “upward tilt” with a longer tail. Next, we see that when cutting out the central section of the cartography plot and training on the remaining data, the shape is actually preserved, suggesting that the small amount of remaining data has an intrinsic tendency to learn the strong length correlation. Likewise, when removing the upper section of “easy examples”, we see that suddenly the RLCD plot becomes much more centered and sharper at the left with low variance, suggestive of the “brittleness” that we saw with WebGPT, where now the easiest pattern is harder to learn, exacerbating the pattern even further.

B.2 Length Scatterplots

Earlier in Section 3, we discuss the idea of reward gain due to length vs other features, examining results with Higher KL term. Here we show comparable plots for WebGPT (Figure 11) and RLCD (Figure 12). Note that the patterns and reward gain are very similar, suggesting that the length constraining techniques all perform similar functions, despite having very different formulations. Also note that for these two settings the ratio of reward gain independent of length remains quite low.