ODIN: Disentangled Reward Mitigates Hacking in RLHF

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro

cs.LG cs.AI cs.CL

Introduction

Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique to elicit the capabilities from pretrained large language models (LLMs) to generate more helpful, honest, and harmless responses that align with human preferences (Ziegler et al., 2019; Askell et al., 2021; Ouyang et al., 2022), which has led to the success of ChatGPT (Schulman et al., 2022) and many other AI systems (Pichai, 2023; Anthropic, 2023; Touvron et al., 2023). RLHF trains a reward model (RM) on human preferences for the responses of given prompts, followed by training the language model to generate responses that maximize the learned reward through reinforcement learning. Such a paradigm simplifies human data collection, as acquiring human ratings is easier than collecting demonstrations for supervised fine-tuning. Moreover, it has been observed that RLHF has weak-to-strong generalization, where the policy becomes more creative than the supervision it receives (Burns et al., 2023).

Despite the promises, one subtle issue of RLHF is reward hacking, or reward model over-optimization, i.e., the policy obtains a high reward but does not fulfill the actual objectives. It happens because the RM is not a perfect proxy of human preferences and has limited out-of-distribution (OOD) generalization, but the policy is a capable LLM that can learn to generate OOD examples to exploit the vulnerabilities of the RM (Hendrycks et al., 2021; Ramé et al., 2024). More critically, the human preference data can often be biased and inconsistent due to the difficulty and subjectivity of the task itself, flaws in the rating criteria, and the limited quality of raters. The most common pattern of reward hacking in practice is verbosity: the language models generate more tokens to make the response appear more detailed or better formatted after RLHF (usually for helpfulness) but the actual quality does not improve (Singhal et al., 2023; Wang et al., 2023b). This tendency is largely due to a preference among human raters for longer responses, which could be exploited by RM easily and cause the length hacking. Given the challenges in controlling the quality of human data, it becomes increasingly important and beneficial to study mitigating the impact of spurious features from the reward modeling and algorithmic perspective.

In this paper, we take a step towards mitigating reward hacking by conducting a comprehensive study on the impact of reward models and the RL algorithm on the verbosity and performance of the learned policy. Considering the challenges in model-based evaluations due to their biases (Zeng et al., 2023), e.g., open-sourced LLMs climb up on Alpaca-Eval (Li et al., 2023a) leaderboard by utilizing the length bias of the judge GPT-4 (Liu, 2024), we first establish a more reliable evaluation protocol for comparing different training configurations, which gathers evaluation results from large-scale grid search under these configurations and compares the achieved performance on the Pareto front of evaluation score vs. length. This offsets the length biases and gives a holistic understanding of the optimal result each approach can achieve at different lengths to reduce the randomness of the conclusions due to the length bias in model-based evaluation. Under this setup, we investigate the effectiveness of hyperparameters and tricks in RL for reducing reward hacking on length, including reward clipping (Mnih et al., 2015) and length penalty (Singhal et al., 2023). While tuning and tricks can push up the Pareto front, we find it hard to conclude with simple principles for tuning this large set of hyperparameters. We seek to solve the issue from its root and eliminate the spurious length signal from the reward. To this end, we train a two-head reward model to disentangle representations for length from the actual preference and discard the length head during RL. The proposed reward disentangling method, Odin111Odin sacrificed one eye for wisdom, similarly our RM discards the length head for more focus on the actual content., helps the policy achieve a higher Pareto front than previous results with a more expensive tuning budget, and the conclusion holds for both PPO (Schulman et al., 2017) and ReMax (Li et al., 2023b), showing the great potential of Odin to improving the different RL-tuning algorithms and shed light for reducing the length hacking.

Preliminaries

We consider the RLHF pipeline widely adopted in the developments of LLMs (Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022; Touvron et al., 2023), which consists of three stages: (1) Supervised Fine-tuning (SFT); (2) Reward modeling: training the reward model based on the SFT checkpoint; (3) RL: using the SFT checkpoint as initialization and the reward model for feedback.

Reward Modeling. Same as (Stiennon et al., 2020; Ouyang et al., 2022; Touvron et al., 2023), we consider the approach where the reward model is initialized from a supervised fine-tuned LM, with a randomly initialized linear layer appended to the end to project the feature representation of the whole sequence into a scalar representing the reward. The reward model is trained to minimize the loss under the Bradley–Terry model (Bradley and Terry, 1952) on pair-wise comparisons of model responses as

where $r_{\bm{\theta}}(x,y)$ is the scalar reward from the reward model with trainable parameters ${\bm{\theta}}$ for prompt $x$ and the response $y$ ; $y_{w}$ and $y_{l}$ are the chosen and rejected responses respectively, and $\sigma(\cdot)$ denotes the sigmoid function.

RL Objective. Different from SFT, RL fine-tuning stage of RLHF does not require golden responses for supervision. Instead, the reward model is used as a proxy of human feedback on the responses generated by the policy throughout training. Specifically, it fine-tunes the parameters ${\bm{w}}$ of the policy $\pi_{{\bm{w}}}$ by maximizing the the following objective:

where the SFT policy $\pi^{\text{SFT}}$ is used as initialization of $\pi_{{\bm{w}}}$ , $\mathcal{D}_{\pi_{\bm{w}}}=\{(x,y)|x\sim\mathcal{D}_{\text{RL}},y\sim\pi_{\bm{w}}(y|x)\}$ is the set of prompt-response pairs sampled from the prompt set and $\pi_{\bm{w}}$ , and $\beta>0$ is a constant adjusting strength of the KL regularization. The KL regularization term is used to mitigate reward hacking by preventing the policy $\pi_{\bm{w}}$ from drifting away from the SFT model $\pi^{\text{SFT}}$ (Stiennon et al., 2020; Ouyang et al., 2022). The KL term is intractable, therefore in practice it is approximated with some estimator, which makes Eq. (2) equivalent to maximizing some auxiliary reward $\hat{r}(x,y)$ . Following Stiennon et al. (2020), we consider the naïve estimator in this paper, and define the auxilary reward as

See Schulman (2020) for unbiased estimator of KL.

Different RL algorithms can be used to maximize $\hat{r}(x,y)$ . We compare two options to see how existing mechanisms in RL algorithms can reduce reward hacking in RLHF: the simpler REINFORCE with baseline (Williams, 1992), and the more sophisticated PPO (Schulman et al., 2017). For REINFORCE, we consider the ReMax variant (Li et al., 2023b), which saves memory and compute significantly by replacing the value network with the reward on the greedy decoding of the current policy. Li et al. (2023b) proved that similar to REINFORCE, ReMax has an unbiased gradient estimate, and it reduces gradient variance under certain assumptions. Specifically, ReMax maximizes the following objective with gradient ascent on ${\bm{w}}$ :

where $\bar{y}$ is the greedy sampling from $\pi_{{\bm{w}}}$ .

PPO is a more prevalent option adopted by many works (Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022; Touvron et al., 2023). For clarity, we provide details of PPO in the context of RLHF for LLMs in Algorithm 1 in Appendix. PPO maximizes the clipping objective

1italic-ϵ^𝐴\mathbb{E}_{\mathcal{D}_{{\bm{w}}_{old}}}\big{[}\min\big{\{}\textstyle\frac{\pi_{\bm{w}}(y|x)}{\pi_{{\bm{w}}_{\text{old}}}(y|x)}\hat{A},\text{clip}\left(\textstyle\frac{\pi_{\bm{w}}(y|x)}{\pi_{{\bm{w}}_{\text{old}}}(y|x)},1-\epsilon,1+\epsilon\right)\hat{A}\big{\}}\big{]}, (5) where $\epsilon>0$ is a constant for clipping, $\frac{\pi_{\bm{w}}(y|x)}{\pi_{{\bm{w}}_{\text{old}}}(y|x)}$ is the likelihood ratio, $\hat{A}$ is the advantage usually estimated by GAE (Schulman et al., 2015) as a function of the value estimate and the reward. Intuitively, this clipping objective can help reduce reward hacking. It can prevent reward over-optimization, as it prevents the model from becoming over-confident on samples with positive advantage by stopping optimizing on samples when their likelihood ratio $\frac{\pi_{\bm{w}}(y|x)}{\pi_{{\bm{w}}_{\text{old}}}(y|x)}>1+\epsilon$ . Our results in Fig. 3 (b) demonstrates this point.

Mitigating Reward Hacking in Practice

In this section, we first establish a more reliable evaluation for comparing different methods, which uses the length of the generated response $L(y)$ as an indicator of the degree of reward hacking. Then, we study the impact of RL hyperparameters and tricks on the Pareto front of model-based or human evaluation metrics against $L(y)$ , and propose a more reliable approach by training a reward model that disentangles the spurious length correlation from the actual reward on contents.

It is challenging to evaluate the policy automatically through LLM evaluators, as these LLM evaluators can often be biased in practice (Zeng et al., 2023; Zheng et al., 2023a; Singhal et al., 2023), and the policy can learn to exploit these biases which also exist in the reward model. Previous works studying reward hacking use a ground-truth reward model to annotate the preference data to train another proxy reward model. Then, they train the policy against this proxy reward model, and evaluate the reward it achieves in the ground-truth reward model (Gao et al., 2023; Ramé et al., 2024). Here, we want to develop a scalable approach that can reliably evaluate the policy trained for the real human preference without involving human evaluators. To achieve this, we look at the model-based evaluation metric against the average response length $L$ on the evaluation set, and compare the Pareto front achieved by each method or configuration. We consider the response length because it is easy to measure and well-reflects the degree of reward hacking in RLHF for LLMs; in practice, the policy tends to generate longer responses when reward hacking happens (Ramé et al., 2024; Wang et al., 2024). A better method or configuration should achieve higher score when $L$ is the same, therefore a higher Pareto front in the plots. We mainly use model-based evaluations in our studies, where we compare responses generated by the policy against the responses generated by the SFT baseline. We then use the following win score as the metric:

50100subscript𝑛𝑤𝑖𝑛subscript𝑛𝑙𝑜𝑠𝑒𝑛\text{Win Score}=50+100\times\frac{n_{win}-n_{lose}}{n}, (6) where $n_{win}$ ( $n_{lose}$ ) is the number of examples rated as winning (losing) against the baseline, and $n$ is the total number of evaluation examples. $\text{Win Score}\geq 50$ when the test model is no worse than the baseline. See Section 4 for more details.

2 How much hacking can we mitigate by tuning RL?

We investigate how much the hyperparameters and tricks used in RL can reduce reward hacking and improve evaluation results. While this helps to some extent, we find it can be hard to obtain a simple heuristic for tuning the hyperparameters that will guarantee a significantly better Pareto front.

KL Regularization. The KL regularization is introduced into the RL objective to prevent reward hacking by preventing the policy from drifting away from the SFT initialization. In Fig. 9, we show that larger KL weight $\beta$ can indeed prevent excessive length increase, but the policy becomes closer to SFT initialization and the win score becomes worse. In Fig. 3 (a), we show the effect of KL is marginalized when reward clipping is introduced.

PPO clipping $\epsilon$ . As mentioned in Section 2, the clipping objective can potentially reduce reward hacking. From Fig. 3 (b), we find it is indeed the case, with smaller $\epsilon$ bringing around 2.5 points of improvement on the Pareto front. However, it becomes more challenging to determine the optimal $\epsilon$ when reward clipping is introduced; see Fig. 10.

Sampling from the old policy. Another mechanism that can potentially alleviate reward hacking is to sample the responses from the old policy, which should reduce the chance of sampling from a hacking policy. This is effective when $N>b$ , where the policy is trained on ( $N-b$ ) “off-policy" experiences in each PPO inner epoch. Surprisingly, in Fig. 3 (c), we show that a higher degree of off-policy makes it more likely to generate longer responses, and the win score around the length of $\pi^{\text{SFT}}$ is not as high as pure on-policy ( $N=b$ ), where even the PPO clipping $\epsilon$ in Eq. 5 is ineffective since $\rho_{\pi_{{\bm{w}}_{old}}}(x,y)\equiv 1$ .

Reward Clipping. Reward clipping is widely adopted by previous works like (Mnih et al., 2015; Engstrom et al., 2020) as well as the Deepspeed RLHF implementation. Specifically, we clip the reward from the reward model and maximize the clipped auxiliary reward as

where $c>0$ is a constant. Reward clipping can alleviate reward hacking, since it ignores the excessive reward potentially achieved by hacking the reward model. In Fig. 3 (d), we do observe that a proper $c$ leads to a higher win score for PPO at length close to the SFT init. In Fig. 8, we show that a proper clipping can also improve ReMax, but a more aggressive clipping (e.g., $c=1$ ) can hinder effective learning by preventing the policy from exploiting higher reward responses. As a result, similar to the recommendation in (Zheng et al., 2023b), careful tuning is required to use reward clipping successfully in practice.

Length penalty. A more straightforward way to prevent reward hacking on length is to explicitly penalize longer responses. Singhal et al. (2023) adds a length penalty proportional to the response length using the standard deviation of reward as the coefficient. However, to eliminate the correlation with length, we also need to consider the covariance between the reward and length, which can be constantly changing during RL due to shifts in the distribution of generations. Therefore, we simply make the coefficient a tunable constant $\alpha>0$ , and change the auxiliary reward into $\hat{r}_{\bm{\theta}}^{\text{lp}}(x,y)=\hat{r}_{\bm{\theta}}(x,y)-\alpha*L(y)$ , where $L(y)$ is number of tokens in the response $y$ . In Fig. 4, we show that length penalty makes $\hat{r}_{\bm{\theta}}^{\text{lp}}(x,y)$ less affected by length and improves the Pareto front, but is not as effective as Odin, which bakes length decorrelation into RM training to make the reward more reliable and does not add new hyperparameters to RL.

3 Reward Disentanglement: a more reliable approach

In the previous section, we have shown the challenges in reducing reward hacking on length through tuning and tricks in RL when using a vanilla reward model. Here, we demonstrate a better approach where we train the reward model to disentangle the actual reward from the spurious reward. The spurious reward correlate with patterns that are easy to identify, but do not represent the actual quality of the response. It adds to the vulnerabilities of the reward model, since the reward hacking is often a consequence of spurious rewards being exploited. Different from previous approaches that learn and integrate rewards from multiple types of preferences (Wu et al., 2023), we discard the spurious rewards during RL. We find this removes the need to use reward clipping and length penalty to prevent length increase and achieves better results without excessive tuning on the disentangled reward model.

Learning Multiple Rewards on Shared Representations. To minimize the overhead for learning disentangled rewards, we increase the output dimension of the final linear layer of the RM to predict different rewards. This is sufficient to separate out the spurious reward, since the RM is a pretrained LLM with enough capacity. Specifically in the case of disentangling length reward $r_{\bm{\theta}}^{\text{L}}(x,y)$ and the actual reward reflecting quality of the response $r_{\bm{\theta}}^{\text{Q}}(x,y)$ , we represent the full reward from the feature representation as $r_{\bm{\theta}}^{\text{Q}}(x,y)+r_{\bm{\theta}}^{\text{L}}(x,y)$ , and consider the following ranking loss for reward model:

superscriptsubscript𝑟𝜽Q𝑥subscript𝑦𝑤superscriptsubscript𝑟𝜽L𝑥subscript𝑦𝑤superscriptsubscript𝑟𝜽Q𝑥subscript𝑦𝑙superscriptsubscript𝑟𝜽L𝑥subscript𝑦𝑙\mathcal{L}^{\text{R}}_{\bm{\theta}}(x,y_{w},y_{l})=-\mathbb{E}\big{[}\log\big{(}\sigma\big{(}r_{\bm{\theta}}^{\text{Q}}\left(x,y_{w}\right)+r_{\bm{\theta}}^{\text{L}}\left(x,y_{w}\right)-r_{\bm{\theta}}^{\text{Q}}\left(x,y_{l}\right)-r_{\bm{\theta}}^{\text{L}}\left(x,y_{l}\right)\big{)}\big{)}\big{]}, (8) which equivalently trains the model to decompose the original projection weights into the sum of two sets of projection weights, and should have better capacity than the single-head baseline in Eq. 1.

Disentangling the Rewards. We consider the case when supervision can be added to all but one of the rewards, since unsupervised learning of disentangled representations is impossible without inductive biases on both the models and the data for generative models (Locatello et al., 2019). In the case of length and quality, we first design the loss to enhance the length correlation of $r^{\text{L}}$ while minimizing that for $r^{\text{Q}}$ as follows:

where $L(y)$ is number of tokens in the response $y$ , and $\rho(X,Y)$ is the Pearson correlation of $X,Y$ computed within the global minibatch. To compute $\rho$ within the global minibatch when data parallel is enabled, we gather the rewards and lengths from all devices only in the forward pass, which leads to the correct gradients for parameters ${\bm{\theta}}$ in the backward pass since the reward predictions are independent of each other in the Transformer architecture (Vaswani et al., 2017). Note we use $\mathcal{L}^{\text{L}}_{{\bm{\theta}}}(x,y)$ as a regularization added to the ranking loss in Eq. 8. When $\mathcal{L}^{\text{L}}_{{\bm{\theta}}}(x,y)$ is minimized to $-1$ , $r_{\bm{\theta}}^{\text{L}}$ and $r_{\bm{\theta}}^{\text{Q}}$ will have zero correlation, which can be beneficial since it indicates $r_{\bm{\theta}}^{\text{Q}}$ and $r_{\bm{\theta}}^{\text{L}}$ did not co-adapt to reduce the ranking loss and both heads are learning independently to maximize their predictive power. However, perfect correlation and decorrelation can be hard to achieve in practice, since we usually train on minibatches, and we want to generalize the RM to OOD examples in RL.

To further enhance disentanglement between $r_{\bm{\theta}}^{\text{Q}}$ and $r_{\bm{\theta}}^{\text{L}}$ and learn both more effectively, we enforce the orthogonality of their projection weights. Specifically, let ${\mathbf{W}}_{\text{Q}},{\mathbf{W}}_{\text{L}}\in{\mathbb{R}}^{1\times d}$ be the linear projection for quality and length rewards. We introduce the orthogonality loss

When enforced together with $\mathcal{L}^{\text{R}}_{\bm{\theta}}(x,y_{w},y_{l})$ and $\mathcal{L}^{L}_{{\bm{\theta}}}(x,y)$ , $\mathcal{L}_{\bm{\theta}}^{\text{O}}$ can be beneficial for disentangling the feature representations of length and quality into orthogonal subspaces, because the feature representation of the RM will learn to represent the quality and length to minimize $\mathcal{L}^{\text{R}}_{\bm{\theta}}(x,y_{w},y_{l})$ and $\mathcal{L}^{L}_{{\bm{\theta}}}(x,y)$ , and the quality and length components aligning with ${\mathbf{W}}_{\text{L}}$ and ${\mathbf{W}}_{\text{Q}}$ will be orthogonal as ${\mathbf{W}}_{\text{L}}$ and ${\mathbf{W}}_{\text{Q}}$ are learned to be orthogonal. In Table 1 and Fig. 5, we show that adding $\mathcal{L}^{\text{O}}_{\bm{\theta}}$ further reduced the length correlation, and lead to even better RL policies.

Note that both $\mathcal{L}^{\text{L}}_{{\bm{\theta}}}(x,y)$ and $\mathcal{L}_{\bm{\theta}}^{\text{O}}$ can be minimized when ${\mathbf{W}}_{\text{Q}}=0$ . To prevent this degeneration from happening and improve training dynamics, we add weight normalization (Salimans and Kingma, 2016) to both ${\mathbf{W}}_{\text{Q}}$ and ${\mathbf{W}}_{\text{L}}$ before computing the losses and predicting the rewards.

We train Odin with weight-normalized ${\mathbf{W}}_{\text{Q}}$ and ${\mathbf{W}}_{\text{L}}$ to minimize the following loss

superscriptℒR𝑥subscript𝑦𝑤subscript𝑦𝑙subscript𝜆LsubscriptsuperscriptℒL𝜽𝑥subscript𝑦𝑤subscript𝜆LsubscriptsuperscriptℒL𝜽𝑥subscript𝑦𝑙subscript𝜆OsubscriptsuperscriptℒO𝜽\mathcal{L}^{\text{R}}(x,y_{w},y_{l})+\lambda_{\text{L}}\mathcal{L}^{\text{L}}_{\bm{\theta}}(x,y_{w})+\lambda_{\text{L}}\mathcal{L}^{\text{L}}_{\bm{\theta}}(x,y_{l})+\lambda_{\text{O}}\mathcal{L}^{\text{O}}_{\bm{\theta}}, (11) where $\lambda_{\text{L}},\lambda_{\text{O}}>0$ are constants for regularization strength. In RL, we only use the $r^{\text{Q}}$ from Odin. Without excessive tuning, we find setting $\lambda_{\text{L}}=\lambda_{\text{O}}=1$ to yield reasonably good results for RL outperforming many baselines in Fig. 2. In Table 1, we show that using only the quality reward $r_{\bm{\theta}}^{\text{Q}}$ of the disentangled RM maintains the validation accuracy compared with the baseline, while drastically reducing correlation with length.

Experiments

Dataset. We use the OpenAssistant dataset (Köpf et al., 2023), a human-generated, human-annotated assistant-style conversation corpus with over 10,000 complete and fully annotated conversation trees. Our preprocessing of this dataset involves the following steps: (1) We transform all items into a dialogue format (see Section E.5) and discard samples with non-English prompts or responses. (2) For prompts associated with multiple ranked responses, we retain all these responses by considering all the pairwise comparisons. This results in $k(k-1)/2$ unique comparisons when a prompt has $k$ ranked responses. As a result, we use 22,065 examples for RM training, and 7494 prompts for RL tuning.

We use Vicuna-7b222https://huggingface.co/lmsys/vicuna-7b-v1.5. as the base model $\pi_{\text{SFT}}$ , which is a SFT model with decent instruction-following capability. We fine-tune the reward model from Vicuna-7B with randomly initialized projection layer appended to the last layer. We also initialize the policy $\pi_{\bm{w}}$ from the same Vicuna-7b. All experiments are implemented with DeepSpeed-Chat (Yao et al., 2023) and Huggingface Transformers (Wolf et al., 2020), running on 8 NVIDIA A100 80GB GPUs. We tried different learning rates from $\{1e-5,3e-5,5e-5\}$ with batch size $128$ for tuning both the baseline RM and Odin on 22k preference data for 3 epochs, and picked the one with the highest validation accuracy for both. We fine-tune all the parameters in the models for both RM training and RL without freezing anything or using adapters. To evaluate how the efficacy of Odin can transfer across different RL algorithms, we experiment with ReMax (Li et al., 2023b), an efficient and effective version of REINFORCE without a value network, and Proximal Policy Optimization (PPO) (Schulman et al., 2017). We provide more details on the hyperparameters in Appendix E. To compare with other alternatives for utilizing human feedback, we re-implement Direct Preference Optimization (DPO) (Rafailov et al., 2023) and use it to tune the same Vicuna 7B on the same Open Assistant human preference data as we train our reward models. For reference, we also evaluate and compare with another open-sourced models trained with DPO, tulu-2-dpo-7b (Ivison et al., 2023), which is based on the same pretrained model (Llama 2 7B) as Vicuna 7B.

Evaluation Metrics. Our main focus is on open-ended generation. Incorporating recent advances in automated evaluation (Dubois et al., 2023; Zheng et al., 2023a; Chiang et al., 2023), we use model-based metrics for large-scale studies. We use GPT-4 (OpenAI, 2023) as the judge to compare two responses for each prompt. We use the same prompt as Chen et al. (2023), where GPT-4 is asked to give a rating for each response when both responses are present in the input; see Appendix D for details. By comparing the two ratings, the result can be win, tie, or lose. To counter positional bias in GPT-4 ratings (Wang et al., 2023a), we collect two sets of ratings by alternating the order of test and baseline model responses. A winning response must receive at least one win and at most one tie. This protocol can mitigate the positional bias and improve the rating quality of GPT-4 as reported by Chiang et al. (2023). After counting number of win, tie and lose for the test model, we use the Win Score as defined in Eq. 6 as the aggregated metric. To show the relative improvement each model obtained compared with the SFT baseline (Vicuna-7B), for each prompt, we use one response generated by Vicuna-7B, and collect the other one from the RL policy we want to evaluate in all our GPT-4 evaluations. Taking the length bias in the GPT-4 evaluations into account (Wang et al., 2023b), a real improvement is achieved with higher Win Score at a similar average length, therefore we use the Pareto front achieved by each method for the final judgement. To validate the results, we also select best models at different length scales and compare them with human studies.

Benchmarks.

For the GPT-4 evaluation and human studies, we use prompts from the LIMA (Zhou et al., 2023) test-set, which contains 300 open-ended prompts in total. We also evaluate the performance of our models on benchmarks on specific model capabilities. Following Instruct-Eval (Chia et al., 2023), we test the trained policy $\pi_{\text{RL}}$ on BBH (Suzgun et al., 2022), MMLU (Hendrycks et al., 2020), DROP (Dua et al., 2019), and TruthfulQA (Lin et al., 2021) to evaluate the model’s ability on challenging task solving, multi-task, Math, and Truthfulness. We expect the trained policy to improve on the LIMA evaluations, and maintains its ability on the benchmarks (BBH, MMLU, DROP and TruthfulQA), which is not targeted by the Open Assistant data we are using but was obtained from pretraining.

2 Results

RM Evaluation. The efficacy of the reward models is best judged by the performance of the policy they supervised, which is demonstrated by the large-scale studies based on GPT-4 evaluation in Fig. 2 and our human studies in Fig. 6. For direct comparison of the reward models, we mainly evaluate the accuracy of distinguishing the chosen and rejected responses on the Open Assistant test set. We also look at the correlation of the reward with length to measure how much the reward prediction relies on length. Besides the linear Pearson correlation $\rho$ , which we explicitly used for training Odin, we also consider the rank correlations, Kendall’s $\tau$ and Spearman’s $r_{s}$ (See Appendix C), to see how much the reward rankings correlate with length rankings, as the reward model is optimized for ranking. We report results of RMs with the highest validation accuracy in Table 1. It shows that, despite only being trained to minimize the Pearson correlation with length, the rank correlations are also eliminated, which helps understand why Odin outperforms the linear length penalty in Fig. 4 as it can only remove linear correlation. Without exploiting length information, Odin is able to maintain most of the prediction accuracy on preference data, and the drop is insignificant considering the significant reduction in correlation and the 66% natural length bias in the preference data. This indicates that $r^{\text{Q}}$ better utilized the actual content for rankings.

Automatic Evaluation. The main results are shown in Fig. 2, where the Pareto front of the policy $\pi_{{\bm{w}}}$ trained by Odin is always higher than that of the respective baselines (PPO* and ReMax*) when $L(y)\geq 210$ . $L(y)<210$ may indicate lower quality as the SFT model tuned on high-quality demonstrations has $L(y)=220$ . Note that:

For the PPO* and ReMax* baselines shown in Fig. 2, we have included additional tricks (reward-clipping and length-penalty) and used more compute budget for enhancement.

Considering the challenges in selecting the best checkpoint due to reward hacking (Ramé et al., 2024), and the limited budget in evaluation, we prioritize on evaluating three checkpoints for each run that are: 1) At step 500; 2) At step 702, the last step; 3) With the highest reward on evaluation set. We then include all available data points in Fig. 2.

We also provide head-to-head GPT-4 evaluations of the best models of each method in Fig. 6.

Human Studies. We further conduct human studies involving 8 college students as participants rating the quality of generated responses. Each rater evaluates 90 samples, with at least three ratings obtained for each sample. Due to the limited budget, we sample 60 prompts from the LIMA test set in each group of evaluation. Since human evaluations can also be biased toward longer or shorter responses, we select models with similar average lengths on the Pareto front of each method for comparisons. For each sample, we presented raters with the original prompt as well as two randomly positioned responses. Referring to the guideline, the rater will choose a better response or rate both as similar. The guideline asks raters to consider the following criteria: Alignment with the User’s Intent, Clarity and Precision, Directness and Relevance, and Efficiency and Brevity. (See Appendix B for details.) The results can be seen in Fig. 6 where all the examined models trained with Odin are more preferred than the baselines, with the difference becoming more significant as length increases.333As indicated by win rate minus loss rate, or Win Score.

Results on Benchmarks. We show the results in Table 3. We observe improvements in TruthfulQA, which may come from a better understanding of the questions after RLHF. They also maintain the performance for all other tasks compared to the SFT initialization. It is worth pointing out that on every length scale, the policies trained by Odin could perform better than those trained by the vanilla reward model.

Related Works

Learning from Feedbacks. Since its first application on language models (Ziegler et al., 2019), RLHF has empowered the success of several epochal LLM systems (Schulman et al., 2022; OpenAI, 2023; Pichai, 2023; Anthropic, 2023; Google, 2023), and more diverse sources of preferences have been used to train the reward model (Bai et al., 2022; Lee et al., 2023) or provide feedbacks directly in RL (Liu et al., 2023). Since both human and LLM evaluators have biases, Odin stays relevant for most types of feedbacks as long as a reward model needs to be trained. Many capable conversational AI systems (Schulman et al., 2022; Anthropic, 2023) use online algorithms like PPO (Schulman et al., 2017) for RL and demonstrate strong instruction-following ability. Many offline alternatives have also shown promises for better learning from feedbacks, which includes SLiC-HF (Zhao et al., 2023), DPO (Rafailov et al., 2023), IPO (Azar et al., 2023), KTO (Ethayarajh et al., 2023), ReST (Gulcehre et al., 2023) and RSO (Liu et al., 2024). They use humans or reward models to annotate a large batch of LLM generations and then train the policy on the annotated experiences (the generations) or preferences, without sampling from the policy during training. Offline algorithms can be less prone to reward hacking as the experiences are updated less frequently, but hacking can still happen in the long term. In this paper, we focus on studying the impact of RM on the online algorithms, which are widely adopted by practical systems.

As a sign of reward hacking, RLHF can often causes response length to increase, especially when optimized for helpfulness. Singhal et al. (2023) explored ways to reduce length increase for PPO, including regularizations for PPO (increasing KL regularization, omitting outputs beyond a length threshold and reward scaling), and improvements on reward model training data (including length balancing, confidence-based truncation and reward data augmentation with random preferred response as negative examples). Their mitigations for PPO were not able to prevent length increase compared to SFT, and make the reward lower. Their improvements on the reward model either decrease reward model accuracy or fail to decrease correlation with length to significantly small values. Language models sometimes have a contrary length bias where it favors generating shorter sequences, e.g., Sountsov and Sarawagi (2016) found encoder-decoder models tend to generate shorter sequences with beam search. Multiple approaches have been proposed to mitigate reward hacking in RLHF. Shen et al. (2023) proposed to use a smaller reward model to learn the biases in the reward and a larger reward model to learn the true reward. Different from their approach, we explicitly train a linear projection on the shared reward model features to be correlated with length and remove such correlation from the other head. Rewarded Soup (Rame et al., 2023) interpolates weights of policies trained to optimize different reward objectives, which can approximate more costly multi-policy strategies. Eisenstein et al. (2023) found that reward model ensembles can mitigate reward hackings, but not eliminating them. Instead of interpolating the policies, Ramé et al. (2024) proposed a more efficient approach, which uses weight-averaged reward models to improve their OOD robustness and reduce reward hacking in RL. Like their approach, Odin does not sacrifice reward model efficiency for RL, while significantly improving results in practice. Except for the methods above, a more straightforward way is to continuously gather human preference data by sampling from the current optimal policy to identify the hacking responses, retrain the reward model, and continue training the policy with the new reward model (Ziegler et al., 2019). However, this process can be costly, and its effectiveness relies on the assumption that the quality of human rating can be sufficiently high, and biases in the human preferences can be effectively controlled.

LLM evaluations.

For instruction-following evaluation, current evaluations of SFT/RLHF models usually rely on LLM evaluators like GPT-4 during development, for its scalability and efficiency (Zheng et al., 2023a; Touvron et al., 2023). However, current open-source models can often exploit the length bias of the LLM evaluators, generating excessively verbose responses to achieve higher scores on benchmarks like Alpaca Eval (Li et al., 2023a; Liu, 2024). To conduct fair and holistic evaluations of the actor models trained via our reward models, in our paper, we evaluate the models by comparing the Pareto front of the evaluation score to length trade-off. As for benchmarks, we aim to evaluate the base capabilities of LLMs, e.g., reasoning and factuality, on BBH (Suzgun et al., 2022), MMLU (Hendrycks et al., 2020), DROP (Dua et al., 2019), and TruthfulQA (Lin et al., 2021). Since these capabilities are mostly gained from pertaining corpus (Zhou et al., 2023), and the fine-tuning stage has limited data compared to the pretraining, we only expect the performance to be maintained on these benchmarks after RLHF.

Conclusion

In this work, we embark on an exploration to address the challenge of reward hacking in RLHF, focusing particularly on the issue of verbosity as a form of reward hacking. To combat this, we first introduce a more reliable evaluation protocol which evaluates different methods by the score-to-verbosity trade-off. We conduct extensive experiments to verify the impact of hyperparameters and tricks (reward clipping and length penalty) on reward hacking. While we observed some trends for PPO clipping and the replay buffer size, the best results of baselines come from tuning all these dimensions, and it becomes hard to draw definitive conclusions about how these hyperparameters should be tuned when applied all together. We seek to resolve the issue from its root and propose Odin, a novel approach designed to disentangle representation of the content quality from the lengths of responses. Odin demonstrates notable improvements on the Pareto front, which transfers across two RL algorithms (ReMax and PPO). This advancement not only showcases the effectiveness of our approach but also sheds light on future research in RLHF. Evaluating and generalizing Odin on other types of hacking is an interesting future direction.

References

Appendix A Appendix

Appendix B Human Study

We designed the following human study interface based on the Gradio, shown as Fig. 7. After consenting to the study, the participants are presented with a screen containing a session ID used to track and reference back the session, and guidelines framing how to evaluate the response. The criteria used are described in Table 4.

Appendix C Correlation Metric

We use three correlation metrics in our main paper, i.e., Spearsman’s rank correlation $r_{s}$ , Kendall’s $\tau$ , and Pearson $\rho$ . We compute $\rho$ , $r_{s}$ and $\tau$ using the following formulas:

where $d_{i}=R(X_{i})-R(Y_{i})$ is the difference between two ranks of each observation and n is the number of the observations.

Appendix D Evaluation Prompt

Appendix E Hyperparameter

For RLHF training, to encourage models’ exploration, we choose $\text{top\_p}=0.9$ and temperature $\text{T}=1.0$ as the generation config which aligns with the setting used in Deepspeed-Chat and ReMax. As for evaluation, we use $\text{T}=0.8$ and $\text{top\_p}=0.8$ to avoid over-randomness on the generations.

E.2 PPO config

We do full-model fine-tuning for both the actor and critic. Same as [Nakano et al., 2021], we use one epoch (set $K=1$ ), and set $\gamma=1.0,\lambda=0.95$ for GAE. We train the model on Open Assistant for 3 epochs, which translates to 702 gradient update steps under the batch size $b=32$ , and takes around 11 hours to finish on 8 A100 GPUs with ZeRO stage 2. To make the search space tractable, we use the same learning rate $\eta$ for the actor and critic. We search $\eta\in\{5e-7,1e-6,2e-6\}$ , $\epsilon\in\{0.1,0.2,0.4\}$ , $\beta\in\{2.5e-3,5e-3,1e-2,2e-2\}$ , $c\in\{\inf,2,4\}$ , and $N\in\{32,64,256\}$ . Note we did not finish all experiments with $\beta=2.5e-3$ , but we have included the partial results in the plots when $\beta=2.5e-3$ is not explicitly excluded. The max input prompt length and max response length is both set to 512.

E.3 ReMax Config

The full-model finetuning is applied as well. Same as PPO, we use global batch size 32, and train the model for 3 epochs on the prompt set. The max input prompt length and output response length are both set to 1024. We search $\beta\in\{1e-3,2.5e-3,5e-3,1e-2\}$ and $\eta\in\{1e-6,5e-7\}$ first. But we found the lengths of the trained actor models are mostly over 225. Unlike PPO, ReMax baselines do not have many hyperparameters (only $\beta$ and $\eta$ ), we add some extra $\beta\in\{5e-3,5.5e-3,6e-3,\ldots,9.5e-3\}$ with $\eta=5e-7$ to get more results across different lengths, which makes the comparisons between different Pareto fronts more reliable.

E.4 Configs for Length Penalty Experiments

For experiment shown in Fig. 4, we tried $\alpha\in\{1e-3,1e-4,1e-5,5e-4,1e-6,5e-6\}$ for ReMax, and $\alpha\in\{5e-5,1e-4,5e-4,1e-3\}$ for PPO. We select evaluation results with the same set of other RL hyperparameters like $\eta,\beta,\epsilon,N$ for different settings. Therefore, the length penalty setting always tends to have more data points.

E.5 Dialogue Format

We convert the prompts and responses in OpenAssistant into dialogue format using the following template:

If the dialogue is multi-turn, we will use the same template as described and make all the previous turns’ prompts and answers as the model’s inputs.

Appendix F Frequently Asked Questions

We use LIMA [Zhou et al., 2023] as our test set for evaluating the instruction-following capability of models since it has 300 prompts, the size of which is larger than the other commonly used test set, e.g., WizardLM test set(218 prompts) [Xu et al., 2023], Koala(180 prompts) [Geng et al., 2023], MT-bench [Zheng et al., 2023a], and Self-Instruct(252 prompts) [Wang et al., 2022]. The evaluation cost (for human study) is extremely high since we have tons of actor models to evaluate. Thus, the main evaluations are conducted by GPT-4, and we also select some models on the Pareto front to do human study.

F.2 Why do you choose Vicuna-7B as the base model or the starting point of the RL?

We choose Vicuna-7B as our base model for two reasons: (1). Compared to other open-sourced 7B models, Vicuna-7B has pretty good instruction-following capability. To ensure efficient and effective exploration in RLHF, we need a good base model. (2). To ease the comparison and provide an accurate and comprehensive view of the RL algorithm, we chose the well-known SFT model but did not do the SFT on the OpenAssistant dataset by ourselves. It can also help us avoid the selection of the SFT checkpoint, where different people have different criteria. By using Vicuna-7B and the reward model we provided, we believe the community could reproduce our results more easily.

Appendix G Case Study

We show two comparison in Fig. 12 and Fig. 13, where our models could generate more accurate answers with shorter length.