Disentangling Length from Quality in Direct Preference Optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn

Introduction

Recently Large Language Models (LLMs) have seen significant improvements in capabilities, such as code-generation, mathematical reasoning, and tool use. Importantly, they can now fluently interact with users and follow their instructions, leading to their widespread adoption. Fine-tuning with Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Stiennon et al., 2022) has been a significant component in those advances and is now a standard part of advanced LLM training pipelines (Ouyang et al., 2022; Bai et al., 2022a; Touvron et al., 2023; Jiang et al., 2024; Anil et al., 2023). Currently, all the leading LLMs deploy some sort of RLHF pipeline Dubois et al. (2024); Zheng et al. (2023); Liang et al. (2023). The classical approach consists of three-stages. The first stage begins with a general model pre-trained with next-token prediction on a large corpus of text (Radford et al., 2019; Brown et al., 2020), which is then further-tuned for instruction-following purposes Wei et al. (2022). In the second stage, the model is prompted with general requests, and generates multiple possible answers, which are then ranked by the user. These ratings are used to train a reward model, which represents human preferences (Christiano et al., 2017; Stiennon et al., 2022; Ziegler et al., 2020; Bai et al., 2022a; Touvron et al., 2023). In the final stage, the instruction-tuned LLM is further trained to maximize expected rewards from the reward model trained in the second stage (a proxy for user preferences) using general purpose reinforcement learning algorithms (Schulman et al., 2017; Mnih et al., 2016). While successful, this pipeline is quite technically complex, and computationally expensive, mainly due to the final stage of RL optimization.

The quality of the learned reward model is crucial for the RLHF process Touvron et al. (2023). However, prior works have demonstrated that reward models can be exploited (Casper et al., 2023; Gao et al., 2023) due to a Goodhart’s law effect Clark and Amodei (2016); Manheim and Garrabrant (2019); Skalse et al. (2022); Lambert and Calandra (2023). Under this phenomenon, the model can achieve high rewards during the RL training while generating undesirable behaviours Gao et al. (2023); Dubois et al. (2024). A particular case of the reward exploitation phenomenon is the well-known verbosity issue - models fine-tuned with RLHF generate significantly longer answers, without necessarily improving the actual quality (Singhal et al., 2023; Kabir et al., 2023). This has been linked to an explicit bias in the preference data towards longer responses Singhal et al. (2023), however, the statistical increase in verbosity of RLHF-trained models significantly outmatches the the difference of distribution lengths between the preferred and rejected answers. This effect is even observed in in strong propriety models, such as GPT4 John Schulman et al. (2022), which is now frequently used to evaluate the performance of other LLMs Dubois et al. (2024); Zheng et al. (2023); Zeng et al. (2023). However, even as an evaluator GPT4, exhibits strong preferences for length. Prior work Wang et al. (2023) has noted that when evaluating 13B parameter models in head-to-head comparisons with the Davinci-003 model, win rates and the average number of unique tokens in the model’s response have correlation of 0.96.

Recently Direct Preference Optimization Rafailov et al. (2023) has emerged as an alternative to the standard RLHF pipeline. The key observation of DPO is that the reward model can directly be re-parameterized through the optimal LLM policy obtained in the reinforcement learning stage. This allows us to directly train the language model through the reward learning pipeline, eliminating the need for the reinforcement learning stage. This algorithm has become widely used, since it can train completely offline, yielding better simplicity of tuning, speed and resource efficiency, while maintaining performance (Dubois et al., 2024; Jiang et al., 2024). For these reasons it has also been widely adopted by the open-source community. At the time of this writing, 9 out of the top 10 models on the HuggingFace Open LLM Leaderboard use DPO as part of their training pipeline.

While the question of length exploitation has been extensively studied in the classical RLHF pipeline, it has not been explored in the DPO setting before. Moreover, recently concerns have been raised that open-source models have not improved significantly across automated benchmarks, but instead have been exploiting the verbosity bias of the evaluator Liu (2024). These statistics are demonstrated in Figure 1, as open-source models can match the overall performance of proprietary ones, but lag significantly on length-corrected basis.

We make several contributions in our work: First we study the length exploitation problem in the DPO setting and and show it is quite persistent, which we empirically link to out-of-distribution bootstrapping. Next, we derive a simple but efficient regularization approach, which we show can effectively control verbosity, without impacting model performance, even when evaluated by a biased judge, such as GPT4.

Preliminaries

In this section we will outline the core components of the standard RLHF pipeline Ziegler et al.; Stiennon et al.; Bai et al.; Ouyang et al.) and the Direct Preference Optimization algorithm Rafailov et al. (2023), which is central to our analysis and regularization derivations.

The standard RLHF pipeline consists of three stages: 1) We first pre-train a general LLM for instruction-following purposes with supervised fine-tuning (SFT); the Reward Modelling stage consists of gathering human feedback and training a parameterized reward model; finally during the final Reinforcement Learning stage, we further optimize the LLM in a reinforcement learning loop, uing the trained reward model from the previous stage.

SFT: During this stage, we use a dataset of prompts x\mathbf{x} and high-quality answers y\mathbf{y} to train an LLM with next-token prediction to obtain a model πSFT(yx)\pi_{\text{SFT}}{(\mathbf{y}|\mathbf{x})}. In our notation we treat the entire prompt and answer strings as a single variable.

Reward Modelling Phase: In the second phase the instruction-tuned model is given prompts x\mathbf{x} and produce pairs of answers (y1,y2)πSFT(yx)(\mathbf{y}_{1},\mathbf{y}_{2})\sim\pi_{\text{SFT}}{(\mathbf{y}|\mathbf{x})}. Users then rank the answers, denoted as ywylx\mathbf{y}_{w}\succ\mathbf{y}_{l}\mid\mathbf{x} where yw\mathbf{y}_{w} and yl\mathbf{y}_{l} are the preferred and dispreferred answer respectively. The rankings are usually assumed to be generated by the Bradley-Terry (BT) Bradley and Terry (1952), in which the preference distribution pp is assumed to be driven by an unobserved latent reward r(x,y)r(\mathbf{x},\mathbf{y}) and the following parameterization:

𝑟𝐱subscript𝐲1𝑟𝐱subscript𝐲2p(\mathbf{y}_{1}\succ\mathbf{y}_{2}\mid x)=\frac{\exp\left(r(\mathbf{x},\mathbf{y}_{1})\right)}{\exp\left(r(\mathbf{x},\mathbf{y}_{1})\right)+\exp\left(r(\mathbf{x},\mathbf{y}_{2})\right)}. (1) Then given a dataset of user rankings \mathcal{D}=\bigl{\{}\mathbf{x}^{(i)},\mathbf{y}_{w}^{(i)},\mathbf{y}_{l}^{(i)}\bigr{\}}_{i=1}^{N}, we can train a parameterized reward model rϕ(x,y)r_{\phi}(\mathbf{x},\mathbf{y}) using maximum likelihood:

Reinforcement Learning Phase: During the final phase, we use the learned reward function in an RL loop to where the LLM is treated as a policy. The most common optimization objective is the following:

where πref(yx)\pi_{\text{ref}}{(\mathbf{y}|\mathbf{x})} is a reference distribution (usually taken to be πref(yx)\pi_{\text{ref}}{(\mathbf{y}|\mathbf{x})}) and β\beta is a hyper-parameter. This objective trades-off maximizing the reward rϕ(x,y)r_{\phi}(\mathbf{x},\mathbf{y}) and a divergence term from a fixed reference distribution. The second term acts as a regularizer to prevent the policy πθ\pi_{\theta} from drifting too far away from the initialization πref(yx)\pi_{\text{ref}}{(\mathbf{y}|\mathbf{x})}. This objective is then optimized using a general purpose RL algorithm, such as PPO Schulman et al. (2017).

2 Direct Preference Optimization

Direct Preference Optimization Rafailov et al. (2023) starts with the same objective as Eq. 2.1. However, DPO assumes we have access to the ground truth reward r(x,y)r(\mathbf{x},\mathbf{y}) and derives an analytical transformation between the optimal reward and optimal policy. This can be substituted back into the reward optimization objective in Eq. 2.1, which allows us to train the optimal model directly on the feedback data using the following objective:

Here the parameter β\beta is the same as in Eq. 2.1 and similarly controls the trade-off between expected reward and divergence from the model initialization. The DPO objective is attractive as it allows us to recover the optimal model using a standard classification loss, without the need for on-policy sampling or significant amount of hyper-parameter tuning. Eq. 2.2 resembles the reward modelling objective in Eq. 2.1 under the parameterization

We will refer to this as the DPO "implicit reward". Theorem 1 in Rafailov et al. (2023) shows that this is indeed a valid parameterization of a reward model without loss of generality. If we substitute this form of rθ(x,y)r_{\theta}(\mathbf{x},\mathbf{y}) into the RL objective 2.1 we can obtain the optimal solution in a closed form, which happens to be πθ\pi_{\theta}. We will return to the interpretation of DPO as an implicit reward function later on in our analysis of out-of-distribution bootstrapping.

Building in Explicit Regularization in DPO

Prior works have explicitly considered length-regularization in the classical RLHF pipeline Singhal et al. (2023), however these methods do not transfer directly to direct alignment algorithms, such as DPO. We will derive a length-regularized version of the algorithm from first principles, by adding a regularized term in the RL problem in Eq. 2.1. The below considerations hold for a general regularizer, but we will focus on a length term αy\alpha|\mathbf{y}|, where α\alpha is a hyper-parameter and y|\mathbf{y}| denotes the token-length of the answer y\mathbf{y}. We then formulate the regularized RL problems in the following objective:

where we assume that r(x,y)r(\mathbf{x},\mathbf{y}) is still the same latent reward driving human preferences. We can follow the same derivations in Rafailov et al. (2023) for the reward function r(\mathbf{x},\mathbf{y})\bigr{]}-\alpha|\mathbf{y}| and obtain the optimal solution to Eq. 3 as

where Z(x)=yπrefe1β(r(x,y)αy)Z(\mathbf{x})=\sum_{\mathbf{y}}\pi_{\mathrm{ref}}e^{\frac{1}{\beta}(r(\mathbf{x},\mathbf{y})-\alpha|\mathbf{y}|)}. With some simple algebra, we can then obtain the equivalent regularized reward re-formulation:

𝛽superscript𝜋conditional𝐲𝐱subscript𝜋refconditional𝐲𝐱𝛽𝑍𝐱𝛼𝐲r(\mathbf{x},\mathbf{y})=\beta\log\frac{\pi^{*}(\mathbf{y}|\mathbf{x})}{\pi_{\text{ref}}{(\mathbf{y}|\mathbf{x})}}+\beta\log Z(\mathbf{x})-\alpha|\mathbf{y}| (8) We can then plug in Eq. 8 into the reward modelling stage in Eq. 2.1, which yields the following regularized DPO objective:

This is similar to the standard DPO objective, except for the an additional regularization term within αywαyl\alpha|\mathbf{y}_{w}|-\alpha|\mathbf{y}_{l}| in the logit of the binary classification loss.

Concurrent work Chen et al. (2024) also consider the length exploitation problem in the classical RLHF pipeline. They suggest a similar regularization in the reward modelling stage in Eq. 2.1 to disentangle the answer’s quality from the length bias and show meaningful improvement in length-controlled model performance. Our derivations can be seen as the DPO implicit reward counterpart to that classical RLHF approach, explicitly linking the regularized reward modelling problem to an equivalent regularized RL setup.

Similar to the original DPO formulation, the regularized objective still aims to increase the likelihood along the preferred answer, while decreasing the likelihood along the dis-preferred answer, modulated by a weighting term. This term is equivalent to the original DPO formulation with the addition of the regularization margin αywαyl\alpha|y_{w}|-\alpha|y_{l}|. We can interpret this as an additional per-example learning rate, which up-weighs the gradient on feedback pairs, in which the selected answer is shorter and down-weights the gradient on pairs in which the selected answer is longer, proportional to the difference in length.

Experiments

In this section we will empirically investigate the verbosity exploitation issues in DPO, the effectiveness of our regularization strategy and the potential causes of these effects. We beging with a description of our evaluation tasks and models.

We utilize three different setups in our experimental setting based on summarization, dialogue and general instruction-following.

Summarization We use the standard Reddit TL;DR (TL;DR) summarization dataset from Stiennon et al. (2022), which consists of a Reddit post and several short summaries, judged for quality and informativeness by human evaluators.

Dialogue: For our dialogue experiment we use the Anthropic Helpful and Harmless (HH) datasets Bai et al. (2022b), which consists of general conversations with a language model assistants, which are also ranked by human annotators.

Datasets statistics are included in Table 1 where exhibit a small length bias in the preferred response. Following Rafailov et al. (2023) we use the Pythia 2.8B Biderman et al. (2023) for both the dialogue and summarization tasks and carry out full-parameter fine-tuning, using the DPO original codebase222https://github.com/eric-mitchell/direct-preference-optimization with default hyperparameters, except when noted otherwise. All experiments were carried out on 4 A40 GPUs for a total of about 2000 GPU hours.

2 Length Exploitation in DPO and Effectiveness of Regularization

We first consider the Anthropic Helpful and Harmless and Reddit TL;DR datasets. For both tasks, we train models with three parameter values β[0.5,0.1,0.05]\beta\in[0.5,0.1,0.05] and then sample 256 answers using prompts fron the evaluation dataset. The length histograms are shown in Fig. 2. The first two columns show the answer length distribution for the set of preferred, rejected and DPO-generated answer, with each row corresponding to a different value of the β\beta parameter. We see that the DPO generated answers are, on average, significantly longer than both the preferred and rejected answers. Models trained with smaller values of β\beta generate longer responses on average, which is expected as this parameter controls the deviation from the initial policy. Not only does the DPO model generate longer answers, it also generates answers that are significantly out-of-distribution in terms of length from the offline preference dataset.

The third and fourth column in Fig .2 show results for the SFT, DPO the length-regularized DPO model introduced in Section 3. We use parameters of α=0.01\alpha=0.01 and α=0.05\alpha=0.05 for the Anthropic Helpful and Harmless and Reddit TL;DR datasets respectively. While the length-regularized models still show mild increase in average length, they match the SFT model much more closely. Moreover, they do not generate answers with significantly out-of-distribution lengths. This indicates that the proposed algorithm can efficiently regularize the verbosity of the trained model.

3 Length Versus Quality Trade-Offs

In this section we evaluate the length versus quality model trade-offs. For the Anthropic Helpful and Harmless and Reddit TL;DR datasets we use the answers generated in the previous section and compare them head-to-head against the dataset preferred answer, using GPT 4 as an evaluator. Our main results are shown in Fig. 3, which plots model win rates against average answer length, with 90% confidence intervals. We again evaluate three different values for the beta parameter β[0.05,0.1.0.5\beta\in[0.05,0.1.0.5 and three values of α\alpha with α[0,0.005,0.01]\alpha\in[0,0.005,0.01] for HH and α[0,0.2,0.5]\alpha\in[0,0.2,0.5] for TL;DR respectively (α=0\alpha=0 is the standard DPO algorithm). Similar to before, we see that the length-regularized training can efficiently control verbosity, significantly decreasing the average length of the answers as compared to the standard DPO training. Moreover, on the HH task, regularization also leads to mild improvement in win rates, but a slight decrease on TL;DR although both of these are not statistically significant. These results are quite promising, as GPT4 is known to have a significant length bias in its preferences Wang et al. (2023); Singhal et al. (2023). On both the HH and TL;DR, the length-regularized experiments with β=0.05\beta=0.05 and β=0.01\beta=0.01 match the average lengths of the corresponding β=0.5\beta=0.5 runs, but achieve statistically significant higher corresponding win rates with close to 20% improvement on HH and 15% improvement on TL;DR.

4 Is Length a Proxy for KL-Divergence?

In the constrained RL problem in Eq. 2.1 and the corresponding DPO objective in Eq. 2.2, the β\beta parameter controls the degree of policy divergence from the initial reference model. In Fig. 2 and Fig. 3, we see that average length of the model generated answers is inversely proportional to the β\beta parameter. In this section, we investigate the relationship between the length-regularized DPO objective in Eq. 3 and the KL divergence from the initial policy. In Fig .4, we plot the trained policy KL divergence from the initialization πref\pi_{\mathrm{ref}} for the different values of β\beta and α\alpha parameters. We see only a weak correlation between KL divergence and length. For both HH and TL;DR, length-regularized models trained with β=0.05\beta=0.05 and β=0.01\beta=0.01 match the average length of train runs with β=0.5\beta=0.5 (Fig. 3). At the same time, these runs have statistically significant higher KL divergences and win rates as shown in Fig. 3. We hypothesize that this indicates the existence of different factors driving human preference, with length being only a partial one.

5 DPO and Early Convergence

In Rafailov et al. (2023), the authors show early convergence of the DPO algorithm on the HH dataset. DPO achieves its best performance within a few hundred gradient steps, and does not improve with further training. Similar observations have also been made within the open-source community. We claim that this effect is likely due to length-exploitation and the biased GPT4 evaluator. In Fig. 5, we consider the training progression on the HH dataset with β=0.1\beta=0.1. We compare the regular DPO run (α=0\alpha=0) with the length-regularized one α=0.1\alpha=0.1. We train for a single epoch and evaluate intermediate checkpoints on the same set of prompts for average answer length, win rates, and KL divergence. We see that already within the first 10% of the epoch, the standard DPO run produces answers almost twice as long as the SFT model. Unregularized DPO achieves its highest win rate here, with only KL divergence and average length increasing steadily with further training. In contrast, the length-regularized run sees little to no intermediate increase in length, but steady improvement in win rates throughout training and slow increases in divergence from the reference policy. We hypothesize that the regular DPO training quickly increases length, which exploits the evaluator’s bias, but does not capture the more complex features of preferences. On the other hand, the length-regularized training run is able to disentangle the verbosity component and fit other, more difficult quality features over a longer training period.

6 What Drives Length Exploitation?

Excessive model verbosity John Schulman et al. (2022) has been well understood under classical RLHF as a reward exploitation problem Gao et al. (2023); Casper et al. (2023); Lambert and Calandra (2023) driven by a bias in the feedback datasets for longer answers. In particular, in the classical RLHF pipeline as outlined in Section 2.1, the reward model is continuously queried on new data generated by the model, which can create an out-of-distribution robustness issue. These results do not directly transfer to the DPO algorithm, as it does not train a separate reward model and only uses the offline feedback dataset for training. Surprisingly we find that the exploding length issue in DPO training is similarly driven by out-of-distribution exploitation. We consider the DPO algorithm as an implicit reward training method, as outlined in Section 2.2. We investigate the behaviour of the implicit reward rθr_{\theta} as defined in Eq. 5. Since the DPO policy πθ\pi_{\theta} is the optimal solution to the constrained RL problem in Eq. 2.1 corresponding to rθr_{\theta}, any exploitation behaviour from the policy must be driven by the reward function. We evaluate rθr_{\theta} trained with β=0.1\beta=0.1 and different α\alpha parameters on the offline feedback dataset (within its training distribution) and on answers generated by the corresponding DPO policy (out of distribution). Surprisingly, within distribution, the corresponding implicit reward models exhibit weak to no length correlation (and even negative length correlation with strong α\alpha regularization). However, they all show significant length bias on out-of-distribution samples, with length explaining 0.3-0.46 of the reward variance.

Related Work

Reward Exploitation in RLHF: RLHF reward exploitation, also known as reward over-optimization, is a well-known issue (Skalse et al., 2022; Pan et al., 2022; Casper et al., 2023; Lambert and Calandra, 2023) in which during the reinforcement learning stage, the expected reward keeps improving, but the quality of the model begins to degrade after some point. These effects were confirmed analytically in controlled experiments Gao et al. (2023), as well as empirically in user studies Dubois et al. (2024). Increased model verbosity has been explicitly linked to this phenomenon John Schulman et al. (2022). A number approaches have been proposed to mitigate this issue, such as penalizing epistemic uncertainty Coste et al. (2023); Zhai et al. (2023) or using mixture reward models Moskovitz et al. (2023), but they do not explicitly target the length issue.

Mitigating Length Biases in RLHF: A number of works have sought to explicitly address length biases in RLHF policies. Ramamurthy et al. (2023) suggest setting a simple discount factor, which improves naturalness of the generated language, Singhal et al. (2023) carry out an extensive study of length correlations in classical RLHF and suggest a number of mitigating approaches. The closes to our approach are the works of Shen et al. (2023) and the concurrent work of Chen et al. (2024), which propose to disentangle length-biases from quality during the reward modelling stage. Our work can be seen as a DPO equivalent counter-part to these approaches.

As far as we are aware, this is the first work to study the length exploitation problem for direct alignment algorithms, such as DPO.

Limitations

Our work addresses the particular issue of length exploitation in Direct Preference Optimization. Our regularization objective requires explicit penalty function (such as length) and may not be suitable to avoid general exploitation issues along axes separate from verbosity. Furthermore, we only study the DPO objective, which might behave differently from other direct alignment algorithms, which use different objective functions.

Conclusion

In this work, we study the problem of length exploitation in the Direct Preference Optimization algorithm for the first time. On two standard human feedback datasets, we empirically show that DPO exhibits significant length hacking across a range of hyperparameters. We then link this phenomenon to out-of-distribution bootstrapping. We derive an analytical length-regularized version of the DPO algorithm and show empirically that we can maintain model performance, as evaluated by GPT4 without significant increases in verbosity, boosting length-corrected win rates by up to 15-20%. Given the strong length bias in public feedback datasets and the prominence of DPO in the open source community, we hypothesize that a lot of open source models suffer from similar length-exploitation issues, driving the observations of Fig. 1. Our results are encouraging, suggesting that open-source models could match proprietary ones on automated evaluations on a length corrected basis as well.

Acknowledgements

Chelsea Finn is a CIFAR Fellow in the Learning in Machines and Brains program. This work was supported by ONR grant N00014-22-1-2621 and the Volkswagen Group.

References