Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

Introduction

Aligning large language models (LLMs) with human preferences is important for their fluency and applicability to many tasks, with the natural language processing literature using many techniques to incorporate human feedback [Christiano et al., 2017, Stiennon et al., 2020, Ouyang et al., 2022]. Typically in LLM alignment, we first collect large amounts of preference data, consisting of a context and two potential completions; one of these is labelled as the preferred completion, and the other as the dispreferred. We use this data to learn a general policy for generating completions in a given context. Direct Preference Optimisation (DPO) [Rafailov et al., 2023] is a popular method for learning from human preferences, and it has shown to be effective at improving the performance of pretrained LLMs on downstream tasks such as reasoning, summarisation, and alignment [Wang et al., 2023, Tunstall et al., 2023]. The theoretical motivation for DPO is based on a preference-ranking model with an implicit reward function that models the relative probability of picking the preferred completion over the dispreferred.

In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model’s likelihood of the preferred completions (as long as the relative probability between the preferred and dispreferred classes increases), and we empirically show that this phenomenon occurs when fine-tuning current LLMs on common datasets. Our theoretical explanation for the phenomenon suggests that the problem occurs most frequently in preference datasets with small edit distances between each pair of completions, such as in math-based preference datasets.

Using these insights, we design a new loss function: DPO-Positive (DPOP), which adds a new term to the loss function that penalises reducing the probability of the positive completions. We also create new preference datasets based on ARC [Clark et al., 2018], HellaSwag [Zellers et al., 2019], and MetaMath [Yu et al., 2023] and use them along with DPOP to create new models.

We introduce the Smaug class of models which use DPOP and achieve state-of-the-art open-source performance. We fine-tune 72B, 34B, and 7B models on our new datasets and show that DPOP far outperforms DPO. We evaluate our resulting models on multiple benchmarks including the HuggingFace Open LLM Leaderboard [Beeching et al., 2023, Gao et al., 2021], which aggregates six popular benchmarks such as MMLU [Hendrycks et al., 2021] and GSM8K [Cobbe et al., 2021], and MT-Bench [Zheng et al., 2023], a challenging benchmark that uses a strong LLM to score candidate model responses across eight different categories of performance. On the HuggingFace Open LLM Leaderboard, Smaug-72B achieves an average accuracy of 80.48%, becoming the first open-source LLM to surpass an average accuracy of 80% and improving by nearly 2% over the second-best open-source model, and our Smaug-34B model is the best in its class of models of similar parameter count. We release our code and pretrained models at https://github.com/abacusai/smaug.

We theoretically and empirically show a surprising failure mode of DPO: running DPO on preference datasets with small edit distances between completions can result in a catastrophic decrease in accuracy.

We introduce DPO-Positive (DPOP) which we theoretically and empirically show ameliorates the performance degradation. In particular, DPOP often outperforms DPO, even on preference datasets with high edit distances between completions.

We create new preference-based versions of ARC, HellaSwag, and MetaMath.

Using DPOP and our new datasets, we create and release the Smaug class of models, with Smaug-72B becoming the first open-source model to achieve an average accuracy of 80% on the HuggingFace Open LLM Leaderboard. We open-source our trained models, datasets, and code.

Background and Related Work

Large language models (LLMs) have shown impressive zero-shot and few-shot performance [Radford et al., 2019, Brown et al., 2020, Bubeck et al., 2023]. Recently, researchers have fine-tuned pretrained LLMs on downstream tasks by using human-written completions [Chung et al., 2022, Mishra et al., 2021] or by using datasets labelled with human-preferred completions relative to other completions [Ouyang et al., 2022, Bai et al., 2022, Ziegler et al., 2020]. These techniques have been used to improve performance on a variety of downstream tasks such as translation [Kreutzer et al., 2018] and summarisation [Stiennon et al., 2020], as well as to create general-purpose models such as Zephyr [Tunstall et al., 2023]. Two of the most popular techniques for learning from human preference data are reinforcement learning from human feedback (RLHF) [Ouyang et al., 2022, Bai et al., 2022, Ziegler et al., 2020] and direct preference optimisation (DPO) [Rafailov et al., 2023]. We summarise these approaches below.

Consider a dataset of pairwise-preference ranked data D={x(i),yw(i),yl(i)}i=1N\mathcal{D}=\{x^{(i)},y_{w}^{(i)},y_{l}^{(i)}\}_{i=1}^{N} where x(i)x^{(i)} are prompts and yw(i)y_{w}^{(i)} and yl(i)y_{l}^{(i)} are respectively the preferred and dispreferred completions conditioned on that prompt. We have an initial LLM πref\pi_{\text{ref}} that parameterises a distribution πref(yx)\pi_{\text{ref}}(y|x). Often, we initialise πref\pi_{\text{ref}} as an LLM that has undergone supervised fine-tuning (SFT) to improve performance on downstream task(s).

RLHF begins by modelling the probability of preferring ywy_{w} to yly_{l} using the Bradley-Terry model [Bradley and Terry, 1952] which posits the following probabilistic form:

where σ\sigma is the logistic function and r(x,y)r(x,y) corresponds to some latent reward function that is assumed to exist for the completion yy given the prompt xx. Given D\mathcal{D}, we can learn a parameterised estimate of rr by minimising the negative log-likelihood of the dataset:

For RLHF, we use reinforcement learning to optimise based on this learned reward function rϕr_{\phi} (with a regularising KL-constraint to prevent model collapse), and obtain a new LLM distribution πθ\pi_{\theta}.

DPO

Rafailov et al. showed that it is possible to optimise the same KL-constrained reward function as in RLHF without having to learn an explicit reward function. Instead, the problem is cast as a maximum likelihood optimisation of the distribution πθ\pi_{\theta} directly, with the objective:

where β\beta is a regularisation term corresponding to the strength of KL-regularisation in RLHF. In this case, the implicit reward parameterisation is r(x,y)=βlogπθ(yx)πref(yx)r(x,y)=\beta\log\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}, and Rafailov et al. further showed that all reward classes under the Plackett-Luce model [Plackett, 1975, Luce, 2005] (such as Bradley-Terry) are representable under this parameterisation. For an abbreviation, we define πratio(yx)=πθ(yx)πref(yx).\pi_{\text{ratio}}(y|x)=\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}.

Since the release of DPO, various alternatives have been proposed. We discuss the most relevant to our work below and in Appendix A.

IPO

Azar et al. aim to understand the theoretical underpinnings of RLHF and DPO. They identify that DPO may be prone to overfitting in situations where the preference probability of the preferred over the dispreferred examples is close to 1. They propose an alternative form of pairwise preference loss—‘Identity-PO (IPO)’. IPO tries to prevent overfitting to the preference dataset by penalising exceeding the preference margin beyond this regularised value. Conversely, we identify that DPO can lead to underfitting as well—even complete performance degradation.

Failure Mode of DPO

In this section, we take a step back and examine the DPO loss in Equation 1, specifically with an eye towards how it can reduce the probability of the preferred completion. The loss is a function only of the difference in the log-ratios, which means that we can achieve a low loss value even if πratio(ywx)\pi_{\text{ratio}}(y_{w}|x) is lowered below 1, as long as πratio(ylx)\pi_{\text{ratio}}(y_{l}|x) is also lowered sufficiently. This implies that the log-likelihood of the preferred completions is reduced below the original log-likelihood from the reference model!

Why is this an issue? The original use-case of RLHF did not explicitly denote the preferred completions as being also ideal completions (rather than just the preferred completion out of the two choices ywy_{w} and yly_{l}), and hence the DPO objective is a good modelling choice. However, since then, a large body of work has focused on distilling the knowledge of powerful models into smaller or weaker models, while also showing that doing so with RLHF/DPO outperforms SFT [Taori et al., 2023, Tunstall et al., 2023, Xu et al., 2023, Chiang et al., 2023]. In this paradigm, it is often the case that in each pair of completions, the better of the two is indeed also an ideal completion. Furthermore, a new technique is to transform a standard labelled dataset into a pairwise preference dataset [Ivison et al., 2023, Tunstall et al., 2023], which also has the property that for each pair of completions, one is an ideal completion.

While the above illustrates a hypothetical situation, now we provide a specific case in which DPO may cause a decrease in the probability of the better completion. Consider the case of trying to improve a model’s math or reasoning abilities by comparing a completion of “2+2=4” to “2+2=5.” This process creates a pair of preferred and dispreferred completions which have an edit (Hamming) distance of 1, i.e., all tokens in the completion are the same except for one. In the following, we will explore how the location of the differing token impacts the computation of the DPO loss. For sake of argument, we will examine what happens when the differing token is the first token, though the argument also follows if it appears elsewhere.

For preliminaries, consider two completions with an edit distance of 1 which differ at token mm with 1mK1\leq m\leq K, i.e., consider yw=(t1,,tK)y_{w}=(t_{1},\dots,t_{K}) and yl=(t1,,tm1,tm,tm+1,,tK)y_{l}=(t_{1},\dots,t_{m-1},t_{m}^{\prime},t_{m+1},\dots,t_{K}). Denote y<r=(t1,,tr1)y^{<r}=(t_{1},\dots,t_{r-1}) and yr=(tr,,tK)y^{\geq r}=(t_{r},\dots,t_{K}). Assume that the vocabulary length of the LLM is LL. Let si{x}s_{i}^{\{x\}} represent the probability of the ii-th token in the model’s vocabulary given the input xx. While the LLM model parameters θ\theta are numerous, we restrict our attention to the logits, θj\theta_{j} with j[L]j\in[L].

The gradient of Equation 1 with respect to θ\theta is proportional to the following:

We note first that for m>1m>1, all tokens from 1 to m1m-1 have no effect on the gradient, simply because for all 1i<m1\leq i<m, πθ(tiyw<k,x)=πθ(tiyl<k,x)\pi_{\theta}(t_{i}|y_{w}^{<k},x)=\pi_{\theta}(t_{i}|y_{l}^{<k},x), causing these tokens’ contribution to the gradient to cancel out. Therefore, without loss of generality, assume m=1m=1, i.e., ywy_{w} and yly_{l} differ only at the first token. Without loss of generality, we also assume that tkt_{k} takes vocabulary position 1. Then we have the following for each k>1k>1 (derivation in Section B.1):

As we typically run DPO after SFT, the model is likely to be reasonably well optimised, so we should have sj{yw<k,x}sj{yl<k,x}s_{j}^{\{y_{w}^{<k},x\}}\leq s_{j}^{\{y_{l}^{<k},x\}} for j1j\neq 1 and s1{yw<k,x}s1{yl<k,x}s_{1}^{\{y_{w}^{<k},x\}}\geq s_{1}^{\{y_{l}^{<k},x\}}. Therefore, while this analysis only extends to gradients with respect to the logits, we see that the gradient vector is decreasing in the correct logit dimension and increasing in the wrong logit dimensions. Surprisingly, this suggests that under DPO, all tokens that follow a mismatched token should have reduced probability of emitting the correct token when compared to πref\pi_{\text{ref}}. We will later give empirical evidence for this in Section 5 and Figure 3.

DPOP

Now, we introduce DPO-Positive (DPOP), which is a solution to fix the failure mode described in the previous section. While there is no incentive for DPO to maintain the high log-likelihood of the preferred completions, DPOP adds the penalty term max(0,logπref(ywx)logπθ(ywx))\text{max}\left(0,\frac{\log\pi_{\text{ref}}(y_{w}|x)}{\log\pi_{\theta}(y_{w}|x)}\right) to the loss. It is 0 when πratio(ywx)1\pi_{\text{ratio}}(y_{w}|x)\geq 1 and increases as the ratio goes below 1. Thus, the DPOP loss function is:

where λ>0\lambda>0 is a hyperparameter that can be tuned. This form of loss retains the property that we are fitting parameters on the preference data under the Bradley-Terry model. The implicit reward parameterisation is

By applying this optimisation pressure, the model can no longer minimise the loss by significantly reducing the log-likelihood of the dispreferred examples more than it reduces the log-likelihood of the preferred examples; it must also ensure that the log-likelihood of the preferred examples remains high relative to the log-likelihood under the reference model.

Now, we show that Equation 3 mitigates our examples of failure modes from the previous section. Recall from Section 3 that we focused on two completions, ywy_{w} and yly_{l}, which differ by one token at location m=1m=1. We showed in Equation 2 that for standard DPO, the gradient of the kk-th token in the completions with respect to the jj-th logit is sj{yl<k,x}sj{yw<k,x}s_{j}^{\{y_{l}^{<k},x\}}-s_{j}^{\{y_{w}^{<k},x\}}. However, for DPOP, if πratio<1\pi_{\text{ratio}}<1, the gradients become

𝜆1superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑤absent𝑘𝑥superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑙absent𝑘𝑥superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑤absent𝑘𝑥𝑖𝑗𝜆1superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑤absent𝑘𝑥superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑙absent𝑘𝑥𝑖𝑗\displaystyle=\begin{cases}\lambda\left(1-s_{j}^{\{y_{w}^{ii is the vocabulary index of token tkt_{k}. Since sj{yw<k}1s_{j}^{\{y_{w}^{<k}\}}\leq 1, for the case i=ji=j, the gradient is guaranteed to be positive for a large enough choice of λ\lambda. Similarly, for the case iji\neq j, the gradient is guaranteed to be negative for a large enough λ\lambda (as long as sj{yw<k}>0s_{j}^{\{y_{w}^{<k}\}}>0). This therefore fixes the issue from Section 3. The derivation of the above is given in Section B.1.

While the main motivation for DPOP is to avoid the failure mode described in Section 3, we also note its connection to contrastive loss. Contrastive learning is a popular technique in areas such as computer vision for datasets of similar and dissimilar pairs [Oord et al., 2018, Chen et al., 2020, He et al., 2020], and the loss function often uses a margin factor. Equation 3 can be viewed as similar to contrastive loss with margin m=log1πref(ywx)m=\log\frac{1}{\pi_{\text{ref}}(y_{w}|x)}. We give further details in Appendix C.

DPOP Datasets & Experiments

In this section, we empirically validate that the failure mode does arise in practice and that DPOP is able to mitigate the failure. We also show that even when the edit distance is large and DPO does not show degradation in performance, DPOP can still outperform on downstream task evaluation.

For our empirical analysis, we focus on the downstream tasks of GSM8K, ARC, and HellaSwag, and we introduce and release associated paired preference-ranked datasets.

GSM8K [Cobbe et al., 2021], a dataset of diverse grade school math word problems, has been adopted as a measure of the math and reasoning skills of LLMs [Chowdhery et al., 2023, Touvron et al., 2023b, a, Beeching et al., 2023, Gao et al., 2021]. We create a paired preference-ranked version of MetaMath [Yu et al., 2023], an extended version of the GSM8K training data [An et al., 2023, Yu et al., 2023]. The correct completions in the MetaMath dataset consist of a series of steps which lead to the final answer. To create a dispreferred version, we randomly corrupt one of the results of an intermediate calculation. This dataset has a low (normalised) edit distance of 6.5%.

ARC [Clark et al., 2018] is a dataset that tests the level of understanding of science at grade-school level. We focus specifically on ARC-Challenge, the more difficult of the two subsections of ARC, which has been widely adopted as a measure of LLM reasoning and world understanding [Chowdhery et al., 2023, Touvron et al., 2023b, a, Beeching et al., 2023, Cobbe et al., 2021]. The ARC-Challenge dataset consists of four choices of responses to each question, one of which is correct. To create a paired preference-ranked dataset, for each correct response in the training split, we create three pairs using each incorrect response. Due to the differences in the responses, this dataset has a high normalised edit distance of 90%.

HellaSwag [Zellers et al., 2019] is a dataset containing commonsense inference questions known to be hard for LLMs. Similar to ARC, each question has one correct completion and three incorrect completions, and so we create a paired preference-ranked dataset by creating three pairs for each correct response in the training split. See Appendix D for further details and documentation about our newly released datasets.

2 Experiments

In this section, we compare training DPO, IPO, and DPOP on the datasets mentioned above and evaluate them on the corresponding tasks. We apply each preference-training method to the base model of Mistral7B [Jiang et al., 2023]. We evaluate on the test sets of GSM8K and ARC by using the LLM Evaluation Harness [Gao et al., 2021].

First, we compare DPO, IPO, and DPOP when training on both MetaMath and ARC; see Figure 2. We find that when training on MetaMath, DPO catastrophically fails, while IPO does not improve performance. DPOP is the only model to improve performance over the base model. When training on ARC, which has a higher edit distance as described in the previous section, both DPO and DPOP are able to improve on the base model significantly; however, DPOP performs better.

Ablation study over β𝛽\beta

One potential hypothesis for how degradation of DPO on MetaMath could be prevented is by modifying the strength of the regularisation parameter, β\beta. We test β{0.1,0.3,1.0}\beta\in\{0.1,0.3,1.0\}, and although a larger β\beta does induce a slower decrease, the performance with DPO still plummets, while DPOP shows strong and consistent performance with different values of β\beta (see Appendix Figure 4).

Token-level analysis

Recall that in Section 3, we gave theoretical motivations for why DPO is likely to perform poorly on low-edit distance datasets. We now analyse the log-probabilities of the trained models at the token level on the MetaMath dataset over 1000 samples to empirically support our arguments. Let us denote the index of the first token that is different between the preferred and dispreferred completion by mm.

We suggested that πθ(yrx,y<r)\pi_{\theta}(y^{\geq r}\mid x,y^{<r}) for r>mr>m will have ‘wrong-way’ gradient updates and therefore decrease. We find this is indeed the case—the average log-prob after training of tokens after mm is 0.37-0.37 for the reference model and 0.26-0.26 for DPOP, but 1.82-1.82 for DPO on the preferred completions (see (Figure 3) (left)). Perhaps most instructively, for both the reference model and DPOP, in Figure 3 (right), we see that tokens after the edit indices show higher log-likelihood than those before the edit indices—this is indicative of well-behaved language modelling, with lower perplexity as more tokens are added to the context. By contrast, DPO shows the opposite pattern—with log-likelihood actually reducing after the edit indices. This is indicative of a deeper breakdown in language modelling, which we believe is facilitated by the wrong-way gradient we outlined in Section 3. Finally, we are also able to substantiate our assumption from Section 3 that s1s1s_{1}\geq s_{1}^{\prime}; we find from our analysis that for the baseline model, the tokens after the edit have an average log-likelihood of 0.37-0.37 on the preferred completion, but this drops to 0.86-0.86 on the dispreferred completion.

In Appendix E, we present additional results on ARC comparing the averaged log-probs of DPO and DPOP on the preferred completion during training; see Figure 6. DPOP once again demonstrates higher log-probs than DPO.

Smaug

In this section, we introduce the Smaug series of models. We train models for 7B, 34B and 72B parameter sizes using DPOP. We use the 7B class for a direct comparison of DPOP vs. DPO, including on existing and widely-used paired preferenced-ranked datasets. Due to the computational resource requirements involved in training the larger model sizes, we only perform DPOP on 34B and 72B and compare to other models on the HuggingFace Open LLM Leaderboard. For the same reason, we also do not perform any hyperparameter tuning; it is possible that even better performance can be achieved, e.g., with a different value of λ\lambda.

Smaug-34B is a modified version of the base model Bagel-34B-v0.2 [Durbin, 2024a], which itself is a SFT version of Yi-34B-200k [01.AI, 2024]. We first take Bagel-34B-v0.2 and perform a SFT fine-tune using a combination of three datasets: MetaMath [Yu et al., 2023], ORCA-Chat [Es, 2024], and the ShareGPT dataset [Z., 2024]. Next, we perform DPOP with five datasets: our pairwise MetaMath DPO, ARC DPO, and HellaSwag DPO datasets described in Section 5, the ORCA DPO dataset [Intel, 2024], and the UltraFeedback Binarized dataset [AllenAI, 2024]. Finally, we perform a standard DPO with the Truthy DPO dataset [Durbin, 2024b]. We run these experiments with 8 H100 GPUs. We set β=0.3\beta=0.3, λ=50\lambda=50, a learning rate of 5×1055\times 10^{-5}, and the AdamW optimizer [Loshchilov and Hutter, 2017], and we run 1000 steps for all DPOP routines. The total training time for all steps took 108 hours.

For 72B, we start from MoMo-72b-lora-1.8.7-DPO [Moreh, 2024], which itself is a fine-tune of Qwen-72B [Bai et al., 2023]. MoMo-72b-lora-1.8.7-DPO has already undergone SFT, so we simply run the five DPOP routines as in Smaug-34B. The total training time is 144 hours.

We evaluate using the HuggingFace Open LLM Leaderboard [Beeching et al., 2023, Gao et al., 2021], a widely-used benchmark suite that aggregates six popular benchmarks: ARC [Clark et al., 2018], GSM8K [Cobbe et al., 2021], HellaSwag [Zellers et al., 2019], MMLU [Hendrycks et al., 2021], TruthfulQA [Lin et al., 2022], and WinoGrande [Sakaguchi et al., 2020]. We evaluate directly in HuggingFace, which uses the LLM Evaluation Harness [Gao et al., 2021]. It performs few-shot prompting on the test sets of these tasks and checks if the model emits the correct answer. We compare Smaug-72B to the evaluation scores of the top five open-weight LLMs according to the HuggingFace Open LLM Leaderboard [Beeching et al., 2023, Gao et al., 2021] as of February 1, 2024; see Table 1. Smaug-72B achieves an average accuracy of 80.48%, becoming the first open-source LLM to surpass an average accuracy of 80% and improving by nearly 2% over the second-best open-source model. Smaug-34B also achieves best-in-its-class performance compared to other models of similar or smaller size (see Appendix E).

MT-Bench

Next, we evaluate using MT-Bench [Zheng et al., 2023], a challenging benchmark that uses GPT-4 [OpenAI, 2023] to score candidate model responses across eight different categories of performance. As shown in other works [Zheng et al., 2023, Rafailov et al., 2023], strong LLMs such as GPT-4 show good agreement with human preferences. We run MT-Bench with the Llama-2 conversation template [Touvron et al., 2023b]. See Table 2 for a comparison with state-of-the-art LLMs according to Arena Elo as of February 1, 2024. Smaug-72B achieves the top MMLU score and third-best MT-bench score out of the open-source models in Table 2. In Appendix E, we give examples of Smaug-72B completions to MT-Bench questions.

2 Comparison using 7B Model

We fine-tune Smaug-7B from a base of Llama 2-chat [Touvron et al., 2023b]. Since Llama 2-chat has already undergone instruction fine-tuning, we perform DPO and DPOP directly using the same datasets described in the previous section. We train for 2 000 steps on 1xGH200.

We evaluate Smaug-7B on MT-Bench in the same setting as described in the previous section. We find that DPOP achieves a first-turn score of 7.275 whereas DPO achieves a score of 7.076.

Conclusion, Limitations, and Impact

In this work, we presented new theoretical and empirical findings on a severe failure mode of DPO, in which fine-tuning causes the probability of the preferred examples to be reduced. In order to mitigate this issue, we devised a new technique, DPOP, which we show overcomes the failure mode of DPO—and can outperform DPO even outside this failure mode. By fine tuning with DPOP on our new pairwise preference versions of ARC, HellaSwag, and MetaMath, we create a new LLM that achieves state-of-the-art performance. In particular, it is the first open-weights model to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard, and it is nearly 2% better than any prior open-weight LLM.

In the future, creating pairwise preference-based versions of other datasets, and running DPOP with these datasets, could push the abilities of open-source LLMs even closer to the performance of proprietary models such as GPT-4 [OpenAI, 2023]. Furthermore, using DPOP on additional mathematical datasets is an exciting area for future work, as it has the potential to further advance LLMs’ abilites in mathematial reasoning.

While our work gives theoretical and empirical evidence on a failure mode of DPO and a proposed solution, it still has limitations. First, we were unable to run a full ablation study on our 72B model. Running multiple fine-tuning experiments on a 72B model is infeasible, as each one can take over five days to complete. Therefore, we assume that our ablations on smaller models still hold up at scale. Furthermore, while we expect DPOP to achieve strong performance on any preference dataset, especially those with small edit distance, we have only demonstrated its performance on five English-language datasets. We hope that future work can verify its effectiveness on more datasets, in particular on non-English datasets.

Broader Impact

This paper proposes a technique to fine-tune LLMs using preference data, releases three preference datasets based on mathematics and reasoning, and releases new LLMs fine-tuned on the preference data. As with any paper that releases a fine-tuning technique or an LLM, there are inherent risks. An adversary could use such a technique to create a model fine-tuned to produce harmful, toxic, or illegal content. However, we are optimistic that our work will have a net positive impact. In particular, DPOP is especially beneficial when used with mathematical or reasoning datasets, in which only a few tokens differ between the preferred and less-preferred completions. Furthermore, we give theoretical intuition on the failure cases of DPO, and a method to avoid this failure case, therefore moving towards a more complete understanding of preference optimisation-based techniques. With recent extensive efforts towards AI safety [Huang et al., 2023, Yao et al., 2023, Weidinger et al., 2021], we hope that preference optimisation-based techniques will impact society positively. Finally, we note that the models we release are less performant than the current top proprietary models such as GPT-4 [OpenAI, 2023], therefore lessening the negative impact of releasing our models.

References

Appendix A Related Work Continued

In this section, we continue our discussion of related work from Section 2.

Wang et al. seek to align LLMs to correctly ‘score’ (in terms of perplexity) their own generations. They do so by generating multiple chain-of-thought [Wei et al., 2023] responses to each prompt, which they categorise as preferred or dispreferred according to whether they answer the question correctly. Their proposed ‘Alignment Fine-Tuning (AFT)’ paradigm adds an alignment objective LA\mathcal{L}^{*}_{A} to the standard fine tuning loss, defined as

1subscriptsubscript𝑦𝑤subscript𝒢𝑊subscriptsubscript𝑦𝑙subscript𝒢𝐿superscript𝑒subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle\mathcal{L}^{*}_{A}(\pi_{\theta})=\log\bigg{[}1+\sum_{y_{w}\in\mathcal{G}_{W}}\sum_{y_{l}\in\mathcal{G}_{L}}e^{(\log\pi_{\theta}(y_{l}|x)-\log\pi_{\theta}(y_{w}|x))}\bigg{]} where Gp\mathcal{G}_{p} is the set of preferred examples and Gn\mathcal{G}_{n} is the set of dispreferred examples. By minimising LA\mathcal{L}^{*}_{A}, the log-likelihoods of preferred examples are encouraged to be larger than the log-likelihoods of dispreferred examples, akin to DPO. However, Wang et al. takes an opposing motivation to us: they are particularly concerned with the issue of the log-likelihoods of dispreferred examples being pushed down too significantly

Our work differs from AFT in three key points. First, although Wang et al. discusses DPO in the appendix, they do not show how their approach would extend to a reformulation of its objective; they also focus their experiments solely on supervised fine-tuning. Next, we use a different constraint mechanism—theirs is a soft margin constraint on the log-probability distance of the dispreferred example from the preferred example, while ours is a soft penalty for deviating from a reference model. Finally, they are focused specifically on the case of self-generated LLM CoT responses and calibrating the LLM’s perplexity of its own responses.

HALO

Ethayarajh et al. seek to understand alignment methods, including DPO, in the context of ‘Human-Centred Loss Functions (HALOs)’. By drawing an equivalence between the alignment methods and the work of Tversky and Kahneman in prospect theory, they adapt the ‘human value function’ in that paper to the LLM setting:

The major difference of this approach with DPO is that it does not require paired preference data. The above loss function can be used for any dataset as long as the labels are individually marked as positive or negative.

CPO

Very recently, concurrent work [Xu et al., 2024] proposes adding a new term to the DPO loss function in order to allow DPO to become better at rejecting ‘worse’ completions that are good quality but not perfect. Specifically, they include the term

While similar, their work uses a different loss function with different motivation, and furthermore they only considered machine translation models up to 13B parameters.

Appendix B Derivation of logit gradients

Consider two completions of length KK with edit (Hamming) distance of 1 which differ at token mm with 1mK1\leq m\leq K. Put yw=(t1,,tK)y_{w}=(t_{1},\dots,t_{K}) and yl=(t1,,tm1,tm,tm+1,,tK)y_{l}=(t_{1},\dots,t_{m-1},t_{m}^{\prime},t_{m+1},\dots,t_{K}). Put y<r=(t1,,tr1)y^{<r}=(t_{1},\dots,t_{r-1}) and yr=(tr,,tK)y^{\geq r}=(t_{r},\dots,t_{K}). Note that the derivative of Equation 1 with respect to θ\theta is proportional to:

When we substitute this into θLDPO\nabla_{\theta}\mathcal{L}_{DPO}, we observe

Assume that the vocabulary length of the LLM is LL. This means that πθ(yx)\pi_{\theta}(y|x) corresponds to a vector of length LL and represents the softmax output of the final layer of the LLM, i.e., s{x}=πθ(yx)\bm{s^{\{x\}}}=\pi_{\theta}(y|x). Note, we may drop the xx from the notation of s\bm{s} if the context is obvious. Therefore, si{x}s_{i}^{\{x\}} represents the probability of the ii-th token in the model’s vocabulary given the input context xx.

While the LLM model parameters θ\theta are numerous, let us restrict our attention to just the logits, which we denote as θj\theta_{j} with j[L]j\in[L] and are the input to the softmax. With this assumption, we know that

subscript𝜃𝑗superscriptsubscript𝑠𝑖𝑥1𝑖𝑗superscriptsubscript𝑠𝑗𝑥\displaystyle\frac{\partial}{\partial\theta_{j}}\log s_{i}^{\{x\}}=1\{i=j\}-s_{j}^{\{x\}}. (4) Consider the case where ywy_{w} and yly_{l} differ only at the first token, i.e., m=1m=1. In this instance, we have that for k>1k>1:

Without loss of generality, assume that tkt_{k} takes vocabulary position 1. Then from Equation 4, we have:

As we typically run DPO after SFT the model is likely to be reasonably well optimised, so we should have sj{yw<k,x}sj{yl<k,x}s_{j}^{\{y_{w}^{<k},x\}}\leq s_{j}^{\{y_{l}^{<k},x\}} for j1j\neq 1 and s1{yw<k,x}s1{yl<k,x}s_{1}^{\{y_{w}^{<k},x\}}\geq s_{1}^{\{y_{l}^{<k},x\}}. Therefore, while this analysis only extends to gradients with respect to the logits, we see that the gradient vector is decreasing in the correct logit dimension and increasing in the wrong logit dimensions. In particular, this derivation suggests that under DPO, all tokens that follow a difference at mm at any point should have reduced probability of emitting the correct token when compared to πref\pi_{\text{ref}}.

B.2 Derivation for DPOP

We can follow a similar line of reasoning for calculating θLDPOP\nabla_{\theta}\mathcal{L}_{DPOP} with respect to its logits which we denote again by θj\theta_{j}.

Again taking token position tkt_{k} for illustrative purposes and assuming that tkt_{k} takes vocabulary position ii, in the case when πratio(yx)<1\pi_{\text{ratio}}(y|x)<1,

subscript𝜋𝜃conditionalsubscript𝑡𝑘superscriptsubscript𝑦𝑤absent𝑘𝑥subscript𝜋𝜃conditionalsubscript𝑡𝑘superscriptsubscript𝑦𝑙absent𝑘𝑥𝜆subscript𝜋𝜃conditionalsubscript𝑡𝑘superscriptsubscript𝑦𝑤absent𝑘𝑥1𝜆subscript∇subscript𝜃𝑗subscript𝜋𝜃conditionalsubscript𝑡𝑘superscriptsubscript𝑦𝑤absent𝑘𝑥subscript∇subscript𝜃𝑗subscript𝜋𝜃conditionalsubscript𝑡𝑘superscriptsubscript𝑦𝑙absent𝑘𝑥1𝜆1𝑖𝑗superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑤absent𝑘𝑥1𝑖𝑗superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑙absent𝑘𝑥cases𝜆1superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑤absent𝑘𝑥superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑙absent𝑘𝑥superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑤absent𝑘𝑥𝑖𝑗𝜆1superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑤absent𝑘𝑥superscriptsubscript𝑠𝑗superscriptsubscript𝑦𝑙absent𝑘𝑥𝑖𝑗\displaystyle\begin{split}\nabla_{\theta_{j}}{}&[\log\pi_{\theta}(t_{k}|y_{w}^{πratio(yx)1\pi_{\text{ratio}}(y|x)\geq 1, then we have the standard gradient from LDPO\mathcal{L}_{\text{DPO}}.

Appendix C Motivation: Contrastive Loss

While the main motivation for DPOP is to avoid the failure mode described in Section 3, we also note its connection to contrastive loss. Contrastive learning is widely used [Wang and Liu, 2021, Wang and Isola, 2020, Saunshi et al., 2019, Oord et al., 2018, Chen et al., 2020, He et al., 2020], often for embedding learning applications. The contrastive loss formulation typically includes two main terms: one encouraging the proximity of analogous inputs, the other encouraging the divergence of distinct classifiable data.

Moreover, the introduction of a margin appended to one of these terms often ensures a more stable training process. This margin serves as an indicator of indifference to point displacement once a specific value threshold is exceeded. The margin, when attached to the similar points term, establishes a minimum threshold beyond which we do not care about pulling the points closer. Alternatively, if added on the dissimilar points term, the margin sets a maximum threshold.

We show that the DPO loss is structured such that learning the probabilities during DPO training are equivalent to learning the embeddings in a contrastive loss formulation. However, the standard DPO only uses the term computing distance between dissimilar points, and does not include the similar points term or the margin. Consequently, it is predictable that traditional DPO’s inefficiencies mirror the known shortcomings of contrastive training when one constituent term is absent. DPOP, our refined DPO formulation, fixes this by adding the absent term and the margin.

Contrastive loss is defined in Hadsell et al. . If we keep the margin in the similar points terms, it can be written as follows:

subscriptfor-all𝑖𝑗italic-ϵsubscript𝒫𝑑𝒟subscript𝑦𝑖subscript𝑦𝑗𝜆subscriptfor-all𝑖𝑗italic-ϵsubscript𝒫𝑠𝒟subscript𝑦𝑖subscript𝑦𝑗𝑚0\displaystyle-\sum_{\forall(i,j)\epsilon\mathcal{P}_{d}}\mathcal{D}(y_{i},y_{j})+\lambda\sum_{\forall(i,j)\epsilon\mathcal{P}_{s}}\min(\mathcal{D}(y_{i},y_{j})-m,0) Recall that the standard DPO loss (Equation 1) is as follows:

Say we designate an embedding function H\mathcal{H}:

And we define a distance function D\mathcal{D} as follows:

The standard DPO only has the dissimilar points term under the analogy of the contrastive loss formulation. For more robust training we accommodate for the similar embeddings term. We use the concept of anchor points or embeddings for both positive and negative points as in triplet loss Schroff et al. . These points are known ideal embeddings we want our points to achieve. They carry probabilities of 1 and 0 respectively in our equivalence depending on whether they are preferred or dispreferred samples.

delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥𝜆delimited-[]1subscript𝜋refconditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝑚0\displaystyle=\bigg{[}\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)}-\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)}\bigg{]}+\lambda\bigg{[}\min\bigg{(}\log\frac{1}{\pi_{\text{ref}}(y_{w}|x)}-\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)}-m,0\bigg{)}\bigg{]} \displaystyle\hskip 165.73721pt+\lambda\bigg{[}\min\bigg{(}\log\frac{\epsilon}{\pi_{\text{ref}}(y_{l}|x)}-\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)}-m,0\bigg{)}\bigg{]} If we set the margin m=log1πref(ywx)m=\log\frac{1}{\pi_{\text{ref}}(y_{w}|x)}, the second term is:

This choice of margin is mathematically equivalent to choosing a threshold which ensures the similarity term only contributes to the loss when the learned model performs worse on the preferred response than the base model.

We can ignore the third term during training for two primary reasons. First, it is trying to push the log probability of negative samples to negative infinity which may be unstable during training. Second, in essence, it negatively impacts the likelihood of the negative samples. However, given our objective of extending the distance between positive and negative samples without diminishing the likelihood of positives, sacrificing this signal is acceptable. In the worst-case scenario, while the probability of negatives may be increased, the consequent increase in the probability of positives is ensured by the first two terms, thus rendering the sacrifice acceptable.

Appendix D Details of Datasets

In this section, we give more details and examples of the three datasets we release. We release all datasets under the Apache 2.0 License.

GSM8K [Cobbe et al., 2021] is a dataset of diverse grade school maths word problems, which has been commonly adopted as a measure of the math and reasoning skills of LLMs [Chowdhery et al., 2023, Touvron et al., 2023b, a, Beeching et al., 2023, Gao et al., 2021]. The MetaMath dataset is partitioned into queries and responses, where the query is a question involving mathematical calculation or reasoning, and the response is a logical series of steps and calculations that culminate in a final answer. To construct our paired-preference version of MetaMath, we take the queries as prompts xx and the responses as the preferred completions ywy_{w}. We create yly_{l} by modifying the response by randomly corrupting one of the results of an intermediate calculation so that it is incorrect; however, we leave the answer untouched.

An example is as follows: xx: “What is the total cost of purchasing equipment for all sixteen players on the football team, considering that each player requires a 25jersey,a25 jersey, a15.20 pair of shorts, and a pair of socks priced at 6.80?"6.80?"y_{w}:Eachplayerrequiresa: “Each player requires a25 jersey, a 15.20pairofshorts,andapairofsockspricedat15.20 pair of shorts, and a pair of socks priced at6.80. So the total cost for each player is 25+25 +15.20 + 6.80=6.80 =47. Since there are sixteen players on the football team, the total cost for all of them is 16 * 47=47 =752. #### 752 The answer is: 752" yly_{l}: “Each player requires a 25jersey,a25 jersey, a15.20 pair of shorts, and a pair of socks priced at 6.80.Sothetotalcostforeachplayeris6.80. So the total cost for each player is25 + 15.20+15.20 +6.80 = 52.Sincetherearesixteenplayersonthefootballteam,thetotalcostforallofthemis1652. Since there are sixteen players on the football team, the total cost for all of them is 16 *47 = $752. #### 752 The answer is: 752"

The dataset contains 393 999 training examples and 1 000 evaluation examples. Our motivation in building this dataset is to align models towards being precise in intermediate calculations. This dataset has low edit distance – the normalised edit distance is approximately 6.5%.

ARC

ARC [Clark et al., 2018] is a dataset that tests the level of understanding of science at approximately grade-school level. We focus specifically on the ‘Challenge’ subsection of ARC, the more difficult of the two subsections, which has been widely adopted as a measure of LLM reasoning and world understanding [Chowdhery et al., 2023, Touvron et al., 2023b, a, Beeching et al., 2023, Gao et al., 2021, Cobbe et al., 2021]. We create a paired preference-ranked dataset from the train split of ARC-Challenge. The dataset is partitioned into questions which we take as our prompts xx, and four choices of responses to each question of which only one is the correct answer. The correct response is taken as ywy_{w} and the incorrect responses are taken to be yly_{l}; as there are three incorrect responses for every prompt, we repeat ywy_{w} multiple times for each prompt. The dataset contains 3357 training examples and 895 evaluation examples. This dataset has a high normalised edit distance of approximately 90%.

HellaSwag

Finally, we consider the HellaSwag dataset [Zellers et al., 2019], a dataset containing commonsense inference questions known to be hard for LLMs. An example prompt is “Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. then” And the potential completions are [[ ", the man adds wax to the windshield and cuts it.", ", a person board a ski lift, while two men supporting the head of the person wearing winter clothes snow as the we girls sled.", ", the man puts on a christmas coat, knitted with netting.", ", the man continues removing the snow on his car." ]] The dataset contains 119 715 training and 30 126 evaluation examples.

Appendix E Additional Experiments and Details

No hyperparameter tuning was done when creating Smaug-34B or Smaug-72B. DPOP has two hyperparameters, β\beta and λ\lambda. We chose β=0.3\beta=0.3, similar to prior work [Rafailov et al., 2023], and we chose λ=50\lambda=50 without trying other values. It is possible that even better performance can be achieved, e.g., with a different value of λ\lambda.

Here, we give the licenses of all models used to train our Smaug-series of models.

Smaug-7B started from Llama 2-chat [Touvron et al., 2023b]. Therefore, we release it under the Llama 2 license (https://ai.meta.com/llama/license/).

Smaug-34B started from Bagel-34B-v0.2 [Durbin, 2024a], which itself is a SFT version of Yi-34B-200k [01.AI, 2024]. Therefore, we release Smaug-34B under the Yi Series Models Community License Agreement (https://github.com/01-ai/Yi/blob/main/MODEL_LICENSE_AGREEMENT.txt).

Smaug-72B started from MoMo-72b-lora-1.8.7-DPO [Moreh, 2024], which itself is a fine-tune of Qwen-72B [Bai et al., 2023]. Therefore, we release Smaug-72B under the Qwen license (https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT).

E.2 Additional Results

In Section 5, we described an ablation study on β\beta to reject one potential hypothesis for how degradation of DPO on MetaMath could be prevented is by modifying the strength of β\beta, the regularisation parameter. Here, we present the plot; see Figure 4. We test β{0.1,0.3,1.0}\beta\in\{0.1,0.3,1.0\}. Although a larger β\beta does induce a slower decrease, the performance with DPO still plummets. On the other hand, DPOP shows strong and consistent performance with different values of β\beta.

Log-probabilities of preferred completions

In Figure 5, we show the log-probabilities of the preferred completion of the train and eval sets during training on MetaMath. We plot the log-probabilities in more granularity than in Figure 3. We confirm our theoretical insights from Section 4 – the log-probabilities of the preferred completion drop substantially in DPO, whereas they increase for DPOP – across both the train and eval sets. For ARC, we see in Figure 6 that DPOP maintains high train-set log-probs, while both the train and eval set log-probs decrease for DPO. Notably, even though eval set log-probs do decrease for DPOP, they are still higher than the train set log-probs of DPO.

Additional tables

In Table 3, we give an extension of Table 1 (whose experimental details are in Section 6). In Table 4, we give the same table, except for models of size 34B or lower.

Appendix F Example Completions

In this section, we give example completions by Smaug-72B for questions in MT-Bench [Zheng et al., 2023]. Note that these are not cherry-picked – they include examples of both good and bad completions.