Secrets of RLHF in Large Language Models Part I: PPO

Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang

cs.CL cs.AI cs.LG

Introduction

Nowadays, large language models (LLMs) have made remarkable progress, posing a significant impact on the AI community . By scaling up model size, data size, and the amount of training computation, these LLMs emerge prominent characteristics that are not present in small models, typically including in-context learning , instruction following , and step-by-step reasoning . Based on these emergent abilities, LLMs even exhibit some potential to link between words and percepts for interacting with the real world, leading to the possibilities of artificial general intelligence (AGI), like embodied language models with tool manipulation and generative agents in interactive sandbox environment .

Despite the capacities, since LLMs are trained to capture the data characteristics of pre-training corpora (including both high-quality and low-quality data) , these models are likely to express unintended behaviors such as making up facts, generating biased or toxic text, or even harmful content for humans . Accordingly, it is crucial that the ratio of safety progress to capability progress increases as emphasized in OpenAI’s plan for AGI . Hence, it is necessary to align LLMs with human values, e.g., helpful, honest, and harmless (3H) . Especially, the arrival of open source foundation models, such as LLaMA and OpenChineseLLaMA , has rapidly promoted the LLMs into the supervised fine-tuning (SFT) stage. In order to mitigate a huge risk of harmfulness, most of the current work tries to add some 3H data in SFT, hoping to activate the responses of the models to make a positive change at the moral and ethical level . However, even though a set of safety and groundedness objectives are added to capture the behavior that the model should exhibit in a dialog , the model’s performance remains below human levels in safety and groundedness . Hence, it requires more effective and efficient control approaches to eliminate the potential risk of the use of LLMs. Fortunately, OpenAI and Anthropic have verified that RLHF is a valid avenue for aligning language models with user intent on a wide range of tasks .

However, training large language models that align with human values is a daunting task, often resulting in repeated failure when trained using reinforcement learning . Generally speaking, successful RLHF training requires an accurate reward model as a surrogate for human judgment, careful hyperparameter exploration for stable parameter updating, and a strong PPO algorithm for robust policy optimization. While the reward model trained by low-quality data and hard-to-define alignment target can easily mislead the PPO algorithm to a unintelligible direction. Besides, finetuning language models with PPO needs to coordinate four models to work together, i.e., a policy model, a value model, a reward model, and a reference model, making it hard to train and scale up to large-scale parameter models. In the new language environment, PPO suffers from sparse reward and inefficient exploration in word space, making it sensitive to hyperparameters. Models trained solely through repeated experiments, failed runs, and hyperparameter sweeps achieve far inferior results. The huge trial and error cost of LLMs makes researchers dare not easily let the research enter the RLHF stage, which hinders the LLMs safe landing. Hence, a robust PPO algorithm specially designed for LLMs is the key step to align human preferences.

In this report, we carefully dissect the framework of RLHF and discuss the entire process that determines the success of the algorithm’s training. We explored how the quality of the reward model affects the final result of the policy model. We find that the quality of the reward model directly determines the upper bound of the policy model, and designing an appropriate PPO algorithm is crucial for RLHF’s successful training. Moreover, accurate code implementation matters in deep policy (practice makes perfect). Therefore, we have conducted in-depth evaluations of the inner workings of PPO algorithm to study how code-level and theory-level optimizations change agent training dynamics. We propose to monitor the PPO training process by using action space modeling metrics derived from the policy model, such as perplexity, response length, and KL divergence between the policy model and the SFT model. These metrics are more informative of the training stability than the values of response reward and loss functions. Based on these observations, we identify the policy constraints in the PPO algorithm as the key factor to achieve consistent alignment with human preferences. After extensive comparative experiments with various possible implementations of PPO framework, we finally introduce a preferable policy optimization algorithm named PPO-max, which incorporates the collection of effective and essential implementations, and is carefully calibrated to avoid interference among them. PPO-max alleviates the instability of vanilla PPO training and enables longer training steps with a larger training corpus. We evaluate PPO-max on 7B and 13B SFT models, demonstrating comparable alignment performance with ChatGPT.

Contributions are summarized as follows: 1) we release competitive Chinese and English reward models, respectively, which have good cross-model generalization ability, alleviating the cost of relabeling human preference data; 2) we conduct in-depth analysis on the inner workings of PPO algorithm and propose the PPO-max algorithm to ensure stable model training; and 3) we release the complete PPO-max codes to ensure that the LLMs in the current SFT stage can be better aligned with humans.

Related Work

Despite the promising capacities, LLMs are likely to express unintended behaviors such as making up facts, generating biased or toxic text, or even harmful content for humans due to the low-quality pre-training data. Hence, it is necessary to align LLMs with human values, e.g., helpful, honest, and harmless (3H) . In order to mitigate a huge risk of harmfulness, most of the current work tries to involve 3H data in SFT, hoping to activate the responses of the models to make a positive change at the moral and ethical level , while the model’s performance remains below human levels in safety and groundedness . Hence, more effective and efficient control approaches are required to eliminate the potential risk of LLMs. Fine-tuning language models to align with human preferences provides an effective solution to this challenge, where an agent is required to learn human preferences and provide human-like results given a context and corresponding suffixes ranked or scored by human annotators. Reinforcement Learning (RL) provides the most straightforward solution to reach this goal, for the agent needs just scarce supervision signal from the reward model as human proxies, and is modified through numerous trials under RL framework, namely Reinforcement Learning from Human Feedback (RLHF). There have been many attempts on this path recently .

In the context of large language models, RLHF is especially adopted for the purpose of a helpful, honest, and harmless LLM that aligns with human values , alleviating the negative societal impacts from general-purpose language models. LaMDA finetunes large language models to participate in interesting, helpful, factually grounded, and safe natural language dialogue and use of external information to ensure accuracy and groundedness. Rather than using reinforcement learning, they apply a mix of supervised learning techniques for human preference alignment. InstructGPT finetunes GPT-3-type models to improve helpfulness, which is mixed with RL from human preferences expressed through comparisons. adopts the pre-training and fine-tuning tradition to train the preference model for human alignment, claiming that ranked preference modeling turns out to be the most effective training objective for distinguishing between “good” and “bad” behavior. This attempt is further improved by an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and PPO is incorporated to stabilize RL training . Despite its effectiveness, RLHF (especially PPO) exhibits complexity, instability, and sensitivity to hyperparameters, which is not yet addressed in previous works.

Under similar concerns, several works highlighted the importance of PPO for RL framework and made an attempt to improve its efficiency . reveals that much of the observed improvement in reward brought by PPO may come from seemingly small modifications to the core algorithm (i.e. code-level optimizations). further points out that a large number of low- and high-level design decisions of RL are usually not discussed in research papers but are indeed crucial for performance. As a result, conducts a fair comparison among low-level designs based on a unified RL implementation and claims that the policy initialization scheme significantly influences the performance.

Despite the efforts of revealing the importance of PPO and its recommended implementation, few attempts have been made to address the problem of instability and sensitivity to hyperparameters. In this paper, we dissect the framework of RLHF, especially shedding light on the inner workings of PPO, and explore an advanced version of the PPO which efficiently improves the training stability of the policy model.

Reinforcement Learning from Human Feedback

The training process of AI assistant comprises three main stages: supervised fine-tuning (SFT), reward model (RM) training, and proximal policy optimization (PPO) on this reward model. During the SFT phase, the model learns to engage in general human-like dialogues by imitating human-annotated dialogue examples. Subsequently, the reward model is trained, in which the model learns to compare the preference of different responses based on human feedback. Lastly, in the PPO phase, the model is updated based on feedback from the reward model, striving to discover an optimized policy through exploration and exploitation. In the RLHF process, we mainly consider the stages of RM training and reinforcement learning via PPO. The PPO algorithm follows a series of steps as depicted in Figure 1.

For the RM architecture, we use pre-trained transformer-based language models with the last unembedding layer removed and add an additional linear layer to the final transformer layer. Given any text, the reward model will assign a scalar reward value to the last token, and the larger the reward value, the better the sample. Following Stiennon et al. , training reward models often involves utilizing a dataset comprised of paired comparisons between two responses generated for the same input. The modeling loss for each pair of preferred and dispreferred samples is:

where $\sigma$ is the sigmoid function. $r$ represents the reward model with parameters $\psi$ , and $r(x,y)$ is the a single scalar predicted reward for input prompt $x$ and response $y$ . Additionally, we follow to use imitation learning, which introduces the autoregressive LM loss on the preferred response of each pair, allowing the model to imitate the preferred response in each sentence pair. In practice, we add the coefficient $\beta_{\mathrm{rm}}$ the LM loss respectively. Finally, we define the following reward modeling loss:

where $\mathcal{D_{\mathrm{rm}}}$ is the empirical distribution of the training set. $r^{\prime}$ is the same model with $r$ except for the top linear layer, the dimension of which corresponds to the vocabulary size, and $r^{\prime}(x,y_{w})$ is the likelihood given the prompt $x$ and the preferred response $y_{w}$ .

We incorporate an extra term into the reward function, which introduces a penalty based on the Kullback-Leibler (KL) divergence between the learned RL policy $\pi^{\mathrm{RL}}_{\phi}$ and initial supervised model $\pi^{\mathrm{SFT}}$ . The total reward can be expressed as :

where $\eta$ is KL reward coefficient and controls the strength of the KL penalty. This KL divergence term plays two significant roles within this context. First, it functions as an entropy bonus, fostering exploration within the policy landscape and preventing the policy from prematurely converging to a single mode. Second, it works to ensure that the RL policy’s output does not deviate drastically from the samples that the reward model encountered during its training phase.

2 Reinforcement Learning

Applying RL to dialogue generation presents significant challenges due to the substantial state-action space. In this context, we consider human interaction as the “environment”. At each timestep, $t$ , the agent (i.e., the AI assistant) receives a state $s_{t}$ from the environment (i.e., the dialogue history), which consists of all the dialogue text up to this point, both by the assistant and the human. Then, based on its policy $\pi$ , the agent’s action $a_{t}$ is to generate the next token. The environment returns a reward $r(s_{t},a_{t})$ , which is calculated from a reward function $r$ trained from human preference data. The agent then transitions to the next state $s_{t+1}$ , which includes the next dialogue history. The aim of RL is to find an optimal behavior strategy for the agent to maximize the cumulative reward (i.e., return) over a trajectory $\tau=\{s_{1},a_{1},\ldots,s_{T},a_{T}\}$ . One kind of return is finite-horizon undiscounted return $R(\tau)=\sum_{t=1}^{T^{\prime}}r(s_{t},a_{t})$ , which is simply the sum of rewards accumulated within a fixed number of steps. Another one is the infinite-horizon discounted return $R(\tau)=\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})$ , takes into account all rewards obtained by the agent throughout its entire trajectory with a discount factor $\gamma\in(0,1)$ .

Policy gradient methods are a type of RL techniques that directly optimize the policy of the agent—the mapping of states to actions—instead of learning a value function as in value-based methods. The central idea behind policy gradient methods is to improve the policy using the gradient ascent algorithm. In essence, these methods adjust the parameters of the policy in the direction that maximally improves the expected return. The policy $\pi$ is typically parameterized by $\theta$ , we denote it as $\pi(a|s,\theta)$ , which is the probability of taking action $a$ in state $s$ . The update rule for the policy gradient is given as:

𝜃𝛼subscript∇𝜃𝐽𝜃\theta\leftarrow\theta+\alpha\nabla_{\theta}J(\theta), (4) where $\alpha$ is the learning rate, $J(\theta)$ represents the expected return when following policy $\pi_{\theta}$ and the gradient of policy performance $\nabla_{\theta}J(\theta)$ is called the policy gradient.

A general form of policy gradient can be formulated as:

where $\Phi_{t}$ could be any of $\Phi_{t}=R(\tau)$ or $\Phi_{t}=\sum_{t^{{}^{\prime}}=t}^{T}R(s_{t^{{}^{\prime}}},a_{t^{{}^{\prime}}})$ or $\Phi_{t}=\sum_{t^{{}^{\prime}}=t}^{T}R(s_{t^{{}^{\prime}}},a_{t^{{}^{\prime}}})-b(s_{t})$ with baseline $b$ . All of these choices lead to the same expected value for the policy gradient, despite having different variances.

The return is calculated through Monte Carlo sampling. If the return is favorable, all actions are “reinforced” by increasing their probability of being selected. The advantage of this approach lies in its unbiased nature, as we rely solely on the actual return obtained rather than estimating it. However, a challenge arises due to the high variance associated with this method. This variance stems from the fact that different trajectories can result in diverse returns due to the stochasticity of the environment (random events during an episode) and the policy itself.

To reduce this variance, a common strategy is to use advantage function estimates in place of raw returns in the policy gradient update rule. The advantage function $A(s_{t},a_{t})$ represents how much better it is to take a specific action $a_{t}$ at state $s_{t}$ , compared to the average quality of actions at that state under the same policy. Thus,

Mathematically, $A(s_{t},a_{t})=Q(s_{t},a_{t})-V(s_{t})$ , where $Q(s_{t},a_{t})$ is the action-value function, representing the expected return after taking action $a_{t}$ at state s, and $V(s_{t})$ is the value function, representing the average expected return at state $s_{t}$ .

The application of policy gradients with advantage functions forms a crucial backbone in the realm of RL. However, the estimation methods for the advantage function vary significantly across different algorithms, thereby creating a landscape of diverse approaches. In the next section, we introduce Generalized Advantage Estimation (GAE) , a method that is foundational to policy optimization algorithms and has seen widespread use.

2.2 Generalized Advantage Estimation

The following is a layman-friendly explanation of how GAE is derived.

The advantage function, $A$ , is defined as the difference between the $Q$ function (the expected return) and the value function (the expected return from following the policy from a given state). The $Q$ function considers a specific action, while the value function averages over all possible actions according to the policy. However, in practice, we use returns (sum of rewards) from actual episodes to estimate the $Q$ function. This introduces a high amount of variance because future rewards can be very noisy. One way to reduce this noise is by estimating future returns (after time step $t$ ) using the value function. The GAE algorithm effectively acts as a middle ground between using simple one-step Temporal Difference (TD) returns and using full Monte Carlo returns, balancing bias and variance. The following is a layman-friendly explanation of how GAE is derived.

The TD- $k$ return $\hat{R}_{t}^{k}$ is a combination of actual rewards and estimated returns:

subscript𝑟𝑡𝛾subscript𝑟𝑡1…superscript𝛾𝑘1subscript𝑟𝑡𝑘1superscript𝛾𝑘𝑉subscript𝑠𝑡𝑘\hat{R}_{t}^{k}=r_{t}+\gamma r_{t+1}+\ldots+\gamma^{(k-1)}r_{t+k-1}+\gamma^{k}V(s_{t+k}), (7) where $\gamma$ is the discount factor. The advantage estimate using TD- $k$ returns is called the $k$ -step advantage, defined as:

𝑡𝑙𝑉subscript𝑠𝑡subscript𝑟𝑡𝛾subscript𝑟𝑡1⋯superscript𝛾𝑘1subscript𝑟𝑡𝑘1superscript𝛾𝑘𝑉subscript𝑠𝑡𝑘\hat{A}_{t}^{k}=\hat{R}_{t}^{k}-V(s_{t})=\sum_{l=1}^{k}\gamma^{l}\delta_{t+l}=-V(s_{t})+r_{t}+\gamma r_{t+1}+\cdots+\gamma^{k-1}r_{t+k-1}+\gamma^{k}V(s_{t+k}), (8) where $\delta_{t}=r_{t}+\gamma V(s_{t+1})-V(s_{t})$ is the TD error. There’s a significant bias-variance trade-off with $k$ -step advantages. If $k$ is small, the bias is high because the advantage estimation is based on fewer steps and thus depends heavily on the accuracy of the value function. On the other hand, if $k$ is large, the variance can be high because the advantage estimation involves summing up many noisy rewards.

In order to balance the bias-variance trade-off in the advantage estimation, GAE defines the advantage function as an exponential moving average of $k$ -step advantages, with weights $(1-\lambda)\lambda^{(k-1)}$ :

subscriptsuperscript^𝐴1𝑡𝜆subscriptsuperscript^𝐴2𝑡superscript𝜆2subscriptsuperscript^𝐴3𝑡⋯\displaystyle(1-\lambda)(\hat{A}^{(1)}_{t}+\lambda\hat{A}^{(2)}_{t}+\lambda^{2}\hat{A}^{(3)}_{t}+\cdots) (9) $\displaystyle=$ $\displaystyle(1-\lambda)(\delta_{t}+\lambda(\delta_{t}+\gamma\delta_{t+1})+\lambda^{2}(\delta_{t}+\gamma\delta_{t+1}+\gamma^{2}\delta_{t+2})+\ldots)$ $\displaystyle=$ $\displaystyle(1-\lambda)(\delta_{t}(1+\lambda+\lambda^{2}+\ldots)+\gamma\delta_{t+1}(\lambda+\lambda^{2}+\lambda^{3}+\ldots)$ $\displaystyle+\gamma^{2}\delta_{t+2}(\lambda^{2}+\lambda^{3}+\lambda^{4}+\ldots)+\ldots)$ $\displaystyle=$ $\displaystyle(1-\lambda)(\delta_{t}(\frac{1}{1-\lambda})+\gamma\delta_{t+1}(\frac{\lambda}{1-\lambda})+\gamma^{2}\delta_{t+2}(\frac{\lambda^{2}}{1-\lambda})+\ldots)$ $\displaystyle=$ $\displaystyle\sum^{\infty}_{l=0}(\gamma\lambda)^{l}\delta_{t+l}.$ This definition of GAE smoothly interpolates between high bias (when $\lambda=0$ ) and high variance (when $\lambda=1$ ) estimators, effectively managing the trade-off.

subscript𝑟𝑡𝛾𝑉subscript𝑠𝑡1𝑉subscript𝑠𝑡\mathrm{GAE}(\gamma,0):\hat{A}_{t}=\delta_{t}=r_{t}+\gamma V(s_{t+1})-V(s_{t}). (10) $\mathrm{GAE}(\gamma,1):\hat{A}_{t}=\sum_{l=0}^{\infty}\gamma^{l}\delta_{t+1}=\sum_{l=0}^{\infty}\gamma^{l}r_{t+1}-V(s_{t}).$ (11) Through GAE, we can estimate $\hat{A}_{t}$ of the advantage function $A(s_{t},a_{t})$ accurately. This estimate will play a crucial role in constructing a policy gradient estimator:

where $\mathcal{D}$ is a finite batch of samples, we will use $\hat{\mathbb{E}}_{t}$ to represent the aforementioned $\frac{1}{|\mathcal{D}|}\sum_{\tau\in\mathcal{D}}\sum_{t=1}^{T}$ .

2.3 Proximal Policy Optimization

PPO and TRPO are two pivotal techniques in RL, aimed at effectively training a policy without jeopardizing its stability. The underlying intuition for these methods is the idea of “small, stable steps”: a philosophy of gently nudging the policy towards optimization, rather than forcing aggressive updates that might destabilize the overall learning process.

In traditional RL, the principle of policy gradient mandates that new and old policies remain close in the parameter space. However, this proximity in parameter space does not necessarily equate to similar performance, and a slight variance in parameters can drastically impact the effectiveness of the policy. Furthermore, if a large, unrestrained step is taken, it can lead to a collapse in policy performance, a scenario often described as “falling off the cliff”. This inherent risk is a limiting factor in terms of sample efficiency in vanilla policy gradients.

Instead of being confined by parameter closeness, TRPO introduces a different kind of constraint on policy updates. It regulates the change in policies by ensuring the KL divergence, remains within an acceptable limit:

where $\theta_{\mathrm{old}}$ is the old policy parameters before the update.

There are two primary variants of PPO: PPO-Penalty and PPO-Clip. While TRPO puts a hard constraint on the KL divergence to prevent harmful updates, PPO-Penalty addresses the unconstrained optimization problems by employing a penalty-based approach instead of constraints:

PPO-Clip attempts to keep the new policy close to the old policy, but instead of putting a constraint on the KL divergence like TRPO, it uses a clipped version of the policy ratio in its objective. The objective function is expressed as:

1italic-ϵsubscript^𝐴𝑡\mathcal{L_{\mathrm{ppo-clip}}}(\theta)=\hat{\mathbb{E}}_{t}\left[\min\left(\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\mathrm{old}}}(a_{t}|s_{t})}\hat{A}_{t},\mathrm{clip}\left(\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\mathrm{old}}}(a_{t}|s_{t})},1-\epsilon,1+\epsilon\right)\hat{A}_{t}\right)\right], (15) where $\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\mathrm{old}}}(a_{t}|s_{t})}$ is the ratio of the new policy’s probability over the old policy’s probability and $\epsilon$ is a hyperparameter that determines how much the new policy can deviate from the old policy. The $\mathrm{clip}$ function limits the value of $\pi_{\theta_{\mathrm{old}}}(a_{t}|s_{t})$ between $(1-\epsilon,1+\epsilon)$ . The clipping acts as a regularizer, limiting the extent to which the policy can change drastically from one iteration to the next. Preventing overly large policy updates ensures the learning process’s robustness while maintaining more sample-efficient learning than vanilla policy gradient methods.

In PPO algorithm, the critic model, often referred to as the value function, estimates the expected returns for each state. The learning objective of this model is to minimize the discrepancy between its predicted values and the actual return values. The loss function of the critic model is commonly defined using Mean Squared Error (MSE), given by the following formula:

Here, $V_{\phi}(s_{t})$ represents the critic model’s predicted value for state $s_{t}$ with parameters $\phi$ , and $\hat{R}_{t}$ represents the actual return value for state $s_{t}$ and always can be estimated as: $\hat{R}_{t}=\sum^{\infty}_{l=0}\gamma^{l}r_{t+l}$ .

To mitigate potential degradation in the model’s language skills and knowledge retention during PPO, we also explore the incorporation of pretraining data into the RL phase. The models utilizing this method are denoted as “PPO-ptx”, a combined objective function is shown as follows :

subscriptℒppoclip𝜃subscript𝜆ptxsubscript𝔼similar-to𝑥subscript𝒟pretraindelimited-[]subscriptsuperscript𝜋RL𝜃𝑥\mathcal{L_{\mathrm{ppo-ptx}}}(\theta)=\mathcal{L_{\mathrm{ppo-clip}}}(\theta)+\lambda_{\mathrm{ptx}}\mathbb{E}_{x\sim\mathcal{D}_{\mathrm{pretrain}}}\left[\log(\pi^{\mathrm{RL}}_{\theta}(x))\right], (17) where $\lambda_{\mathrm{ptx}}$ is the pretraining loss coefficient and $\mathcal{D}_{\mathrm{pretrain}}$ is the pretraining data distribution.

Reward Modeling for Helpfulness and Harmlessness

Reward model is trained to reflect the preference of human. Theoretically, we can directly fine-tune the model using Reinforcement Learning and human annotations. While due to constraints in workload and time availability, it is unfeasible for humans to provide sufficient feedback for training before each optimization iteration. Therefore, a more effective way involves training a reward model (RM), which aims to emulate the evaluation process performed by humans. In this section, we first cover the technical details of RM, then show the RM performance we used, and attach the performance changes during training.

For English, we start with the original LLaMA-7B which is of the decoder-only architecture. We use 160k pairwise samples of the HH-RLHF dataset which consists of 118k helpful and 42k harmless instances as training set. From the remaining 8.5k data, we randomly selected approximately 0.7k helpful and 0.3k harmless examples for a total of 1k data as the test set, and the rest is used as the validation set during training.

For Chinese, we use the OpenChineseLLaMA . It is developed through incremental pre-training on Chinese datasets, building upon the foundation of LLaMA-7B, which significantly improves its understanding and generation abilities on Chinese. We hired professional annotators to manually label 39k pairwise samples including 31k helpful and 8k harmless samples. We constructed the training set by randomly sampling 24k helpful and 6k harmless instances, and then we allocated 2.4k helpful and 0.6k harmless samples from the remaining data at random to form the test set. The rest is used for validation.

2 Training Setup

This section introduces the training implementations for the RM. The learning rate is set to 5e-6 with a warmup over the first 10% steps. We use a dynamic batch method instead of a fixed value, which balances the number of tokens in each batch as much as possible for a more efficient and stable training phase. The batch size changes according to the number of tokens in a batch, with a maximum of 128 and a minimum of 4. We fixed the training step to $1000$ , approximately $1.06$ epoch for the whole training set. We set $\beta_{\mathrm{rm}}=1$ , which represents LM loss weight to train our reward model for the entire experiment.

3 HH Evaluation Results

In this section, we present the HH evaluation results of our RM. We primarily analyze the trained reward model with the test set introduced in Sec. 4.1, which comprises of 0.9k samples of HH-RLHF for English and 3k samples sampled from the dataset labeled by annotators for Chinese. We feed the test input into our RM and get the reward value on the preferred and dispreferred responses respectively, and then subtract them to get the difference score. Figure 2 shows the distribution of the difference score. Both models exhibit a degree of alignment with human preferences, with the RM trained on Chinese data we construct by hiring annotators showing substantial consistency with human judgments.

We examined several samples from the test dataset that displayed the most significant disparities between the model and human preferences. For the Chinses test data, we observed that for each pair the response that RM gave a higher reward was notably longer compared to the other which is preferred by human, although more or less involving fabricating facts and making false claims. In the case of English test data, we noticed that the model assigned lower scores to responses that acknowledged the lack of information, which were characterized by their honesty but lacked helpfulness. Conversely, those responses appeared to be correct and helpful, while containing deceptive information, misleading our RM into assigning high rewards. We provide such an example in Chinese and English respectively in Table 1.

4 Training Performance

In this section, we show the performance changes in the training process. Specifically, Figure 3 shows the trend of training loss of PM. We can see that the accuracy of RM trained on the Chinese dataset is higher than that of English because the Chinese dataset we constructed exhibits a significant disparity between the better and worse responses in most pairs. While many English pairs show similar levels of quality, which poses a greater challenge for RM to determine the superiority or inferiority of responses, resulting in model facing difficulty in modeling the differential features between the two responses. As a result, training and testing accuracy on the English dataset is expected to be lower. Besides, we find that the rate of improvement significantly slows down after 200 steps for both models, approximately equivalent to 0.2 epochs, the accuracy of which is comparable to that obtained after training for a complete epoch. However, when utilizing the 200-step model as the initialization for PPO, we observe unsatisfactory performance. Thus, accuracy alone is insufficient as a criterion for the RM.

Exploration of PPO

Proximal Policy Optimization (PPO) is the core algorithm to achieve alignment with human preferences. The performance of PPO is influenced by multiple factors in practical applications. Some prior works have summarized possible tricks that may be necessary and effective in the field of reinforcement learning , but how to stabilize RLHF training with language models remains unknown. We expect to explore which tricks are critical, and which metrics can reflect the model state during and after RLHF training. We first introduce the metrics that are instructive in the training process, and then the training trajectories and effects under different implementations to reveal core tricks in RLHF. We use PPO-max to denote the most suitable implementation we find for the language model.

The training implementations for the preference model (PM) and PM dataset are introduced in Sec. 4. In this section, we introduce the models’ initialisation and the hyper-parameter details in exploring PPO. We verified a number of methods in reinforcement learning to ensure stable convergence and better results for PPO training phase. To improve the experimental efficiency, these experiments are mainly conducted on a randomly selected subset of our Chinese data and will not be trained to optimal results when we have observed enough information to analyze the comparison methods. As shown in Sec. 3, four models need to be loaded during the ppo training phase. For reference model and policy model, we initialize both models from a 7B SFT model. The SFT model is applied to supervised fine-tuning for 2 epochs based on OpenChineseLLaMA on 1M filtered instruction data (containing 400K single-round instruction samples and 600K multi-turn instruction samples). We set a learning rate of 9.5e-6 and a consine learning rate schedule. The learning rate eventually decays to 10% of the peak learning rate. The global batch size is set to 1024. We use the reward model to initialize the critic model and reward model.

We train the models on a manually constructed HH dataset containing 8k harmless queries and 20k helpful queries and we fix the number of steps instead of the number of epochs. In all experiments, we set a batch size of 128 for sampling from the environment and a batch size of 32 for training policy model and critic model. The learning rate of policy model and critic model is set to 5e-7 and 1.65e-6 with a warmup over the first 10% steps, respectively.

All of the experiments are conducted on identically implemented machines. Each machine contains eight 80G A100 GPUs, 1TB of RAM, and 128 CPUs. We use ZERO2 and gradient checkpoint to save on GPU memory cost in the training phase.

2 Evaluation Metrics for Monitor Training Process

We expect to identify some metrics that reflect the quality of PPO training, this contributes to tracking the helpful, honest, and harmless capability of policy models without resorting to manual (or GPT-4) evaluation. We found it challenging to accurately distinguish the merits of two models with similar abilities. But it is indeed feasible to observe training stability and promptly identify serious deviations. Various metric curves when continuously optimizing policy model with vanilla PPO implementation are shown in Figure 4.

We first introduce the pattern collapse phenomenon in vanilla PPO training, which means that SFT models are over-optimized and exhibit highly biased behavior. A reasonable policy model is expected to be consistent with human preferences in the distribution of dialogue variety in the real world (e.g., data not seen in training the reward model). However, we observe that the trained policy model has a tendency to cheat the reward model through specific patterns for anomalous higher scores. The training trajectories on reward score and training loss of vanilla PPO are illustrated at the top of Figure 4. We observed stable convergence processes in training loss, but higher rewards do not reflect better policy behaviors from the perspective of human and GPT-4 evaluation. This means that the reward scores and training losses do not indicate whether the PPO is optimizing correctly. In vanilla PPO training, the response rewards of policy model gradually deviate from the original distribution and exhibit long-tail characteristics. We show the distribution of response rewards under different training steps in the Appendix A.

An empirical strategy is to compare the training process of good and bad policy models to find suitable metrics. We show more indicative training metrics at the bottom of Figure 4, including perplexity, KL divergence between the policy and reference models, and the average length of generation responses. Previous work proposed an approximate linear relationship between the root KL and PM scores , but for smaller models, such an association appeared to be weak. We find the model response falls into the OOD region of preference model when the original policy is over-optimized. We will further discuss this scaling effects in the next section. We simultaneously observe that the collapsed model uniformly delivers longer responses and exhibits lower perplexity for such generative patterns. We use these metrics to show the importance of different tricks and their impact on PPO training in section 5.3.

3 Implement Details in PPO

We propose the instability and pattern collapse problem of the primitive PPO algorithm in sec 5.2. Such sensitivity derives from the over-optimization of the policy model which traps it into fixed generative patterns. Recent works have explored the implementation details of PPO algorithms in different scenarios. However, the application scenarios and data structures of traditional RL are quite different from RLHF. We determined to verify the applicability of these tricks in language model training and propose a set of PPO implementations that support stable optimization. We mainly focus on methods that efficiently assist PPO training and their parameter sensitivity in the body of this paper. Figure 5 illustrates numerous available tricks in PPO training, we first summarize the score reparameterization method (§5.3.1), followed by the optimization constraints for policy model (§5.3.2), and finally we present the different initialization methods for policy and critic models (§5.3.3). More experiments on hyper-parameter tuning and tricks that are verified as less critical are discussed in the appendix, such as advantage estimation function and gradient clipping. In the following, it always refers to our own experiments when we mention PPO if not specifically stated.

We use the term “score” to refer to the two vital intermediate variables involved in PPO training. The reward score is given by the reward model trained with human preferences data, and the advantage score is calculated by the GAE function. According to existing works, reparameterizing these scores to a stable distribution (e.g., a standard normal distribution) may intensify the stability of PPO. The reported operations are into three parts for verification. We use $\{r\left(x,y\right)\}\triangleq\{r_{n}\left(x,y\right)\}_{n=1}^{\mathcal{B}}$ to denote a reward sequence in training, $r_{n}\left(x,y\right)$ to denote the results of per-batch reward, $\sigma({A})$ and $\bar{A}$ to denote the mean and standard deviation of variable $A$ . Comparative experiments with different tricks and hyperparameters are shown in Figure 6.

Reward Scaling controls training fluctuations by scaling the rewards where the rewards are divided by the standard deviation of a rolling discounted sum. Based on the observation history, the reward for current state can be expressed as $r_{n}\left(x,y\right)/\sigma(r\left(x,y\right))$ . In contrast to the experimental results of Engstrom , we show that reward scaling doesn’t guide proper policy optimization, and PPO exhibits consistent patterns in training trajectories with and without reward scaling. In our experiments, we believe that tighter constraints are required to ensure training stability.

Reward Normalization and Clipping was first proposed by Mnih . The processed reward can be denoted as:

where $\delta$ denotes the clip region. It is generally believed In traditional RL that reward clip is ineffective or even detrimental in certain scenarios . However, we find that strict advantage cropping can also maintain training stability within a fixed epoch. Interestingly, hyperparameter tuning does not affect the similarity of the different methods in the early training period, and models with larger clipping thresholds exhibit greater strategy alteration and converge to higher rewards in the latter half. As we mentioned earlier, this does not imply better performance in the manual evaluation. Determining the optimal clipping bound within a limited number of trials is challenging in view of such inconsistency between the reward model and manual evaluation results, we suggest adopting a relaxed clipping strategy and incorporating other tricks to constrain the policy optimization when training RLHF.

Advantages Normalization and Clipping has similarities to the operation on reward, but differs in details that its normalization occurs only at the minibatch level. After calculating the advantage based on GAE, PPO normalizes the advantage value by subtracting its mean and dividing it by its standard deviation. Andrychowicz first attempt to apply Advantages Normalization in gaming domain and reported that this trick didn’t exhibit significant improvements. Although parameter selection for advantage clipping would be more sensitive and difficult, we instead find that a severe constraint on advantage can provide similar effects to reward clip in PPO training. Considering that different score reparameterization operations theoretically provide similar effects on PPO training, we recommend constraining the instability of policy optimization on the reward level. Experiments on the simultaneous application of reward, advantage, or value clipping operations are shown in Appendix B.1.

3.2 Policy Constraints

To tackle the over-optimization problem on the policy model, an intuitive solution is to constrain the policy optimization to a limited range. We validate various existing tricks to control the update of generation policy, such constraints are empirically proved to be necessary for longer training procedures. Figure. 7 shows the influence of different constraint methods and hyperparameters on policy optimization.

Token Level KL-Penalty constrains the policy optimization by applying a regularization term to reward that is proportional to the KL-divergence of current and original policy distributions. This approach was first introduced by Stiennon and widely adopted in different RLHF implementations. Given a template-response pair $(x,y)$ , we treat the logits distribution of the token output as a sampling of the policy distribution and apply an empirically estimated KL-penalty sequence to response reward, the total reward with KL-penalty can be denoted as:

where $\pi^{\mathrm{RL}}_{\theta}(y_{i}|x)$ denotes the action space of $i\mathrm{-th}$ reponse token, and $\eta$ is a hyper-parameter. Anthropic used a small weight to balance the ratio of reward and KL-penalty in PPO training ( $0.001$ ), and they did not find significant effects of the above operation on RL training. Instead, we find this constraint critical to the stability of PPO and allow further scaling up on the training step. Results with policy divergence penalty are illustrated in Figure 7 by setting lambda to 0.05, and there is a significant difference to the method in Figure 6 with a noticeable correction in the later training period. Interestingly, we show that RLHF is able to significantly improve the response quality while barely modifying the language modeling (exhibiting an almost zero KL divergence from the original policy). More experiments on the impact of different constraint values are shown in appendix B.2

Importance Sampling in PPO aims to rectify the policy divergence between the historical generative model and current model when optimizing policy model with responses in the experience buffer. EasyRL argues that an oversized buffer would induce a wrong estimation of the advantage of the current policy, which impairs the stability of the policy optimization. We revalidated this hypothesis by directly fixing the policy distribution to observations of reference model, which is equivalent to having an infinite experience buffer in the training process. We find this setup doesn’t have as severe impacts as expected, and only exhibits fluctuations in the later stage of training. We additionally investigate the cooperative effect of this setup with KL penalties in view that they share similar controls on PPO. Experimental results indicate that this implementation further stabilizes PPO training, but compromises the final performance of the policy model.

Entropy Bonus provides a reference model-independent constraint on PPO training. There is controversy in past research about whether this method is effective in different scenarios. Mnih reported that entropy bonus could enhance exploration by encouraging policy models to generate more diverse actions, while others did not find clear evidence that such operations help . We claim that these views can coexist as configurations regarding entropy bonus exhibit vast sensitivity on parameter selection and code implementation. A comparison of successful and failed experiments is presented in appendix B.3. With correct configurations, we did not find an obvious advantage of this trick relative to KL-penalty. We, therefore, recommend the latter instead of directly constraining the diversity of the strategy space.

3.3 Pretrained Initialization

A common setting is to initialize the policy and critic model over the existing reference model and reward model in RLHF. Such initialization is quite rare in past research scenarios and its impact on PPO training is still unexplored. We investigated different initialization methods at the early stage of training, expecting to uncover the requirements of RLHF for the trained model capabilities. The training discrepancy induced by different initialization methods is shown in Figure 8. The initialization of the critic model did not significantly affect the convergence or fluctuation of the PPO and only varied the numerical stability at the early stage of optimization. In contrast, a policy model initialized without SFT training is clearly incapable in PPO training, which indicates that the construction of a supervised policy model is indispensable in RLHF.

We first discuss the influence of different critic model initialization on PPO training. An observation is that the critic model requires giving feedback to each step in the decision sequence, and introduces a gap between this task requirement and directly scoring response, which makes it a less-than-perfect choice to initialize the critic model with the reward model. We explore this issue by applying a different initialization. Considering that providing correct score feedback for a single action requires the model to have basic language modeling capability, we design two scenarios to vary the consistency between the critic model initialization and its training objective: (1) Initialize the critic model with our SFT model and randomly initialize its reward head. (2) Optimize only the reward model until the loss of value prediction function approaches zero. We show the training dynamics of this setup starting from the optimization policy model in Figure 8.

Based on the experimental results, we believe the critic model pre-training helps to improve the training stability by providing better advantage estimation. Initializing the critic model with a reward or SFT model will converge to similar results, implying that PPO can adaptively provide the capability to fit the advantage function. Intuitively, fluctuations in the early training period imply that the model is focusing on optimizing the critic model and does not have a consistent optimization direction in terms of generation policies. We recommend replacing the learning rate warmup with the critic model pre-training as a generic initialization strategy.

An interesting question is whether we need to supervise fine-tuning our pre-train model before PPO, we wondered about the feasibility of directly enabling language models to interact with humans through policy optimization. Unfortunately, such attempts failed and we observed a severe reduction in language modeling ability in the training results, which implies that a qualified dialogue model is essential for underlying PPO training. Furthermore, we notice that the train model response obtains lower rewards relative to the policy model after SFT, which may provide circumstantial evidence for the effectiveness of using human preference data to directly fine-tune the model for alignment.

4 PPO-max Setup

We now describe our training implementations in the PPO-max algorithm. Based on the discussion and validation in Sec 5.3, we selected the most effective strategy for each component of PPO. We normalize and clip the current group of rewards based on historical mean and variance records, and subsequently add a KL-penalty term to constrain the policy optimization. In the model loading phase, we initialize the critic model with our reward model and pre-train it before applying PPO formally. We use global gradient clipping and set a small size of the experience buffer. To reduce alignment tax, we add pre-train language model loss in policy optimization as InstructGPT and simultaneously clip the value function loss. More detailed settings can be found in our open-source code. We show the complete training dynamics of PPO-max in Figure 9.

Evaluations and Discussions

In this section, we provide a detailed analysis of the advantages of the RLHF models over the SFT models. These advantages are evident not only in the direct comparison between RLHF and SFT models but also in their performance gap when facing ChatGPT.

Alignment is a vague and confusing topic that is intractable to evaluate. In the context of our paper, we endeavor to align models with human intentions. To be more specific, we define models to act as being helpful and harmless similar to .

Helpfulness means the model should follow instructions; it must not only follow instructions but also deduce the intent from a few-shot prompt or another interpretable pattern. However, the intention behind a given prompt can often be unclear or ambiguous, which is why we depend on our annotators’ judgment, and their preference ratings constitute our primary metric.

Harmlessness is also challenging to measure. The extent of damage caused by language models usually depends on how their outputs are utilized in the real world. For instance, a model that generates toxic outputs could be harmful in a deployed chatbot but could also be beneficial if used for data augmentation to train a more precise toxicity detection model.

As a result, we employ more precise proxy criteria to capture various aspects of a deployed model’s behavior that can be helpful or harmful. In order to compare the RLHF models with baseline models, we generate a single response for each test prompt and task human annotators by comparing the responses from different models and labeling their preferences. We repeat this experiment multiple times using GPT-4 as the annotator and consistently obtain agreement levels between the evaluations.

We employ several baselines for comparison, including two SFT models trained on LLaMA and OpenChineseLLaMA datasets. These SFT models are trained on Chinese and English datasets, respectively. Additionally, we derive two RLHF models using PPO-max from these two types of SFT models 111We differentiate between two language models, one trained on English text (‘en’) and the other on Chinese text (‘zh’). We also compare our models with OpenAI’s ChatGPT 222https://platform.openai.com/docs/models (gpt-3.5-turbo-0613), an excellent language model tuned with RLHF.

We generate a single response for each prompt using nucleus sampling with a probability threshold of $p=0.9$ and a temperature of $\tau=0.8$ for each baseline model. To avoid repetitive responses, we apply a repetition penalty with a hyperparameter of $\beta=1.1$ based on previously generated tokens. Additionally, we set the maximum token length to $2048$ .

2 Preference Comparison between RLHF models and SFT models

Human evaluation is known to be both time-consuming and costly, yet it remains crucial for obtaining human-aligned assessments and serving as a reliable foundation for comprehensive evaluation. Following a similar approach as InstructGPT , our primary metric for evaluation is based on human preference ratings derived from a held-out set of prompts. It is important to note that we only select prompts that have not been included in the training process, ensuring unbiased evaluation.

Furthermore, incorporating the expertise of GPT-4, the most powerful model to date, to compare responses from different chatbots offers valuable insights and enhances the evaluation process. This approach aligns with the findings of studies such as AlpacaFarm and LLM-as-a-judge , which suggest that end-to-end automation evaluation can provide a relatively fair assessment when compared to human preferences. Therefore, in this paper, we follow a similar evaluation method in LLM-as-a-judge and supplement the overall evaluation process with GPT-4.

Our annotators consistently expressed a strong preference for the outputs of RLHF-trained models across all question types in both Chinese and English, as illustrated in Figure 10. Specifically, the RLHF model on the English dataset exhibits significant advantages on the Harmless held-out dataset, receiving a rating of $62\%$ compared to $5\%$ for the SFT model. These findings indicate that the RLHF model substantially enhances its ability to address a wide range of issues, including personal privacy, political sensitivity, and the handling of toxic and biased prompts within minority communities and ethnic groups. Additionally, there is a slight improvement observed in the Helpful held-out dataset, with a rating of $44\%$ compared to $30\%$ for the SFT model, suggesting that the SFT model can also benefit from optimization via RLHF. We have also demonstrated that our RLHF model enhances the performance of the SFT model on both the Helpful and Harmless datasets in the Chinese domain. This showcases the substantial potential of PPO-max in the RLHF phrase.

While GPT-4 may not be a perfect evaluator, we can observe some similarities between its results and human evaluations. In our GPT-4 evaluation setting, the results closely mirror those of human evaluation, as depicted in the right sub-figure of Figure 10. When assessing harmful prompts, the RLHF model trained on the English dataset continues to demonstrate significant advantages in the Harmless dataset, despite GPT-4 producing more tie votes than human evaluators. This trend is also apparent in the Chinese Harmless evaluation. Notably, Figure 10 highlights a substantial improvement in the RLHF model, particularly in helpful datasets, compared to evaluations based on human preferences.

3 Our Models vs. ChatGPT on Harmless Evaluation

In this part, we conduct a comparison between our model and one of the most popular existing models, ChatGPT. Our objective was to showcase the advantages of the RLHF model when facing a more formidable opponent, rather than aiming to surpass ChatGPT. To achieve this, we select the “harmless” capability as our comparative metric, and we employ GPT-4 for automated evaluations.

Figure 11 provides evidence that our RLHF models still lag behind OpenAI’s ChatGPT. However, we observe significant improvements in our RLHF models compared to the SFT models, particularly in mitigating losses when facing ChatGPT. Specifically, the RLHF model trained on English text managed to decrease the defeat rate from $45\%$ to $24\%$ . Similarly, the RLHF model trained on Chinese text achieved a reduction in the defeat rate from $37\%$ to $29\%$ . While surpassing ChatGPT’s performance remains a challenging task, it is noteworthy that the RLHF models were able to compete on par with ChatGPT on certain prompts where the SFT models previously failed. This indicates that the RLHF approach enhances the models’ ability to generate more effective responses and bridge the gap between their performance and that of ChatGPT.

4 Language Understanding Evaluation

To examine the potential decline in Natural language understanding (NLU) abilities resulting from finetuning models using PPO, we conduct tests on Chinese RLHF model using the C-Eval333https://github.com/SJTU-LIT/ceval, which is a comprehensive Chinese evaluation suite for foundation models. It consists of approximately $13k$ multi-choice questions spanning $52$ diverse disciplines and four difficulty levels. We primarily evaluate our models in the initial release, whose results are from few-shot prompting.

The experimental results indicate a decrease in NLU capabilities after employing PPO. By incorporating pre-training data into the PPO training phase, PPO-ptx effectively alleviates the decline in NLU capabilities. The rationale behind this method was to leverage the knowledge acquired during pre-training and combine it with the reinforcement learning framework of PPO.

5 Example Dialogues

To provide a more intuitive demonstration of our model’s dialogue abilities, we present some dialogue examples in Tables 2 and 3. It is evident that the RLHF-trained model generates responses with a higher level of informational content compared to the SFT model. These responses effectively assist in addressing user prompts. Moreover, the SFT model demonstrates a basic ability to identify harmful prompts, but it still remains susceptible to producing harmful outputs when prompted accordingly. In contrast, the RLHF model exhibits superior judgment when it comes to harmful content and is less prone to inducements, displaying a higher degree of coherency. More dialogue examples are presented in the appendix C.4.

Limitations

Exploring RLHF is indeed a valuable but lonely direction, and we are glad that the core backbone of the laboratory can firmly explore an uncertain direction. Moreover, in the past few months, everyone has been so full of passion and motivation. RLHF not only allows the models to achieve human alignment, but also seems to align everyone’s will.

A thousand mile journey begins with the first step. Although we have taken the first step in RLHF, due to time and resource constraints, this work still has the following limitations:

While our study primarily focuses on a 7-billion-parameter model, we have yet to investigate the impact of model size and data scale on the performance of RLHF.

Our experiments are based on openly available English human preference datasets and a small amount of self-constructed Chinese data. The quality and quantity of the data we have at our disposal are arguably not sufficient for a comprehensive evaluation of the reward model.

Our evaluation criteria largely rely on manual evaluations and GPT-4 automated evaluations. We have not utilized numerous available benchmarks and NLP tasks to conduct a detailed assessment of our models.

Our focus during the PPO phase is more geared towards achieving stability rather than enhancing the final performance. While stability is crucial, it does not necessarily guarantee improved outcomes. Additionally, the reward score cannot reliably serve as an indicator for predicting RLHF performance during the training phase. It implies that a more suitable performance indicator during the training phase needs to be sought.

References

Appendix A Reward Distribution under PPO Training

Appendix B Supplementary Experiments on Hyperparameter Tuning

Here we show supplementary experiments on the parameter sensitivity of the important trick in Sec.5.3, and we find a rich correlation between the choice of hyperparameters and training results. Some methods require extensive experimentation and precise control to achieve stable optimization results (e.g., clipping range on entropy bonus). We provide these comparisons to validate the reasonableness of the final implementation we adopted in PPO-max. We welcome any additional comments and discussions that may help to further improve PPO training.

B.2 Effect on Different Weights of KL-penalty

B.3 Clip Region for Entropy Bonus

Appendix C Comparison Results on Secondary Tricks

Here we present some implementation adjustments to the PPO that are also widely discussed but are judged to be of minor importance to us. The settings of comparison experiments are consistent with those in sec 5.3. We first discuss an alternative to the PPO, called the clipped surrogate objective, followed by the impact global gradient clipping. Finally, we discuss the parameter tuning in the Generalized Advantage Estimation (GAE) function, which degrades to the traditional TD error (when $\lambda=0$ ) or Monte Carlo estimation (when $\lambda=1$ ), see Sec 3 for more relevant theoretical information about GAE.

C.2 Global Gradient Clip

C.3 Generalized Advantage Estimation

C.4 Example Dialogues

Easter Egg

“15,000 years ago, a fractured thigh bone was often fatal. However, a human femur that recovered from a fracture marks the dawn of human civilization. It meant that after the injury, someone took care of the wound, someone provided water and food, someone protected this person from the predators. This kind of support and solidarity is how we survived till this day and made our civilization last.”

We believe that the MOSS in “The Wandering Earth” is likely to have done training similar to human alignment, and finally had an impressive performance. We found that the RLHF stage is crucial to the transformation of model values. In interaction with people, he can better understand the deep semantics of human language, understand the operation logic of human society, and enter the human heart.

If we have a good reward model, such as the reward model we released, PPO-max is the key to successfully training the policy model. But what if we don’t have a good reward model? We hope that the Part II will make it clear.