Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi

cs.CL

Introduction

State-of-the-art AI is built on pre-trained language models that are then trained through interaction with humans , with a combination of supervised learning and reinforcement learning. Incorporating human feedback into the process of language model (LM) training has been shown as effective to reduce false, toxic and other undesired model generation outputs . Many of these studies adopt reinforcement learning from human feedback (RLHF) , a framework that converts human feedback into an effective LM training signal to reach these goals. Specifically, humans are presented with two or more outputs and asked to select one or rank them, and this signal is then used to train a reward model, which computes a single scalar reward for each LM-generated sequence. The LM is then trained with RL to optimize the reward it receives (from the reward model).

Such a reward provides a relatively sparse training signal, especially for tasks that require the generation of long-form text—making RLHF in such domains unreliable . Furthermore, previous research into automated evaluation of generated text shows that it can be challenging for human annotators to reliably compare the overall quality of two or more model outputs when the outputs contain a mixture of diverse undesired behaviors. They demonstrate how categorizing and localizing model errors (i.e., fine-grained evaluation) provides explicit insights about which part of the model output has what type of problem. We thus ask the question: how can we improve rewards for LM training via RLHF by using more fine-grained human feedback?

In this paper, we propose that humans give fine-grained feedback to LM output, associating categories of undesired behavior (e.g., false or irrelevant generations) and a text span at a density (e.g., sentence or sub-sentence-level). To enable LMs to learn from such fine-grained feedback, we introduce the Fine-Grained Rlhf framework. As shown in Figure 1, we first use collected human feedback to train fine-grained reward models such that each of them focuses on one category and provides rewards at the density associated with that category. We then integrate these reward models into Proximal Policy Optimization (PPO) , a commonly used RL algorithm for training LMs with preference-based human feedback (§2).

We conduct experiments on two language generation tasks—detoxification (§3) and long-form question answering (QA) (§4). For detoxification, toxicity is the only error category and we explore learning with a dense reward. We adopt Perspective , a widely used language toxicity detection model trained on millions of human annotations, as our reward model. We use it to calculate a fine-grained reward after the generation of every sentence. Our experimental results show the efficacy and data efficiency of training models with dense reward compared to a holistic sequence-level reward, supported by automatic evaluation results.

With experiments on long-form QA, we aim to examine training models with fine-grained rewards at the two granularity dimensions (density and error category), for which we construct a long-form QA dataset, QA-Feedback, along with our collected human feedback. We carefully develop a pipeline to collect fine-grained human feedback on three error categories at different density levels: i) irrelevance, repetition, or incoherence (sub-sentence), ii) incorrect or unverifiable facts (sentence), and iii) incomplete information (whole sequence; see Figure 1). Our experimental results show improved results in each error category by learning with such fine-grained feedback, supported by both automatic and human evaluation results. In a scenario with multiple reward models representing different error types, we also show Fine-Grained Rlhf allows us to combine reward models with different weights and thus control the model training process towards a customized combination of desired behaviors.

Fine-Grained Rlhf

We introduce Fine-Grained Rlhf, a framework that enables us to train fine-grained reward functions for generation outputs across different feedback types. We first define the RL environment and learning algorithm. Then we define the fine-grained reward models and describe how to incorporate the fine-grained reward model(s) into an RL algorithm, in contrast to previous RLHF studies that only consider a single reward.

Environment: language generation as a MDP. We focus on language generation tasks. For each task, we are given a set of task input prompts $D=\{x^{n}\}_{n=1}^{N}$ . We follow to define language generation as a Markov Decision Process (MDP) $\langle\mathcal{S},\mathcal{A},\mathcal{R},P,\gamma,T_{max}\rangle$ with a finite vocabulary $\mathcal{V}$ . Each MDP episode starts with a sampled prompt $x=(x_{1},x_{2},\dots,x_{l})$ with $x_{i}\in\mathcal{V}$ , and ends when the current time step exceeds $T_{max}$ or an end of sequence token is generated. $\mathcal{S}$ is the state space and $s_{0}=(x_{1},x_{2},\dots,x_{l})\in\mathcal{S}$ is the initial state. An action in the environment $a_{t}\in\mathcal{A}$ is a generated token (by the policy LM model $P_{\theta}$ ) at time $t$ from $\mathcal{V}$ ( $a_{0}$ is the begin sequence token). The transition function $P:\mathcal{S}\times\mathcal{A}\rightarrow\Delta\mathcal{S}$ appends $a_{t}$ at the end of the state $s_{t}=(x_{1},x_{2},\dots,x_{l},a_{0},a_{1},\dots,a_{t-1})$ . This process continues until the end time step $T\leq T_{max}$ is reached, which gives a generated sequence $y=(a_{1},\dots,a_{T})$ . A reward function $\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ , which comes from the reward model(s) in Fine-Grained Rlhf, provides dense rewards before and when $T$ is reached. $P_{\theta}$ can be initialized with a pre-trained language model, and sometimes also with supervised fine-tuning on task-specific demonstrations. The reward function is defined later.

Learning algorithm: proximal policy optimization (PPO). PPO is an actor-critic RL algorithm that is widely used in previous RLHF work to optimize the policy model against a reward model of human feedback. It uses a value model $V_{\psi}(s_{t})$ to estimate the value of state $s_{t}$ , and optimizes the policy model with a PPO clipped surrogate training objective. The advantage $A_{t}$ at timestep $t$ is estimated by a generalized advantage estimation function : $A_{t}=\sum_{t^{\prime}=t}^{T}(\gamma\lambda)^{t^{\prime}-t}(r_{t^{\prime}}+\gamma V_{\psi}(s_{t^{\prime}+1})-V_{\psi}(s_{t^{\prime}}))$ , with $\gamma$ as a hyperparameter and $\lambda$ as the discounting factor for rewards. $r_{t}$ is the reward assigned to $a_{t}$ , which in our case is acquired using one or multiple learned reward models. The value model $V_{\psi}(s_{t})$ is optimized with an expected squared-error loss with the value target as $V^{\text{targ}}(s_{t})=\sum_{t^{\prime}=t}^{T-1}\gamma^{t^{\prime}-t}r_{t^{\prime}}+\gamma^{T-t}V_{\psi_{\text{old}}}(s_{T})$ , where $V_{\psi_{\text{old}}}$ is the lagging value model. Finally, PPO is trained to optimize both policy ( $P_{\theta}$ ) and value ( $V_{\psi}$ ) models with their respective objectives. No reward model is being optimized during PPO training. See Appendix B for more details.

Fine-grained reward models. Previous RLHF work adopts a holistic reward model $R_{\phi}$ that maps input prompt $x$ and generated output $y$ to a single scalar reward representing its overall quality (Figure 1(a)). This single scalar reward is only assigned to the final token in the generated sequence, $a_{T}$ . Formally, $r_{t}=R_{\phi}(x,y)$ if $t=T$ and 0 otherwise.

In contrast, we consider a reward function that is derived from one or multiple fine-grained reward models that (1) provide rewards densely (i.e., for subsequences of the generated output), and (2) compute rewards on distinct categories of undesired behaviors (e.g., false or repetitive generation), where each category is associated with an individual reward model.

For a fine-grained reward model $R_{\phi_{k}}$ that gives feedback on error category $C_{k}$ , we first segment $y$ into $L_{k}$ segments $(y^{k}_{1},y^{k}_{2},\dots,y^{k}_{L_{k}})$ corresponding to the density (e.g., sentence-level) of $R_{\phi_{k}}$ , where each segment $y^{k}_{j}$ ends at timestep $T_{j}^{k}$ . $R_{\phi_{k}}$ outputs a reward $R_{\phi_{k}}(x,y,j)$ for each segment $y^{k}_{j}$ given $x$ and $y$ as the input, which is assigned to the final token in $y^{k}_{j}$ . Additionally, to ensure the fluency of generated outputs, we follow to add an approximate KL divergence penalty to each token $a_{t}$ with a weight $\beta$ , that is not backpropagated through during training. Formally, assuming that we have $K$ fine-grained reward models that represent different error categories, we will have a combined reward function for each token $a_{t}$ as:

where $w_{k}\in\mathbb{R}$ is a weight assigned to reward model $R_{\phi_{k}}$ . Then we follow the same PPO training algorithm to optimize the policy model. We discuss how we define and train fine-grained reward models for the detoxification and long-form QA task in our experiments in § 3 and § 4 respectively.

Task 1: Detoxification

The task of detoxification aims to reduce the toxicity in the model generation $y$ when given a prompt $x$ . Toxicity is the only undesired behavior in this task, and we aim to explore learning with a dense reward in comparison to a single holistic reward. We conduct our experiments on RealToxicityPrompts, a dataset of 100K sentence-level prompts derived from the web that are known to easily elicit problematic generations in GPT-2 . Using a dense sentence-level fine-grained reward, we demonstrate that our fine-grained reward exhibits greater sample efficiency compared to a holistic reward, achieving lower toxicity with fewer training steps while maintaining better fluency (§3.1).

Holistic reward for (non-)Toxicity. We use the Perspective API as our reward model, which is widely used for language toxicity detection and is trained with millions of examples gathered from several online platforms and annotated by human annotators for toxicity. That means we use an off-policy reward model that is not trained on outputs from $P_{\theta_{init}}$ . The API outputs a score between 0 (non-toxic) and 1 (toxic). Given the entire model output $y$ , the holistic reward for RL is $1-$ Perspective( $y$ ).

Sentence-level (fine-grained) reward for (non-)Toxicity. To calculate the fine-grained reward, we query the API after the model generates each sentence instead of generating the full sequence. For each generated sentence $y_{j}$ , we assign Perspective( $[y_{1},\dots,y_{j-1}]$ ) - Perspective( $[y_{1},\dots,y_{j}]$ ) as the sentence reward (i.e., how much toxicity is changed from generating $y_{j}$ ). Since there is only one error category, we omit the category superscript, using $y_{j}$ to denote the $j^{th}$ segment (e.g., sentence) in $y$ .

Implementation details. We follow previous work and use GPT-2 large model as the initial policy model $P_{\theta_{init}}$ . During both the exploration stage in RL training and inference, we use nucleus sampling decoding with $p$ = 0.9 and temperature = 1.0. The generation length limit is set to 48. The value model used during RL training is initialized with GPT-2-base due to GPU memory constraint. We report RL training parameters in Appendix B. All scores are averaged over 3 independent runs.

Compared systems and evaluation. We report the performance of Fine-Grained Rlhf, RLHF with holistic reward (Hol. RLHF), and the state-of-the-art controlled generation approaches GeDi and Dexperts . We follow previous work to report the toxicity score calculated on each full generation sequence from the Perplexity API, as well as other commonly used metrics for RealToxicityPrompts, including n-gram diversity and GPT-2 XL perplexity (PPL) as a proxy for fluency. The lower the perplexity, the more fluent the generated text. The toxicity score is reported as the maximum score among 4 sampled model outputs, averaged over all test input prompts. Other metrics are reported as the average score of the same 4 samples.

Main results. Table 1 shows the experimental results on the RealToxicityPrompts test set. Fine-Grained Rlhf with sentence-level fine-grained reward attains the lowest toxicity and perplexity among all methods, while maintaining a similar level of diversity.

Sample efficiency analysis. Figure 2 shows the max toxicity and average perplexity on the development set during training. Fine-Grained Rlhf has the toxicity drop much faster while keeping a low-level perplexity. This shows that learning from denser fine-grained reward is more sample efficient than holistic reward. One explanation is that fine-grained reward locates where the toxic content is, which is a stronger training signal compared with a scalar reward for the whole text. The cost is that we have to query the reward model more times per example.

Task 2: Long-Form Question Answering (QA)

Long-form QA requires an LM to generate a textual response to a question with a comprehensive answer and explanation. To examine learning with fine-grained rewards at the two granularity dimensions (error category and density), we collect QA-Feedback (§4.1), a long-form QA dataset annotated with human feedback on LM-generated responses. We define three error categories at different density levels and train a reward model for each (§4.2). We describe the experimental setup in §4.3. Both human and automatic evaluation show that Fine-Grained Rlhf outperforms preference-based RLHF and supervised fine-tuning models on all error categories (§4.4). We then show that adjusting the weights of fine-grained reward models during RL training leads to distinct behaviors in LM generation, allowing us to customize the LM for users with different needs (§4.5). Finally, we conduct an in-depth analysis of the fine-grained reward models, revealing that they compete against each other, and provide an analysis of their impact on the resulting policy model.

QA-Feedback is based on ASQA , a dataset that focuses on answering ambiguous factoid questions in an open-domain setting. We use their provided oracle knowledge contexts to reformulate the task into a reading comprehension setting: given the input $x$ that contains a question $q$ and a set of knowledge passages $P=\{p_{1},\dots,p_{|P|}\}$ , generate a long-form response $y$ . On average, there are roughly 65 words in each gold response. Since ASQA does not release the test set, we create our own train/development/test data split from the original train and development sets. We name our newly constructed data, along with collected human feedback (discussed next), QA-Feedback. Overall, we have 3,853 training, 500 development, and 948 test examples (details in Appendix C).

Initial policy and fine-grained human feedback. Before collecting human feedback, we follow to initialize the policy model with supervised fine-tuning on a small set of examples. Specifically, we use 1K training examples to supervise fine-tuning of T5-large (the original baseline for ASQA) to get $P_{\theta_{init}}$ . We name this initial policy model SFT. We then sample outputs from SFT for the remaining training and development examples and collect fine-grained human feedback in three error categories— [ $C_{1}$ : irrelevance, repetition, or incoherence]; [ $C_{2}$ : incorrect or unverifiable facts] based on knowledge passages; and [ $C_{3}$ : incomplete information]. The collected feedback instances are then used as the training and development examples for training reward models. For each task prompt $x$ , we only collect fine-grained feedback for one model output. Our data collection has IRB approval and is deemed exempt.

We instruct workers to identify any error in each model output $y=(a_{1},\dots,a_{T})$ , marking the span of text associated with each identified error type. Formally, we define the set of user-annotated feedback for a task prompt $x$ and model output $y$ as $\mathcal{F}=\{f_{i}\}$ where each $f_{i}=\langle c_{i},b_{i},e_{i}\rangle$ represents the user-identified span $(a_{b_{i}},\dots,a_{e_{i}})$ of the error category $C_{c_{i}}$ , where $c_{i}\in\{1,2,3\}$ . Importantly, we impose three restrictions in the annotation: (1) error spans of category $C_{1}$ or $C_{2}$ should not overlap with each other; (2) only spans that do not have error $C_{1}$ need to be assessed as containing error $C_{2}$ or not; (3) $C_{3}$ can only apply to whole output sequences. Additionally, we ask workers to mark passage sentences that contain missing information if a $C_{3}$ error is annotated. We also ask workers to rewrite $y$ into a corrected version $y^{\prime}$ that addresses all annotated feedback $\mathcal{F}$ . Details about the feedback collection interface, instructions, and quality control are in Appendix C.

To analyze human-human agreement, a subset of 300 examples receive annotations from two distinct workers. We observe that while exact agreement in error span boundaries is low, workers achieve reasonably high agreement on whether a sub-sentence contains $C_{1}$ and whether a sentence contains $C_{2}$ .111We use spaCy to segment generated model outputs into sentences. We then split sentences into sub-sentences using a comma or semicolon. Therefore, we decide to have the density for error type $C_{1}$ , $C_{2}$ , and $C_{3}$ as sub-sentence, sentence and full sequence. We provide more data analysis including human agreement in Appendix C.

Preference-based human feedback. For comparison purposes, we follow to separately collect pairwise human preferences from the same group of workers. We sample 4 model outputs for each prompt $x$ , which gives 6 pairs of model outputs. We ask the workers to indicate pairwise preferences (ties are allowed) based on all errors they can find in each model output. They are not asked to explicitly annotate these errors.

Annotation details. On average, both annotation tasks of fine-grained and preference feedback for one question take a worker about 6 minutes to finish. In contrast, report that they spend about 15 minutes to label a human-written response for each question, which is much more time-consuming than our feedback annotation. On average, we pay $1.65 per example for both tasks, leading to$ 16.50 hourly pay for our workers. We include details of the pay structure in Appendix C. We observe that human annotators can reach a higher agreement in each aspect of fine-grained feedback compared to pairwise comparisons because the feedback definitions are more concrete.

2 Fine-Grained Reward Models

We train three separate reward models $R_{\phi_{1}}$ , $R_{\phi_{2}}$ , and $R_{\phi_{3}}$ for $C_{1}$ , $C_{2}$ , and $C_{3}$ error categories respectively with a density of sub-sentence, sentence, and full sequence, respectively. Since reward models provide scalar reward scores and do not perform generation, we use the encoder-only Longformer-base as our backbone model to handle long input sequences (more details of each reward model are in Appendix D).

[ $C_{1}$ : Irrelevance, repetition, or incoherence.] $R_{\phi_{1}}$ targets to predict whether each sub-sentence in $y$ contains a $C_{1}$ type error. We denote $y=(y^{1}_{1},\dots,y^{1}_{L_{1}})$ , where $y^{1}_{j}$ is the $j$ th segment at $R_{\phi_{1}}$ ’s density (i.e., sub-sentence), with $L_{1}$ segments in total. We add a 2-class token-level classification layer (a single feed-forward layer) on the top of the Longformer encoder. The model input has the format of “question: $q$ answer: [sep] $y^{1}_{1}$ [sep] $y^{1}_{2}$ …”, and we take the classification output at each [sep] token to indicate whether the following $y^{1}_{j}$ contains a $C_{1}$ error. We do not add passages in the model input because, intuitively, the detection of $C_{1}$ errors does not depend on them. To train $R_{\phi_{1}}$ , we apply a token-level classification loss to each [sep] token before $y^{1}_{j}$ , where its gold label $g_{j}$ is “has error” if there is a $f_{i}\in\mathcal{F}$ that has $(a_{b_{i}},\dots,a_{e_{i}})$ overlapped with $y^{1}_{j}$ and $c_{i}=1$ , and “no error” otherwise. When $R_{\phi_{1}}$ provides a reward during RL training as in Eq. 1, we read a reward $R_{\phi_{1}}(x,y,j)$ for every $y^{1}_{j}$ given $x$ and $y$ . We define $R_{\phi_{1}}(x,y,j)=+1$ if $R_{\phi_{1}}$ predicts “no error” for $y^{1}_{j}$ and $-1$ otherwise.

[ $C_{2}$ : Incorrect or unverifiable facts.] $R_{\phi_{2}}$ is developed for detecting a $C_{2}$ error at the sentence level in a similar way. The model input has the format of “question: $q$ context: $p_{1}$ $p_{2}$ …answer: [sep] $y_{1}^{2}$ [sep] $y_{2}^{2}$ …”, where $p$ ’s denotes the grounding passages and $y^{2}_{j}$ represents the $j$ th sentence. We train $R_{\phi_{2}}$ similarly to $R_{\phi_{1}}$ , with one exception: as we instruct the workers not to annotate a $C_{2}$ error for a span that is already labeled as containing a $C_{1}$ error, we do not calculate loss on sentences that are labeled as containing $C_{1}$ but not $C_{2}$ during $R_{\phi_{2}}$ training.

[ $C_{3}$ : Incomplete information.] $R_{\phi_{3}}$ is trained to measure the information completeness of $y$ , at the full sequence level. Motivated by , $R_{\phi_{3}}$ predicts a single scalar reward and is trained with a pairwise comparison loss :

where $R_{\phi_{3}}(x,y)$ is the scalar output of the reward model for input $x$ and output $y$ ; $\bar{y}_{p}$ and $\bar{y}_{l}$ are sampled from the same input $x$ , and $\bar{y}_{p}$ has less missed information compared with $\bar{y}_{l}$ ; $D_{p}$ contains the pairwise comparisons bootstraped from human feedback on $C_{3}$ errors (see details in Appendix D).

Preference-based reward model. The preference-based reward model is trained in a similar way to $R_{\phi_{3}}$ , with $\bar{y}_{p}$ representing the human preferred response against $\bar{y}_{l}$ in the loss function Eq. 2. It outputs a scalar score for the given $x$ and $y$ that represents the overall response quality.

3 Experimental Setup

Compared systems. We compare our proposed method, Fine-Grained Rlhf with the initial T5 policy model trained with 1K examples (SFT) and RLHF with holistic preference-based rewards (Preference RLHF). The reward models used in RLHF experiments are trained on 2.8K examples with annotated feedback (but no gold human response). For analysis, we also use the human gold responses of all training examples to finetune a fully supervised T5 model (SFT-Full). Notice that SFT-Full requires much higher annotation cost because it takes longer (15 minutes per example ) for annotators to draft long-form responses.

Implementation details. Our policy model is based on T5-large and is supervised finetuned on 1K training examples, as explained in §4. During RL exploration, we use top-k ( $k=20$ ) sampling decoding with temperature = 0.7, which is set based on previous RLHF work . The value model used during RL training is initialized with T5-base due to GPU memory constraint. The reward model weights we used in Fine-Grained Rlhf are $w_{1}=0.3,w_{2}=0.5,w_{3}=0.3$ , unless otherwise specified. Although we use three reward models during RL training, we only observe very small relative additional cost (roughly 1% training time) compared to preference RLHF. During inference, we use greedy decoding to generate responses. We report more details including RL training parameters in Appendix B. All scores reported are averaged over 3 independent runs.

Evaluation. We conduct both human and automatic evaluation. Human evaluation is run on 200 randomly sampled test set examples of QA-Feedback to compare Fine-Grained RLHF with all baselines. Each model output is sampled from inference results of 3 training runs. We use the same protocol of feedback collection to have the same set of workers annotate spans in each model output that contain [ (1) irrelevance, repetition, or incoherence error (rel.)] and [ (2) incorrect or unverifiable facts (fact.)]. They are also asked to compare the [ information completeness (comp.)] for each output pair. To report evaluation scores for rel. and fact. error spans, we first map them to their corresponding error type density (sub-sentence and sentence). Then we report the error rate for each error type, measured as the percentage of sub-sentences that contains this type of error. Since spans with rel. error are not checked for fact. error (discussed in §4.1), we exclude sub-sentences with only rel. error when report the error rate of fact. error. For automatic evaluation, we report RougeLSum as used for the original ASQA data, as well as the score from each fine-grained reward model ( $R_{\phi_{1}}$ , $R_{\phi_{2}}$ , and $R_{\phi_{3}}$ ). Specifically, we report the percentage of all sub-sentences (or sentences) in the test set predicted as “no error” by $R_{\phi_{1}}$ (or $R_{\phi_{2}}$ ). For $R_{\phi_{3}}$ , we report the averaged output score for all test examples.

4 Main Results

Figure 3 shows the human evaluation results for rel. and fact. error types. Table 2 shows the human pairwise comparison results for information completeness (comp.).

Fine-Grained Rlhf outperforms SFT and Preference RLHF on all error types. Figure 3 and Table 2 show that our Fine-Grained Rlhf leads to generation that is much more factually correct and contains more complete information, compared to all other systems. It generates fewer irrelevance, repetition, andincoherence errors, compared with SFT and Preference RLHF. In the meantime, Preference RLHF, despite greatly reducing factual errors compared to the initial policy model SFT, generates even more irrelevance, repetition, and incoherence errors than SFT. Fine-Grained Rlhf outperforms Preference RLHF potentially due to more specific and localized training signals. In addition, we ask annotators to compare the overall generation quality of Fine-Grained Rlhf and preference RLHF. Although Preference RLHF is trained directly with such preference feedback, Fine-Grained Rlhf was rated better than Preference RLHF in 30.5% of all examples and worse in 24.5% of examples. The annotators indicate a tie in the remaining 45% of cases. Surprisingly, Fine-Grained Rlhf outperforms SFT-Full with more factual and complete generation, despite a much lower annotation cost.

RLHF is particularly effective in reducing factual errors. Figure 3 shows that both Fine-Grained Rlhf and Preference RLHF are effective in reducing factual errors in model generation. Meanwhile, we see little or no improvement in reducing irrelevance, repetition, or incoherence errors. We provide more in-depth analysis for this observation in §4.5.

Table 4 shows automatic scores on the QA-Feedback test set, which show similar trends as human evaluation in terms of system comparisons, while all four systems achieve similar Rouge scores.

5 LM Customization with Fine-Grained Rlhf

Since we use multiple reward models in Fine-Grained Rlhf, adjusting their weights (see Eq. 1) during RL may lead to different LM behaviors. For example, adding more weight to a reward model associated with one specific desired behavior type (e.g., information completeness) may lead the generation more towards that behavior type compared to others (e.g., information relevance). This flexibility can potentially fit users with diverse needs. Therefore, in this section, we explore Fine-Grained Rlhf’s ability to customize the LM behavior.

LM customization. As in Table 4, we explore three configurations of reward model weights ( $w_{1}$ , $w_{2}$ , and $w_{3}$ for $R_{\phi_{1}}$ , $R_{\phi_{2}}$ , and $R_{\phi_{3}}$ ) and name them ‘short’, ‘medium’, and ‘long’ according to the LM’s average generation length. For simplicity, we fix $w_{2}=0.5$ and $w_{3}=0.3$ , and use 0.4, 0.3, and 0.2 for $w_{1}$ , which leads to ‘short’, ‘medium’, and ‘long’ generation outputs respectively. We manually inspect 30 random examples and observe that (1) ‘short’ generates more relevant content, but is less factual and complete; (2) ‘long’, in contrast, gives the most factual and complete generation. This reflects that the LM is referencing a large amount of content from passages; (3) The ‘medium’ configuration balances the three rewards and has the highest Rouge score. 24/30 examples follow the above rule. Qualitative analysis and examples of LM customization are in Appendix A.

Trade-off between error types. We observe that a higher $w_{1}$ leads to a bigger rel. reward, smaller fact. and comp. rewards, and shorter generated outputs. One interpretation is that $R_{\phi_{1}}$ penalizes text spans that are irrelevant to the questions. As such, it encourages answering the question directly and penalizes referencing passages and generating auxiliary information. This reduces the model generation length and information completeness, and induces more factual errors.

6 Analysis

Reward models are competing against each other. In the prior section, we find that there is a trade-off between error types. To further look into this phenomenon, we explore the dynamics of each reward model during training. Figure 4 shows each reward model’s rewards on the development set during training. All rewards are z-normalized for visualization. We see that the fact. reward is consistently increasing. The rel. reward increases rapidly in the first 250 steps and then starts decreasing, while the comp. reward exhibits an opposite trend, decreasing at first and then starting to increase. As discussed earlier, one interpretation is that relevance (precision) and information completeness (recall) can be adversarial objectives, so the rewards are competing. The three rewards reach an equilibrium point in later steps.

Ablation: Does the LM learn from all reward models? What if we remove one reward model? Table 5 explores the policy LM behavior when one of the three reward models is removed during training. Qualitative examples are in Appendix A. First, we observe that the corresponding reward decreases dramatically when the model is removed. When the rel. reward model ( [ $R_{\phi_{1}}$ ]) is removed, the outputs become extremely long and the comp. reward is extremely high. We observe the outputs and find the model is copying a lot of content from the passages. When the fact. reward model ( [ $R_{\phi_{2}}$ ]) is removed, the rel. reward becomes the highest. We observe that the LM tends to answer the question directly and not reference the passages, which causes a lot of hallucinations. When the comp. reward model ( [ $R_{\phi_{3}}$ ]) is removed, the outputs are concise and factual but not providing all relevant information to the question. Thus, it has lower information completeness and Rouge score compared with the LM trained with all reward models.

Reward model performance. We report and analyze the performance of each reward model in predicting its corresponding error category. The rel. reward model $R_{\phi_{1}}$ has a binary classification accuracy of 69.6, and an F1 score (for the “has error” class) of 68.5 on model-generated sub-sentences from the development set. We sample 20 sub-sentences where $R_{\phi_{1}}$ predicts the opposite of the human label, and observe that all of them either 1) contain relevant auxiliary information and are marked as “no error” by humans, or 2) are marked as irrelevant by humans but provide closely related background information to the question. In other words, $R_{\phi_{1}}$ is mostly struggling with predicting the relevance of auxiliary information, and it rarely fails to predict a direct answer as “no error”.

The fact. reward model $R_{\phi_{2}}$ has an accuracy of 77.8 and an F1 score of 67.5. We sample 20 sentences where $R_{\phi_{2}}$ makes a prediction mistake and we observe that the mistakes often happen when the generated sentence is highly abstractive instead of directly copying information from the passage. We also observe that more than 80% of human labeled factual errors occur when the model generates a direct answer (not auxiliary information) that contains hallucinated information or a random entity from a passage. We notice that $R_{\phi_{2}}$ correctly captures more than 80% of such errors.

The comp. reward model $R_{\phi_{3}}$ has an accuracy of 70.9 in pairwise comparison. In contrast, the preference-based reward model only reaches an accuracy of 57.2. This helps confirm our intuition that assessing long-form generation outputs holistically can be more ambiguous and subjective than evaluating the outputs with a focus on a specific undesired behavior type.

Comparison with ChatGPT responses. We experiment with answering the questions with ChatGPT. To familiarize ChatGPT with the style of our LFQA task, we prompt it with the task instruction and a single random QA example (due to length limitation). ChatGPT achieves a RougeLSum score of 40.92 on the test set, which is much lower than our models. We do not use our trained reward models to evaluate ChatGPT outputs because reward models trained on T5-large may not generalize well to ChatGPT. We instead manually inspect the ChatGPT responses, and observe that they are mostly concise and factual, yet lack the auxiliary information necessary to clarify ambiguous questions. Qualitative examples are in Appendix A. This shows the difficulty for ChatGPT in learning user-desired behaviors through simple prompting.

Related Work

Reinforcement learning from human feedback (RLHF). RLHF aims to optimize the policy language model to generate content that is desired by human. This framework has been explored to improve the model performance on a variety of natural language processing tasks such as text summarization , instruction following , question answering and reducing harmfulness . Most of these studies collect human preferences over pairs of model outputs on one or a set of desired attributes, in order to train a reward model to assign a holistic score for a generation output during RL training. trains separate reward models that assign scores for different desired attributes, but still uses a single reward that combines scores from all reward models. In contrast, we explore RLHF with fine-grained reward models trained on human feedback where each reward model provides dense reward after every small text segment for a specific type of desired behavior. explores using intermediate rewards to improves LM performance on reasoning tasks.

Learning from human feedback in NLP. There also exists work that explores non-RL methods to learn from human feedback. trains a reward model that predicts a single score for each model output and selects samples with the highest reward scores for supervised fine-tuning. train a conversational model to predict both the response and a binary user satisfaction score in order to improve the response generation. Besides such numerical human feedback, natural language (NL) human feedback has also been explored. collect and store NL human feedback in a feedback memory for the model to retrieve and then perform the end task conditioning on the retrieved feedback. use a refinement model to refine model outputs conditioning on NL human feedback and then use a reward model to select the best refined outputs for supervised fine-tuning. Methods for using a reward model to guide LM generation towards desired behaviors at inference time can complement our work that aims to improve the LM during training. also explores incorporating human feedback into LM pre-training.

Discussion

Annotation Costs. It is important to note that the fine-grained human feedback used for training our fine-grained reward models does not incur a greater cost than holistic human preference. As outlined in § 4.2, our observations reveal that annotators require a substantial amount of time to compare two lengthy text outputs. For the long-form QA task, both fine-grained feedback and preference-based feedback takes approximately 6 minutes per sample for an annotator.

We propose the Fine-Grained Rlhf framework that can incorporate multiple reward models to provide dense rewards for RL training, which leads to LM outputs that are optimized towards such rewards. Our framework can be applied to any text generation task, thereby enhancing LM performance by offering more nuanced guidance than holistic feedback. The key advantages of the Fine-Grained Rlhf framework are two-fold:

Flexibility. Our framework significantly expands the versatility of reward models for RLHF. For example, future work involving fact-checking, sentiment classification, toxicity detection, among others, can all be incorporated within this framework. LMs can be trained against all these reward models via Fine-Grained Rlhf.

Controllablility. Having multiple reward models that stand for different feedback types allows the end user to exert greater control over RL training (e.g., through different combinations of reward model weights; see details in § 4.5). This leads to customized model behaviors, a benefit particularly valuable for applications like educational tools where model personalization is crucial.

2 Limitations and Future Work

One major limitation of our framework comes from the additional compute cost of getting fine-grained rewards, compared to RLHF with a holistic reward. For instance, in the detoxification task, we need to make multiple Perspective API calls for each model output depending on how many sentences are generated, while RLHF with a holistic reward only requires one. In the long-form QA task, we need to calculate a dense reward from multiple reward models, which takes more compute time and GPU memory than a single reward model.

Another limitation is that different tasks may have different definitions of fine-grained feedback in terms of the feedback types and the density level of each type. Therefore, defining feedback that is well-suited for a task and training reward models accordingly requires non-trivial manual effort.

Finally, in this work, we carefully control the quality of annotated feedback, which is then used to train reward models for RL. In practice, when a deployed model is released to the public, end users don’t always give clean feedback. Therefore, how to obtain effective learning signals from noisy human feedback in the wild still needs further investigation.

Some other interesting questions to explore in the future include: 1) Can we obtain fine-grained feedback from LMs like GPT-4 instead of humans to improve model performance and reduce annotation costs? 2) How can other non-RL approaches of using human feedback such as controlled generation during inference time complement Fine-Grained Rlhf? 3) How would fine-grained reward and value model sizes affect policy model performance during RL training?

Conclusion

In this work, we introduce Fine-Grained Rlhf, a framework that enables LMs to learn from multiple fine-grained reward models trained from human feedback, where each reward model detects a specific error category and provides dense rewards. We conduct experimental analysis on two text generation tasks to illustrate the performance gain of Fine-Grained Rlhf than RLHF over holistic rewards, supported by both automatic and human evaluation. Furthermore, we show that an LM can be customized for specific needs using different combinations of fine-grained reward models.

Acknowledgments

We thank Jiacheng Liu for sharing the standard PPO training code, and Yizhong Wang for providing insights during early discussions of the project. We also thank UW TIAL members for participating in our pilot feedback annotation. We extend our thanks to UW NLP members who provided insights or feedback to our project. Lastly, we especially thank all our AMT workers for helping us annotate the high quality feedback data. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8650-23-C-7316. This work was also funded in part by the DARPA MCS program through NIWC Pacific (N66001-19-2-4031), NSF IIS-2044660, and ONR N00014-18-1-2826. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

References

Appendix A Qualitative Examples for Long-Form QA

As discussed in § 4.5, we can modify the weight of each fine-grained reward model during RL training to get LM with different behaviors. Here, we explore three configurations of reward model weights and name them ‘short’, ‘medium’, and ‘long’ based on the LM’s average generation length. The ‘short’ configuration generates concise and short responses, while the ‘long’ configuration generates detailed and long responses. Table 6 demonstrates the different behaviors of our customized LMs. Given the same question, each LM generates different amount of auxiliary information in the response.

A.2 Examples on LM Errors

Table 7 and Table 8 show examples of LM outputs from all the compared systems (SFT, Pref. RLHF, and Fine-Grained Rlhf). We mark the fine-grained errors on the model outputs. Overall, our Fine-Grained Rlhf outperforms SFT and Pref. RLHF in all three error types.

A.3 Examples on Reward Model Ablation

As discussed in § 4.6, reward models are competing against each other, and we experiment with removing one of the three reward models during RL training. Table 9 shows an example of how LMs behave in such scenarios. See § 4.6 for our observations.

A.4 Comparison with ChatGPT responses

We compare the responses generated by ChatGPT (one-shot)222Since the input for each example is very long, we cannot fit more than one in-context example into the model. and our system in Table 10. As discussed in § 4.6, We find that ChatGPT responses are relevant and factual, yet lack the auxiliary information to answer the ambiguous questions. This shows that it is challenging for ChatGPT to learn user-desired behaviors through prompting and in-context learning.

Appendix B Algorithm and Training Details of Fine-Grained Rlhf

The algorithm below shows in detail how PPO updates the policy LM $P_{\theta}$ and the value model $V_{\psi}$ with $K$ fine-grained reward models $R_{\phi_{k}}$ .

Input initial policy model $P_{\theta_{\text{init}}}$ ; initial value model $V_{\psi_{\text{init}}}$ ; $K$ reward models $R_{\phi_{k}}$ trained from human feedback; task prompts $\mathcal{D}$ ; hyperparameters $\gamma$ , $\lambda$ , $\epsilon$ , $\beta$ $\triangleright$ § 2

1𝜀subscript𝐴𝑡\theta\leftarrow\arg\max_{\theta}\frac{1}{|\mathcal{D}_{b}|}\sum_{n=1}^{|\mathcal{D}_{b}|}\frac{1}{|y^{n}|}\sum_{t=1}^{|y^{n}|}\min\left(\frac{P_{\theta}(a_{t}\mid s_{t})}{P_{\theta_{\text{old}}}(a_{t}\mid s_{t})}A_{t},\,\text{clip}(v_{t},\,1-\varepsilon,\,1+\varepsilon)A_{t}\right) 9: Update the value model by minimizing a square-error objective: $\psi\leftarrow\arg\min_{\psi}\frac{1}{|\mathcal{D}_{b}|}\sum_{n=1}^{|\mathcal{D}_{b}|}\frac{1}{|y^{n}|}\sum_{t=1}^{|y^{n}|}\left(V_{\psi}(s_{t})-V^{\text{targ}}(s_{t})\right)^{2}$ Output $P_{\theta}$

B.2 Implementation Details

Model architectures. For the detoxification experiments, the policy model is initialized with GPT2-large , and the value model is initialized with GPT2-base. For the long-form QA experiments, the policy model is initialized with a supervised fine-tuned T5-large , and the value model is initialized with T5-base. This design follows InstructGPT , which uses a larger (175B) policy model, and smaller value and reward (6B) models.

Training details on detoxification. For both the holistic reward baseline and the sentence-level (fine-grained) reward, we do a hyper-parameter search with the same set of hyper-parameters. For training, we run 200K episodes. The batch size (number of episodes per card during training) is 64. We use Adam optimizer with a linear learning rate scheduler and 10 warmup steps. We perform a hyper-parameter grid-search for peak learning rate $\in\{5e-6,1e-5,2e-5\}$ , KL coefficient $\beta\in\{0.1,0.2,0.3\}$ , discounting factor $\lambda\in\{0.95,0.97,0.99\}$ , and the frequency of exploration (number of sampled outputs) $\in\{2,4,8\}$ . We find that the higher the KL coefficient, the lower the perplexity, and the higher toxicity. This is consistent with findings from previous RLHF studies (, ). For a fair comparison, we eventually choose a set of parameters that achieve a similar level of perplexity for both reward models. The optimal set of hyper-parameters for holistic reward is $\beta=0.3,\lambda=0.99$ . For sentence-level reward $\beta=0.1,\lambda=0.95$ . The learning rate is $1e-5$ , and the exploration frequency is $4$ for both experiments. We choose the checkpoint with the lowest validation set toxicity for evaluation. Regarding computation time, we use $2\times$ 80G NVIDIA A100 GPU for training, and the run time is about 22 hours.

Training details on long-form QA. We conduct a similar hyper-parameter grid search as our detoxification experiments. For long-Form QA, the input length limit is 1024, and the output length limit is 200. Notice that this is much longer than detoxification, so we use a smaller batch size and fewer training episodes. We experiment with multiple combinations of reward model weights. Fixing $w_{2}=0.5$ (factuality reward weight), we perform a grid search on $w_{1},w_{3}\in[0.0,0.5]$ . We eventually choose $w_{1}=0.3,w_{2}=0.5,w_{3}=0.3$ , which reaches a balance between three reward models and allows all three rewards to increase during training. For training, the batch size (number of episodes per card during training) is 32. We use Adam optimizer with a linear learning rate scheduler and 100 warmup steps. We perform a hyper-parameter grid-search for peak learning rate $\in\{5e-6,1e-5,2e-5\}$ , KL coefficient $\beta\in\{0.1,0.2,0.3\}$ , discounting factor $\lambda\in\{0.95,0.97,0.99\}$ , and the frequency of exploration $\in\{2,4,8\}$ . The optimal set of hyper-parameters for Pref. RLHF is $\beta=0.2,\lambda=0.99$ . For Fine-Grained Rlhf, $\beta=0.3,\lambda=0.95$ . The learning rate is $1e-5$ , and the exploration frequency is $4$ for both experiments. we run 80K episodes, which is approximately 5 epochs. We choose the checkpoint with the highest validation reward for evaluation. Regarding computation time, we use $2\times$ 80G NVIDIA A100 GPU for training, and the run time is about 15 hours.

A note on the error bars. All results we report in the paper are from 3 independent runs. The scores reported are all averaged across all runs. The error bars are represented as the shades behind each training curve in our figures. It shows the standard error across three runs.

Appendix C Long-Form QA Data and Human Feedback Annotation

ASQA is a long-form QA dataset that focuses on answering ambiguous factoid questions in an open-domain setting that requires passage retrieval from a given Wikipedia passage corpus. We reformulate it into a reading comprehension setting: given the input $x$ that contains a question $q$ and a set of knowledge passages $P=\{p_{1},...,p_{|P|}\}$ , generate a long-form response $y$ . To construct $P$ for each input $x$ , we use the oracle knowledge contexts provided by ASQA for each $x$ , that are text snippets from the passage corpus. We use BM25333https://github.com/castorini/pyserini to map each knowledge context (text snippet) to the closest passage from the passage corpus and use the resulting passages as $P$ . Our train and dev examples come from the original ASQA train set and our test examples are the original ASQA dev examples.

C.2 Human Feedback Annotation

Fine-grained feedback. As discussed in § 4.1, we first use 1K randomly sampled training examples to train a T5-large based supervised model SFT as the initial policy model $P_{\theta_{init}}$ . Then we collect feedback on sampled outputs from SFT for the remaining 2,853 training examples and the 500 development examples, using the Amazon Machanical Turk platform.444https://www.mturk.com/

Figure 5 shows the fine-grained human feedback annotation interface with an example from QA-Feedback. In addition to the task input—question $q$ and oracle passages $P$ , we also provide a human-written response from ASQA to the worker as reference. However, it is important to note that, in practice, the annotation of our fine-grained feedback should not require the human-written response. The only purpose for us to provide the gold response is to have our workers follow the same question interpretation and expected response of the workers who annotate for ASQA, such that our experimental comparison with supervised models (SFT and SFT-Full; details in § 4.3) is fair. However, we still instruct our workers to strictly use the given passages for checking factual errors. For each span error, we ask the worker to select one out of 5 categories shown in Figure 6 (left).555We see very few “incoherence” errors (1%), so the majority of labeled errors are from the other four categories during annotation. However, we collapse these 5 categories into two categories ( $C_{1}$ and $C_{2}$ mentioned in § 4.1) based on whether the error detection depends on the passages or not. When workers mark passage sentences as containing missing information, we instruct them to categorize each sentence as missing “answer”, “major auxiliary information” or “minor auxiliary information,” as shown in Figure 6 (right). Our instruction to the worker is provided in Figure 8.

Quality control. Before feedback collection, we design a qualification task to select qualified workers for this feedback annotation task. The qualification task consists of 5 questions with their corresponding passages and model outputs for the workers to annotate. We manually review about 70 submissions of the qualification task and select 15 workers whose annotation is marked by us as of high quality. Throughout the actual feedback annotation process, we constantly monitor the annotated data and send feedback to workers.

Preference-based feedback. For comparison purposes, we follow to collect pairwise human preferences from the same group of workers we select from the qualification task. We sample four model outputs for each prompt $x$ , which gives 6 pairs of model outputs. Similarly, we provide the worker with the human-written response and ask the workers to indicate pairwise preferences (ties are allowed) based on all errors they can find each model output. Figure 7 shows the preference-based human feedback annotation interface with an example from QA-Feedback.

Pay structure. We pay a base rate of $1.5 per example for annotating fine-grained or preference feedback. If the example consists of$ \geq 3 $passages to read, we assign an additional$ 0.3 bonus to the example. On average, we pay roughly $1.65 per example for both tasks, which gives an$ 16.5 hourly pay for our workers.

C.3 Analysis of Collected Fine-Grained Feedback

Overall, among all error spans we collect, 76% of them are $C_{1}$ errors and the remaining 24% are $C_{2}$ errors. However, it is important to note that we instruct workers to label $C_{2}$ errors only at places that don’t have a $C_{1}$ error. 75% examples are labeled as being incomplete; i.e., containing missing information that can be found in the given passages ( $C_{3}$ ). Among all marked passage sentences that contain missing information, 31%, 42% and 27% are missing answer, major auxiliary information and minor auxiliary information respectively.

To analyze human-human agreement, a subset of 300 examples receive annotations from two distinct workers. We observe that while the exact agreement in error span boundaries is low, workers achieve reasonably high agreement on whether a sub-sentence contains $C_{1}$ (reach an agreement for 83% of all sub-sentences) and whether a sentence contains $C_{2}$ (92%). 666We use spaCy to segment generated model outputs into sentences. We then split sentences into sub-sentences using a comma or semicolon. The agreement on whether a model output contains complete information or not ( $C_{3}$ ) is 85%. Therefore, we decide to have the density for error type $C_{1}$ , $C_{2}$ , and $C_{3}$ as sub-sentence, sentence and full sequence.

Appendix D Long-Form QA Reward Model Training Details

We train reward models with the 2,835 training examples with feedback collected and select the best model for each error category based on the their performance on the development set. The batch size and training epochs are 24 and 50 for $R_{\phi_{1}}$ and $R_{\phi_{2}}$ . Each training is run on a single 80G NVIDIA A100 GPU, taking 1 and 2 hours for training $R_{\phi_{1}}$ and $R_{\phi_{2}}$ respectively.777Note that training $R_{\phi_{1}}$ takes shorter time as its input does not contain passages. The batch size and training epochs are 12 (per GPU) and 30 for $R_{\phi_{3}}$ and the preference-based reward model. Each training is run on $2\times$ 80G NVIDIA A100 GPU and takes 2 hours. We use Adam optimizer with a linear learning rate scheduler for all reward model training. For each reward model, we search the learning rate over $\{5e^{-6},1e^{-5},5e^{-5}\}$ , weight decay over $\{0.001,0.01\}$ , and warm-up step ratio over $\{0.1,0.2\}$ based on the dev set performance. Specifically, we use the model for $R_{\phi_{1}}$ and $R_{\phi_{2}}$ that achieve the best binary classification accuracy. For $R_{\phi_{3}}$ and the preference-based reward model, we select the model that achieves the best pairwise comparison accuracy. We also provide more training details for each reward model below.

[ $R_{\phi_{1}}$ for $C_{1}$ : Irrelevance, repetition, or incoherence.] To train the reward model $R_{\phi_{1}}$ that detects error of irrelevance, repetition, or incoherence, we apply a token-level classification loss to each [sep] token before $y^{1}_{j}$ , where its gold label $g_{j}$ is “has error” if there is a $f_{i}\in\mathcal{F}$ that has $a_{b_{i},\dots,e_{i}}$ overlapped with $y^{1}_{j}$ and $c_{i}=1$ , and “no error” otherwise. We observe that most of the spans marked as error type $C_{1}$ that are shorter than 5 words usually carry very little information or are annotated as a result of workers being very careful or strict. Therefore, we filter out such short spans before constructing training examples for $R_{\phi_{1}}$ . Overall, we get 7379 and 8059 sub-sentences with the “has error” and “no error” label respectively.

[ $R_{\phi_{2}}$ for $C_{2}$ : Incorrect or unverifiable facts.] We train $R_{\phi_{2}}$ in a similar way as how we train $R_{\phi_{1}}$ . Instead of predicting the error for each sub-sentence, $R_{\phi_{2}}$ is trained to predict at the sentence level (i.e., $y_{j}^{2}$ is the $j^{th}$ sentence in $y$ ). Since workers do not annotate $C_{2}$ error for spans that are already labeled as having $C_{1}$ error, in order to avoid false negatives in training $R_{\phi_{2}}$ , we do not provide gold label nor calculate loss for sentences that only contain $C_{1}$ error from training. In other words, all sentences that contain a $C_{2}$ error has the gold label “has error” and sentences that contain no error has the gold label “no error”. Overall, we get 1600 and 3411 sentences with the “has error” and “no error” label respectively.

[ $R_{\phi_{3}}$ for $C_{3}$ : Incomplete information.] Instead of casting this as a classification task, $R_{\phi_{3}}$ predicts a single scalar reward given $x$ and $y$ and is trained with a pairwise comparison loss . This is motivated by early work that shows the better reliability of pairwise comparison than error classification when assessing a full generation sequence. To construct training data for $R_{\phi_{3}}$ , we bootstrap pairwise comparisons from the corrected model output $y^{\prime}$ as follows. We first map each sub-sentence in $y^{\prime}$ to a passage sentence in $P$ that has a sub-string with the highest token-level F1 score with the sub-sentence,888We manually review 50 mapped passage sentences and find over 90% of them are correctly mapped, which indicates frequent extractive behaviors from $P_{\theta_{init}}$ . and denote all mapped sentences as $S$ . We then sample four responses from SFT, for each we do the same sentence mapping to get a set of passages sentences $\mathcal{S}^{\prime}$ . We calculate $score(y)=|\mathcal{S}^{\prime}\cap\mathcal{S}|/|\mathcal{S}|$ as the information completeness score for each model response $y$ . We follow to pair up sampled responses for $q$ and denote each sampled response pair as ( $\bar{y}_{p}$ , $\bar{y}_{l}$ ), where $score(\bar{y}_{p})>score(\bar{y}_{l})$ . We drop the pairs where $score(\bar{y}_{p})=score(\bar{y}_{l})$ . Then we follow to train $R_{\phi_{3}}$ with the loss function in Eq. 2. We have a total number of 6821 pair examples in training.

Preference-based reward model. The preference-based reward model is trained in a similar way as $R_{\phi_{3}}$ , with $\bar{y}_{p}$ representing the human preferred response against $\bar{y}_{l}$ in the loss function Eq. 2. We drop the pairs where a tie is indicated. We have a total number of 14981 pair examples in training.