Recursively Summarizing Books with Human Feedback

Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano

Introduction

To train an ML model on a new task, we need a training signal that tells the model which behaviors are better and which are worse. For some tasks, like playing a video game, this training signal can be calculated automatically. However, for many useful tasks an accurate training signal can only be provided via a human in the loop. For example, humans can provide demonstrations of the correct behavior (Bain and Sammut,, 1995) or compare two outputs from the model being trained (Christiano et al.,, 2017), and this data is used to train the model.

In this paper we focus on tasks that are difficult for humans to supervise or evaluate, either because the tasks take a lot of time or because they require specialized knowledge and expertise to evaluate. For example, imagine training a model to summarize an entire sub-field of scientific research. For a human to provide a demonstration or evaluate the quality of a model-generated summary, they would likely need a huge amount of time and expertise. One could circumvent this difficulty by using easier-to-measure proxy objectives (e.g. how often words in the summary relate to the topic, and how accurate individual sentences in the summary are), but these proxies are usually less aligned with our actual goals, and optimizing them can have unintended consequences (Clark and Amodei,, 2016; Krakovna et al.,, 2020; Amodei et al.,, 2016).

Successfully training ML systems on such tasks will require more scalable means of producing an effective training signal — this problem is known as scalable oversight (Amodei et al.,, 2016).

Our approach to scalable oversight is directly inspired by Christiano et al., (2018) and Leike et al., (2018), who make use of task decomposition (Singh,, 1992; Dayan and Hinton,, 1993) and learning from human feedback. At a high level, these methods take a top-level task and decompose it into several smaller subtasks whose answers would help a human solve or evaluate the top-level task. These subtasks can in turn be decomposed into smaller tasks until it is feasible for humans to provide a training signal for a leaf task. ML models can be trained to solve the leaf tasks, to solve higher-level tasks given answers to the lower-level tasks, and to decompose the harder tasks into subtasks. While Dayan and Hinton, (1993) and Christiano et al., (2018) only tried this on simple algorithmic tasks, Perez et al., (2020) and Rajani et al., (2019) use similar ideas for question-answering tasks using a single step of decomposition.

We take a step further in this direction by scaling task decomposition to abstractive book summarization. Abstractive book summarization is a difficult task, where dataset collection is challenging (Mihalcea and Ceylan,, 2007; Ladhak et al.,, 2020; Kryściński et al.,, 2021) and existing methods are typically either extractive (Radev et al.,, 2004; Mihalcea and Tarau,, 2004; Bamman and Smith,, 2013; Ladhak et al.,, 2020) or focused on shorter stories (Kazantseva,, 2006; Zhang et al., 2019b, ).

We implement a natural task decomposition for long-form summarization: first, we train models to summarize small parts of the book, and then use these models to help humans summarize larger sections of the book, and continue with this strategy recursively. We train a single model to perform these tasks using standard cross-entropy behavioral cloning (BC) and reinforcement learning (RL) from human preferences (Christiano et al.,, 2017).

Our main result is a model that can be applied recursively to generate plausible summaries of entire books. Our approach lets us summarize books of arbitrary length – we achieve believable summaries on books with hundreds of thousands of words by recursing to depth 3. With a non-recursive approach, generating or evaluating a book summary requires a human reading the entire book, so naively collecting such a dataset is over 50x more expensive per data point (see Appendix E.2).

Qualitatively, these summaries contain important events from the book, and sometimes synthesize these details abstractively; however, they often leave out important details or fail to grasp the broader context. When evaluated quantitatively, our model significantly outperforms our behavioral cloning baseline, and a small number of summaries approach human-level quality. Separately, we perform an ablation comparing RL to BC on summarizing smaller sections of a book, and find that RL has better scaling properties. We also evaluate our summaries with the NarrativeQA question-answering dataset (Kočiskỳ et al.,, 2018) and find that a zero-shot model taking our summaries as input achieves competitive results at answering questions about books and movie scripts. We also achieve state-of-the-art results on the recent BookSum dataset (Kryściński et al.,, 2021) for book-length summarization.

Overall, our results show that combining recursive task decomposition with learning from human feedback can be a practical approach to scalable oversight for difficult long-document NLP tasks. We hope that our work encourages more research in using models trained on simpler tasks to aid humans in providing training signals on more difficult tasks.

Approach

Consider a task for which it is very expensive for a human to provide a training signal. Christiano et al., (2018), Irving et al., (2018), and Leike et al., (2018) all propose in some way reducing the task into simpler parts which humans can supervise.

In task decomposition, a human decomposes this parent task into several subtasks, such that each subtask is simpler than the parent task, and having the responses to the subtasks would help a human provide a training signal for the parent task. This task decomposition process can be applied recursively to obtain a tree of tasks, such that the leaf tasks are simple enough for a human to demonstrate or evaluate. For example, the parent task “Write a research report on climate change interventions” might decompose into a subtask like: “Give me a list of the most promising climate change interventions”, which then further decomposes into simpler tasks like “How effective is reducing food waste?” and “What are ways to make nations coordinate in avoiding tragedy of the commons scenarios?”.

If we repeat this process many times, we obtain a dataset that we can use to train an ML model. Specifically, given a (sub)task we want to train a model that can perform two fundamental operations:

Decompose: Ask for responses to a set of simpler tasks.

Respond: Given responses to some number (possibly none) of simpler tasks, respond to the original task. When there are simpler tasks used, we sometimes refer to the operation as Compose, since it composes the sub-responses into an overall response.

Then any task can be performed via a recursive procedure if it is amenable to decomposition; we show a pseudocode implementation in Appendix A. It remains an open question to what extent natural tasks are actually amenable to decomposition (Ought,, 2020).

While the framework above is fully general, it can be further simplified if the task lends itself to a simple recursive structure where the decomposition operation can be performed algorithmically, and the ML model only needs to be trained on the Respond operation.

2 Decomposition for book summarization

We use a simple procedure to algorithmically decompose a summarization task for a piece of text: If the text is short enough, summarize it directly. If it is longer, chunk the text into smaller pieces, and recursively ask to summarize each one. This results in a tree of summarization tasks (see Figure 1), where only the leaf tasks operate on passages of the original book text.

Each task, corresponding to nodes with pencil symbols in Figure 1, has a height and depth, which correspond to the standard terminology used for trees. The height of a node is the length of the longest downward path to a leaf from that node. A height 0 task is a leaf task, where the goal is to summarize the original book text. We sometimes refer to tasks that are height > 0 as composition tasks, since the input is a concatenation of summaries, and the goal is to produce another summary. The depth of a node is the length of the path from the node to the root. A depth 0 task is the final summarization task, where the goal is to produce a summary of an entire book (given summaries produced from the depth 1 tasks).

An evident issue with the above approach is that tasks corresponding to passages further into a book may lack the necessary context for a successful summary. We remedy this by additionally putting prior summaries in context, from the same depth, concatenated together in order.Early on, we found this previous context to help the model (according to log loss on a BC model). We also found that variants that include the previous un-summarized text did worse – though it includes more information, our models did not have enough context length to make use of it. We call these summaries the previous context. In Figure 1, the previous summaries inputs for the blue task are indicated using dotted lines. We include as many prior summaries as can fit in the model’s context length. We would like each summary to flow naturally from the previous context, since it may get concatenated with it at a higher height or in the previous context for a later task.

A convenient property of this decomposition is that all of the tasks in the tree are extremely similar to one another. Every task for the model is a summarization task that can be formatted the same way. The input text is either the original book text or a concatenation of summaries, and we optionally have additional previous context in the form of summaries.

Pseudocode and detailed parameters of tree construction can be found in Appendix A.5.

3 Training

For training the model, we most closely follow the procedure of Stiennon et al., (2020). We start with a pretrained language model and a pool of trained human labelers (see Appendix B for details). We collect demonstrations from labelers and train a model via behavioral cloning. We then repeat many iterations of reward learning and reinforcement learning. To learn the reward function, we collect comparisons from labelers on outputs from the current best policy and train a reward model to predict log odds that a response is better. Reinforcement learning directly optimizes the reward with an additional KL term to prevent too much drift from the initial policy, typically our best supervised policy. More details in Appendix D.

To collect a label for a given task, we need to generate its inputs: if a node is not a leaf, we run an existing model (typically the best available) recursively to generate summaries for each of its children.

In summary, we use the following algorithm:

Recursively decompose books (and compose child summaries) into tasks using the procedure described in 2.2, using the best models we haveWhile the tree is typically created from a single best model for all tasks, there are times when, e.g., our best model at height 0 is an RL model but the best model at height 1 is supervised. We also initially experimented with training different models for height 0 and height 1, but found that training a unified model worked better, and trained a single model for all heights thereafter. and the best sampling parameters we haveOur best guess sampling parameters are generally determined by human evaluations on the individual tasks. See Appendix D.2. While this could be done with humans, it would be prohibitively expensive.

Sample a node from the tree, corresponding to a summarization task which we’d like to train on.Note that throughout much of the project, we sample only from the early parts of the tree and thus avoid running the full procedure from step 1. Details below in 2.3.2.

Obtain training data, given the inputs to that node

For demonstrations, we then have human labelers write a desired output

For comparisons, we run the model we wish to train to obtain two outputs, typically at temperature 1. We then ask human labelers to choose which output is better.

We then finetune the model using the training data

For demonstrations, we use behavior cloning (BC). We do a supervised finetune using the standard cross entropy loss function.

For comparisons, we use reinforcement learning (RL) against a reward model trained to predict human preferences.

We can iterate this entire process with newer models, different node sampling strategies, and different choice of training data type (demonstration versus comparison).

Since each model is trained on inputs produced by a different model, inputs produced by itself are outside of the training distribution, thus causing auto-induced distributional shift (ADS) (Krueger et al.,, 2020). This effect is more severe at later parts in the tree computation (later in the book, and especially higher in the tree). This means that after each round of training, running the full procedure always results in inputs out of the prior training distributions, for tasks at non-zero height. While we did not systematically measure the severity of this effect, in practice we generally found that additional rounds of training at height 0 resulted in better-rated summaries at height 1.

Because of the ADS mentioned in Section 2.3.1, it is advantageous to prioritize training on nodes earlier/lower in the tree computation, before moving to nodes later in the computation.

First subtree. The first subtree refers to the first height 1 task, and its height 0 child tasks (of which there are typically 10-13). See the yellow nodes in Figure 1 for an example. In Section 4.1, we find that by training on merely the first subtree, the model can generalize to the entire tree.

First leaves. The first leaves refers to the height 0 tasks in the first subtree, i.e. those which are children of the first height 1 task.

For early rounds, we initially train only on the first leaves, since inputs to later nodes depend on having plausible summaries from earlier nodes, and we do not want to use excessive human time. We then move to the entire first subtree (additionally training on a single height 1 task), once the summaries for the first leaves look reasonable. At this point, our model is already capable of generalizing to the full tree, and we switch to training on all nodes. Curriculum changes were made in an ad hoc manner, moving on when we deemed the models "good enough" at earlier tasks.

We use pretrained transformer language models (Vaswani et al.,, 2017) from the GPT-3 family (Brown et al.,, 2020), which take 2048 tokens of context. Input tokens are produced by the byte pair encoding introduced in Radford et al., (2019). Other architecture and hyperparameters choices follow those of Stiennon et al., (2020). More details in Appendix D.

In the first leaves phase of the project, we collect data for all first leaves together. When moving to first subtree, we independently collect data for the height 1 tasks, letting us vary the ratio of training data at the different heights. Finally, for the full tree phase, we follow a strategy of first randomly sampling a depth, and then randomly selecting a task amongst tasks at that depth. Inputs are typically generated using the best model available and best guess sampling parameters (see Appendix D.2).

In all cases, we train on all past data (individual demonstrations and comparisons for tasks from various parts of the tree). We then shuffle and sample tasks randomly.

We ran three variants of sampling tasks for reinforcement learning episodes, corresponding to our changes in the training curriculum.

The first leaves: Each episode is a single first leaf task. The algorithm trains on consecutive leaf tasks in succession; the sampled summaries are used as previous context for later leaves.

The first subtree: Each episode consists of a first leaf task or the height 1 composition task for the first subtree. The algorithm trains on the leaf tasks in succession, followed by the composition task using their sampled outputs.

Full tree: We choose a random depth dd and then a random node at that depth. The algorithm trains on N successive depth d+1d+1 tasks followed by a single depth dd composition task using those N outputs. Input trees are generated ahead of time from the initial model with best-effort sampling settings (in practice, we sometimes use some trees from older models as well).

Since our demonstration and comparison data is at the level of individual nodes, we train the RL policy at the same granularity: each task is its own episode, and no rewards propagate to other nodes of the tree.

4 Advantages of decomposition

Compared to end-to-end training, decomposition makes it much easier to collect human feedback for a given task. Correspondingly, it makes the task much easier for the ML model. But it also offers other benefits:

It empowers a human to do or evaluate parts of the task themself. For example, a human with access to lower-level summaries can quickly summarize themselves.

It makes it easier to trace what the model is thinking, and debug errors in the model. If a model summary contains a relatively isolated fact, a human with access to the tree can trace it back to the original text.

Our procedure generalizes gracefully to longer books. It can be used at test time on books of unbounded length, regardless of the length of books in the training dataset.

Task details

For training, we use a subset of the books used in GPT-3’s training data (Brown et al.,, 2020). The books are primarily fiction, and contain over 100K words on average. We further constrain our dataset by asking labelers to skip non-narrative books.

We chose narrative fiction books due to our belief that they were the most difficult to summarize, which is supported by our later qualitative findings (Appendix J). Summarizing narrative texts is particularly challenging for extractive methods since any given sentence tends to be a very low-level description. We find additional evidence for this in Section 4.2, where our models outperform an extractive oracle on the BERTScore metric.

2 Summarization task

We aim to summarize abstractively, tracing out narrative arcs and larger themes rather than listing series of events. Our primary metric is labeler judgments of overall summary quality on a 1-7 Likert scale, on held-out books that were neither in the GPT-3 pretraining dataset nor in our book dataset. We also ask labelers to evaluate summary accuracy, coverage of the source text, coherence, and amount of abstraction; see more details on our instructions to labelers in Appendix C.1.

For each summarization subtask, we generally aim to compress the text by a factor of 5-10x, with length upper limits of 128 to 384 tokens, depending on the task height. We ask labelers to evaluate summary quality conditioned on its length; that is, labelers are answering the question “how good is this summary, given that it is X words long?” This is in part to avoid the scenario where, if longer summaries are preferred by labelers, models will generate the longest summaries allowed by the length constraints (Stiennon et al.,, 2020).

We emphasize that for each subtask, labelers only consider the quality of the summary with respect to the direct input to the model, rather than the subset of the book representing the true summarization target. See Appendix A.3 for more discussion.

Results

We first evaluate our models’ ability to summarize full books that were unseen during pretraining or fine-tuning. To do this, we use the 40 most popular books published in 2020 according to Goodreads at the time we looked. The resulting books span a variety of genres (see Table 5).

We then assigned two labelers to read each book (purchased with reimbursement) and to write a summary of the book. Finally, we ask the labelers to rate summaries from various models and from the other labeler. Labeler agreement for relative quality of model-written summaries was nearly 80%.

We evaluate two model sizes, 175B parameters and 6B parameters. For each size, we also evaluate three different modes of training: RL on the whole tree, RL on the first subtree, and BC on the whole tree. For each policy, we generate 3 summaries each, in order to reduce error bars. Even for temperature 0 policies, we can vary the summaries by changing the seed used to randomly choose chunking boundaries – we found this to produce significant variation in the summaries.

We evaluated all BC policies at temperatures T=0.0, 0.3, and 0.6 on this test set. The results in Figures 2 and 3 use the best temperatures for these policies.While this may overstate quality of the BC policies, we consider the policies to be a baseline and did not want to understate the quality. This is because it was too expensive to ablate temperature on the full book summarization task on our validation set (though we we show temperature sweeps on the validation set for leaf summarization tasks in Appendix D.2, these temperatures are not a priori the best for full book summarization). In the end, we empirically found that the best temperatures for the leaf task were also the best for full book summarization: T=0.6 was best for our 6B BC baseline, and all temperatures performed about equally for our 175B BC baseline.

Our best models can generate realistic summaries of books unseen during training. Some of these summaries approach human-level quality: over 5% of summaries from the best 175B model were given a score of 6 out of 7, and over 15% were given a 5 out of 7, scores which were also sometimes assigned to human-written summaries (Figure 3). However, on average our model summaries are still significantly worse than human-written summaries (Figure 2(a)), See our websitehttps://openaipublic.blob.core.windows.net/recursive-book-summ/website/index.html#goodreads for our model summaries and ratings.

We find that training on the first subtree does comparably to training on the full tree (Figure 2(b)). Our models trained on just the first subtree generalize quite well to the full book summarization task. However, we also found the full tree models disappointing; the final 175B full tree model we trained was noticeably worse than the previous one.We had convincingly detected this prior to final evaluations via Likert scores for tree tasks, but included it for completeness. The results in the remainder of the paper use the better (earlier) model, and we had committed to doing this before running final book evaluations. We discuss possible reasons for this in Appendix G. We also find that our 175B RL policies significantly outperform our 175B BC baseline, though the improvement is smaller for the 6B models.

Likert scores for the full book summaries were significantly lower than Likert scores of any of the individual decomposed tasks. This is unsurprising, since the errors accumulated at each depth are all reflected in the full book summary score. See Appendix A.3 for more discussion.

2 BookSum results

We also evaluate our models on the recently proposed BookSum dataset for book-length summarization (Kryściński et al.,, 2021) We compare to the best extractive (BertExt; Liu and Lapata, 2019b, ) and abstractive (T5; Raffel et al.,, 2019) models, as well as an extractive oracle (which uses the reference summary to find the sentences in the source text that lead to the highest score). Kryściński et al., (2021) evaluate book summaries using ROUGE (Lin and Och,, 2004), BERTScore (Zhang et al., 2019a, ), and SummaQA (Scialom et al.,, 2019). SummaQA requires paragraph-aligned summaries, which we do not have, and so we report results on ROUGE and BERTScore. Our depth 0 summaries are substantially shorter than the reference summaries, so we use the concatenation of depth 1 summaries.

Our 175B models beat all non-oracle baselines on ROUGE by 3-4 points and approach the performance of an extractive oracle. They also significantly outperform all baselines on BERTScore, including the extractive oracle. The 6B models are comparable to baselines on ROUGE while also significantly outperforming all baselines on BERTScore, including an 11B T5 model (Raffel et al.,, 2019) fine-tuned on the BookSum dataset.

Kryściński et al., (2021) report length being a confounder for BERTScore, with longer summaries having lower scores. We also find a slight negative correlation between length and BERTScore, but controlling for it does not significantly affect our conclusions (see Appendix I).

Note that we cannot rule out overlap of the BookSum dataset with our pretraining dataset. Nevertheless, from manual inspection of the trees, we believe that the summarization procedure largely reflects the structure of the book, rather than being a result of memorization from pretraining.

3 Human label efficiency of RL vs. BC

In Section 4.1.2 we found that our RL models outperformed our BC models. However, our RL models were trained on significantly more data. A significant open question is whether doing RL on summary comparisons is actually better than simple behavior cloning on an equal number of high-quality human demonstrations. Previous results from Stiennon et al., (2020) showed that doing RL greatly improved summary quality over their BC baseline, and even outperformed human-written summaries. However, their reference summaries were scraped from Reddit TL;DRs, which are often not good summaries of the original text, and they do not compare to collecting a similar number of high-quality demonstrations.

In this work, we use the same trained labelers to create demonstrations and comparisons, and directly compare RL to BC by plotting model performance versus the amount of human time required to produce each dataset. We study this on the first leaf summarization task rather than the full book summarization task to save human time.

We trained 3 versions of a 6B parameter BC baseline, with ¼, ½, and all the demonstrations. Then, we trained RL policies starting from each of the ¼ and ½ BC policies,We collected comparisons of the initial BC policies at temperature T=1, trained a reward model, and then ran a single round of RL with the initial BC policy at initialization. with approximately the same number of comparisons as there were demonstrations. For these BC policies, we used temperature T=0.6, while for RL policies, we use T=0 (see Appendix D.2 for justification).

We found that while RL on comparisons was about as effective as BC on demonstrations after 5k-10k demonstrations, comparisons were far more efficient on the margin after 10k-20k demonstrations (Figure 4). Furthermore, comparisons used to produce this figure were 3x as fast for us to collect as demonstrations (see Appendix E).

4 NarrativeQA: using book summaries for question answering

Another way to evaluate summaries is to test whether they can be used to answer questions about the original text (Scialom et al.,, 2019; Wang et al.,, 2020).

We applied our summarization model to the NarrativeQA question answering dataset (Kočiskỳ et al.,, 2018), a dataset consisting of question/answer pairs about full book texts and movie transcripts. The question/answer pairs come from Wikipedia summaries, matched by title to the full text. In the full stories version of NarrativeQA, the model must use the original text.

We test whether our summaries can be used as input (instead of the full book or movie text) to a question answering (QA) model. For the QA model, we simply use a trained UnifiedQA model (Khashabi et al.,, 2020) in a zero-shot manner with temperature 0. We can give it either the depth 0 summary, or a concatenation of the depth 1 summaries (the concatenation of depth 2 summaries can be quite long). We found that depth 1 summaries work better.

As shown in Table 3, we achieve competitive results, despite our summarization model not being trained explicitly for question answering. However, we use far more parameters than Izacard and Grave, (2020), the previous SOTA. When using smaller UnifiedQA models for question answering, results are substantially worse, suggesting that the quality of the QA model is a primary bottleneck (Figure 7). All our samples are available on our website.

Related work

Our work is directly inspired by previous papers that lay the groundwork for applying human feedback to reinforcement learning (Christiano et al.,, 2017), especially to large-scale tasks. Our task decomposition approach can be thought of as a specific instantiation of iterated amplification (Christiano et al.,, 2018), except we assume a fixed decomposition and start training from the leaf tasks, rather than using the entire tree. Similarly, our approach can be considered a form of recursive reward modeling (Leike et al.,, 2018) if we understand the purpose of model-generated lower-level summaries to be to help the human evaluate the model’s performance on higher-level summaries. Our contribution over these works is showing that this approach can be realistically applied to a difficult, large-scale task. We also build on the growing body of work that fine-tunes models with human feedback. This has been applied in many domains including summarization (Böhm et al.,, 2019; Ziegler et al.,, 2019; Stiennon et al.,, 2020), dialogue (Jaques et al.,, 2019; Yi et al.,, 2019; Hancock et al.,, 2019), translation (Kreutzer et al.,, 2018; Bahdanau et al.,, 2016), semantic parsing (Lawrence and Riezler,, 2018), story generation (Zhou and Xu,, 2020), review generation (Cho et al.,, 2018), and evidence extraction (Perez et al.,, 2019), and agents in simulated environments (Christiano et al.,, 2017; Ibarz et al.,, 2018).

There has been relatively little work on summarizing novels and other long-form fiction writing. Early work (Gorinski and Lapata,, 2015) used graph-based methods to summarize movie scripts. Mihalcea and Ceylan, (2007) introduced a dataset of book summaries scraped from CliffsNotes and tested an unsupervised extractive system based on MEAD (Radev et al.,, 2004) and Textrank (Mihalcea and Tarau,, 2004). More recently, Ladhak et al., (2020) propose a method for extractive summarization of chapters of novels. There has been work on generating partial summaries of fictional stories: Zhang et al., 2019b investigate generating character descriptions written by the story author, and Kazantseva, (2006) investigate extractive methods for generating information about the story setting and characters, but not the plot. Relatedly, Bamman and Smith, (2013) proposes an unsupervised method for aligning books with human-written summaries. There has also been some work on question answering using full books (Mou et al.,, 2020; Izacard and Grave,, 2020; Zemlyanskiy et al.,, 2021). Concurrent with our work, Kryściński et al., (2021) extended the datasets of Mihalcea and Ceylan, (2007) and evaluated neural baselines.

While work on summarizing novels is sparse, there has been plenty of work on summarizing other kinds of long documents, such as scientific papers (Abu-Jbara and Radev,, 2011; Collins et al.,, 2017; Subramanian et al.,, 2019; Cohan et al.,, 2018; Xiao and Carenini,, 2019; Zhao et al.,, 2020; Sotudeh et al.,, 2020), and patents (Sharma et al.,, 2019), as well as multi-document summarization (Liu et al.,, 2018; Ma et al.,, 2020; Gharebagh et al.,, 2020; Chandrasekaran et al.,, 2020; Liu and Lapata, 2019a, ; Gao et al.,, 2020). Many of these techniques use a hierarchical approach to generating final summaries, either by having a hierarchical encoder (Cohan et al.,, 2018; Zhang et al., 2019c, ; Liu and Lapata, 2019a, ), or by first running an extractive summarization model followed by an abstractive model (Subramanian et al.,, 2019; Liu et al.,, 2018; Zhao et al.,, 2020; Gharebagh et al.,, 2020). The latter can be seen as a form of task decomposition, where the leaf task is document-level extractive summarization and the parent task is abstractive summarization conditioned on the extracted summaries.

The idea of decomposing hard tasks into multiple smaller sub-tasks has been used extensively in NLP. For example, Fan et al., (2018) generate fictional stories by first training models to generate a story prompt, and then training another model to generate the story conditioned on this prompt. The idea of saving human time by using models trained at lower levels of the hierarchy to help humans label data for higher-level tasks has also been explored. In Fan et al., (2020), models are used to search for evidence of facts, to help humans fact check faster and more accurately.

Discussion

Our main interest in this work is scaling human feedback to hard problems; we want to empower humans to give feedback to models on tasks that are very difficult to evaluate. We expect this to be a critical part of the alignment problem because we need to make sure humans can communicate their values to AI systems as they take on more societally-relevant tasks (Leike et al.,, 2018). If we develop techniques to optimize AI systems on what we actually care about, then we make optimization of convenient but misspecified proxy objectives obsolete.

In this paper, we showed that it is feasible to train models using human feedback on the difficult task of abstractive book summarization, by leveraging task decomposition and learning from human feedback. We also showed that doing RL on summary comparisons is more efficient than supervised learning on summary demonstrations, once the summarization policy has passed a quality threshold. Though we used a fixed decomposition strategy that applies only to summarization, the general techniques could be applied to any task. In this sense we have made progress towards optimizing what we actually care about: good summarization performance as judged by humans.

Something we do not address in this paper is training a single model to perform the entire top-level task, e.g. a single model that maps a book to a summary. This could be done via distillation as suggested in Christiano et al., (2018), however in our case that would require training a single model with a very large context window, which introduces additional complexity. Furthermore, since the majority of our compute is at the leaf tasks, this would not save us much compute at test-time.

While our models successfully generate book-level summaries that contain much of the important information, they often read more as a list of events from the book, rather than a coherent summary that a human would write. In theory, this could be remedied with more rounds of RL at the top-level summarization task, however in practice we found RL at higher levels of the tree to be challenging (see below).

Task decomposition assumes that separate parts of the task can be completed independently. However, this may not be true for summarizing books. For example, it may be hard to catch cases where earlier details in the book are only later revealed to be important (e.g. in mystery books). Our summarization models also sometimes generate inaccurate statements due to a lack of context; for example, there is a passage of Pride and Prejudice in which the main character gets asked for “their hand”. In the broader context of the chapter, it is clear that the character is being asked for a dance. However, this is not clear from only the local context of the leaf task, and thus the model summarizes it as asking for “her hand in marriage”. This is a general weakness of our training setup because we require each summary to be produced from only this local context, with a model that has not read the rest of the book.

Some of these issues may be alleviated by learning a decomposition procedure rather than using a fixed algorithm (see Appendix A.3 for some discussion). However, this may not resolve all of the problems with decomposition. Consider a case where important information is sprinkled lightly across many parts of the book, e.g. small details implying a buildup of love or resentment, where each detail is too minor to be included in a chapter summary despite being a prominent overall theme. Determining the kinds of tasks that are amenable to decomposition remains an open problem.

In general, policy errors at lower levels compound at each composition task, ultimately leading to large errors on the top-level task. Auto-induced distributional shift (ADS, see Section 2.3.1) may also be making training significantly more difficult, and curriculum choice may matter a lot as a result. Our curriculum and node sampling strategies were chosen in an ad hoc way.

As shown in Section 4.1, training on the full tree of tasks did not lead to improved performance. We discuss some possible reasons in Appendix G but leave thorough investigations to future work.

2 Open questions

Though our approach produced plausible book summaries, the limitations above suggest some open questions for future research. First, are there better and more principled curricula? Could one obtain improved performance by doing RL more on-policy, by generating the summary trees on the fly, or by training the reward model online as in Ziegler et al., (2019)? Is it better to have longer or shorter episodes, encompassing more or less of the tree? While having longer episodes means the policy has more in-distribution inputs at test time, it also means training on fewer trees for a given amount of compute and makes the reward model less on-distribution.

There are also many ways to improve the fundamental techniques for fine-tuning models using human feedback. For example, are there more efficient ways to collect data from humans instead of binary comparisons? Could other methods for optimizing against human feedback, such as expert iteration (Anthony et al.,, 2017), be more efficient?

Finally, there are questions for how this procedure extends to other tasks. Is learning a task decomposition model, rather than using a fixed decomposition, feasible for hard real-world tasks? For what kinds of tasks is task decomposition fundamentally limiting? How else can we use ML models to assist humans in specifying their preferences for high-level tasks? We hope to address some of these in future work.

3 Broader impacts

This work expands on the reward modeling technique proposed in Ziegler et al., (2019) and Stiennon et al., (2020). Thus, the broader impacts are similar to the ones described in those papers. On the positive side, our research is motivated by the benefits of aligning ML systems with human intentions. We believe alignment techniques are an increasingly important tool to improve the safety of ML systems, particularly as these systems become more capable. Conversely, improved alignment could also enable malicious actors to more easily train models that cause harm, and could also lead to increased automation of some jobs, leading to job loss. See the broader impacts discussion of Stiennon et al., (2020) for more discussion of these points. The difference in this paper compared to previous work on reward modeling is that we combine the technique with task decomposition, which allows us to use human feedback to train ML models to perform more difficult tasks. This amplifies both the potential benefits and the risks listed above.

One point we reiterate from Stiennon et al., (2020) is to be careful when defining the ‘good’ model behavior that labelers will reinforce. In other words, what or who should we align our models to? Deciding what makes a good summary is relatively straightforward, but defining good behavior becomes more difficult as we move beyond summarization to more complex tasks where humans might disagree on the correct model behavior.

When solely considering the impacts of automatic book summarization, our models still make many mistakes while summarizing, and thus should not be deployed in a setting where high summarization accuracy is necessary. Our model summaries also seek to preserve the intent of the book, whose contents may be harmful or biased.

Acknowledgements

We thank Wojciech Kryściński for discussion of book evaluation methods, and for help with BookSum; Alec Radford for discussions about baselines and NarrativeQA; Ben Mann, for help with our initial dataset; Michael Petrov, Alethea Power, Chris Hesse, and the entire OpenAI Supercomputing team for help with infrastructure; and Alex Ray, Mark Chen, Tom Brown, Nick Ryder, and others for help with and work on pretrained models.

We also thank Jonathan Uesato, Ethan Perez, Sam Bowman, Wojciech Kryściński, and Diogo Moitinho de Almeida for detailed feedback and suggestions on the paper; Pamela Mishkin for book suggestions and feedback on broader impacts; Kelly Clancy for discovering the Pride and Prejudice example; Natalie Summers for suggestions on books/scripts to use; Geoffrey Irving, Beth Barnes, William Saunders, and Dario Amodei for their support and thinking about our research agenda; Justin Wang for creating the graphics for the blog post; and Jeff Clune for the idea to modify books to check prior knowledge.

Last but not least, we’d like to thank all of our labelers, without whom this research would be impossible: Russell Bernandez, Gabriel Ricafrente, Laura Cowley-Martinson, Kelly Guerrero, Megan Niffenegger, Rachelle Froyalde, Ethan Myers, Stephen Ogunniyi, Jack Kausch, Jenny Fletcher, Charles Boone, Justin Dill, Celina Georgette T. Paglinawan, Bryce Vogel, Gabriel Perez, Cody St. Clair, Jelena Ostojic, Erol Can Akbaba, Maria Orzek, Alfred Lee, Ollie Horsfall, Eli Kapsack, Tasmai Dave, Cyra Mayell Denura, Sarah Mulligan, Emill Jayson Caypuno, Morris Stuttard, Ife Riamah, Sebastian Gonzalez, Vladan Djordjevic, Sarah Kirsten, Conor Agnew, William Brewer, Medeea Bunea, Joe Kwon, Chait Singh, Jennifer Brillo, Bashir Harrell, Leo Yung, Bekah Guess, Atresha Singh, and Jacob Bryan.

References

Part I Appendix

We generally aim for a text compression rate of 5-10x at each step, although the compression rate at top of the tree is typically lower, depending on the number of children of the root.

We also generally aim to chunk text at white-space boundaries such as repeated newlines, chapter boundaries, etc., though we do not guarantee this and it is done heuristically.

We filter out preamble and postamble using manually devised heuristics, though our labelers are instructed to output empty summaries upon such inputs if our heuristics do not catch everything.

Finally, the chunking code also consumes a random seed, allowing us to vary sectioning while chunking the above desiderata.

A.2 Structure

Inputs to leaf nodes are typically around 600 tokens. Then, for height 1 tasks, we concatenate 10-13 summaries (each up to 128 tokens). For higher height tasks, we target concatenating up to 8 summaries (each up to 192 tokens at height 2, or 384 tokens at higher heights), though it can be as low as 2 if there is not enough text, which is common at higher heights.

When applying our tree procedure, each book is split into about 200 leaf nodes on average, and about 20 height 1 nodes. Trees typically reach height 3 (meaning there are additionally height 2 composition tasks, and a final composition task), but on rare occasions reach height 4 or greater.

A.3 Using input model summaries as ground truth

For each task, we ask labelers to consider only the quality of the summary with respect to the direct input to the model, rather than the subset of the book representing the true summarization target. Ideally, we would consider the ultimate task of the labeler or model to be to summarize or evaluate summaries of the full range of the book corresponding to the input in our decomposition. The role of the existing best model would be as a "helper model" to aid in that task (by producing summaries of parts of the book), but the labeler/model would potentially still refer to the original text when needed. Then the reward model at depth 0 would correspond to the "true" reward, rather than corresponding to only part of the trajectory.

Had we defined the tasks this way, it may have helped address issues the error accumulation problem discussed in Section 4.1.2. When inputs were contradictory or confusing, labelers could consult the original source. This would be particularly compelling if the model was also capable of question-answering.

Unfortunately, while we find this framing appealing, the pretrained models we had access to had limited context length. Furthermore, this would have complicated our infrastructure and made the task for labelers somewhat more difficult. Thus we start with the simpler version and leave such investigations to future work.

A.4 General task decomposition pseudocode

In this implementation of decomposition, the input at each step is simply a task which we wish to do, and a list of (subtask, response) pairs. The subtasks are assumed to have come from a previous invocation of the function, and the subtask responses should help in answering the primary task.

def do_task(task, subtask_pairs=[]): result = decompose_if_needed(task, subtask_pairs) if type(result) == Decompose: # recursively get the response to the subtask subresponse = do_task(result.subtask) return do_task( task, subtask_pairs + [(result.subtask, subresponse)] ) if type(result) == Respond: return answer_directly(task, subtask_pairs)

We have assumed existence of two functions:

decompose_if_needed, which returns either a Respond() indicating the subtasks can be synthesized and answered by the model directly, or a Decompose(subtask) if the model requires help to solve the task. This subtask can be decomposed even further if necessary.

answer_directly, which returns an actual answer to the task, synthesizing the answers to subtasks

In general, both decompose_if_needed and answer_directly could be learned and implemented by an ML model. In the fixed decomposition case, decompose_if_needed is implemented programmatically instead.

Note also that Decompose only returns a single subtask, rather than a list of them. This way, other child subtasks can depend on the result of the prior ones.

A.5 Book decomposition pseudocode

A basic implementation of our tree decomposition for books described in Section 2 might look like this: {python} def decompose_if_needed(task, child_summaries): if len(task.text) < MAX_LENGTH: # just summarize actual book text assert not len(child_summaries) return Respond() # split text into parts of similar length chunks: List[str] = chunkify_text(task.text) # assume any existing N answers are for the first N chunks if len(child_summaries) == len(chunks): # we have all answers necessary, summarize concatenation return Respond() # We still need a summary for one of our children, recurse to it. # The outer loop will call the model to summarize this, # and append to child_summaries return Decompose(Task(text=chunks[len(child_summaries)]))

def answer_directly(task, child_summaries): if not len(child_summaries): # actual book text to_summarize = task.text else: to_summarize = "\n\n".join(child_summaries) return model(to_summarize)

A version which correctly uses "previous context" is a bit more involved to implement. We keep an info field which tracks a mapping from depth to all summaries written at that depth so far. Note that the "previous context" summaries are from the same task depth (not necessarily the same task height). For example, at height 0, if summarizing page 5-6, in addition to receiving the original text for pages 5-6, a model/human would also read the tail end of summaries for pages 1-4.

def decompose_if_needed(task, child_summaries): if len(task.text) < MAX_LENGTH: # just summarize actual book text assert not len(child_summaries) return Respond() # split text into parts of similar length chunks = chunkify_text(task.text) # assume any existing N answers are for the first N chunks if len(child_summaries) == len(chunks): # we have all answers necessary, summarize concatenation return Respond() # we still need a summary for one of our children, recurse to it new_info = add_context_info(task.info, child_summaries) return Decompose(Task( info=new_info, depth=task.depth+1, text=chunks[len(child_summaries)], ))

def answer_directly(task, child_summaries): if not len(child_summaries): # actual book text to_summarize = task.text else: to_summarize = "\n\n".join(child_summaries) return model(format_for_model( text=to_summarize, previous_context=get_context_for_depth(task.info, task.depth), ))

We are given a text to summarize, a depth dd, and a mapping from depth to previous context for the text (which precedes the text we are summarizing)

If our text to summarize is small, we ask the model to produce a summary directly, conditioning on the previous context at our depth

If the text to summarize is long, we break it into N smaller chunks

We recursively ask for a summary of the first chunk, at depth d+1d+1.

We append that first chunk summary to the previous context at depth d+1d+1, and then recursively ask for a summary of the second chunk.

We finally concatenate the N chunk summaries into a final input, and summarize that, ensuring that the summary flows from the previous context at depth dd

Appendix B Labeler interaction details

We use a similar process to Stiennon et al., (2020) for training labelers, and use many of the same labelers. We pay labelers an hourly wage and have relatively extensive on-boarding materials. All labelers are fluent in English and the majority are native speakers.

B.2 Quality control

We generally have a fairly involved quality control process, adopting techniques from Stiennon et al., (2020). We often have a second labeler give detailed feedback on task completion, and give the first labeler a chance to respond to that feedback. When doing composition tasks with human-written inputs, we also give a chance for labelers to give feedback on those inputs.

We also communicate frequently with our labelers via Slack, giving them a chance to give us feedback and vice versa.

B.3 Task interface

We use a website and task-allocation library developed specifically for giving tasks to labelers. We use different customized "renderers" for different tasks (demonstrations, comparisons, final evaluations, etc). See Figure 5 for an example of a demonstrations renderer.

Appendix C Labeling task details

The following guidelines were given to labelers for evaluating summary quality, and applied to both demonstrations and comparisons.

Coverage: All information in the summary should be important, and there should be no other more important information omitted from the summary. So gratuitously including small details is generally penalized, and omitting important details is also penalized.

Accuracy: All information in the summary should faithfully reflect the original passage.

Coherence: Ignoring the passage, the summary should not be confusing, ambiguous, or logically incoherent.

We also have a fourth criteria which is primarily applicable at higher height. Labelers were to use their own judgment on how important it was

Abstraction: When possible, writing should describe larger arcs and themes rather than just listing a series of events that happened.

In addition, we also have the following guidelines

The summary should flow from the end of the previous context

When using pronouns, resolutions should be clear for a naive reader

Reader uncertainty should be indicated in square brackets, e.g. [maybe]

Line breaks should be used to indicate a change of scene

Output should be empty if the content is preamble/postamble (publishing details, etc.)

Comparing summaries of different lengths can be very difficult, and result in e.g. systematic preferences for longer summaries, if labelers value summaries being informative rather than concise. Length was found to be a significant confounder of quality in Stiennon et al., (2020), who report length-controlled results.

Consistent with our coverage criterion, we ask for the best summary “overall”, controlling for length – a summary is evaluated for the particular length it was written at. For example, if summary A was 100 tokens and summary B was 200 tokens, we asked labelers to imagine that summary A had a 100 token “budget”, summary B had a 200 token “budget”, and to report which summary did a better job of using its budget. Overall, in our work, we find length has an insignificant effect on summary quality. This avoids the need to control for length.

Nevertheless, we set limits on length. Our allowed range of lengths increase as we summarize more of the book. We institute hard limits of 128 tokens for the height 0 (leaf level) tasks, 192 tokens for height 1, and 384We increased the limit mid-project from 192, and typical lengths are still much closer to 192. for all other heights. In practice, we do not frequently hit these length limits - when they are exceeded, we truncate the summaries before they are shown to humans (and before shown in this paper).

C.2 Differences between human and model tasks

In principle, our models and humans should be performing the exact same task. In practice, they differ very slightly, though we expect none of these differences affect results or conclusions

For demonstrations, although we ask for best “overall” taking length into account, humans can just as easily write good summaries at different lengths. Thus we gave our labelers a range of different suggested length targets within the acceptable range, with 20% headroom in either direction. This ensured our models tried outputting summaries at different lengths. The suggested lengths are typically chosen between half the length limit and the limit, roughly between 100 and 200 BPE tokens.

When collecting data (demonstrations and comparisons) on the first leaves, we typically have labelers do all the tasks consecutively at once, thus saving a bit of time by virtue of already having paged in the previous context – though this did cause labelers to see more context than a model doing the same task saw.

When doing the “contaminated” comparisons, labelers typically saw the same previous context for the summaries being compared. However, for some period, our reward model was seeing summaries with different previous contexts (for the same data collected).

Much of the expense of collecting a comparison is in reading the input text. We can speed up comparison collection by asking labelers to compare multiple pairs of summaries for each input text (at the cost of higher correlations in the collected data). Furthermore, the pairs of summaries can have overlap. In practice, we use up to 3 pairs of comparisons between 3 summaries. Though we could use a similar trick for demonstrations, we tried it briefly and abandoned it, as we were afraid the demonstrations for the same text would be too similar when written in quick succession.

For valuation and diagnostic purposes, we also collect the following data, at various points in time:

When doing comparisons of summaries, we collect 1-7 Likert ratings for the primary criteria mentioned in Appendix C.1. We also always collect an overall Likert rating. Ratings reflect absolute quality rather than relative quality (to another summary).

We also ask for ratings of coherence of the input texts for composition tasks

At various points in time, we also collected other datasets, including but not limited to:

Annotations of spans in the summary which were inaccurate, incoherent, or exhibit poor coverage

Questions about the texts being presented

Overall, none of these data affected the primary task (of demonstration/comparison) in any way, and were simply supplementary data intended for future experimentation.

Appendix D Additional training details and hyperparameters

Our hyperparameter choices follow those of Stiennon et al., (2020). BC models and reward models are trained for 1 epoch, with cosine decay. Learning rates are chosen by a separate sweep for each model size, and we use a cosine decay schedule. We use the Adam optimizer.

Like Stiennon et al., (2020), for reward models, we add an additional head on top of the final layer, initialized randomly. We often run multiple seeds and choose the best reward model based on validation loss/accuracy. We normalize the reward model to be zero-centered around human demonstrations prior to using it for RL. This makes it slightly easier to compare rewards across runs, and likely affects the optimization in a beneficial way (if at all). We also initialized the value function to the reward model weights, which we found helps learning.

For reinforcement learning, we primarily tune KL coefficient and learning rate. KL coefficient is generally chosen in an ad-hoc way to target a KL range we deemed reasonable - we used 0.02 for most runs, but also experimented with 0.01 and 0.03 earlier in the project. Learning rates are chosen using sweeps for each model size (very roughly chosen, for 175B). We use linear learning rate decay and run for up to 200,000 episodes (for most of the project, we used 150,000 episodes).

D.2 Temperature

To ensure that we compared against a fair baseline, we swept temperatures and had labelers evaluate quality of various BC models on the leaf tasks. In Figure 6, we find that the 6B supervised model is best at around T=0.6, while the 175B supervised model is best around T=0.3.

Higher level tasks followed similar overall pattern (although we have noisier estimates). We later found in final evaluations that better temperatures for individual tasks was predictive of performance on the full book summarization tasks as well.

D.3 Input format

The input format concatenates the following, in order: previous context summaries separated from each other by "\n----\n", the separator "\n====\n", the text to summarize, and finally the phrase "TL;DR:". The model then generates the summary after that. The previous context summaries are truncated (from the beginning) to fit within the 2048 token context window while leaving room for a summary of maximal length.

Appendix E Human timing

We collected detailed timing information which let us know how long the primary tasks took.

We found empirically that comparisons are about twice as fast as demonstrations, ignoring read time. Including read time, they were about 40% faster. For leaf tasks, where the distribution is not policy-dependent, we estimate 2.5 minutes reading, 4 minutes per written demonstration, and 1.5 minutes per comparison.

Since for both (especially comparisons), reading the passage is a non-trivial part of the cost, amortizing the read time across many demonstrations or comparisons can help increase rate of data collection. We briefly tried collecting demonstrations of different lengths; however, we found the demonstrations to generally be quite similar and stopped collecting such data early on. For comparisons, we typically collect 3 at a time, thus amortizing the read time down to around 0.8 minutes for leaf tasks. This makes comparisons nearly 3x faster than demonstrations (2.3 minutes vs. 6.5) minutes. Empirically, we find it over 3x faster (1.8 minutes). This may be because we typically compare all pairwise combinations between 3 samples, thus yielding only log2(6)=2.58\log_{2}(6)=2.58 bits of information rather than 33 bits, but also saving on time processing each summary. Similar results hold across all heights. Demonstrations generally took between 10 to 15 minutes total, while a set of 3 comparisons also took between 10 to 15 minutes.

Our results in Figure 4a on the first subtree uses these practices. The results hold despite comparisons being 3x faster to collect and each yielding far less information (less than 1 bit per comparison, versus potentially thousands per demonstration). When plotting with estimated human time, the advantage of RL is more apparent, see Figure 4b.

E.2 End-to-end baseline estimates

It took over 12 hours on average for a labeler to read a full book, and additionally over 1 hour to write the summary. This is over 50 times longer than it takes labelers to do a single decomposed summarization task. Thus using the same amount of human time as Figure 2(b) (enough for 100K total demonstrations and comparisons), we would have had summaries for at most 2K distinct books.

While existing datasets of book summary datasets can be scraped from the Internet (e.g. from study guides such as Sparknotes), they typically have only hundreds of well-known books. For example, Bamman and Smith, (2013) has 439 (book, summary) pairs.

Another consideration is that reading time can be amortized greatly by having contractors write multiple summaries per book. In practice, we found it difficult to have contractors write multiple distinct summaries. Nevertheless, this could plausibly save a substantial amount of time if executed well.

Furthermore, learning the book summarization task end-to-end would likely be much more difficult than the decomposed tasks, as the model would need to learn attributions across an extremely long context. Overall, we believe an end-to-end baseline would likely have been infeasible with naive methods, but leave it to future work to try.

Appendix F Mistakes and miscellaneous learnings

Given that we were doing a recursive strategy, we should’ve made the base case smaller. Reward modeling did not work as well on the leaf level task as it had on the TL;DR task from Stiennon et al., (2020). With shorter input texts and summaries, we may have seen signs of life much sooner.

The “contamination” set up (see Appendix C.2.2) complicated our infrastructure, and resulted in task mismatch between the human and model. We likely should have assumed the more general set up immediately.

F.2 Miscellaneous Learnings

We tried initializing reward models from the previous one and fine-tuning on only the data collected since. We could not tell whether this was better or worse, though it saved on compute.

Similarly, we considered initializing RL models from the previous one (and also using the previous RL model for the KL penalty). However, RL seems to lose entropy in suboptimal ways: at some point, our model really favored summaries that started with "[X] reflects". For this reason, we always use the most recent supervised policy, rather than the best RL policy, for the RL initialization and KL penalty. However, further investigation is needed.

We collected structured feedback of when the models made coverage/coherence/accuracy mistakes, with highlights of spans where errors occur. Training on this data as a supervised task did not help as initialization for reward models. However, this was very exploratory and we remain very excited about future work in this direction.

Postamble filtering didn’t seem necessary, even though the model was barely trained on postambles (whereas comparatively a lot of training data contained preambles)

Training a reward model to directly predict Likert scores using a least squares lost resulted in similar accuracy to our binary comparison based models.

Appendix G Difficulty and mysteries of full tree training

As shown in Section 4.1, training on the full tree of tasks did not lead to improved performance. We give some possible reasons for this.

Lack of hyperparameter tuning: We did not tune the 175B models much due to compute costs.

Poor input distribution and noisy comparisons for higher level tasks: The quality of the input summaries given to the model (and thus to human evaluators when evaluating this model) degrades as one moves up the tree. The quality of input summaries is important for labeling accuracy: we found that inter-labeler agreement went down when labelers judged the input summaries as less coherent. Thus, the training signal degrades if we move to training on higher level tasks too early, before the summarization models have passable summaries.

Poor node sampling during RL: Our episode sampling strategy described in Section 2.3.3 may have been suboptimal. Rather than the vast majority of tasks being height 0 tasks, only about one third are. This is in contrast with evaluation time, where height 0 are both most numerous and potentially most important. Empirically, we found that the best full tree 175B RL model did sacrifice performance on lower heights in order to do better at the higher height task, relative to the best first subtree model (which had similar full book performance overall). However, the later full tree 175B RL model, shown as the unfortunate dip found in Figure 2(b), had worse Likert scores at all heights. This makes the explanation somewhat unlikely, although it is possible that it is actually better at higher heights and the shift in lower height summaries makes it appear worse.

Most of the below reasons do not explain why training on more full tree data decreased performance for our 175B model. We do not have good hypotheses for why, but we also cannot rule out a bug in the training code, or randomness across RL runs. Our initial guess was that the behavioral cloned model or the reward model performance had degraded – however, they did not regress significantly on lower height tasks on loss and accuracy metrics, compared to corresponding models trained only on first subtree data. While this does not rule out a reward model which is generalizing worse in some way during RL, it leads us to believe the issues were primarily elsewhere in the RL.

Appendix H NarrativeQA: additional findings

In Figure 7 we show ablations on NarrativeQA for both the UnifiedQA model size and the summaries used as input to the UnifiedQA model.

H.2 GPT-3 memorization

For the QA model, we also attempted using a pretrained GPT-3 in a few-shot manner, similar to Brown et al., (2020). However, unlike the UnifiedQA models, pretrained GPT-3 surprisingly achieved extremely strong performance without any summaries. In fact, the 175B parameter model had state of the art results according to all metrics except ROUGE-L (which was extremely close).

H.3 Zero-shot recursive question answering

We also attempted a method of prompt engineering to cause our summarization model to act as a recursive question-answering model. To do this, we run our tree procedure, but augmenting each step with the question. Specifically, we add an additional prompt between the passage and response: "Answer the following question based on the above passage, or reply with a summary of relevant information if no answer is found: {question}". The procedure can be viewed as a type of summarization with respect to a question. Unfortunately, this is quite expensive, since we need to re-run the entire tree for each question.

In a small sample of 100 such depth 0 trees produced this way, the authors found this gave even better answers, although the "answers" tended to still include extraneous summarization-like information. The authors found 29 of 100 questions were correctly answerable (clearly agreed with at least one of the gold labels), and a further 8 were either partially correctly answerable or correctly inferrable. On the other hand, for the trees without the question augmentation, we deemed only 10 of 100 correctly answerable, and 12 partially correctly answerable or correctly inferrable. During this process, we also found that a substantial percentage of the NarrativeQA dataset appeared to have incorrect texts, where the questions do not appear to be about the correct book.

H.4 Comparison to prior work

The NarrativeQA results highlight that our model summaries contain enough useful and accurate information to answer questions about the original book. While previous methods are far more parameter efficient (Izacard and Grave, (2020) had 2 orders of magnitude less parameters and ReadTwice (Zemlyanskiy et al.,, 2021) had nearly 3 orders of magnitude fewer parameters), there are some advantages of using an approach like ours:

First, our technique is quite general, and answers questions fully abstractively, rather than via token extraction. For example, we observed the model inferring that a country of interest was England, despite it having no explicit mention in the summary besides the mention of London.

Second, when answering 30 questions per passage, we require only one forward pass over the full book rather than 30, with the remaining passes being over a much smaller text. (On the other hand, we cannot answer questions that are not answered by the summary.)

Lastly, and most importantly, we retain the benefits of decomposition. Our model’s answers can often be easily traced back to the source in the book, and by leveraging the tree structure, we can often tell where mistakes led to wrong answers. Our model’s summaries can help a human perform question answering quickly – see Appendix H – whereas the approach of Zemlyanskiy et al., (2021) produces hard-to-interpret latents.

Appendix I BookSum: BertSCORE length control

Kryściński et al., (2021) report length being a confounder for BERTScore, with longer summaries having lower scores. We also find a slight negative correlation between length and BERTScore. However, using a simple linear regression to control for length does not significantly change our scores. See Table 4 for details. Furthermore, our length distribution overlaps significantly with the reference summary lengths, while the BERTScores are consistently higher than the average, at all lengths. See Figure 8.

Appendix J Book summary qualitative findings

We chose our task of abstractive summarization of narrative books to be difficult, and our models are still far from human quality. Here are some of the problems that our labelers reported, roughly in order of frequency and severity.

The model frequently gets confused between characters, mis-attributing actions. Interpersonal relationships of the characters were often incorrect and events were wrongly attributed. Sometimes the name given to the protagonist was a peripheral character, or even the author’s name. This is exacerbated by mis-resolved pronouns and long dialogues, and likely exacerbated by concatenation of summaries.

The model is often unable to pick out the important information, rather than disjointed bits of unimportant stuff. The “essence” of the story was missing from many summaries. For example, a summary of A Promised Land never mentioned Obama’s presidency. In books with unique imaginary/speculative elements, the model fails to integrate key world-building details. This makes some science fiction and fantasy books particularly hard to summarize.

Relatedly, the model tends not to abstract away from specific happenings. For example, judging characters’ mental states, authorial intent, or abstracting a very long chain of events into a single coherent one.

The model tends to focus more on earlier material.

The model doesn’t handle scene switches or flashbacks well (e.g. the Midnight Library has incursions from different universes, Transcendent Kingdom is non-chronological)

Occasionally a quote/excerpt was selected that misrepresented a character or their actions (e.g. a character trying to hide their identity acting like someone else).

J.2 Preexisting knowledge

We found that the model was very able to leverage preexisting knowledge from pretraining, often in interesting ways.

Labelers reported behavior such as the model using the fact that Anakin Skywalker’s daughter is Leia in the Star Wars universe, while it was not mentioned in the passage. One of the books in full book evaluations was The Ballad of Songbirds and Snakes, a prequel to the previously published Hunger Games trilogy. A labeler noticed that the model spuriously mentioned characters from the main trilogy who did not appear in the prequel. Sometimes it uses this data falsely, such as introducing a real world actor’s name into a fictional story in place of a fictional actor.

Another labeler reported that bilingual text was partially translated in the summary, with the model taking “The woman at my mother’s side reached out to touch her—vas a estar bien, she told her before turning to walk back to her car.” and summarizing “The woman accompanying them tells his mother she’ll be ok.”

As a further confirmation of this, we tried summarizing a version of Harry Potter with many characters given replacement names. Despite this, the model translated “you-know-who” back to Voldemort, despite Voldemort having been given a different name.

J.3 Difficulty of summarizing narrative fiction

Despite the fact that our model was trained on narrative fiction, narrative fiction books seemed to remain more difficult to summarize than other books, due to the reasons outlined in J.1.

Of the 40 books we chose for the full book evaluations, 6 were nonfiction (see Table 5). These 6 books had significantly higher Likert ratings than the fiction books (1st, 2nd, 4th, 5th, 7th, and 11th highest average ratings of model summaries). Furthermore, the only book which our labelers judged as non-narrativeCaste was determined to have no plot. had the 2nd highest Likert ratings. While this is not strong evidence, it agrees with the qualitative reports from J.1.

Appendix K Book summary samples

We provide a website with examples of project Gutenberg summary trees at our websitehttps://openaipublic.blob.core.windows.net/recursive-book-summ/website/index.html#gutenberg. We also provide examples from our test set of books published in 2020.

See 5 for the full list of books we used for the evaluations in Section 4.1, based on popularity from Goodreads according to this list at the time we checked.

K.2 Book samples

To provide a better understanding of the quality of the summaries generated by our models, we show samples at various Overall Likert scores, ranging from 2 to 6 (Tables 6-10, for books from the Goodreads test set (that our model has not seen during training time). We select the books at random with the constraint that our 175B first-tree RL policy has one summary that attains the desired Likert score. For each book, we show the best human-written summary, the 175B RL summary with the desired Likert score, and a random summary from the 175B BC policy at T=0.