BookSum: A Collection of Datasets for Long-form Narrative Summarization

Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

Introduction

Text summarization aims at condensing long documents into a short, human-readable form which contains only the salient parts of the source. Leveraging the cutting-edge findings in natural language processing, such as multi-task learning methods (Raffel et al., 2019), pre-training strategies (Zhang et al., 2019a), and memory-efficient architectures (Zaheer et al., 2020), text summarization has seen substantial progress.

The majority of papers published in the field focus on summarizing newswire documents from popular datasets, such as CNN/DailyMail (Nallapati et al., 2016). Other domains gaining interest of the research community are scientific and legal documents, with notable datasets being Arxiv/PubMed (Cohan et al., 2018) and BigPatent (Sharma et al., 2019). While the performance of state-of-the-art methods on those datasets is impressive, the mentioned domains have inherent shortcomings and thus pose limited challenges for future generations of text summarization systems. First, the length of summarized documents is limited, ranging from a few hundred words in case of news articles, to a few pages for scientific documents and patent applications. In most cases, such short-form documents can be quickly read by humans, thus limiting the practical value of automatic summarization systems. Second, the domains under consideration impose strict requirements regarding the document’s layout and stylistic featuresowl.purdue.edu/owl/purdue_owl.html. Statements follow a logical order and all facts are offered explicitly, leaving limited space for interpretation and reasoning. Additionally, such constraints, can introduce layout biases into the datasets which later dominate the training signal of the summarization systems. The lead-bias present in news articles being one example of such effects (Kedzie et al., 2018; Kryściński et al., 2019). Third, documents in the mentioned domains lack long-range causal and temporal dependencies, and rich discourse structures. Due to the limited length and fact-centric style of writing, most causal dependencies span only a few paragraphs, temporal dependencies are organized in a monotonic fashion where newly introduced facts refer only to previously stated information, and document lacks features such as parallel plot lines.

In this work we address the shortcomings of existing datasets and introduce BookSum, a collection of data resources for long-form narrative summarization Ladhak et al. (2020). The data covers documents from the literature domain, including stories, plays, and novels (Fig. 1), each provided with a highly abstractive, human-written summary. Leveraging the characteristics of fiction writing, BookSum introduces a set of new challenges for summarization systems: processing long-form texts ranging up to hundreds of pages, understanding non-trivial causal and temporal dependencies spread out through the entirety of the source, handling documents with rich discourse structure which include parallel plots or changes between narration and dialogue, and generating highly abstractive and compressive summaries. Solving such challenges will require progress in both automatic document understanding and processing of long inputs. To support incremental progress, the BookSum collection includes examples on three levels of granularity with increasing difficulty: 1) paragraph-level, with inputs consisting of hundreds of words and short, single-sentence summaries, 2) chapter-level, with inputs covering several pages and multi-sentence summaries, 3) book-level, with inputs spanning up to hundreds of pages and multi-paragraph summaries. The hierarchical structure of the dataset, with aligned paragraph, chapter, and book-level data, makes it a viable target for single- and multi-document summarization approaches.

To demonstrate the new set of challenges for text summarization models introduced by BookSum and lay the groundwork for future research, we evaluated several state-of-the-art extractive and abstractive summarization architectures on the newly introduced task. We share the data preparation scripts here: https://github.com/salesforce/booksum.

Related Work

The availability of digital documentation has translated into a number of novel, large-scale datasets for text summarization that span a variety of domains. In the news domain, Sandhaus (2008) introduced a corpus of news articles from the New York Times magazine with summaries written by library scientists. Nallapati et al. (2016) collected articles from the CNN and DailyMail portals with multi-sentence article highlights repurposed as reference summaries. Narayan et al. (2018) aggregated articles from the BBC website with highly abstractive, single sentence summaries. Grusky et al. (2018) introduced a dataset spanning 38 news portals, with summaries extracted from the websites metadata. In the academic article domain, Cohan et al. (2018) collected scientific articles from the Arxiv and PubMeb article repositories and used paper abstracts as reference summaries. Wang et al. (2020) aggregated a set of articles in the medical domain related to the Covid-19 pandemic, also using paper abstracts as reference summaries. Hayashi et al. (2020) introduced a multi-domain collection of scientific articles each with two associated summaries, one covering the article’s contributions, the other explaining the context of the work. Related to dialogue summarization, Pan et al. (2018) repurposed image captioning and visual dialogue datasets to create a summarization dataset containing conversations describing an image, with image captions considered the summaries. Gliwa et al. (2019) introduced a corpus of conversations between hired annotators designed to mimic interactions on a messaging application with human written summaries. In the legal domain, Sharma et al. (2019) has collected a. large collection of patent filings with associated, author-written invention descriptions.

Despite the increased interest in the broader field of text summarization, little work has been done in summarizing stories and novels. In Kazantseva (2006), the authors focused on generating extractive overviews of short works of fiction. The work proposed two modeling approaches, one utilizing decision trees the other based on a manually designed system of rules with experiments conducted on a set of 23 short stories. Mihalcea and Ceylan (2007) introduced the task of book summarization along with a set of resources and baselines. The authors collected and curated a set of 50 books from the Gutenberg Project with two human-written summaries associated with each book collected from online study guides. More recently, Zhang et al. (2019b) tackled the problem of generating character descriptions based on short fiction stories. The authors collected a dataset of stories with associated, author-written summaries from online story-sharing platforms and proposed two baseline methods for solving the task. Ladhak et al. (2020) explored the problem of content selection in novel chapter summarization. The authors studied different approaches to aligning paragraphs from book chapters with sentences from associated summaries and created a silver-standard dataset for extractive summarization. The work also studied the performance of extractive models on the task.

Our work extends the efforts made by Ladhak et al. (2020). The BookSum corpus prioritizes abstractive summarization and offers aligned data on three levels of granularity (paragraph, chapter, full-book), substantially increasing the number of available examples. We also benchmark the performance of state-of-the-art extractive and abstractive methods on all introduced data subsets.

Dataset

In this section we describe the data sources and pre-processing steps taken to create the BookSum data collection and conduct an in-depth analysis of the collected resources.

Despite the popularity of books in electronic format, aggregating and sharing literature pieces is a non-trivial task due to the copyright law protecting such documents. The source documents available in BookSum were collected from the Project Gutenberg public-domain book repositoryUS edition: https://www.gutenberg.org/ and include plays, short stories, and novels of which copyrights have expired. Associated summaries were collected using content provided by the Web Archivehttps://web.archive.org/. The summary data includes both book- and chapter-level summaries.

Data Acquisition

Source texts were downloaded in plain text format in accordance with Project Gutenberg’s guidelineshttps://www.gutenberg.org/policy/robot_access.html. The data collection contains texts exclusively from the US edition of Project Gutenberg. Summaries were collected using content provided by the Web Archive and processed using the BeautifulSoup libraryhttps://crummy.com/software/BeautifulSoup/. Collecting summaries from several independent sources with small content overlaps between them resulted in certain texts having multiple associated summaries. Upon manual inspection, substantial stylistic differences were found between the related summaries, thus such coverage overlap was considered advantageous for the dataset.

Data Cleaning & Splitting

To ensure high quality of the data, both the source texts and summaries were cleaned after collection. Metadata containing author, title, and publisher information was removed from source files. The documents were manually split into individual chapters to accommodate chapter-level summarization. Due to the unstructured nature of plain text files, heuristic approaches were used to extract chapter content. Initial, automatic chapterization was done using the regex-based Chapterize toolhttps://github.com/JonathanReeve/chapterize. However, an inspection of outputs revealed many partially processed and unprocessed files, such instances were chapterized manually by the authors of this work. Paragraph-level data was obtained by further splitting the extracted chapter into individual paragraphs based on a white-character pattern. Short paragraphs and dialogue utterances were aggregated to form longer paragraphs. Collected summaries were also inspected for scraping artifacts and superfluous information. Regular expressions were used to remove leftover HTML tags, author’s notes, and analysis parts that were not directly related to the content of the summary.

Data Pairing

Source texts and associated summaries were collected independently of each other and required alignment. The pairing procedure was conducted in phases, starting with coarse-grained full-text alignments and ending with fine-grained paragraph alignments, with each phase involving automatic alignments followed by manual inspection and fixes. Full texts were paired with summaries based on title matches and later verified by matching author names. To accommodate automatic alignment, titles were normalized into a common format with lower-case letters and all punctuation characters removed. Chapter alignments were based on chapter metadata, extracted during source text chapterization, and chapter titles collected from online study guides. Similar to full-text titles, chapter names were transformed to a common format with chapter names lower-case and cleaned from punctuation characters, and chapter numbers translated to roman numerals. Paragraph-level alignments were computed between paragraphs extracted from chapters and individual sentences of chapter-level summaries. Following a two step process introduced by Ladhak et al. (2020), the alignment process was preceded by a human-based study aimed at finding an optimal alignment strategy, with its details presented in Appendix B. With the insights from the study, paragraph-sentence similarities were computed using a SentenceTransformer (Reimers and Gurevych, 2019), and leveraged a stable matching algorithm (Gale and Shapley, 1962) to obtain the final alignments. All examples on the chapter- and book-level, and a random subset of examples on the paragraph-level were manually inspected to ensure high quality of data. Quantitative verification of alignment quality is discussed in Appendix C.

Data Splits

The data was split into training, validation, and test subsets in a 80/10/10% proportion. To prevent data leakage between data subsets, the splits were assigned per book title, meaning that all paragraph, chapter, and full-book examples belonging to the same book title were assigned to the same data split. For consistency with the dataset introduced by Ladhak et al. (2020), all titles overlapping between the two datasets were assigned to the same splits. Remaining titles were assigned to splits at random following the predefined size proportions. The data collection and pre-processing pipeline is visualized in Figure 3 in the Appendix D.

2 Data Analysis

The data collection and matching process described in Section 3.1 yielded 217 unique book titles with a total of 6,327 book chapters. After the pre-processing and alignment steps, the BookSum collection contains 146,532 paragraph-level, 12,630 chapter-level, and 405 book-level examples. Figure 1 shows the distribution of literary genres in our corpus. Following Grusky et al. (2018), we computed statistics of the BookSum collection and compared them with other popular summarization datasets in Table 1. Coverage and density, which measure the extractive span similarity between source and summary, indicate that while the extractiveness of summaries increases from 0.5 and 0.92 for paragraphs to 0.89 and 1.83 for full-books, the summaries are still highly abstractive when compared to other datasets, such as CNN/DM or Newsroom. Relatively low coverage and density scores for paragraph-level alignments might partially be an artifact of the heuristic approach to aligning the data. The lengths of source and summary texts substantially increases across data granularity. Paragraph-level data includes short documents with an average of 159 words which fit within the limitations of existing models, chapter-level examples contain texts with average of over 5000 words, which are longer than in most of existing datasets and go beyond limitations of many state-of-the-art methods (Liu et al., 2019), while book-level examples contain inputs with over 110,000 words on average, which are orders of magnitude longer than any document previously used in NLP tasks. While long source documents create computational challenges for encoding components of models, the associated summaries on chapter- and book-level are also much longer than in any other dataset, thus creating challenges for the generative component of summarization methods.

Salient Content Distribution

To assess the difficulty of content selection in our datasets we measure the distribution of salient unigrams in the source texts (Sharma et al., 2019). The distribution is computed as the percentage of salient unigrams in four equally sized segments of the source text, where salient unigrams are words appearing in the associated summaries after removing stopwords. As shown in Figure 2 (a), all subsets of the BookSum dataset have a relatively even distribution of salient words across all four segments of the source documents. This suggests that to generate high quality paragraph, chapter, or book summaries models will have to use the entire source document instead of only relying on parts of it. In comparison, other datasets, such as CNN/DM, Newsroom, or Arxiv/Pubmed, contain strong layout biases where the majority of salient words appear in the first quarter of the source documents.

Summary Abstractiveness

To quantify the abstractiveness of summaries in BookSum we measured the percentage of nn-grams from summaries not appearing in the associated source document (See et al., 2017). Results presented in Figure 2 (b) show that BookSum contains highly abstractive summaries across all measured nn-gram sizes. The highest ratio of novel nn-grams in BookSum was found for the paragraph-level alignments, followed by chapter-level data and full-books. Results also indicate that our dataset is substantially more abstractive than most previous datasets, with the exception of XSum. High scores for trigrams also indicate that summaries included in BookSum do not contain long extractive spans, which aligns with the Density statistics shown in Table 1.

Qualitative Study

For a deeper understanding of the data beyond quantitative evaluation, we manually analyzed subsets of BookSum. First we compared summaries on different levels of granularity assigned to the same title. Summaries on the chapter- and book-level partially overlap in the summarized content, however substantially differ in the level of detail with which they cover the content. This relation could be leveraged for training models in a hierarchical fashion, from shorter to longer source texts (Li et al., 2015). Next, we compared summaries coming from different sources which were aligned with the same book or chapter. We noticed that the summaries had high semantic and low lexical overlap, meaning that they covered the same content of the summarized documents, but were written in a unique way. Such examples contain useful training signal for abstractive summarization models. Table 7 shows examples of chapter summaries of ”Sense and Sensibility”.

Experiments

To motivate the challenges posed by the BookSum corpus, we study the performance of multiple baseline models, both extractive and abstractive, on the different levels of alignment: paragraph, chapter and books. We refer to these levels of alignment as BookSum-Paragraph, BookSum-Chapter, and BookSum-Book accordingly.

(See et al., 2017) is an extractive heuristic where the first three sentences from the source document are treated as the summary. Despite its simplicity, Lead-3 is a strong baseline for domains which show layout biases, such as newswire.

Random Sentences

follows the Lead-3 heuristic and extracts 3 sentences sampled at random from the source document. It represents the performance of an untrained extractive baseline.

CNN-LSTM Extractor

(Chen and Bansal, 2018) builds hierarchical sentence representations which capture long-range dependencies using a CNN and bi-directional LSTM-RNN layers. A separate LSTM-based pointer network is applied to the representations to extract summary sentences.

BertExt

(Liu and Lapata, 2019) extends the BERT (Devlin et al., 2019) model with the ability to generate distinct representations for multiple text spans. Based on those representations the model selects sentences into the extractive summary.

MatchSum

(Zhong et al., 2020) formulates extractive summarization as a semantic text matching problem. Multiple candidate summaries are extracted and embedded as dense vectors using a Siamese-BERT model and matched with the reference text in the semantic space.

BART

(Lewis et al., 2019) uses a denoising autoencoder pre-training strategy designed specifically for NLG tasks. It has achieved state-of the-art results on many generative tasks, including abstractive text summarization.

T5

(Raffel et al., 2019) approaches transfer learning by unifying multiple NLP tasks into a common text-to-text format. All tasks are modeled with a large-scale seq-to-seq Transformer architecture in the order of billions of parameters. The model can be used to generate abstractive summaries using a summarize: prefix added to the text.

PEGASUS

(Zhang et al., 2019a) uses a pre-training objective designed for abstractive text summarization which includes masked language modeling and gap sentence generation. The model achieved state-of-the-art performance on mulitple summarization datasets.

2 Setup

Computational constraints and input length limits of pre-trained models prevent us from training the baselines on long input sequences. To circumvent those issues we follow a generate & rank approach for BookSum-Chapter and BookSum-Book. We use baseline models fine-tuned on BookSum-Paragraph, to generate individual summaries for all paragraphs in BookSum-Chapter and BookSum-Book. Next, we rank the generated summaries based on the model’s confidence. In case of abstractive models we look at the perplexity-level, for extractive models we take the model assigned scores. As the final chapter- or book-level summary we combine the top-kk ranked paragraph-summaries, where kk is chosen based on summary length statistics in the training set.

Extractive Oracle

We follow the steps described by Zhong et al. (2020) to generate oracle candidates for the BookSum-Paragraph data. First, we compute a mean ROUGE-{1,2,L} score between each sentence in a paragraph and the associated summary. Next, we select the 5 highest scoring sentences and generate all combinations of 1, 2, and 3 sentences to serve as candidate oracles. The final oracle chosen from the set of candidates is the one which maximizes the mean ROUGE-{1,2,L} score with the paragraph summary.

Implementation

Models were implemented in Python using the PyTorch (Paszke et al., 2019) and Huggingface (Wolf et al., 2019) libraries. Abstractive models were initalized from pretrained checkpoints shared through the Huggingface Model Hub. Additional details are listed in Appendix A.

Training & Inference

All models were trained for 10 epochs and evaluated on the validation split at the end of each epoch. Final model checkpoints were chosen based on the performance of models on the validation data. Model outputs were decoded using beam search with 5 beams and nn-gram repetition blocking for n>3n>3 (Paulus et al., 2018).

Evaluation Metrics

Models were evaluated using a suite of automatic evaluation metrics included in the SummEval toolkit (Fabbri et al., 2021). Lexical overlap between nn-grams in generated and reference summaries was measured using ROUGE-{1,2,L} metrics (Lin, 2004). Semantic overlap between mentioned summaries was evaluated using BERTScore (Zhang et al., 2020), which aligns summaries on a token-level based on cosine similarity scores between token embeddings. We also inspect content overlap between generated summaries and source documents by employing SummaQA (Scialom et al., 2019), which generates questions based on the input document and next applies a QA system to evaluate how many of those question can be answered using the summary. Due to the input length limits of SummaQA, the metric was applied individually to paragraphs of chapters and books and next aggregated by averaging to obtain chapter and book-level scores.

3 Automatic Evaluation

We first evaluate the the baseline models using automatic metrics, with results shown in Table 2.

A general trend showing across all evaluated models is low BERTScore values which decrease as reference summaries get longer (from paragraphs to full books). The metric operates on a $$ range, and the highest scores, slightly above 0.19, were achieved by the fine-tuned T5 model on a paragraph level. This suggests that BERTScore might not be a good fit for evaluating highly abstractive, long summaries. We decided to include it in the evaluation process to highlight this issue for future investigation.

The performance of the Lead-3 baseline is relatively low, scoring an R-1 of 17.99, 14.32, and 6.50 on the paragraph-, chapter-, and book-level respectively. The random sentence baseline closely trails Lead-3 across all metrics and data splits. Both results suggest that data from the literature domain included in BookSum may be less susceptible to layout biases present in other domains, such as newswire. Extractive oracle scores on paragraph data substantially underperformed those on the chapter and book data. This could be an artifact of the data pairing procedure where the content of a highly abstractive summary sentences is partially covered by the matched paragraph.

Extractive Models

The performances of the CNN-LSTM and BertExt models are very similar, with the first model being better on paragraph data, and the second model performing better on chapters and books. The small performance gap between the two mentioned models is surprising considering that the BERT based model was initialized from a pre-trained checkpoint, while the CNN-LSTM model was trained from scratch. The MatchSum baseline which reported state-of-the-art performance on news domain datasets (Zhong et al., 2020) achieved the best performance on a paragraph level, but underperformed the other models on chapter and book summaries.

Abstractive Models

We evaluated the performance of abstractive models both in a zero-shot setting and after fine-tuning on the BookSum-Paragraph data. We find that fine-tuning models on the BookSum data leads to consistent improvements across all models and data granularities, with the exception of the BART model on the book-level which performed better in a zero-shot fashion according to the ROUGE metric, and the T5 model on the SQA metrics. Upon manual inspection of model outputs we noticed that zeroshot models included fragments of dialogues in the summaries which are less likely to be found in reference summaries, this in turn could contribute to the lower evaluation scores of zero-shot baselines. The BART model achieved the best performance out of all the baseline models on paragraph- and chapter-level data, while T5 performed best on the book-level. Despite its state-of-the-art performance on most summarization datasets (Zhang et al., 2019a), we found PEGASUS to underperform other baseline models, both in the zero-shot and fine-tuned setting. Examples of generated summaries are shown in Appendix G.

4 Human Evaluation

To further assess the performance of abstractive baselines, human annotators were hired and asked to evaluate generated summaries across four dimensions: fluency, coherence, relevance, and factuality. Scores were assigned on a Likert scale from 1 to 5, with each example annotated by 3 judges and the scores averaged. Relevance and factuality were evaluated only on the paragraph-level since both dimensions require an understanding of the source text, which in the case of chapters and books is prohibitively long. Results are shown in Table 3.

Similarly to the study using automatic metrics, BART shows strong performance across all dimensions for the paragraph- and chapter-level subsets and slightly underperforms on full books. The results also show a general decrease in fluency and coherence across all models as the length of the source documents and summaries increases. This suggests that generating longer passages of fluent and coherent text poses a problem for existing neural models and could be addressed in future work.

5 Discussion

The generate & rank approach allowed us to overcome the limitations of existing models and apply the baselines to the chapter- and book-level data. We recognize that generating and scoring sentences independently has drawbacks, namely: 1) generated summaries may lack coherence, 2) content of selected sentences may overlap or be of low significance, which could negatively affect the overall relevance of the summary. However, the experiments discussed in this section were intended to be groundwork for the introduced task and we leave developing more tailored methods for future work.

The experiment results also show that BookSum poses challenges not only for existing summarization models, but also for evaluation metrics. The abstractive nature of reference summaries makes lexical overlap measured by ROUGE an inadequate metric for model evaluation (Fabbri et al., 2021). Other recently introduced metrics, such as BERTScore and SummaQA, leverage pre-trained neural models, which in turn makes them subject to the same input length limitations as the evaluated summarization models. While the model-based metrics can be individually applied to chunks of the data and then aggregated, as in the case of SummaQA, such use was not studied by the authors and could affect the reliability of returned scores. Human-based studies, which are often used to assess dimensions omitted by automatic metrics, are also problematic when conducted with long-form data included in BookSum. For example, assessing factual consistency requires annotators to be familiar with the content of the source document, which in the case of chapters or books could span dozens of pages making such studies unreliable and prohibitively time consuming.

Conclusions

In this work we introduced BookSum, a collection of datasets for long-form narrative summarization. BookSum includes annotations on three levels of granularity of increasing difficulty: paragraph, chapter, and full-book. Through a quantitative analysis we compare our dataset to existing summarization corpora and show that BookSum sets new challenges for summarization methods. We trained extractive and abstractive baseline models leveraging state-of-the-art pre-trained architectures to test the performance of current methods on the task of long-narrative summarization and to enable easy comparison with future methods. We hope our dataset will contribute to the progress made in the field of automatic text summarization.

Limitations

Web data is subject to local copyright laws. For data that is no longer protected by copyright law, we understand the use described within the paper is legally permissible. For data that is subject to copyright, we understand that such use is allowed under U.S. copyright law’s fair use provision. Depending on how others use this data, the purpose of their use, the jurisdiction they are in, and other factors considered under copyright law, we understand that the decision on whether a specific use case is fair use involves a legal analysis. It is advisable to obtain legal counsel prior to using such data. All data described in this work was collected exclusively for the academic purpose of conducting research. The purpose of using the BookSum data was only for training models and not for public display or any other use. No data was stored upon completion of the research process.

Data Biases

The BookSum dataset contains books written or translated into English. These books are also more than fifty years old and so representative of society in that era. The various pretrained models we evaluated on our dataset carry biases of the data they were pretrained on. However, we did not stress test these models for such ethical biases. We request our users to be aware of these ethical issues in our dataset that might affect their models and evaluations.

Model Evaluation

In this work, we have used established metrics, such as ROUGE, as well as recently introduced metrics, such as BERTScore and SummaQA, to evaluate the introduced baseline models. However, such automatic metrics have not been evaluated for use with very long source documents and highly abstractive summaries. Thus, might not accurately reflect the true performance of the evaluated models. Reliable evaluation of highly abstractive summarization models trained on long source documents is an open problem and an area of active research. Authors using the BookSum data are encouraged to consult appropriate literature whether more robust evaluation methods are available at the time of writing.

Computational Resources

Considering the length of source documents included in the BookSum dataset, training and evaluation of neural models might require substantial computational resources.

References

Appendix A Further Implementation Details

Model hyperparameters followed the best configurations described by the original authors of the models. Models were trained for 10 epochs using a batch size of 16. Many of the baselines presented in this work leveraged pre-trained checkpoints to initialize weights before fine-tuning on the BookSum data. Table 4 lists the checkpoints used for each of the baselines and the approximate number of parameters of each model. Experiments were conducted using 4 NVidia A100 GPUs, all studies described in this paper took an approximate 8 GPU hours.

Appendix B Data Alignment Process

Alignments between book paragraphs and sentences from associated summaries were computed using heuristic methods. The alignment processed followed two steps described by Ladhak et al. (2020): 1) similarity scores were computed for all paragraph-sentence pairs, 2) based on the similarity scores paragraph and sentence were aligned using a stable matching algorithm. Similarity scores between paragraphs and sentences can be computing using different metrics. In our study, we focused on lexical overlap methods and neural embedding methods. The first computed a token overlap between paragraphs and sentences using the ROUGE toolkit and treated that as a similarity score. The second utilized neural networks to embed the text spans into dense vector representations and next computed the similarity score as the cosine distance between such vectors.

To choose the best similarity score metric we conducted a pilot study on a subset of 100 paragraph-sentences pairs sampled from the validation set. The sampled examples were matched using the procedure described above with different neural models used for embedding the text spans. The following similarity score methods were considered:

(Ladhak et al., 2020) computes an average of token-weighted ROUGE-{1,2,L} scores between the sentence and paragraph texts. Token weights approximate the saliency of words and are computed as an inverse frequency of word occurrences in the document.

ROUGE-avg

(Ladhak et al., 2020) computes an average of (unmodified) ROUGE-{1,2,L} scores between the sentence and paragraphs.

BERTScore

(Zhang et al., 2020) measures semantic overlap between the words in the sentences and paragraphs. It aligns words in both text spans by maximizing the cosine similarity between BERT representations of the tokens.

Cross-Encoder

(Humeau et al., 2019) performs self-attention over the sentence and paragraph text passed together through a Transformer network to generate a similarity score between the input pair.

Bi-Encoder

(Reimers and Gurevych, 2019) uses a Transformer architecture to independently encode the sentence and paragraph texts into a dense vector representation. The similarity score is calculated using cosine similarity between the sentence and paragraph representations. We evaluate two checkpoints for the Bi-Encoders as described in Table 4.

The quality of data alignments obtained during the pilot study was assessed by human judges hired through the Amazon Mechanical Turk platform. Workers were hired from English speaking countries and offered a wage of approximately 12 USD per hour. Annotators were shown paragraphs which were aligned with a shared summary sentence using the different methods. For each alignment the annotators were asked to label whether the paragraph-sentence pair is related, somewhat related, or unrelated. Each example was evaluated by three judges, related and somewhat related labels were merged into a single positive label and the majority vote was computed. Results of the study are presented in Table 5 and show the number of times a method was assigned a positive label. The best performing strategy which used a Bi-Encoder fine-tuned on paraphrase detection data.

Using the selected scoring function, paragraph-summary sentence scores were computed between all paragraph-sentence pairs. Next, this data was input into a stable matching algorithms (Gale and Shapley, 1962) to obtain the final alignments. The stable matching procedure creates alignments where no paragraph would prefer to be matched with a different summary sentence to which it is already matched, and no summary sentence would prefer to be matched to another paragraph than it is already matched with.

Appendix C Alignment Quality

The quality of alignments obtained using the process described in Section 3.1 and Appendix B was also evaluated quantitatively, results are presented in Table 6 To measure the semantic similarity of source paragraphs and paired summary sentences, the cosine similarity between their embeddings was computed. To measure lexical overlap between the paragraph-summary pairs ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) scores were computed. Results are presented in Table 6.

The cosine similarity of 0.412 indicates strong semantic overlap between the paired sentences and source paragraphs, suggesting high quality pairings. In comparison, the relatively low lexical overlap of 17.39 R-1 between the mentioned fragments highlights the high abstractiveness of the data.

Appendix D Data Creation Pipeline

The data creation process is visualized in Figure 3.

Appendix E Source examples

Examples of chapter-level summaries of ”Sense and Sensibility” collected from different sources are shown in Table 7.

Appendix F Human Evaluation UI

Screenshots of the user interface, including evaluation instructions, used in thee human studies of abstractive baselines on the paragraph-level are presented in Figure 4, and on the chapter- and book-level in Figure 5

Appendix G Model outputs

Example summaries generated on the paragraph-, chapter-, and book-level by the baseline models discussed in our work are presented in Tables 8, 9, 10, 11, 12, 13, 14, 15.