An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics

Huan Yee Koh, Jiaxin Ju, Ming Liu, Shirui Pan

Introduction

Summarization of textual information is an exacting task for humans and the rate of information growth in the era of big data has made summarizing most information manually to be impractical and impossible. This phenomenon is exacerbated when it comes to long form textual documents as the knowledge and human labour effort required to process and summarize it increases exponentially with the length of documents. Inevitably, a significant amount of invaluable information and knowledge have gone unnoticed, presenting an important bottleneck in the progress of social and economic development. In response, there has been a strong demand for exhaustive research in the field of automatic long document summarization (Cohan et al., 2018; Sharma et al., 2019; Beltagy et al., 2020; Dong et al., 2021; Manakul and Gales, 2021).

Automatic text summarization involves a process of shortening a source text efficiently while keeping the main idea intact, which aids in reducing the amount of time required to process information, helps with faster search for information, and makes learning one topic easier (Liu et al., 2019; Liu et al., 2018). While the potentiality of developing an effective automatic text summarization system has attracted significant interest and attention from the research community, automatic text summarization remains a challenging task and is not ready for wide practical use in day-to-day lives, particularly when it comes to summarizing long documents (Cao et al., 2018; Kryściński et al., 2019; Maynez et al., 2020; Kryscinski et al., 2020). Intuitively, long document summarization is harder than short document summarization due to the significant difference in the amount of lexical tokens and breadth of content between short and long documents. As the length increases, the content that would be considered important will also increase, resulting in a more challenging task for an automatic summarization model to capture all salient information in the limited output length (Gidiotis and Tsoumakas, 2020). Further, short documents are often generic text such as news articles (Sandhaus, 2008; Nallapati et al., 2016; Narayan et al., 2018; Grusky et al., 2018), while long documents are commonly domain-specific articles such as scientific papers that contain more complex formulas and terminologies (Cohan et al., 2018; Kornilova and Eidelman, 2019; Huang et al., 2021). Together with other reasons that will be explored in this survey, long document summarization poses a significantly more challenging task than short document summarization.

In general, automatic text summarization can be conceptualized as having three approaches: extractive, abstractive, and hybrid approach (Kryscinski et al., 2020). The extractive approach directly copies salient sentences from the source document and combine them as the output (Gong and Liu, 2001; Cheng and Lapata, 2016), whereas the abstractive approach imitates human that comprehends a source document and writes a summary output based on the salient concepts of the source document (Rush et al., 2015; See et al., 2017). The hybrid approach attempts to combine the best of both approaches by rewriting a summary based on a subset of salient content extracted from the source document (Hsu et al., 2018; Liu et al., 2018; Gehrmann et al., 2018). Each approach has its advantages and limitations that may suit certain summarization tasks better. For example, extractive summarization may be sufficient in summarizing certain news articles (Cheng and Lapata, 2016; Zhang et al., 2018) but inadequate to summarize a long dialogue where salient content are sparsely distributed (Zhang et al., 2021). This is because while the extractive summarization approach is always factually consistent with the source document, it does not modify the original text and thus lacks the ability to generate fluent and concise summary (Xu et al., 2020b).

Historically, to measure the performance of different summarization architectures, ROUGE score (Lin, 2004) has been the modus operandi for researchers in the summarization research field to compare and study the quality of different candidate summaries. The core idea of ROUGE score is to measure the lexical overlaps such as words and phrases between candidate summary and ground truth summary. While it is efficient, recent findings have shown that ROUGE score does not correlate well with how humans assess the quality of a candidate summary (Kryściński et al., 2019; Bhandari et al., 2020; Chaganty et al., 2018; Hashimoto et al., 2019). As a result, there is a significant amount of effort in improving the way we measure the quality of candidate summaries and performance of summarization architectures (Kryscinski et al., 2020; Maynez et al., 2020; Zhang et al., 2019; Mao et al., 2020; Yuan et al., 2021). Unfortunately, these efforts have entirely been focusing on the short document domains and the progress in measuring the quality of long document summarization approach has been lacking (Pagnoni et al., 2021; Graham, 2015; Huang et al., 2020; Bhandari et al., 2020; Peyrard, 2019b).

Nevertheless, there has been a considerable amount of advancement made in the long document summarization research field and the area lacks a comprehensive survey (Gambhir and Gupta, 2017; Boorugu and Ramesh, 2020; El-Kassas et al., 2021; Shi et al., 2021). Our paper fills this gap by providing a comprehensive overview of the research on long document summarization and a systematic evaluation across the three principal components of its research setting: benchmark datasets, summarization models, and evaluation metrics. The contribution of our paper is as follows:

Comprehensive Review. A comprehensive survey of the long document summarization research literature.

Full-view of summarization research. Text summarization literature mainly explores the three key aspects of research setting: developing advanced models, releasing new datasets, and proposing alternative evaluation metrics. We empirically provide a detailed review of all three key components within the context of long document summarization.

Empirical Studies and Thorough Analysis. To ensure wide coverage of emerging trends, we empirically analyze each component of the long document summarization research setting through fine-grained human analysis and ad-hoc experiments.

Future Direction. We discuss the current progress of long document summarization, analyze the limitation of existing methods, and suggest promising future research directions in terms of model designs, quality and diversity of datasets, the practicality of evaluation metrics and, finally, the feasibility of implementing summarization techniques to real-life applications.

The survey is organized as follows: firstly, an overview of the fundamentals of long document summarization in section 2. Secondly, a detailed study of ten summarization benchmark datasets is in section 3. A comprehensive survey on summarization models that are designed specifically or have to ability to summarize long documents in section 4. Then, in section 5, we analyze the performances of models that are representative of the different types of architectures commonly used by researchers through ad-hoc experiments. In section 6, we summarize the advancement in evaluation metrics and their applicability in the long document summarization domain. Section 7 goes into the applications of long document summarization models and Section 8 discusses promising future research direction in this field. Finally, section 9 concludes this survey.

Fundamentals of Long Document Summarization

To make clear the distinction between short and long documents, we conceptualize the summarization task problem from three different fundamental aspects: 1) length of document, 2) breath of content, and 3) degree of coherence.

Documents are commonly classified as ”long” because the number of lexical tokens in the source document is enormous and it requires a considerable amount of time for an average human to consume the full text. While this definition makes intuitive sense, in the context of machine learning, a document is considered long when current state-of-the-art models for a normal document cannot be implemented similarly in an effective manner due to hardware and model limitations. For example, previous research (Celikyilmaz et al., 2018) considers CNN/DM and NYT benchmark datasets in the news domain as long documents when in the present research context they are now considered to be short document datasets. Currently, a benchmark dataset with an average source document length that exceeds 3,000 lexical tokens could be well-considered as ”long documents” (Zaheer et al., 2020; Manakul and Gales, 2021) due to the fact that most existing state-of-the-art summarization systems (e.g., pre-trained models) are limited to 512 to 1024 lexical tokens only (Devlin et al., 2019; Zhang et al., 2020). These limitations cannot be easily solved without novel techniques that help in assisting current architectures to reason over a long range of textual inputs (Zaheer et al., 2020; Manakul and Gales, 2021; Meng et al., 2021). Accordingly, this survey adopts a similar definition where a document is only considered as long if current state-of-the-art systems used in the short document cannot be extended and applied to a document with significantly longer text. Despite the potentially confusing definition, this enduring definition ensures that the model architectures implemented by researchers require novel techniques to overcome hardware limitations rather than just a mere replica of previous works.

2. Breadth of Content

On average, informative content that is non-redundant will increase together with the length of a document. However, despite the fact that reference summary length often increases together with the source document length, the length of a summary is usually constrained by what an average user considered as reasonable (Cohan et al., 2018; Sharma et al., 2019). Thus, while it may be sufficient for summaries of a reasonable length to cover the most or even all of the informative aspects for short documents, this is not necessarily true for summaries of long documents. In section 3, we empirically show that the relative length of summary against the source document becomes exponentially shorter as the source document length increases. Due to this elevated constraint, the ground truth summary of a long document will inevitably lose information that is not key to the central narrative of the original author or summary writer (Gidiotis and Tsoumakas, 2020). Furthermore, recent work (Kryściński et al., 2019) has also identified that human users could not agree on what should be considered important for a given document in the short document news domain due to the heterogeneity of user preferences and expectations. This issue is exacerbated when it comes to long document summarization as (a) the relative length of summary against the source document is shorter and (b) the chance of users having different preferences and expectations would increase as the breadth of content increases, making the long document summarization task significantly harder than short document.

3. Degree of Coherence

As compared to short documents, long documents are often structured into sections for the ease of user comprehension (Cohan et al., 2018; Kornilova and Eidelman, 2019). The content within each section also differ to a certain extent despite revolving around a key narrative of the long documents. This makes the long document summarization task more burdensome as summarization models cannot concatenate salient texts from different sections without considering its impact on the fluency, redundancy, and semantic coherence of the final summary outputs. Based on the fundamental aspects, the rest of this paper provides an empirical survey on long document summarization, covering the benchmark datasets, summarization models, and metrics.

Datasets

Publicly available benchmark datasets have been introduced to evaluate the performance of summarization models. Nonetheless, the benchmark datasets have different intrinsic characteristics that have been found to be crucial in the understanding of model performances (Maynez et al., 2020; Tejaswin et al., 2021), summarization approach suitability (i.e., extractive or abstractive approach) (Zhang et al., 2018; Sharma et al., 2019) and evaluation metrics effectiveness (Gabriel et al., 2021; Pagnoni et al., 2021). Hence, only through a comprehensive understanding of the benchmark datasets, one can assess the underlying performance and applicability of a summarization model in the real-world settings (Kryściński et al., 2019). Further, insights drawn from benchmark datasets have led to the introduction of state-of-the-art models across a wide range of natural language processing (NLP) tasks (Chen et al., 2016; Yatskar, 2019), including the text summarization task (Gidiotis and Tsoumakas, 2020; Dong et al., 2021; Manakul and Gales, 2021). In response, intrinsic dataset evaluation through large-scale automatic evaluation (Bommasani and Cardie, 2020) or more fine-grained human evaluation at a smaller scale (Tejaswin et al., 2021) has also been performed to enhance the understanding of various benchmarks. Nevertheless, none of the aforementioned works performed a large-scale automatic evaluation analysis nor a thorough human evaluation of benchmark datasets in the long document text summarization domain. To address this gap, this section explores the basic statistics and intrinsic characteristics of popular benchmarks in short and long document domain through the usage of large-scale automatic evaluation metrics and performs fine-grained human analysis on the arXiv benchmark to encourage a better appreciation of the most widely used long document summarization dataset (Zaheer et al., 2020; Huang et al., 2021; Manakul and Gales, 2021).

Short-document datasets studied in this survey are CNN-DM, NWS, XSUM, Reddit-TIFU, and WikiHow. The first three news datasets are chosen due to their popularity while Reddit-TIFU and WikiHow are studied to ensure short documents from other domains are also included. The document-summary pairs from CNN-DM, NWS, and XSUM are typical of that in the news domain, where the source document represents news article while the summary represents either human-curated summary (Grusky et al., 2018; Narayan et al., 2018) or summary created by concatenating bullet-point sentences in the original source document (Nallapati et al., 2016). On the other hand, Reddit-TIFU is a dataset collected from the subreddit r/TIFU (Kim et al., 2019), while the WikiHow benchmark is created using the first sentence of each WikiHow web page’s paragraph as the summary and the rest as source text (Koupaee and Wang, 2018).

For long document summarization research, arXiv, PubMed, BIGPATENT, BillSum, and GovReport have been used in prior research to test and compare novel long document summarization models. arXiv and PubMed (Cohan et al., 2018) are scientific long document summarization datasets collected from arXiv.org and PubMed.com scientific repository. Both datasets represent the earliest work on large-scale long document summarization datasets. BIGPATENT (Sharma et al., 2019) is an enormous dataset with over 1.3 million document-summary records of U.S. patent documents along with human written abstractive summary. BillSum (Kornilova and Eidelman, 2019) is a dataset on summarizing Congressional and California state bills where the content structures and stylistic features of writing are considerably different from documents in other domains. GovReport (Huang et al., 2021), assembled from reports published by U.S. Government Accountability Office, is markedly longer than the other long document datasets. Other long document benchmark datasets that are worth mentioning but are no longer widely used due to the limited amount of document-summary pairs are CL-SciSumm and SciSummNet (Jaidka et al., 2016; Yasunaga et al., 2019). Some other benchmark datasets that are released more recently in the podcast (Clifton et al., 2020) and dialogue domains (Janin et al., 2003; Carletta et al., 2005; Rameshkumar and Bailey, 2020; Zhu et al., 2021) may also be classified as long document summarization benchmark (Manakul and Gales, 2021) but are not explored in this survey as dialogue summarization has been recognized as another sub-domain due to its distinctive features as compared to other document types.

2. Data Metrics

Given a document, DD, and a corresponding reference summary, SS, each document will have a sequence of tokens Dtoken={t1,t2,...,tn}D_{token}=\{t_{1},t_{2},...,t_{n}\} and each summary will also have a sequence of tokens Stoken={t1,t2,...,tm}S_{token}=\{t^{*}_{1},t^{*}_{2},...,t^{*}_{m}\}. Similarly, each document and summary have ll and oo sentences as represented by Dsent={s1,s2,...,sl}D_{sent}=\{s_{1},s_{2},...,s_{l}\} and Ssent={s1,s2,...,so}S_{sent}=\{s^{*}_{1},s^{*}_{2},...,s^{*}_{o}\} respectively. Length of document and summary measured in number of tokens are represented as D|D| and S|S| while length measured in number of sentences are represented as D||D|| and S||S||. Extending on the works in the short document summarization domain (Bommasani and Cardie, 2020; Grusky et al., 2018), the following discusses each of the five metrics used to evaluate the benchmark datasets shown in Table 1: compression ratio, extractive coverage, extractive density, redundancy and uniformity.

Compression Ratio measures the ratio of a source document length against its reference summary length. A higher compression ratio indicates larger information loss in the original document after being summarized. Compression ratios are measured based on tokens and sentences:

Extractive Coverage and Extractive Density are introduced by Grusky et al. (2018) based on the notion of matching fragments. Fragments are obtained by greedily matching the longest shared token sequence where F(D,S)\mathcal{F}(D,S) reflects a set of fragments with each fragment having a length represented by f|f|. Extractive coverage calculates the percentage of tokens in summary that is a derivation of the original source text, whereas, extractive density relates to the average squared length of the extractive fragments in the summary. The former indicates the need for a model to coin novel tokens that are not in the original source text while the latter measures whether a model can match the ground truth summary merely by extracting from the original source text without rearranging or paraphrasing text.

Uniformity measures whether content that are considered important by the reference summary are uniformly scattered across the entire source document. A higher score indicates that important content are scattered across the entire document with no obvious layout bias to take advantage of. This is calculated based on the normalized entropy of the decile positions of salient unigrams in the source text, where salient unigrams are the top 20 keywords extractedWe use NLTK-RAKE for keywords extraction., excluding stopwords, from the reference summary.

3. Intrinsic Characteristics of Datasets

Based on Table 1 below, the following discusses the findings of intrinsic characteristics of long document benchmark datasets in comparison to short document benchmark datasets.

Finding 1. Length of Long Documents: A basic yet important finding is that, except for BillSum, all the other long document datasets have an average source document length of at least 3,000 tokens. In contrast, the longest short document dataset, CNN-DM, has an average document length of 774 tokens. This indicates that a vanilla pre-trained Transformer-based models (Raffel et al., 2020; Lewis et al., 2020a; Zhang et al., 2020) which commonly have an input length limit of 1,024 tokens would need to truncate at least half of the source document in the long document benchmark datasets. Thus, if pre-trained models that have proven to work well under short document settings are implemented without any long document adaptations in their architectural settings and mechanisms, they are unlikely to generate high-quality summaries for long documents (Maynez et al., 2020; He et al., 2020; Rothe et al., 2021).

Finding 2. High Compression Ratio and its Implications: On average, the token-level and sentence-level compression ratio of the long document summarization datasets is greater than the short document datasets by 1.4 and 2.2 times respectively. For long documents, this suggests that either a) there is a greater information loss in the summaries, b) the salient content is more sparsely distributed across the source documents, and/or c) the source document contains significantly more redundant information. As the high compression ratio of long document benchmark is more likely to be the results of the two former factors, this increases the relative difficulty of the long document summarization task as a model would have to clearly identify the key narrative from the source while excluding the content that are expected to be less important by the summary readers. Moreover, if there is a greater information loss in the summary of a long document, the generated summary will inevitably miss an even greater amount of information that is considered important by some readers, diminishing the effectiveness of a generalized summarization approach to satisfy the needs of summary readers. This finding supports the efforts in controllable summarization, where the final generated summaries will be based on the reader’s needs and expectations (He et al., 2020; Wu et al., 2021).

Finding 3. Abstractiveness and Diversity of Datasets: With the exception of BIGPATENT, all long document datasets have greater coverage and density values than the short document datasets. This is likely due to the genres of benchmark datasets where long documents are often related to domain-specific articles such as scientific papers that contain more complex formulas and terminologies. Nonetheless, this indicates that a model that merely extracts lexical fragments from the original source text of a long document can still generate a summary that more closely resembles the reference summary. As abstractive summarization models have recently been found to contain factual inconsistencies in up to 30% of the summary outputs in the short document domain (Cao et al., 2018; Kryscinski et al., 2020) while extractive summarization model will faithfully preserve the original content, this finding is encouraging for the development of long document extractive models in the real-world production level settings. Finally, as the abstractiveness of datasets have been found to greatly affect the summarization strategies of a supervised model (Zhang et al., 2018; Xu et al., 2020a; Wilber et al., 2021), efforts to introduce benchmark datasets with greater abstractiveness (low extractive coverage and density value) should be encouraged to improve the diversity of long document benchmark datasets.

Finding 4. Lesser Layout Bias in Long Document: Kryściński et al. (2019) found substantial layout bias in the source text where nearly 60% of important sentences are contained in the first 30% of the source articles and argued that such layout bias does not apply to the other domains. Our findings on the uniformity of salient content in Table 1 validates their arguments where the salient content of long documents are scattered across the entire source text more uniformly than the short documents. This suggests that unlike practices in the short document summarization domain where models are often benefited by taking advantage of layout biases (See et al., 2017; Gehrmann et al., 2018; Paulus et al., 2018), long document models that implement a truncation strategy to process only a small subset of the leading content of the long documents will likely suffer from significant performance degradation.

Finding 5. Relationship between Intrinsic Characteristics: Other than the intrinsic characteristic measured in Table 1, the statistical relationship between these metrics could yield insights regarding the underlying properties of a benchmark dataset. More importantly, whether the relationship between these metrics differs significantly under short and long document summarization settings should also be of great interest to practitioners. To quantify this, we report the pairwise correlations between each metric pair for both short document (lower diagonal) and long document (upper diagonal) benchmark datasets in figure 1a. The values reported are calculated using the Pearson correlation coefficient, ρ\rho. As represented by darker blue color in figure 1a, ρ=1\rho=1 reflects a perfectly positive correlation between the metric pair and ρ=1\rho=-1 when it is perfectly negative (shown in darker red color).

For positive controls, we see a strong positive relationship between the two compression ratios and the two extractive metrics (coverage and density) under short and long document settings. We also see a lack of statistical correlation when uniformity is measured against other metrics as uniformity relates more to the genres of documents rather than the other characteristics. We further observe redundancy to be inversely related to coverage and density, where a more abstractive reference summary often contains more redundant information. This finding is consistent with a human evaluation study by Kryscinski et al. (2020) where writers are found to be more verbose and write summary content that do not add information when they are writing unconstrained, abstractive summaries. Intriguingly, we see a weakly positive correlation between the extractive metrics and the compression metrics under the short document setting but a strongly negative correlation under the long document setting. It is hypothesize that when authors have to write a concise summary, they are forced to paraphrase the original content more to ensure that the summary can cover the salient content within the constrained summary length.

3.2. Comparison between Long Document Dataset Benchmarks

Looking at the intrinsic characteristics between long document benchmark datasets, arXiv and BIGPATENT have significantly higher compression ratios but lower extractive density values than the others, indicating that two of these datasets require a summarization model to generate a significantly shorter summary that is not written in the same way as the source text. As discussed above, this is likely because for a summary to cover more content within a constrained summary length, one has to paraphrase the original content more. This is also evidenced by the diverging values between extractive coverage and density metric for arXiv benchmark dataset that suggest summaries of arXiv scientific papers have high matching tokens and terminologies with the source document (high coverage) but low matching phrases (low density). Overall, the BIGPATENT dataset is the most suitable benchmark for long document supervised abstractive summarization due to its low coverage and density, substantial training sample pairs to serve as supervisory signals, and high uniformity in salient content. However, only a handful of fully-supervised abstractive long document summarization works (Pilault et al., 2020; Zaheer et al., 2020; Zhang et al., 2020) evaluated their models on BIGPATENT, limiting the visibility of current progress on long document summarizers in general applications. This is despite the fact that BIGPATENT was introduced not long after arXiv and PubMed. Encouragingly, with the recent introduction of long document datasets in domains other than scientific papers including financial reports (Loukas et al., 2021) and books (Kryściński et al., 2021), the research progress of long document summarization models towards general application should become clearer in the near future.

4. Fine-grained Analysis on ArXiv

To perform fine-grained human analysis on the arXiv benchmark, this survey implements a stratified random sampling strategy based on the 6 different categories of scientific domains contained in the arXiv.org scientific repository: physics (ph), computer-science (cs), mathematics (math), quantitative-biology (q-bio), quantitative-finance (q-fin) and statistics (stat). In total, we obtain over 700 annotated ground truth summaries with physics having the most samples (369) followed by computer science (140). Based on fine-grained human analysis of 743 ground truth summaries in the arXiv test set, this subsection reports the disturbing data quality results and studies the degree of diversity in formatting style of reference summary.

With the advent of data-hungry neural architectures, there has been an enormous demand for benchmark datasets with document-summary pairs that are at least in the tens of thousands created through heuristic means such as scraping it directly from the web. As a result, depending on the means of extracting these datasets, the quality of benchmark datasets may vary significantly from one another. To this end, Kryściński et al. (2019) have quantified the percentage of samples with noise for CNN-DM and Newsroom from the short document summarization datasets to be 4.19% and 3.17% using simple heuristic methods. The noises found in the datasets through heuristic means can only suggest a lower bound of what the true amount of noises are as heuristic approaches can only detect obvious structural flaws in the samples. This suggests that the true underlying noises are extremely widespread and often understated. Glaringly, in our experiment, the problem of noisy data affects more than 60% of the annotated ground truth summaries in the randomly sampled arXiv test set. This is greater than 54% detected in XSUM dataset (Tejaswin et al., 2021). While many of the errors and noises are minor, more than 15% of the reference summaries have significant errors where at least half of the summary contains errors, rendering the summaries to be unreadable. Reassuringly, the rest of the test sets with identified noises are not overly significant and often only affect one or two sentences in a benchmark dataset with an average of 10 sentences in the reference summary. To further understand why the noises and errors occurred, we trace the original data based on the arXiv id provided by the benchmark datasets. It was found that many errors occurred such as missing content or sentence breaking after a newline could be due to large-scale scraping of the original data using pandoc (Cohan et al., 2018). As neural summarization models may overfit to these problematic noises and contribute to less interpretable benchmarking results, we release the annotated data to allow a better understanding of the common noises and to encourage quality improvement in future benchmark datasetsThe annotated dataset are released: https://github.com/huankoh/long-doc-summarization.

Other than analyzing the noises in the arXiv benchmark dataset, we also explore to what extent the ground truth summary covers various sections or facets of the source article. Figure 1b shows the average distribution of sections covered by ground truth summaries for different domains. The distribution is plotted based on the assumption that each sentence in the reference summary covers a single facet of the original article and human annotators are asked to identify the section covered by the summary sentences. The facets studied are Introduction, Methodology, Result, Conclusion and Limitation of Research. Interestingly, while the reference summaries in all domains have covered introduction and methodology sections with similar emphasis, we see a negative correlation between contribution and conclusion (i.e., papers that emphasize contributions will write less on conclusions, and vice-versa). Notably, we see scientific papers in the mathematics domain emphasize more on the contribution while papers in the physics domain emphasize more on conclusions. These results make intuitive sense as findings in mathematics often do not lead to a strong substantive conclusion. The trade-off between various sections also illustrates the inevitable information loss when summarizing a long document as the summary can only describe certain aspects of the source document but not all.

Importantly, except for quantitative-biology, all scientific papers do not discuss the limitations of their research. This is consistent with common practices of writing abstracts to attract readers in reading the original paper by emphasizing on the result findings and contributions of the authors. Nevertheless, most researchers would find a discussion on the limitations of research works to be informative and significant. Whether the abstract itself represents the best possible summary for a summarization architecture to imitate from and learn how to appropriately summarize all the salient content including the limitation discussed in the original paper remains an important question to be answered. Recent progress on summarization approaches that generate user-specific summaries based on the need of readers are also important directions towards general applicability of summarization models in commercial settings (He et al., 2020; Wu et al., 2021). Lastly, to summarize the limitations of a paper often requires more external knowledge outside of the content related to the source document and whether current summarization models are able to infer such knowledge from the benchmark dataset is an interesting study left for future works.

Models

The following describes the differences between the extractive, abstractive and hybrid summarization approaches and the general taxonomy of a summarization system.

The works in automatic text summarization research are traditionally classified into three different summarization approaches: (i) the extractive approach that involves direct extraction of salient fragments such as sentences of the original documents into a summary (Gong and Liu, 2001; Cheng and Lapata, 2016), (ii) the abstractive approach imitates human behavior of paraphrasing important parts of a document into a summary (Rush et al., 2015; See et al., 2017) and (iii) the hybrid approach that attempts to combine the best of both approaches (Gidiotis and Tsoumakas, 2020; Manakul and Gales, 2021). Intuitively, the extractive summarization method is an easier machine learning task and can be thought of as a classification and/or ranking problem of extracting lexical fragment units (e.g., sentences) into a summary. Contrastively, abstractive summarization requires paraphrasing important ideas of a document into a summary either by rearranging words and phrases from original text or contriving novel wordings while maintaining the factual consistency of the generated summary with the original document.

Since the extractive summarization approach only extracts and arranges the original text that it believes to be salient and does not alter the original text, it enjoys the benefit of generating summaries that are factually consistent with the source document (Cui and Hu, 2021). Nevertheless, as human-based summarization often involves paraphrasing ideas and concepts into shorter, concise sentences, the extracted sentences of this approach often contain redundant and uninformative phrases (Grenander et al., 2019). While there exist extractive summarization models that break a source document into lower lexical units than sentences (e.g., elementary discourse units) (Xu et al., 2020b), they are often not applied in the long document summarization domain due to the extreme length of the input document.

On the other hand, mimicking how humans write summaries, the abstractive summarization approach presents a blue-sky potential of generating summaries that are fluent, concise and relevant to the source document (See et al., 2017). It can also incorporate external knowledge to the summary depending on the needs of a user (Maynez et al., 2020). However, at the current stage of development, summaries generated by the state-of-the-art abstractive models often contain a significant amount of content that is factually inconsistent with the source document, limiting its application in commercial settings (Cao et al., 2018; Kryscinski et al., 2020).

Finally, in response to the limitation of current model architectures and designs, the hybrid summarization approach only differs from the abstractive summarization approach in that it takes in a carefully chosen subset of the original input document rather than the entire input document in its original form (Pilault et al., 2020; Gidiotis and Tsoumakas, 2020). This extra step reduces the burden on the abstractive summarization models that have to generate an abstract summary and select important content at the same time. This approach is used more often in the long document summarization domain because current models still fail either (a) at reasoning over extremely long texts (Meng et al., 2021; Manakul and Gales, 2021) and/or (b) suffers from memory complexity issues and hardware limitations that prevent it from processing over a long input text (Zaheer et al., 2020; Huang et al., 2021).

In each long document summarization model, this paper breaks down a model into two different constituents: (i) Main Architecture and (ii) its Mechanisms. The main architecture refers to the core framework structure that a model uses and the mechanisms are the different settings or modifications implemented by a model to the main architecture. Two differing models may use the same main architectures but are implemented with different mechanisms, and vice-versa. For example, models that use graph-based main architecture may use different encoding mechanisms in vectorizing the sentences of an input document. The following describes the various main architectures of the summarization models together with how previous works differ in the mechanisms employed to generate long document summaries.

2. Main Architecture and its Mechanisms

In the search for optimal architectural settings of summarization systems, the research field started out with many different novel designs of main architectures and mechanisms but often converge towards a few ideas that are often most effective until another ground-breaking idea that leap-frogs the performance of previous systems, and the cycle repeats. 1. Graph Architecture:

For the extractive summarization approach, the classic graph architecture involves a two-stage process of mapping a document into a graph network, where the vertices are sentences and the edges are the similarity between these sentences, and extracting the top-KK sentences. The sentences are ranked based on the graph centrality scoring of each sentence (Mihalcea and Tarau, 2004; Erkan and Radev, 2004). As there are many different ways to (a) encode or vectorize a sentence before calculating the similarity between them and (b) calculate the centrality score of each sentence, research involving this architecture often differs only in these two mechanisms. For example, with respect to the former mechanism, graph architecture in the past (Mihalcea and Tarau, 2004; Erkan and Radev, 2004) encodes sentences based on word-occurrence or term frequency-inverse document frequency (Tf-Idf) while graph architecture today (Zheng and Lapata, 2019; Liang et al., 2021) encodes sentences with state-of-the-art pre-trained models. On the other hand, to improve the centrality scoring mechanism, PacSum (Zheng and Lapata, 2019) and FAR (Liang et al., 2021) adjust the centrality score of a sentence based on whether the other sentences come before or after it, while HipoRank (Dong et al., 2021) exploits the discourse structure contained in by adjusting the centrality score with positional and sectional bias. In general form, given a set of sentences in the original source document, D={s1,s2,...,sm}D=\{s_{1},s_{2},...,s_{m}\} with the inter-sentential similarity relations represented as eij=(si,sj)Ee_{ij}=(s_{i},s_{j})\in E where iji\neq j, the following illustrates the aforementioned architecture in computing the scoring for each sentence:

The similarity between each sentence is computed using similarity measures such as dot product or cosine similarity, and the sentences are vectorized using Tf-Idf or BERT representation values. The final summary is generated by extracting the top-k sentences ranked by centrality(si)centrality(s_{i}). Importantly, while there are other classical architectures (Gong and Liu, 2001; Vanderwende et al., 2007), the graph architecture is worth a separate mentioning here due to the fact that (a) it remains as a strong baseline against other advanced architectures, (b) it can effectively incorporate external knowledge as an inductive bias to the calculation of the importance of a sentence and (c) it achieves state-of-the-art result in long document unsupervised extractive summarization setting when integrated with current state-of-the-art pre-trained models (Dong et al., 2021; Liang et al., 2021). Lastly, other than the multi-sentence compression approach (Boudin and Morin, 2013; Ju et al., 2020; Zhao et al., 2020a) that may be extended to long document summarization tasks, there has been no applicable work on classical graph-based architecture for long document abstractive summarization. 2. Other Classical Architectures:

In the early work of automated, non-neural text summarization models, past research mostly focused on the extractive summarization approach due to the difficulty of the abstractive summarization task. The main architectures that were tested ranged from support vector machines (Chali et al., 2009; Shivakumar and Soumya, 2015), Bayesian classifiers (Kupiec et al., 1995), decision trees (Mani and Bloedorn, 1998; Knight and Marcu, 2002) to citation network-based summarization (Qazvinian and Radev, 2008). The ones that remained relevant when comparing model performance across various benchmarks in long document summarization settings are LSA (Gong and Liu, 2001), which is based on Singular Value Decomposition (SVD), and SumBasic (Vanderwende et al., 2007) that ranks sentences by simple average word-occurrence probability (Vanderwende et al., 2007). 3. Recurrent Neural Networks:

Extractive summarization that employed neural networks with continuous representations rather than pre-trained word embeddings on traditional techniques (Kobayashi et al., 2015; Yogatama et al., 2015) was proposed by Cheng and Lapata (2016). The model implemented an RNN encoder-decoder architecture with attention mechanism to locate the region of focus during sentence extraction process. Nevertheless, due to the lack of a large-scale long document dataset and the RNN’s inability in capturing long-range temporal dependencies across a long input text, it wasn’t until Xiao and Carenini (2019) that tried implementing LSTM-minus (a variant of RNN) on solving long document summarization task. Typical of a long document summarization system, it incorporates discourse-information (i.e. section structure) of the source document by encoding the section-level and document-level representation into each sentence to significantly boost the model performance. Pilault et al. (2020) also suggested two different variants of RNNs on extractive summarization for long document summarization. Rather than utilizing pre-trained word embeddings, they implemented a hierarchical LSTM to encode words and sentences separately.

When it comes to the abstractive summarization approach, Celikyilmaz et al. (2018) proposed multiple communicating agents to address the task of long document summarization. However, as compared to other simpler architecture, this approach did not gain significant traction after the introduction of the first large-scale dataset on long scientific documents. Together with the contribution of two most commonly used long scientific document datasets, arXiv and PubMed, Cohan et al. (2018) presented an LSTM encoder-decoder architecture where the decoder attends to each section of the source document to determine section-level attention weights before attending to each word. While similar architectures have been widely used in prior works, this work effectively incorporates discourse information that suits the long document summarization task well. 4. Transformers:

The Transformer model proposed in 2017 together with the pre-trained Bidirectional Encoder Representation from Transformers (BERT) model that was based on the Transformer model itself have taken the NLP area by storm (Vaswani et al., 2017; Devlin et al., 2019). Like other NLP tasks, subsequent summarization model architectures have changed significantly to take advantage of these two momentous ideas. Importantly, BERTSum (Liu and Lapata, 2019) showed that by modifying the BERT-segmentation embeddings, it can capture not only sentence-pair inputs but multi-sentential inputs. The BERTSum model could effectively solve both extractive and abstractive summarization tasks. The extractive summarization model proposed by BERTSum involves stacking a Transformer-based classifier on top of the fine-tuned BERT to select and extract salient sentences while the abstractive summarization model involves a classic encoder-decoder Transformer framework where the encoder is a fine-tuned BERT and the decoder is a randomly-initialized Transformer that is jointly trained together in an end-to-end manner. While effective, this architecture cannot be implemented to solve long document summarization tasks due to the BERT’s input length limit of 512 tokens. In this survey, we define Transformer as the main architecture and settings related to the Transformer model as the mechanisms including the use of different types of pre-trained models. As Transformer has effectively replaced most main architectures in both summarization approaches as state-of-the-art models, the various mechanisms applied in the long document summarization context will be thoroughly discussed in section 4.3.

3. Mechanisms of Transformer-based Architectures

Transformer-based model is ubiquitously state-of-the-art across a wide range of tasks in the NLP domain. In line with this development, recent works in the long document summarization models often involve using the same Transformer base architecture but with different proposing mechanisms. These Transformer-based models involve implementing novel mechanisms with long document adaptations to ensure the task of summarizing a document with significantly longer input sequence texts can be effectively addressed. The mechanisms used by extractive, abstractive and hybrid Transformer-based summarization models are described in the following with an overview of mechanisms used by abstractive and hybrid summarization models shown in Figure 3.

As Transformer and its pre-trained models are optimized for short document settings, they may not reason well over long text sequences if not properly fine-tuned. To this end, Cui et al. (2020) proposed combining neural topic modeling together with BERT in learning a topic-enhanced, inter-sentence relationship across the entire document. Nonetheless, the issues of memory complexity and input token length limits were not resolved and significant source text is truncated under this research setting. Recently, Cui and Hu (2021) proposed a memory network that incorporates graph attention networks and gated recurrent units to dynamically select important sentences through sliding a window along the entire source document. This approach can effectively integrate the pre-trained BERT model for long document summarization task by limiting its usage within each window, where the window size is set to be lower than or equal to 512 tokens.

1) General Sequence-to-Sequence Pre-training Task Since the advent of BERT (Devlin et al., 2019), various large-scale models with different pre-training tasks have been introduced. As summarization with the abstractive approach is naturally a sequence-to-sequence task, a pre-trained model with a sequence-to-sequence objective task would suit it better rather than an encoder-only (e.g., BERT/RoBERTa) or a decoder-only (e.g., GPT-2/GPT-3) pre-trained model. In the summarization domain, Bidirectional and Auto-Regressive Transformers (BART) (Lewis et al., 2020a) and Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020) are the two most widely used sequence-to-sequence pre-trained models. BART is pre-trained on a self-supervised task of reconstructing arbitrarily corrupted text while T5 is pre-trained on both unsupervised and supervised objectives, such as token masking, as well as translation and summarization. Interestingly, none of the supervised Transformer models in the long document summarization domain has implemented a summarizer with T5 pre-training task despite its success in the short document domain (Rothe et al., 2021). 2) Gap-Sentence Generation (GSG) Pre-training Task Other than the generalized pre-training task like BART and T5, PEGASUS (Zhang et al., 2020) attempted to significantly advance the progress in the abstractive summarization field through large-scale pre-training with objectives that are specific to the summarization task. The proposed model is self-trained on two large scale datasets (C4 and HugeNews) with the gap-sentence generation pretraining task. Gap-sentence generation pretraining task draws a close resemblance with the general summarization task by self-supervising the model to generate sentences that are masked entirely in a given document. At the time of PEGASUS model release, the model effectively achieves state-of-the-art results across 12 different benchmark datasets, including long document arXiv and PubMed dataset. 3) Efficient Attentions The vanilla Transformer models that utilize full attention have a memory complexity O(n2)O(n^{2}). This attribute limits its wider usage across many domains, including long document summarization. For example, to circumvent the input tokens limits of PEGASUS, DANCER (Gidiotis and Tsoumakas, 2020) summarizes each section of the long document separately and concatenates each of them to form the final summary. As not all benchmark datasets contain discourse information such as section structures, this limits the model usage in many long document summarization settings. To this end, researchers have proposed various ingenious ideas to reduce the memory and time complexity of Transformer models. The variants of Transformer models that require less memory are often known as efficient Transformers (Tay et al., 2022; Tay et al., 2020) and the mechanism is referred to as efficient attentions (Huang et al., 2021).

Longformer (Beltagy et al., 2020) combines local attention, stride patterns and global memory for fine-tuning pre-trained BART to effectively summarize long documents with a maximum input length of 16,384 tokens as opposed to the 1,024 token limit of the original BART model. The model achieved state-of-the-art results in the long document summarization along with other NLP tasks when the model was introduced. BigBird (Zaheer et al., 2020) also implemented the efficient attention mechanism on Transformer-based abstractive summarizer by utilizing the same attention modifications as Longformer with an additional random pattern to achieve matching performance results in terms of ROUGE score. An important work by Huang et al. (2021) explores and compares the performance of different variants of efficient transformers in the context of long document summarization.

4) Prompt-Engineering The GPT-3 model (Brown et al., 2020) has strongly demonstrated that large-scale pre-trained language models can achieve impressive results on numerous downstream language tasks in zero- and few-shots experimental settings. Rather than fine-tuning the language models for specific tasks in a conventional way, natural language prompts and task demonstrations were created for GPT-3 to infer and complete the tasks. Importantly, this is different from the conventional tagging method such as ¡bos¿ token for conditional generations or taking [CLS] tag from BERT for classification tasks. Prompt engineering refers to taking the extra step to design a natural language prompt or template that can optimize the pre-trained model for a specific task. Figure 4 illustrates this important difference.

Many works have explored the ways of uncovering the right prompt for various downstream language tasks to significantly boost the performance of pre-trained models (Gao et al., 2021; Talmor et al., 2020). In the long document summarization research area, CRTLSum (He et al., 2020) achieves significant improvement on vanilla fine-tuned BART model on arXiv dataset through prompt-engineering. The model attempts to more effectively use the pre-trained BART model with the help of extracted keyword prompts, as shown in Figure 4. Further, the work also showed that, given an optimized language prompt, the implemented BART summarization model can achieve ROUGE score that matches ROUGE score of oracle summaries in the test dataset. 5) Signal Guidance Unlike a prompt-engineering mechanism that requires an engineered language prompt or template for a given task, the signal guidance mechanism relates to utilizing signals as inputs to lead models in better identifying and summarizing important content of source texts. Using this approach, the GSum (Dou et al., 2021) model implemented a fine-tuned BART model with dual encoders, one for input document and another for extracted signals, and a decoder that attends to both encoded representations. Similar signal based approach is also used by Ju et al. (2021) to implement an unsupervised pipeline-based long document summarization model. 6) Discourse Bias Similar to the signal guidance mechanism, discourse bias involves the inclusion of the discourse structure of a source document such as the section of a sentence as signals for summarization systems to better identify and summarize important content in the original source text (Christensen et al., 2013; Yasunaga et al., 2017). This mechanism can be classified under signal guidance but is mentioned separately due to its effectiveness and popularity in Transformer (Pilault et al., 2020; Gidiotis and Tsoumakas, 2020) and non-Transformer (Cohan et al., 2018; Dong et al., 2021) based long document summarization models. Unlike short documents, long documents often contain discourse structure information such as table of content, section structures, references and others to guide a human reader in comprehending the original document and previous works have exploited this information to achieve state-of-the-art results. Nonetheless, previous works in the long document summarization domain only utilized discourse information that is made available by the benchmark datasets and did not implement automatic discourse parsing using RST trees (Mann and Thompson, 1988) and coreference mentions due to the difficulties of building an effective representation for a document with extreme input length (Ji and Eisenstein, 2014; Xu et al., 2020b).

The hybrid summarization approach only differs from the abstractive approach in that it takes in a carefully chosen subset of the input document rather than the entire input document. This extra step reduces the burden on the neural summarizers that have to generate an abstract summary and select important content at the same time. Some also refer models that utilize the hybrid approach as retrieve-the-summarize model because it involves retrieving a subset of long document text before summarizing it (Zhang et al., 2021). TLM+Ext (Pilault et al., 2020) first implemented this method by limiting inputs of the scientific articles in arXiv datasets as the introduction of the document, a subset of carefully selected sentences of the original article using extractive summarization approach, and, finally, include the remaining text if there remains extra space for Transformer-based decoder. However, given the effectiveness of sequence-to-sequence neural models, one limitation of this work is that it only utilizes a decoder framework to generate the final summary rather than an encoder-decoder framework that most subsequent works on abstractive and hybrid summarization approach do. Consequently, LoBART (Manakul and Gales, 2021) proposes a hybrid summarization system that completes a summary generation in two separate steps, (i) content selection: using a multi-task RNN, select salient content from the original source document until the total text output reaches the limit of the sequence-to-sequence pre-trained BART model and (ii) abstractive summarization: summarize the carefully selected subset using a pre-trained BART model with efficient transformer mechanism. SEAL (Zhao et al., 2020b) presents a generalized encoder-decoder framework for transformer-based long document summarization and proposed an abstractive summarization system that selects salient content and dynamically choose segments of the selected content for the decoder to attend and summarize in an end-to-end manner. The architecture, however, did not attempt in exploiting the large-scale pre-trained models that were used in most summarization research works. Lastly, facing a similar issue, development in the open-domain question-answering and knowledge-intensive language tasks reflect an interesting parallel with the progress in the long document summarization domain (Lewis et al., 2020b; Petroni et al., 2021).

4. Summary of Trends in Long Document Summarization Systems

Table 2 summarizes the trends and developments in long document summarization models as discussed above. The two standout base architectures that are used in the long document summarization domains are graph-based ranking algorithm for unsupervised extractive models and pre-trained Transformer for supervised abstractive models. While both architectures were initially proposed and tested on short documents, they can be effectively adapted to summarize long documents after incorporating novel mechanisms.For a brief description of each model, please refer to the Supplementary Materials.

Finding 1. Graph-based Extractive Models with Discourse Bias: Classical graph-based unsupervised extractive models have been found to suffer from picking similar sentences that results in a summary with redundant sentences (Liang et al., 2021). To this end, HipoRank (Dong et al., 2021) implements the graph-based architecture for unsupervised extractive summarization by including the sectional information of ArXiv/PubMed as inductive bias when calculating centrality scoring to achieve state-of-the-art results. The discourse bias mechanism is commonly incorporated by other proposed summarization models, including models with RNN and Transformer base architectures.

Finding 2. Pre-training task for Abstractive Summarization Models: Interestingly, despite having other pre-trained sequence-to-sequence models such as T5 (Raffel et al., 2020), BART and PEGASUS are the only two Transformer-based pre-trained models that were used for long document summarization (Lewis et al., 2020a; Zhang et al., 2020). Nevertheless, as both pre-trained models are trained on short documents, they have an input limit of 1,024 tokens. To process long documents that are longer than this limit, the pre-trained Transformers will have to incorporate long document mechanisms to extend the input limits.

Finding 3. Long Document Mechanisms for Transformer: As pre-trained models were often trained on large-scale datasets with input limit length between 512 to 1,024 (Devlin et al., 2019; Lewis et al., 2020a), these Transformer-based pre-trained models were optimized for short document language tasks rather than long documents. Without any long document mechanisms to adapt these models for the long document summarization task, Meng et al. (2021) has shown that BART cannot summarize a long document effectively. Other than the discourse bias mechanism, we observe that (a) efficient attention and (b) content selection mechanisms are the two most notable long document mechanisms. As the content selection mechanism requires a separate retriever to extract salient content from the source (i.e., the hybrid approach), we distinguish Transformer models with content selection mechanism as the retrieve-then-summarize model (Zhang et al., 2021) and the pure encoder-decoder Transformer without this mechanism as an end-to-end model for the rest of this work. Lastly, it is also important to note that both mechanisms can be jointly implemented within a single architecture, where the content selection mechanism will extract a longer subset of input to be processed by a Transformer with efficient attention (Manakul and Gales, 2021).

Multi-dimensional Analysis of Long Document Summarizers

Given the important findings in the graph-based unsupervised extractive model and the Transformer-based supervised abstractive model in the previous section, we design an experiment with the aim of thoroughly understanding the reasons behind the popularity of these architectures and its mechanisms. Our experiment tests out the graph-based extractive and Transformer-based abstractive summarization architectures and its mechanisms on the arXiv benchmark dataset. The documents in the arXiv dataset have an average length of 6,446 tokens. Mechanisms of the supervised extractive approach are not examined as the architectures used between the proposed models vary greatly.

To investigate the effect of incorporating long document discourse structure information, we experiment with four unsupervised graph models by varying the two following mechanisms:

Sentences of source text are encoded either using Term frequency–inverse document frequency (Tf-Idf) (Jones, 1972) or BERT SentenceTransformer (Reimers and Gurevych, 2019).

For both models that implement Tf-Idf or BERT sentence encoder, we experiment a model with the long document discourse bias mechanism and one without the bias. For models without a discourse bias mechanism, the centrality score of each sentence is computed based on the summation of cosine similarity between other sentences. For models with a discourse bias mechanism, the mechanism implemented follows the work of Dong et al. (2021). For each sentence, we adjust centrality score based on the sentence position within each section, cintra(si)c_{intra}(s_{i}), and the sentence’s section position within the document, cinter(si)c_{inter}(s_{i}). Sentences that are closer to the section and document boundaries will be given higher importance. The ”discourse-aware” centrality score for each sentence is:

where μ1\mu_{1} is a weighting factor for inter-section centrality. Following the original author, we fine-tune the weighting factor based on validation set. To ensure comparability, the maximum length of summary for all unsupervised extractive models is set to be 242 tokensDetails of implementation are reported in the Supplementary Materials..

1.2. Transformer - Supervised Abstractive

To study the current state-of-the-art abstractive neural summarizers, we experiment with two different pre-trained Transformers, BART and PEGASUS. For BART, we analyze the long document mechanisms from the perspective of two common approaches:

We experiment with three end-to-end BART models. Firstly, a vanilla BART model with full self-attention that will truncate any input text that exceed 1,024 tokens. Then, two BART models with efficient longformer attention (Beltagy et al., 2020) that can extend up to 4,096 and 16,384 input tokens respectively. The main goal of assessing end-to-end BART with and without efficient attention is to assess how the quality of the generated summary is affected when BART is adapted from a short document summarizer into long document summarizer by allowing it to process long input sequences at the cost of full self-attention. For implementation, due to lack of computational resources, we only fine-tuned original BART on arXiv and obtain the weights of longformer that was trained on arXiv from the original authorhttps://huggingface.co/allenai/led-large-16384-arxiv.

For BART retrieve-then-summarize model, our experiment follows entirely the implementation of LoBART by (Manakul and Gales, 2021)https://github.com/potsawee/longsum0. LoBART has two variants: (a) BART with full self-attention that takes in a selected subset of input text with a maximum length of 1,024 tokens and (b) BART with efficient local attention where maximum length of subset input is 4,096 tokens. The only difference between the two variants is the amount of content to be retrieved from the original source text before feeding it into the Transformer BART model. The two main objectives of experimenting with the retrieve-then-summarize BART models are to assess: (1) the effectiveness of content selection mechanism in adapting short document BART to summarize a long document and (2) whether the performance is improved when content selection mechanism is combined with efficient attention mechanism.

As there is no existing framework that applies content selection mechanism on PEGASUS, we only experiment two end-to-end PEGASUS model: original PEGASUShttps://github.com/google-research/pegasus and PEGASUS with BigBird efficient attentionhttps://github.com/google-research/bigbird. The models have input token limit of 1,024 and 4,096 respectively. Pre-trained weights for both model variants on arXiv benchmark are obtained directly from the original author.

1.3. Assessing Model Outputs

Rather than relying entirely on ROUGE score like most summarization research settings, we use four different metrics to analyse model summary outputs from three important dimensions (D#): relevance, informativeness and semantic coherence.

of a summary is the extent to which a summary contains the main ideas of a source. We use ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2019) metrics to measure relevancy of the candidate summary.

is the amount of new information and knowledge a summary brings to the reader (Peyrard, 2019a). This information may not necessarily be key to the narrative of the source but should add value to readers. For example, limitations of an academic article are not central to the narrative but do add value to readers. This metric tests a model architecture’s ability to effectively generate summary that can cover different aspects of the original source text. This is approximated by the percentage of sections that are covered by a candidate summary, where we assume each sentence of the candidate summary covers a particular section and the sentence belongs to the section where it achieves the highest ROUGE-L score.

measures whether a summary is fluent and semantically coherent. Following Bommasani and Cardie (2020)’s implementation, this is approximated as:

where NSP(.)NSP(.) is BERT NSP function, and sis_{i} denotes the position of a sentence in the candidate summary. However, Bommasani and Cardie (2020) did not fine-tune the general pre-trained BERT model while our BERT NSP model is fine-tuned on arXiv using positive and negative sentence-pairs with a final F1-score of 0.92.

1.4. Other Implementation Details

For all the model variants implemented, the train, validation and test sample split on the arXiv benchmark dataset are 203,037/6,436/6,440, which is the same for all prior works as they follow the same configuration by the original author (Cohan et al., 2018). To ensure consistent preprocessing pipeline, we follow pre-processing of LoBART in all model implementations (Manakul and Gales, 2021). All models that require fine-tuning are trained on the same RTX 3090 GPU with 24 GiB of GPU memory. We use pyrouge package for ROUGE metric.

2. Results and Analysis

The following discusses the experimental findings based on the results shown in Table 3.

Finding 1. Global Word Representation versus Contextual Embedding: Using BERT as the sentence encoder mechanism boosts the unsupervised summarization model performance in the relevancy and informativeness dimension. We hypothesize this is due to the semantic reasoning capability of BERT in encoding important sentences that are informative but are not worded in the same way as the other important sentences when similarity and centrality scoring are computed. This is particularly important for long document as it has higher compression ratio and a higher chance of having sentences with the exact same information being repeated multiple times in the source text. Thus, not encoding the sentences with semantically rich encoding may result in a summary with more redundancy.

Finding 2. Discourse Bias Mechanism boosts Relevance and Informativeness: The inclusion of positional and sectional bias when computing sentence centrality score greatly improves the architecture’s ability in capturing more relevant sentences in the long document and generating a more informative summary, validating our hypothesis in section 4. However, since extractive models merely combine the extracted sentences, these sentences that come from various sections likely caused a drop in the semantic coherence of the final summary outputs.

2.2. Transformer - Supervised Abstractive

Finding 1. Diminishing Return of PEGASUS Pre-training: As compared to the BART-only model, the PEGASUS-only model achieved greater performance across all dimensions, indicating that PEGASUS pre-training mechanism helps a Transformer-based model in writing more relevant, informative and semantically coherent summaries. As PEGASUS is pre-trained on a different corpus as compared to BART (Lewis et al., 2020a; Zhang et al., 2020), it is not conclusive whether the GSG pre-training task and/or the difference in the pre-training corpus contributed to the superior performance of PEGASUS. Interestingly, the performance gain is not as obvious when the input sequence length is allowed to be extended to 4,096 with efficient attention. This could be due to the difference in the efficient attention mechanism used or the need for predicting salient content outside the truncated text has diminished.

Finding 2. Mixed Results on Retrieve-then-summarize Models: For both retrieve-then-summarize BART models, we see state-of-the-art result achieved in terms of the standard ROUGE score metric. The result showed improvement across all dimensions when the Transformer model processed a longer subset extracted by the retriever, demonstrating the effectiveness of Transformer in computing pairwise relations between tokens to identify salient content. Except for the BART with longformer attention (16,384), retrieve-then-summarize models also performed better in informativeness as the models are allowed to process the entire source documents in arXiv dataset. However, the retrieve-then-summarize models performed the worst in semantic coherence when compared to the other abstractive summarization models. As the content selection mechanism is not trained in an end-to-end manner, we hypothesize that this is due to the inevitable disconnect between the content selection mechanism and the encoder-decoder Transformer model at the inference stage. Further, it is possible when the retrieved subset extracted by the content selection mechanism are not ordered in its original form, the incoherence of the subset cascade downwards to the final summary output, causing a drop in semantic coherence. This finding also illustrates the importance of measuring model performance in a multi-dimensional way rather than relying entirely on ROUGE score that has found to have important limitations (Kryściński et al., 2019; Bhandari et al., 2020).

Finding 3. Transformer’s Reasoning Capability over Long Sequences: Holding the pre-trained BART model constant, extending the total input token limits for the pre-trained Transformer improves the summarizer’s ability in generating a summary that is more relevant, informative and semantically coherent. This finding is consistent with a human evaluation experiment by (Huang et al., 2021), providing confidence to the automatic evaluation metrics used in this work. The impact of processing only 1,024 tokens is particularly obvious when it comes to the informativeness of the summary output where BART (1,024) informativeness score is 10 points lower than BART (16,384). Importantly, ROUGE score again did not fully capture this performance difference, highlighting the limitation of traditional summarization research setting of measuring model performance using only ROUGE score. Lastly, this finding suggests that unlike the result of Meng et al. (2021), our experiment demonstrates that Transformer can reason over long sequences given that the right configuration is made to fine-tuned the model for specific downstream task. Through ad-hoc experiment, we systematically analyze the common approaches in long document summarization domain. The experimental result demonstrated that exploiting explicit discourse structures of long documents in unsupervised models and processing longer inputs with long document adaptation on pre-trained Transformer models can yield promising outcomes for the long document summarization task. The result also showed that retrieve-then-summarize model can achieve state-of-the-art results in terms of ROUGE score but may generate less coherent summaries.

3. Limitation of Experiment

Recent studies have found that summary outputs of state-of-the-art abstractive summarization models contain factual inconsistency in up to 30% of summary output (Cao et al., 2018; Kryscinski et al., 2020; Maynez et al., 2020). To address the aforementioned issues, various models and metrics have been proposed to measure the factual consistency of candidate summaries conditioned on the source documents (Goyal and Durrett, 2020; Durmus et al., 2020). Nonetheless, due to the limitations of the proposed metrics including the input length limit of pre-trained models, difficulty of implementation and performance variation across benchmarks (Pagnoni et al., 2021), we did not measure the factual consistency of summary outputs and represents an important limitation of the multi-dimensional analysis experiment above. This is despite after trying out various adaptations on the textual entailment approach proposed by Maynez et al. (2020), our tested models have almost no discriminative ability and was thus not used. The robustness of the metrics used across the relevance, informativeness and semantic coherence dimensions should also be interpreted with care. Lastly, the experiment was only conducted using the arXiv benchmark dataset as it is the only dataset where all pre-trained weights are publicly available. To encourage similar analysis to be conducted across a wide range of benchmark dataset and model implementation, our evaluation metric toolkit for dataset and model is available at https://github.com/huankoh/long-doc-summarization.

Metrics

As evaluating generated summary outputs using manual efforts are costly and impractical, the efficient ROUGE metric (Lin, 2004) has long been the standard way of comparing summarization model performance. It measures the lexical overlap between reference and candidate summary and the common n-gram measures are unigram (ROUGE-1), bigram (ROUGE-2) and longest common sub-sequence (ROUGE-L). However, as it is based on exact token matches and overlap between synonymous tokens or phrases will be ignored, the limitation of ROUGE score metrics have been widely explored (Kryściński et al., 2019; Bhandari et al., 2020; Chaganty et al., 2018; Hashimoto et al., 2019) and many have also attempted to propose more comprehensive content overlap metrics using soft semantic overlap (Ganesan, 2018; Zhang et al., 2019). Further, while content overlap is the fundamental objective of summarization, the quality of a summary, as Gehrmann et al. (2018) and Peyrard (2019a) suggested, should be measured in a multi-dimensional way including relevance, factual consistency, conciseness and semantic coherence. Relevance refers to whether the candidate summary contains the main ideas. Factual consistency metric measures whether a candidate summary is factually consistent with the source document. Conciseness measure whether important information is encapsulated in a short and brief manner. Semantic coherence relates to the collective quality and fluency of summary sentences. Based on these quality aspects, the following discusses the research efforts in the wider summarization domain with a focus on the long document summarization research settings at the end of this section.

A) Hard Lexical Overlap As mentioned above, ROUGE score is an efficient way to consider content overlap through hard lexical matching between candidate summary and the ground truth summary. However, as ROUGE only considers exact matching between reference summary and model output, it (a) will penalize models that coin novel wordings and phrases that do not match the wordings in the reference summary, (b) does not consider factual consistency between the model output and the source document and (c) does not directly consider fluency and conciseness of a summary. Finally, ROUGE score also goes against the human approach of clever paraphrasing and summarizing.

B) Soft Content Overlap To solve the problem of exact matching of lexical units, Zhang et al. (2019) proposes a model that measures soft overlap between the reference and candidate summary by comparing the contextual BERT embeddings of both summaries. Other variants of this idea include MoverScore (Zhao et al., 2019), Word Mover Similarity and an extension of it, Sentence Mover Similarity (Clark et al., 2019; Kusner et al., 2015). The soft content overlap metrics often rely substantially on the encoder used to vectorized the candidate and ground truth summary. BERTScore, for example, utilizes BERT as the fundamental pre-trained model to encode its representations. While BERT has been proven to perform amazingly well under many different benchmark settings, its performance under certain domains such as legal or scientific research has not been thoroughly explored. For example, Beltagy et al. (2019) fine-tuned BERT on large-scale scientific paper datasets and have found its performance to improve in scientific domains as compared to the BERT-base model. Evidenced by Tejaswin et al. (2021)’s experimental result where BERTScore is found not to discriminate summaries with and without errors well, this questions the use of BERT-base model as the ”independent evaluator” of candidate summaries across all domains.

C) Reference-free Approach Rather than measuring the quality of candidate summary based on a ground truth summary, reference-free metrics for relevance measure the quality of candidate summary based on pseudo-reference summaries that are generated from source documents. Wu et al. (2020)’s proposed metric requires training samples of high-quality summaries for model supervision while other reference-free metrics can generate metric scores without the use of high-quality summaries as supervisory signals (Gao et al., 2020; Chen et al., 2021). In section 3 of benchmark datasets, we see that the information covered by a reference summary depends on the data annotation approach as well as the intent of the original authors. We further observe that reference summaries of certain benchmark datasets contain significant noises. If supervised summarization models were to train on datasets with similar issues, they may fit on target summaries that are inconsistent with the expectation and needs of summary readers. A reference-free approach that can bypass the requirement of ground truth summaries would thus be beneficial to the development of models in cases where there is high heterogeneity in summary reader expectation and/or lack of ground truth summary labels. Nevertheless, the use case of reference-free metrics is often limited by the fact that they still require pseudo-reference summaries to be generated by an ”independent model”. Last but not least, the reference-free approach can also be used to augment the reference-based metrics (Hessel et al., 2021).

2. Factual Consistency

Widespread factual inconsistency in abstractive summarization model outputs greatly limit the potentiality of these abstractive models to be applied in most commercial settings. To this end, automated metrics on factual consistency have been proposed by others (Durmus et al., 2020; Kryscinski et al., 2020; Maynez et al., 2020; Wang et al., 2020), which can be categorized into two different approaches: Entailment Classification and Question Answering.

A) Entailment Classification Approach The entailment classification approach evaluates the factual inconsistency of a candidate summary by breaking down the summary into smaller units (e.g., phrases/sentences) to be verified against the original document. For example, FactCC (Kryscinski et al., 2020) implements a BERT-based factual consistency classifier that is trained on synthetic data, where the positive data labels are non-paraphrased and paraphrased sentences from the original source document, and the negative labels are artificially corrupted sentences from the source document. At the inference stage, the faithfulness score for a candidate summary is the number of consistent sentences divided by the total number of summary sentences. Similarly, other proposed models implement factual consistency classifiers by incorporating structured knowledge such as OpenIE triples (Goodrich et al., 2019) or dependency arc (Goyal and Durrett, 2020). For the classifiers to be effective in discriminating the factual consistency of a candidate summary, they often require supervisory signals from factually consistent and inconsistent data (Pagnoni et al., 2021).

B) Question-Answering Approach The Question-Answering (QA) approach employs a question-generation model to generate questions from a given summary output (Durmus et al., 2020; Wang et al., 2020). The generated questions are then answered in two different ways: i) answering the question conditioning on the source text and ii) answering the question conditioning on the summary output. If the answers match between the source text and candidate summary, the answer is then considered consistent, otherwise, it is inconsistent. The final score will be based on discrepancies between the answers generated conditional on the candidate summary and the answers generated conditional on the souce document. Recently, QAGen (Nan et al., 2021) proposes to generate questions and answers from a given text concurrently within a single model to evaluate factual consistency to improve the efficiency of this approach.

Other Important Studies: It is important to note that the aforementioned works consider factual consistency as a binary outcome. In contrast, FRANK (Pagnoni et al., 2021) advocates for a multi-dimensional approach to evaluate factual consistency based on semantic error, discourse error and content verifiability error. Through substantial human annotation, the study further found that the effectiveness of metrics is found to be extremely dependent on the types of architecture measured and the benchmark dataset used. Similarly, human evaluation experiments from previous works have shown conflicting and varying results in the desired approach of developing factuality metrics (Maynez et al., 2020; Nan et al., 2021). In response, Gabriel et al. (2021) proposed five conditions for the development of an effective factuality metric to encourage better standardization in the factual consistency metric research.

3. Conciseness and Semantic Coherence

Metrics to measure other aspects of summarization such as conciseness and semantic coherence were also introduced. As they are not as crucial as relevance and factual consistency, these metrics often complement the others to allow a metric or a model to be more holistic and practical. For example, Bommasani and Cardie (2020) considers semantic coherence of reference summaries when evaluating single document benchmark datasets while Ju et al. (2021)’s unsupervised model generates fluent summary by utilizing the next sentence prediction task in BERT. Metrics for conciseness are also introduced to measure the quality of summaries (Bommasani and Cardie, 2020; Chen et al., 2021).

4. Research Efforts on Metrics in the Long Document Domain

Many recently proposed metrics incorporate pre-trained architectures to achieve better performances. However, as argued in the model discussion above, these pre-trained architectures cannot be easily extended to long documents. As an illustration, our experiment has attempted various adaptationsResult details are in the Supplementary Materials. on a BERT textual entailment model to evaluate arXiv candidate summaries but has found it not effective in discriminating a summary’s factual consistency with the source. This is despite Maynez et al. (2020)’s finding that this model best correlates with the human judgment of factual consistency on the XSUM short document dataset. Furthermore, other than the difficulty of adapting these models on long documents, Nan et al. (2021) has also identified the issue of resource efficiency, where a competing model would take approximately 4 days to evaluate a CNN-DM test set with an NVIDIA V100 Tensor Core GPU and would likely take significantly longer under any long document benchmark datasets. Consequently, the need of re-designing the proposed evaluation models and the requirement for costly computation resources have likely discouraged the adoption of factual consistency assessment models in the long document summarization domain. Looking at the broader research on evaluation metrics of summarization as a whole, for 17 different research papers related to evaluation metrics published in ACL main conferencesACL main conferences are ACL, NAACL, EACL, EMNLP, CoNLL, and AACL. Papers are listed in the Supplementary Materials. from 2015 to September 2021, there were no discussion on the evaluation metrics in the context of long document summarization datasets. This is important as Pagnoni et al. (2021) has found that the effectiveness of proposed metrics to vary based on the dataset characteristics. In sum, unlike the quick adoption of short document practices in the model architectures space, research in exploring evaluation metrics within the context of long document summarization is lacking and may potentially hold back the future progression of long document summarization.

Applications

As the quality of long document summaries generated by state-of-the-art models continues to improve, past works have explored their feasibility in the research and industrial domains. A natural extension for models that were implemented on the scientific paper benchmark, arXiv/PubMed, is to employ it for research purposes. These include writing section-structured (Meng et al., 2021), user-specific (He et al., 2020) or presentation-based (Sun et al., 2021) summaries for scientific papers, automating scientific reviewing (Yuan et al., 2021), and even generating literature survey based on multiple biomedical long scientific papers (DeYoung et al., 2021). When it comes to the general industrial applications of long document summarization, the knowledge and techniques learned from the research domain can address numerous commercial tasks. On the surface level, any information that would be expressed in a textual format would benefit from the advancement in this field, which encompasses summarizing any forms of long textual documents (Sharma et al., 2019; Loukas et al., 2021), extracting content as feature snippets for search engineshttps://developers.google.com/search/docs/advanced/appearance/featured-snippets, writing reviews for long media content (Kryściński et al., 2021) and summarizing long dialogues (Liu et al., 2019; Chintagunta et al., 2021; Zhong et al., 2021; Zhu et al., 2021) and multi-modal content (Yu et al., 2021). With the development becomes increasingly mature in the real-world settings, summarization models are now commercialized as a Software-as-a-Service (Saas) product in the newshttps://ai.baidu.com/tech/nlp_apply/news_summary, businesshttps://quillbot.com/ and consultinghttps://www.datagrand.com/about-us/ domains. Furthermore, as the long document summarization task can be generally understood as identification of important aspects from long sequences, the positive spillover from successful model implementation in this domain can affect a wide range of domains. Long document summarization models, for example, can be utilized for auxiliary tasks such as video captioning (Liu and Wan, 2021), long document question-answering (Lyu et al., 2021) or multi-modal tasks (Li et al., 2020; Narasimhan et al., 2021). Liu et al. (2018) also identified the ”unexpected side-effect” of language model reliably learned how to transliterate names between languages, despite the fact that the model was trained to summarize long Wikipedia articles, while BigBird (Zaheer et al., 2020) applies Transformer-based models designed for long sequences not only to long document summarization but also to DNA promoter region and chromatin profile prediction tasks in the genomics research domain.

General Challenges and Future Directions

This section discusses the general challenges of long document summarization that have yet to be solved and pinpoints potential future research directions to attract practitioners’ attention and improve our understanding and techniques in the long document summarization domain. Advancement in the long document summarization domain should also give rise to beneficial spillover to closely-related NLP sub-domains such as multi-hop QA, information retrieval and reading comprehension.

While there have been significant efforts in solving the time and memory complexity of a neural architecture such as Transformers to enhance model efficiencies, the understanding of a model’s effectiveness in solving different NLP tasks or domains is limited. As shown by our model experiment and result findings of others (Zaheer et al., 2020; Beltagy et al., 2020; Huang et al., 2021), fine-tuning pre-trained models using efficient Transformers that can attend to larger input size of tokens can improve the model performances across a wide range of NLP tasks. However, the underlying reasons of the performance improvement is not well understood. For example, while Transformer models have found to outperform RNN models as RNN lacks the ability to reason over long sequences, Pagnoni et al. (2021) have found that pre-trained Transformer summarization models still make a similar amount of discourse-related errors as the RNN models. Furthermore, research in the effectiveness of various efficient attention mechanisms used by a Transformer to summarize long documents also showed varying results. On the one hand, Huang et al. (2021) showed that efficient attention with learnable patterns to significantly outperform the the efficient attention with fixed patterns such as local-only attention mechanism. On the other hand, Manakul and Gales (2021) have found that extending window size of efficient Transformers to increase number of attended tokens per token do not affect the average distance of attended neighbor, suggesting that local attention to neighboring tokens will be sufficient for the long document summarization task. Altogether, these results highlight the limited understanding on the strategies employed by current neural models to summarize long document and the need of further research to enhance our understanding on this issue.

2. Summarizer with Automatic Discourse Parsers/Annotator

Our experimental result has demonstrated that the simple unsupervised graph architecture outperforms the other unsupervised models when discourse bias of arXiv section information is included. Nonetheless, information regarding sections of a document may not always be of high quality or available for a summarization model. This limits the implementation of many long doument summarization models that require explicit section-based discourse information. Similar issue has been faced by researchers in the dialogue summarization domain where discourse level information is not provided and past work in this domain have achieved state-of-the-art results by incorporating effective automatic discourse annotators (Liu et al., 2019; Feng et al., 2021; Wu et al., 2021). An architecture that can effectively incorporate automated discourse parsers or annotators would thus be a fruitful direction for long document summarization researchers to explore.

3. End-to-end Neural Summarizer with Content Selection Mechanism

In the medium term, in spite of the expected progress in computing efficiencies, there exist a significant amount of long documents such as business reports and books that have tokens that exceed hundreds of thousand (Loukas et al., 2021; Kryściński et al., 2021). Thus, it is not possible to summarize the entire document using a powerful state-of-the-art model without any long document adaptation, as it will truncates most of the long document source text given the current input length limit. A more practical direction is to explore architectures with a content selection mechanism that has shown to be effective in long document summarization (Zhao et al., 2020b; Manakul and Gales, 2021). Zhao et al. (2020b) has proposed an end-to-end long document summarization framework using transformers but did not incorporate powerful pre-trained models and performed slightly worse than other state-of-the-art models. LoBART (Manakul and Gales, 2021), on the other hand, did not design the content selection and abstractive summarizer in an end-to-end manner. Experimental result in this survey has shown that LoBART’s disconnection between the retriever and summarizer resulted in less semantically coherent summaries. In the open-domain QA domain, RAG (Lewis et al., 2020b) achieves state-of-the-art by successfully incorporating content selection mechanism and pre-trained models in an end-to-end manner (Petroni et al., 2021), pointing a promising direction for practitioners in the long document summarization domain to explore.

4. Quality and Diversity of Benchmark Dataset

In section 3 of benchmark datasets, human annotation efforts have been done to measure the quality of the most commonly used long document summarization dataset, arXiv. It was found that 60% of reference summaries contain some form of errors and 15% of them have significant errors where at least half of the summary contains errors. This calls for a benchmark dataset with significantly better quality with fewer errors through robust heuristic rules and scraping strategies. Moreover, the long document summarization benchmark datasets are often in the legislative and scientific domain. While these domains are extremely important, many other domains such as financial reports with significant numerical complexity or long-form dialogues of daily conversation in business settings are equally important. Development across different domains could attract even greater attention from a wide range of partners to incentivize greater research efforts in the summarization field. Last but not least, to achieve the original objectives of benchmark datasets, proposed model architectures for long document summarization should also be tested across a diverse set of long document benchmark datasets rather than focusing merely on arXiv/PubMed.

5. Practicality of Summarization Metrics

The limitation of ROUGE metric that has been widely explored (Zhou et al., 2006; Ng and Abrecht, 2015; Ganesan, 2018; Kryściński et al., 2019) and significant efforts have been made to improve the way we measure candidate summaries from various different aspects. Nonetheless, the proposed methods lack practicality in terms of wide availability for all parties in the research communities. For example, Nan et al. (2021) found that using a single NVIDIA V100 Tensor Core GPU, a factual consistency metric proposed requires longer than four days to evaluate a single set of candidate summaries in the CNN-DM test dataset. Many metrics proposed also require substantial computing resources to re-train across different benchmark settings (Kryscinski et al., 2020; Wang et al., 2020; Durmus et al., 2020). These issues will be exacerbated when it comes to the long document summarization domain. Moreover, most summarization metrics are only tested in the CNN-DM and XSum datasets but not others. This significantly limits its applicability as Pagnoni et al. (2021) have found most metrics to lack robustness across different benchmark settings. To ensure effective metrics have wider application, efficiencies and practicality of metrics should be paid with great attention to ensure that sufficient incentive is provided for practitioners to explore the practicality of metrics rather than a mere focus on state-of-the-art metric performances.

Conclusion

In this survey, we conduct a comprehensive overview of long document summarization and systematically analyze the three key components of its research settings: benchmark datasets, summarization models and evaluation metrics. We first highlight the intrinsic differences of short and long document datasets and show that summarizing long documents requires extra compression of the source text through the identification of key narratives that are more uniformly scattered across the source documents. Nevertheless, long documents are often more extractive in nature and often have explicit discourse structures to take advantage of. For summarization models, we provide a thorough review, comparison and summarization of the model architectures and mechanisms used to generate long document summaries. Through ad-hoc experiment, we also systematically investigate the architectures and mechanisms that are widely applied across various works. We further discuss the current research in evaluation metrics and call attention to the lack of research on metrics that can be easily applied to the long document summarization domain. Finally, we explore the applications of long document summarization models and suggest five future directions for long document summarization research.

References

Supplementary materials

Table 4 details the summary of long document baseline and state-of-the-art summarization systems proposed by previous works in this domain. The ”Prior” column illustrates whether inductive bias such as discourse structure information of the original long document is used and the ”Trunc” column reflects the total percentage of significant truncation across the five long document benchmark.

2. Graph-based Ranking Algorithm in Experimental Section

In general form, given a set of sentences in the original source document, D={s1,s2,...,sm}D=\{s_{1},s_{2},...,s_{m}\} with the inter-sentential similarity relations represented as eij=(si,sj)Ee_{ij}=(s_{i},s_{j})\in E where iji\neq j, the following equation illustrates the graph-based ranking architecture in computing the scoring for each sentence:

The similarity between each sentence is computed using similarity measures such as cosine similarity after being encoded using a sentence encoder. The graph architecture we implemented is a basic directed graph where the centrality score of each sentence is computed based on the summation of bias-adjusted cosine similarity between other sentences and/or sections. The section node will still be represented by sentences where it is the average of the representations for sentences within the section of interest. In other words, a sentence that has the highest sum of similarity against all the other sentences after adjusting for bias will be ranked as the top sentence.

For Tf-Idf encoding, we use scikitlearn Tf-Idf vectorizer to train the encoder using source documents in the arXiv test set (note: this does not include the reference summaries). The preprocessing follows (Manakul and Gales, 2021) to ensure consistency and the minimum document frequency is set to be 0.01. As the vector dimension based on original Tf-Idf will be huge, we reduce the dimension to 768 using TruncatedSVD.

For BERT encoding, we utilize SentenceTransformer package (https://www.sbert.net/) where the model used is ”bert-base-nli-mean-tokens”.

Following the implementation of (Dong et al., 2021), the bias is calculated using intra-section bias (or position-level bias within each section) and inter-section bias (or section-level bias). For sentences within the same section, the cosine-similarity between sentences adjusted by intra-section bias, eijintraBe_{ij}^{intraB}, is computed by adjusting the lambda biases, λ1\lambda_{1} and λ2\lambda_{2} based on sentence boundary function, dbd_{b}:

where the sentence boundary function, dbd_{b}, will determine sentences siIs^{I}_{i} that are closer to the section II boundaries to be more important. This is computed by:

nIn^{I} is the number of sentences in section II and xiIx_{i}^{I} represents sentence i’s position in section I. Cosine similarity between sentence and section adjusted by inter-section bias is calculated similarly except that the bias is computed based on the section’s position in the document. Finally, the resulting adjusted centrality score for each sentence is:

where μ1\mu_{1} is a weighting factor for inter-section centrality. For hyperparameter tuning, we set λ1=0.5\lambda_{1}=0.5 and λ2=1\lambda_{2}=1. Then, using arXiv validation samples, we adjust α{0,0.5,0.8,1.0,1.2}\alpha\in\{0,0.5,0.8,1.0,1.2\} to control the relative importance of the start and end of a section of a source document and μ1{0.5,1.0,1.5}\mu_{1}\in\{0.5,1.0,1.5\} to control the weights of intra-section sentence importance versus inter-section sectional importance. Importantly, the original paper uses λ1=0\lambda_{1}=0 where less important sentences are pruned, while we set λ1=0.5\lambda_{1}=0.5 to down weight rather than prune the less important sentences. For more details, we refer our reader to the original paper (Dong et al., 2021) with their codes available on https://github.com/mirandrom/HipoRank.

Same as Tf-Idf except that the Tf-Idf sentence encoder is replaced by BERT. Lastly, for the implementation of Transformer-based abstractive summarization models, the links to original author’s pre-trained weights, codes and implementations are provided in the footnote of our main article.

3. BERT NSP - Assessing Semantic Coherence of Candidate Summaries

Bommasani and Cardie (2020) suggested using the general pre-trained BERT model to evaluate the semantic coherence or fluency of a summary without any fine-tuning. We find the general pre-trained BERT model that was not trained on academic papers to have little discriminative ability for the semantic coherence of arXiv-related summaries. Thus, we fine-tune the pre-trained BERT model using positive and negative sentence pairs. Positive sentences are extracted by taking any sentence together with its following sentence in the reference summary and source text of the arXiv dataset. Negative sentences are created by replacing the following sentences with any randomly extracted sentences from either the same or other documents in the arXiv benchmark dataset. We also ensured that the total samples of positive and negative sentence pairs are balanced and an equal amount of sentences are obtained from the reference summary and the source text. The codes of all the metrics used in the main paper, including BERT NSP and intrinsic characteristics of the dataset, are made publicly available on: https://github.com/huankoh/long-doc-summarization.

4. Textual Entailment as Factual Consistency Metric

Conditional on the source text, the entailment task classifies summary sentences as entails, neutral or contradicts. Ideally, a candidate summary should entail or be neutral to the source text, but never contradict the source text. As BERT textual entailment model without long document adaptation have a token limit of 512, this will not be directly applicable for long document datasets with a token length of at least in the thousands. To use this in our experiment, we attempted to adapt Maynez et al. (2020)’s textual entailment BERT model by implementing a content selection mechanism to reduce the input size of the source document. The selected subset is constructed using gold label sequences by greedily optimizing the ROUGE score on the ground truth reference summaries, following the algorithm provided by Xiao and Carenini (2019). As we have different variants of the ROUGE score, we experimented with a greedy selection of Rouge-1, Rouge-2 and Rouge-L in the originally ordered sentences and in the randomized ordered sentences. The aim of randomized ordered sentences is to ensure that the salient contents are extracted more uniformly from the source text. To evaluate the discriminative ability of our adapted model, we use the annotated test data in section 3 to evaluate the discriminative ability of our adapted BERT textual entailment model. As the annotated data has 15% of the randomly sampled data to be extremely low quality and 30% to have zero errors in the sentences. A good BERT textual entailment model should then evaluate the 30% high-quality samples as low contradiction and high entailment or neutrality while evaluating the 15% low-quality samples with higher contraction and lower entailment or neutrality. From Table 5, despite trying out various adaptations, our models have almost no discriminative ability and were thus not used in our experimental section in the main article. This is an important limitation of our experiment in section 5 of our main article.

5. Metric-related ACL main conference research papers

Table 6 next page details the 17 papers published from 2015 to September 2021 in ACL main conferences, including ACL, NAACL, EACL, EMNLP, CoNLL, and AACL.