Long Document Summarization with Top-down and Bottom-up Inference

Bo Pang, Erik Nijkamp, Wojciech Kryściński, Silvio Savarese, Yingbo Zhou, Caiming Xiong

Introduction

Text summarization involves compressing a document and preserving key content and meaning. It can be done in either an extractive or abstractive manner. While an extractive summarization model extracts salient fragments (e.g., words, sentences) from the source document to form a summary, an abstractive summarization system aims to generate a semantically coherent and linguistically fluent summary by conditioning on the document. The abstractive approach aligns better with how a human does summarization and generally performs better than extractive models in recent works (Pilault et al., 2020; Zhang et al., 2020). We thus focuses on abstractive summarization.

The dominant approach for abstractive summarization is to use a Seq2Seq model (Sutskever et al., 2014) with an encoder-decoder architecture instantiated with either RNNs (Hochreiter & Schmidhuber, 1997) or, more recently, transformers (Vaswani et al., 2017). In such a model, an encoder infers the latent representations of observed tokens (words or subwords) in the document, conditioning on which a decoder generates a summary. This paper studies the problem of how to infer good latent representations, which in turn would improve summarization. We propose a framework which (1) assumes a multi-scale latent structure of a document and (2) synergizes bottom-up inference with top-down inference. In a multi-scale structure, high-level variables (like those representing sentences, segments) model the document at a coarser time-scale and abstract away details, and are suitable for capturing long range dependency of the document; in contrast, low-level variables (like those representing tokens) preserves details, and prevent the summary from losing key details. In our framework, the summary is generated by conditioning on token representations (low-level variables), similar to recent abstractive summarization models (Zhang et al., 2020; Zaheer et al., 2020; Beltagy et al., 2020). There is however a critical difference. In our framework, token representations are first bottom-up inferred and then top-down updated with high level representations, hence rendering low-level representations aware of long range information. We hypothesize that the proposed inference approach would improve summarization.

Multi-level models have been widely studied in modeling for images (Sønderby et al., 2016), speech (Mehri et al., 2016), and language (Chung et al., 2016). Prior summarization works (Cheng & Lapata, 2016; Nallapati et al., 2016; Zhang et al., 2019; Xu et al., 2020) have also explored hierarchical models. But they mostly focus on extractive summarization and follow a bottom-up inference approach. They pool information in words or sub-words to form sentence representations, based on which a classification is done to make an extraction decision.

In comparison, our framework combines bottom-up and top-down inference. This draws direct inspiration from a line of work which examines variational inference for hierarchical top-down generative models (Sønderby et al., 2016; Maaløe et al., 2019; Child, 2020). In these models, in the bottom-up path distribution parameters of higher level stochastic variables are computed as a function of lower level stochastic variables, while in the top-down path distribution parameters of lower level variables are corrected for as a function of higher level variables. Although we do not assume stochasticity of the document latent representations, our encoder or inference model follows the same idea to infer token representations.

The proposed framework is agnostic to model architecture. Due to the dominance of transformer models in NLP (Chen et al., 2018; Zhang et al., 2020; Sun et al., 2019; Martin et al., 2020) and to leverage pre-trained language models (Liu et al., 2019; Lewis et al., 2020), we instantiate our framework with a transformer-based model. There is a bottleneck of applying transformers to long documents, because its computational and memory cost has a quadratic dependency on the sequence length. This issue is especially critical for summarization since we are more interested in summarizing long documents since short ones can be quickly read through by humans. To address this issue, a large amount of prior works have been devoted to developing efficient transformers with sub-quadratic complexity. They approach this problem with kernel-based methods (Katharopoulos et al., 2020; Choromanski et al., 2020), by low-rank approximation to the attention matrix (Wang et al., 2020), by synthesizing the attention weights (Tay et al., 2021), or by designing content-independent (Child et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020; Zaheer et al., 2020) or content-dependent sparse attention mechanisms (Kitaev et al., 2020; Roy et al., 2021; Wang et al., 2021).

Our framework provides a natural way to diminish this quadratic complexity issue. In the bottom-up inference, we use local self-attention where each token only attends tokens within a local fixed-length window, and thus the complexity does not grow as a function of the input sequence length. The top-down correction for the token representations enable them to capture long-range context, reducing the limitation of local attention. Furthermore, in contrast to most prior efficient transformers that are incompatible with pre-training language models, our framework is flexible for leveraging any pre-trained encoder-decoder models such as BART (Lewis et al., 2020), T5 (Raffel et al., 2020).

We call the transformer-based model following the proposed framework as top-down transformer, to emphasize the importance of the top-down inference. We evaluate the top-down transformer on a set of distinct summarization benchmarks. These benchmarks cover documents from a variety of domains, including news articles and scientific, conversational, and narrative documents, and of various lengths ranging from hundreds of words (e.g., a news article), several thousands to over ten thousands of words (e.g., a scientific paper, a book chapter), to even over hundred thousands of words (e.g., an entire book). On short documents, models following our framework achieves on-par or better summarization performance than models with full self-attention, and are more compute-and memory-efficient. Across all long document datasets, our models achieve state-of-the-art performance. In the end, we show that our model is able to summarize a whole book. Compared to a concurrent work (Wu et al., 2021) using GPT-3 and requiring humans to extensively label data, our model achieves competitive performance with 380 times less parameters and a small amount of publicly available data. The diverse and strong empirical results support the effectiveness and wide applicability of the proposed model.

Methods

Figure 1 gives a graphical overview of the top-down transformer, instantiating the proposed framework. We introduce its details in this section. Suppose a document has $N$ tokens, $\bm{t}=\{t_{i}\}_{i=1}^{N}$ . In our method, token representations are inferred by combining top-down and bottom-up inference. This leads to effective and efficient inference for token representations. They are then attended by a decoder to generate a summary, as in a regular encoder-decoder transformer.

2 Top-Down Inference

The efficiency with local self-attention in the bottom-up inference nevertheless comes with a limitation, that is, each $e_{i}$ only captures the context within a local window instead of that of the whole document. To mitigate this issue, we propose a top-down inference for token representations.

Consider a two-level multi-scale latent structure for a document. The low level consists of token representations, $\{e_{i}\}_{i=1}^{N}$ , computed by the bottom-up inference. The top level consists of units at a coarser level. It is affordable to apply full self-attention at the top level due to its coarser granularity, allowing these top-level units to capture global document context. In our work, the self-attention mechanism for the top-level representations is simply the original multi-head self-attention proposed in Vaswani et al. (2017). Readers are referred to Vaswani et al. (2017) for details.

To instantiate the top-down inference, we need to make two choices: (1) the number of top-levels above the token level and (2) the unit representation for each top-level. We choose to use one top level since it is sufficiently coarser to apply full self-attention for a wide range of long document benchmarks we experimented on. A natural choice for top level units is sentence, paragraph, and chapter, depending on the number top level considered. Such a choice however might lead to complicated implementations and non-scalability due to the varying length of these units. We hence choose a simpler approach, where the top level consists of fixed-length segments of the documents. While we use a single top level, multiple top levels can be simply achieved with segments with increasingly coarser granularity.

In the top-down inference, segment-level self-attention has a complexity of $O(M^{2})$ , and token-segment cross-attention has a complexity of $O(NM)$ . Thus, together with bottom-up inference, the complexity is $O(Nw+M^{2}+NM)$ . In practice, we use relatively small $w$ (window size) and $M$ (number of segments).

3 Pooling Methods

As aforementioned, we use a single top level, consisting of fixed-length segments, in the current work. The segment representations are initialized by pooling token representations. Following the notation above, suppose a document is divided into $M$ segments, and the embedding of the $j$ th segment is initialized as,

where $k$ is the kernel size and $d$ is the stride. $p_{n}$ is the weight for the $n$ th token. We introduce two approaches to compute the weights. The first method is average pooling (AvgPool) and hence $p_{n}=\frac{1}{k}$ , which is simple and convenient. In the second approach, we leverage the reference summary to define the importance of each token to assign adaptive weights (AdaPool). Particularly, we learn an importance tagger with labels constructed with the reference summaries, which involves three steps:

construct training labels for the importance tagger: (1) word lemmatization for document and reference words; (2) label a document word as important if it appears in the reference word list and is a non-stopword

train a top-down transformer encoder with constructed labels as the importance tagger

train the summarization model with oracle weights (i.e., constructed labels from Step 1.) and test it with the adaptive importance weight assigned by the learned tagger

In our experiments, we also used OracleAdaPool where the weights are obtained from Step 1 with the reference summaries. Note that if $\{p_{n}\}_{n=1}^{k}$ does not form a valid probability distribution, $s_{j}$ can be computed with a normalized weight distribution within each pooling window as follows,

$\{s_{j}^{(0)}\}_{j=1}^{M}$ are updated with self-attention, yielding $\{s_{j}\}_{j=1}^{M}$ , which are then used in top-down inference for token representations, as discussed in Section 2.2.

Experiments

We thoroughly evaluate the proposed framework on distinct summarization datasets. See Table 1 for a summary of datasets used in the current work. Our model is first evaluated on two standard long document summarization benchmarks, PubMed and arXiv (Cohan et al., 2018). It outperforms various efficient transformers and other approaches and achieves state-of-the-art performance. Although we focus on long document summarization, models under our framework is also applicable to shorter documents. We test our model on CNN-Dailymail (See et al., 2017), the most widely used short summarization dataset. Compared to a full self-attention model, our model achieves competitive or better performance but is more memory- and compute-efficient. Recently, a more challenging benchmark, SummScreen (Chen et al., 2021), is proposed, where summarization systems need to summarize TV show scripts. These documents convey plot events often indirectly and implicitly in dialogues, in contrast to news and scientific articles where statements follow a logical order and facts are offered explicitly. Moreover, a typical episode contains multiple subplots that proceed in parallel. Solving this benchmark thus requires a system to draw information from utterances spreading out through the entirety of the input and integrate them to a concise description. Our model outperforms strong baselines on this challenging benchmark by a significant margin. Another challenging dataset, BookSum (Kryściński et al., 2021), is also recently released. It covers books from the literature domain, including stories, plays, and novels. Similar to ScreenSum, it requires integrating plot events from indirectly expressed descriptions. A further challenge is to process long-form texts up to hundreds of pages or over 100,000 words. A model under our framework does well on this challenge, achieving competitive or superior performance compared to a concurrent work (Wu et al., 2021) using GPT-3. While the GPT-3-based model has 175 billion parameters and requires human labelers to extensively write summaries and provide reward information, our model with 464 million parameters is 380 times smaller and merely requires training on relatively minimal data. These results suggest our framework is a generally effectively for documents of various lengths, domains.

We use the same encoder-decoder architecture for all datasets. The encoder has 8 bottom-up inference layers and 4 top-down inference layers for tokens, and 2 self-attention layers for segments. The decoder has 12 layers. The encoder layers for tokens (12 layers) and the decoder layers are all initialized from BART (Lewis et al., 2020) except the parameters for token-segment cross-attention in the top-down inference layers, which are randomly initialized. The self-attention parameters for segments are also randomly initialized. The window size is $1024$ unless otherwise specified. Our settings closely follow Longformer (Beltagy et al., 2020) which has 12 layers for the encoder and decoder, is initialized from BART, and uses a local window size of $1024$ . Thus, comparison with Longformer is a test of the effect of top-down correction for token representations. PubMed, arXiv, and CNN-DailyMail are obtained from Huggingface Datasets https://huggingface.co/datasets. SummScreen and BookSum are provided by the authors. Standard train/validation/test splits, provided by either Huggingface or the dataset authors, are used for all datasets. Model performance is evaluated with ROUGE scores (Lin, 2004). Reported performance is based on the checkpoint with the best validation R-2 score. Summary samples for each dataset generated by our models are provided in the appendix.

We first test the effectiveness of our framework on two widely used datasets based on scientific documents, PubMed and arXiv. They consists of long documents of length ranging from several thousands of words to over ten thousands words. Each document in PubMed is a scientific article, collected from PubMed.com, and the reference summary is the associated abstract. Documents in arXiv are collected from arxiv.org. Three variants of our model with various pooling weights are presented. AvgPool, AdaPool, and OracleAdaPool in Table 2 indicate average pooling, pooling with adaptive weights, pooling with adaptive weights determined by references, respectively.

The experiment results are displayed in Table 2. Pegasus (Zhang et al., 2020) is pretrained on a large-scale of dataset with a pretraining objective specifically designed for summarization. It uses a full self-attention encoder and thus has to truncate the source document due to the quadratic memory complexity. The summarization-oriented large-scale pre-training makes it a strong baseline. Dancer (Gidiotis & Tsoumakas, 2020) takes a divide-and-conquer approach in which the summary is divided into sections and each section is paired to the appropriate section of the document and the model is trained on short sequences and has a low memory requirement. This is a straightforward approach achieving strong performance.

TLM-I+E (Pilault et al., 2020) first extracts salient sentences and then uses a GPT-style model to generate a summary by conditioning on the introduction section and extracted sentences (instead of the whole document), thus reducing memory requirement. SSN-DM (Cui & Hu, 2021) is an extractive model and uses a sliding encoder to process segments of a document and a memory module to capture autoregressive dependency between segments. These two models bear similarities to our model in that they use a multi-scale structure. The extracted only salient sentences in TLM-I+E can be considered a representation of the document at a coarser granularity since salient information is retained. Instead of keeping the coarser representations in the latent space, TLM-I+E reads out them to the observed word space. In SSN-DM, the fixed-size memory module pooling information from each segments can also be considered a high level representation of the document. Despite these similarities, our model, following a principled framework to synergize bottom-up and top-down inference, clearly outperforms these prior models.

BigBird (Zaheer et al., 2020), Longformer (Beltagy et al., 2020), and LSH (Kitaev et al., 2020; Huang et al., 2021) are efficient transformers. BigBird based on Pegasus pre-training combines local attention, random attention tokens, and global attention tokens. LSH uses content-dependent sparse attention based on local sensitivity hashing. Longformer is closely related to our models. It uses the same local attention as in our bottom-up inference except it has an extra [CLS] token which is a global attention token. Longformer is also initialized from BART, same as ours. The only difference is that our models infer token representations with both top-down and bottom-up inference, in contrary to pure bottom-up inference in Longformer. The clear performance improvement over Longformer and other efficient transformers indicates the effectiveness of the synergy of bottom-up and top-down inference.

2 Short Documents

To demonstrate the general applicability of the proposed framework, we show its efficiency and effectiveness on short document summarization and compare it to full self-attention inference model. We hypothesize that although the bottom-up inference uses local self-attention (for efficiency), the top-down correction would enable the effectiveness of our inference and hence lead to competitive or better summarization performance.

Our model parameters are initialized from BART. Hence, BART with full self-attention forms a natural baseline, allowing for direct comparison. In the bottom-up inference, the local attention window size is $256$ . As shown in Table 3, models under our framework achieve slightly better performance, especially in terms of R-1 and R-L, than BART. It confirms our hypothesis that a synergy of bottom-up inference with local attention and top-down inference with global attention is effective and achieves on-par or better performance as full self-attention.

3 SummScreen

Scientific and news articles often require that facts are offered explicitly and statements follow a logical order, which might allow summarization models to exploit layout and stylistic biases. We next test the proposed framework on a more challenging dataset, SummScreen, which requires a model to draw and integrate information from indirect expressions across a wide range of the document. SummScreen (Chen et al., 2021) provides two datasets, TVMegaSite and ForeverDreaming, collecting from two different TV show transcript websites. Each document is the transcript of a TV show episode and the summary is an associated recap.

Table 4 summarizes the results. Extractive oracle is an extractive method by extracting nearest neighbors based on Rouge scores. Longformer is an abstractive method and takes the whole document as input. Hybrid models first select salient sentences and then input them to BART. Our models outperform these strong baselines and even achieves comparable or superior performance than those having access to oracle information.

4 BookSum

BookSum (Kryściński et al., 2021) is another challenging dataset, consisting of books from the literature domain including stories, plays and novels. It includes examples on three levels of granularity with increasing difficulty: (1) paragraph-level with inputs with hundreds of words, (2) chapter-level, with inputs with several thousands or over ten thousands of words, (3) book-level, with inputs spanning up to hundreds of pages and over hundred thousands of words. The chapter-level examples have comparable lengths to other popular long-form summarization datasets such as PubMed, arXiv. We first test our models on the chapter level. The book-level summarization is extremely challenging. First, the number of examples (313 books) is limited. Second, a book is too long to fit in current models. We train our model in a curriculum and recursive way to address the two issues.

Table 5 displays the results. Kryściński et al. (2021) takes a divide-and-conquer approach to summarize chapters. They finetune BART, T5, and Pegasus on the paragraph level data and the chapter summary is obtained by concatenating the paragraph summary. This might miss the intra-paragraph context. Our models directly summarize the whole chapters and outperform these divide-and-conquer models. Efficient transformers, Longformer and BigBird, are also able to take in the whole chapters as inputs. But these bottom-up approaches clearly underperform our models.

4.2 Book Level

We first train a top-down transformer on the chapter-level data and then fine-tune it on the book-level data. The inputs to the book-level model are (1) the concatenated chapter reference summaries in training or (2) the concatenated chapter summaries generated by the chapter-level model in testing. The chapter-to-book curriculum training is to mitigate the scarcity of book-level data. The recursive summarization of chapters and then books can be considered abstractive content selection applied to book data, and is used to address the extremely long length of books.

Table 6 summarizes the book-level results. The middle section shows the performance for the models with the divide-and-conquer approach (Kryściński et al., 2021), same as those for the chapter-level data. A concurrent work (Wu et al., 2021) based on GPT-3 with reinforcement learning (RL) also attempts to summarize books. Their method shares similarity with ours in that they decompose books into shorter sequences and train the model and summarize the text segments recursively. There are four major differences between our approach and theirs. First, our model has only 464 million parameters and is 380 times smaller than GPT-3 with 175 billion parameters. Second, we train our model with the limited and publicly available data from BookSum, while Wu et al. (2021) requires human labelers to write summaries and give preference, which is highly costly. Third, our model has lower complexity, allowing it to takes in longer input. Thus, we only need to decompose the book one time (into chapters), in contrast to multiple recursive decomposition steps. Multiple recursive summarization steps is prone to accumulating errors. Forth, GPT-3 uses bottom-up inference to infer token representations, in contrast to the synergy of bottom-up and top-down inference in our approach, which we believe leads to better representation inference. The last two differences might account for our competitive performance using a much smaller model and less data.

Related Work

Prior works have proposed extractive models (Nallapati et al., 2017; Cui & Hu, 2021), abstractive models (Nallapati et al., 2016; Zhang et al., 2020), and hybrid models combining extractive and abstractive methods (Gehrmann et al., 2018; Pilault et al., 2020), for text summarization. Although our model mostly follows the abstractive approach, it also has connections to the hybrid models. These models usually first extract salient sentences from the source document and then summarize the extracted sentences with an abstractive model. Extracted sentences can be viewed a high level representation of the document, although it is the observed space but not in the latent space as in our framework. A continuous representations in the latent space facilities end-to-end learning. Moreover, assigning importance weight with the importance tagger in our method resembles an extractive step in a hybrid model, and thus top down transformer with learned importance tagger can be considered a hybrid model.

Our work draws inspiration from the latent variable inference for hierarchical top-down generative models. To faithfully infer multi-layer latent variables needs to account for the dependency between them. MCMC approaches Nijkamp et al. (2020) naturally accounts for such dependency. Amortized inference Sønderby et al. (2016); Maaløe et al. (2019); Child (2020) makes a special design to capture the multi-layer dependency. In particular, in a bottom-up path, the parameters of the distribution of higher level variables are computed as a function of the lower level variables; in a top-down path, the parameters of the distribution of lower level variables are corrected as a function of the higher level variables.

Despite the effectiveness of transformers on a variety of tasks, its quadratic complexity with respect to the sequence length has limited its application to problems with long sequences. A large amount of works have attempted to address this limitation. A major line of work focuses on designing various sparse attention mechanisms. These works can be roughly categorized into two groups, depending on whether the sparsity pattern is content-dependent (Kitaev et al., 2020; Roy et al., 2021; Wang et al., 2021) or content-independent (Child et al., 2019; Beltagy et al., 2020; Ainslie et al., 2020; Zaheer et al., 2020). Our work is mostly related to content-independent sparse attetntion. A main assumption of content-independent sparse attention is that the context temporally and/or spatially proximate to the query token is more important, which is intuitively sensible and supported by empirical attention analysis (Child et al., 2019). Thus, a common and basic sparse attention pattern is local attention, where each query token only attends to a neighborhood within a fixed temporal and/or spatial window. While this reduces the complexity to be linear, a model with only local attention cannot model long-range dependency. Prior works combine local attention with other attention patterns with wider or global receptive field such as dilated attention, random attention tokens, and global attention tokens (Beltagy et al., 2020; Zaheer et al., 2020). Our models also use local attention for its efficiency and leverage top-down inference to enable global-context awareness.

Conclusion

In this work, we propose a principled inference framework to improve latent representation inference for summarization models. It assumes a hierarchical latent structure of a document where the top-level captures the long range dependency at a coarser granularity and the bottom token level preserves the details. We leverage this hierarchical structure and synergize bottom-up inference with top-down inference to improve token representation inference. In the bottom-up pass, token representations are inferred with local self-attention to exploit its efficiency. Top-down correction is then applied to allow tokens to capture long-range dependency. We demonstrate the effectiveness of the proposed framework on a wide range of summarization datasets, including narrative, conversational, scientific documents and news. Our model achieves (1) comparable or superior performance on short documents with higher memory and compute efficiency, compared to full attention transformers, (2) state-of–the-art performance on a wide range of long document summarization benchmarks, compared to recent efficient transformers, and (3) competitive performance on summarizing whole books using $0.27\%$ parameters and much less training data, compared to a recent GPT-3-based model. These results indicate the general applicability and benefits of the proposed framework.

References

Appendix A Qualitative Examples

Appendix B Additional Experiment Details

Due to the space limit, additional experiment details are reported here. As we discussed in the main text, all of our models are initialized from BART Large with 12 layers (Lewis et al., 2020). In the bottom-up inference module (the left panel in Figure 1), the local self-attention in our models has 8 layers and all parameters are initialized from the first 8 layers of BART (including parameters for layer normalization). We use 4 layers for top-down inference (the middle panel in Figure 1). Each layer consists of (1) token local self-attention, (2) token-segment cross-attention, and (3) feedforward. (1) and (3) are initialized from the last 4 layers of BART (including parameters for layer normalization). All other parameters are randomly initialized. The segment-pooling has a kernel size of 32 and a stride size of 24. The maximum document lengths for PubMed, arXiv, CNN-DM, TVMegaSite, ForeverDreaming, BookSum are 8192, 16384, 1024, 12288, 12288, 12288, respectively.

The local self-attention used in our work is widely used in prior works on sparse attention (Beltagy et al., 2020; Zaheer et al., 2020). It is illustrated in Figure 2. It shows local self-attention of 9 tokens with window size 4. Each token attends 2 tokens on the left and 2 tokens on the right, as long as there are sufficient right and left neighbors. The attended nearby tokens are in light green. Each token also attends itself, as indicated by dark green. White color in Figure 2 indicates absence of attention.

It is also called sliding window attention (Beltagy et al., 2020). We call it local self-attention to make a direct contrast with the full self-attention used for the segment-level representations. Despite its efficiency, it misses information outside of the local attention window. Thus, it is often used together with other attention mechanisms. In Longformer, sliding window attention is combined with dilated sliding window attention and global token attention. In BigBird, it is combined with random attention and global token attention. In our work, we use segment-level tokens to collect long range information which is then used to enrich token-level representations through token-segment cross attention (see top-down inference in Figure 1).

Appendix C Ablation Studies

We present results for a series of ablation studies in this section. The experiments are performed with PubMed. The results are summarized in Table 14. The first row shows the performance of the top-down transformer with top-down update via cross-attention and window size 1024, which is our final model (Please see Figure 1 for an illustration).

The second row shows the performance for a variant of top-down update. In this variant, to update the bottom-up inferred token representations, we concatenate the token representations with the corresponding top-level segment representations, in contrast to the cross-attention approach used in the final model. We can see a clear performance degradation, indicating the importance of the cross-attention-based top-down update.

The third row displays the results without top-down update, and the decoder attends the bottom-up-inferred token representations to generate summaries. Compared to our final model, the performance is also degraded, suggesting the effectiveness of the top-down update.

The bottom panel of Table 14 presents ablation results on the window size of local self-attention (see Figure 2 for an illustration). These results are also plotted in Figure 3. They show an effect of window size. That is, as the window size increases, the performance on all metrics enhances. The effect is quite large when the window size is increased from 32 to 256. The effect becomes smaller after 256, but the model performance can still benefit from larger window size.