Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization

Jiaao Chen, Diyi Yang

Introduction

We live in an information age where communications between human and human/machine are increasing exponentially in the form of textual dialogues between users and users-agents Kester (2004). It is challenging and time-consuming to review all the content before starting any conversations especially when the chatting history becomes very long Gao et al. (2020). How to process and organize those interaction activities into concise and structured data, i.e. conversation summarization, becomes technically and socially important.

Most existing research efforts on text summarization have been focused on single-speaker documents like news reports Nallapati et al. (2016); See et al. (2017), scientific publications Nikolov et al. (2018) or encyclopedia articles Liu* et al. (2018), where structured text is usually used to elaborate a core idea in the third-person point of view, and the information flow is very clear through paragraphs or sections. Different from these structured documents, conversations are often informal, verbose and repetitive, sprinkled with false-starts, back channeling, reconfirmations, hesitations, speaker interruptions Sacks et al. (1978) and the salient information is scattered in the whole chat, making current summarization models hard to focus on many informative utterances. Take the conversation in Table 1 as an example, turns, informal words, abbreviations, and emoticons all introduce new forms of challenges to the task of summarization. This calls for the design and development of new methods for dialogue summarization instead of directly applying current document summarization models.

There has been some recent research on conversation summarization such as directly deploying existing document summarization models Gliwa et al. (2019) and exploring multi-sentence compression Shang et al. (2018), however, most of them haven’t utilized specific conversational structures, which refer to the way utterances are organized in order to make the conversation meaningful, enjoyable and understandable Sacks et al. (1978), in dialogues – a key factor that differentiates dialogues from structured documents. As a way of using language socially of “doing things with words” together with other persons, the conversation has its own dynamic structures that organize utterances in certain orders to make the conversation meaningful, enjoyable, and understandable Sacks et al. (1978). Although there are a few exceptions such as utilizing topic segmentation Liu et al. (2019b); Li et al. (2019), dialogue acts Goo and Chen (2018) or key point sequence Liu et al. (2019a), they either need extensive expert annotations of discourse actsGoo and Chen (2018); Liu et al. (2019a), or only encode conversations based on their topics Liu et al. (2019b), which fails to capture rich conversation structures in dialogues.

Even one single conversation can be viewed from different perspectives, resulting in multiple conversational or discourse patterns. For instance, in Table 1, based on what topics were discussed (topic view) Galley et al. (2003); Liu et al. (2019b); Li et al. (2019), it can be segmented into greetings, today’s plan, plan for tomorrow, plan for Saturday and pick up time; from a conversation progression perspective (stage view) Ritter et al. (2010); Paul (2012); Althoff et al. (2016), the same dialogue can be categorized into openings, intention, discussion, and conclusion. From a coarse perspective (global view), conversations can be treated as a whole, or each utterance can serve as one segment (discrete view). Models that only utilized a fixed topic view of the conversation Joty et al. (2010); Liu et al. (2019b) may fail to capture its comprehensive and nuanced conversational structures, and any amount of information loss introduced by the conversation encoder may lead to larger error cascade in the decoding stage. To fill these gaps, we propose to combine those multiple, diverse views of conversations in order to generate more precise summaries.

To sum up, our contributions are: (1) we propose to utilize rich conversational structures, i.e., structured views (topic view and stage view) and the generic views (global view and discrete view) for abstractive conversation summarization. (2) We design a multi-view sequence-to-sequence model that consists of a conversation encoder to encode different views and a multi-view decoder with multi-view attention to generate dialogue summaries. (3) We perform experiments on a large-scale conversation summarization dataset, SAMSum Gliwa et al. (2019), and demonstrate the effectiveness of our proposed methods. (4) We conduct thorough error analyses and discuss specific challenges that current approaches faced with this task.

Related Work

Document summarization has received extensive research attention, especially for abstractive summarization. For instance, Rush et al. (2015) introduced to use sequence-to-sequence models for abstractive text summarization. See et al. (2017) proposed a pointer-generator network to allow copying words from the source text to handle the OOV issue and avoid generating repeated content. Paulus et al. (2018); Chen and Bansal (2018) further utilized reinforcement learning to select the correct content needed by summarization. Large-scale pre-trained language models Liu and Lapata (2019); Raffel et al. (2019); Lewis et al. (2019) have also been introduced to further improve the summarization performance. Other line of work explored long-document summarization by utilizing discourse structures in text Cohan et al. (2018), introducing hierarchical models Fabbri et al. (2019) or modifying attention mechanisms Beltagy et al. (2020). There are also recent studies looking at the faithfulness in document summarization Cao et al. (2018); Zhu et al. (2020a), in order to enhance the information consistency between summaries and the input.

Dialogue Summarization

When it comes to the summarization of dialogues, Shang et al. (2018) proposed a simple multi-sentence compression technique to summarize meetings. Zhao et al. (2019); Zhu et al. (2020b) introduced turn-based hierarchical models that encoded each turn of utterance first and then used the aggregated representation to generate summaries. A few studies have also paid attention to utilizing conversational analysis for generating dialogue summaries, such as leveraging dialogue acts Goo and Chen (2018), key point sequence Liu et al. (2019a) or topics Liu et al. (2019b); Li et al. (2019). However, they either needed a large amount of human annotation for dialogue acts, key points or visual focus Goo and Chen (2018); Liu et al. (2019a); Li et al. (2019), or only utilized topical information in conversations Li et al. (2019); Liu et al. (2019b).

These prior work also largely ignored diverse conversational structures in dialogues, for instance, reply relations among participants Mayfield et al. (2012); Zhu et al. (2019), dialogue acts Ritter et al. (2010); Paul (2012), and conversation stages Althoff et al. (2016). Models that only utilized a fixed topic view of the conversation Galley et al. (2003); Joty et al. (2010) may fail to capture its comprehensive and nuanced conversational structures, and any amount of information loss introduced by the conversation encoder may lead to larger error cascade in the decoding stage. To fill these gaps, we propose to leverage diverse conversational structures including topic segments, conversational stages, dialogue overview, and utterances to design a multi-view model for dialogue summarization.

Method

Conversations can be interpreted from different views and every single view enables the model to focus a specific aspect of the conversation. To take advantages of those rich conversation views, we design a Multi-view Sequence-to-Sequence Model (see Figure 1) that firstly extracts different views of conversations (Section 3.1) and then encodes them to generate summaries (Section 3.2).

Conversation summarization models may easily stray among all sorts of information across various speakers and utterances especially when conversations become long. Naturally, if informative structures in the form of small blocks can be explicitly extracted from long conversations, models may be able to understand them better in a more organized way. Thus, we first extract different views of structures from conversations.

Although conversations are often less structured than documents, they are mostly organized around topics in a coarse-grained structure Honneth et al. (1988). For instance, a telephone chat could possess a pattern of “greetings $\rightarrow$ invitation $\rightarrow$ party details $\rightarrow$ rejection” from a topical perspective. Such explicit view and topic flow could help models interpret conversations more precisely and generate summaries that cover important topics. Here we combine the classic topic segment algorithm, C99 Choi (2000) that segments conversations based on inter-sentence similarities, with recent advanced sentence representations Sentence-BERT Reimers and Gurevych (2019), to extract the topic view. Specifically, each utterance $u_{i}$ in a conversation $\mathbf{C}=\{u_{1},u_{2},...,u_{m}\}$ is first encoded into hidden vectors via Sentence-BERT. Then the conversation $\mathbf{C}$ is divided into blocks $\mathbf{C}_{topic}=\{\mathbf{b}_{1},...,\mathbf{b}_{n}\}$ through C99, where $\mathbf{b}_{i}$ is one block that contains several consecutive utterances, such as the topic view described in Table 1.

Stage View

As a way of doing things with words socially together with other people, conversation organizes utterances in certain orders to make it meaningful, enjoyable, and understandable. Sacks et al. (1978); Althoff et al. (2016) For example, counseling conversations are found to follow a common pattern of “introductions $\rightarrow$ problem exploration $\rightarrow$ problem solving $\rightarrow$ wrap up” Althoff et al. (2016). Such conversation stage view provides high-level sketches about the functions or goals of different parts in conversations, which could help models focus on the stages with key information.

We follow Althoff et al. (2016) to extract stages through a Hidden Markov Model (HMM). We impose a fixed ordering on the stages and only allow transitions from the current stage to the next one. The observations in the HMM model are the encoded representations $h_{i}$ from Sentence-BERT. We set the number of hidden stages as 4. Similar to the topic view extraction, we segment the conversations into blocks $\mathbf{C}_{stage}=\{\mathbf{b}_{1},...,\mathbf{b}_{n}\}$ , where $\mathbf{s}_{i}$ is one block that contains several consecutive utterances. We interpret the inferred stages qualitatively and further visualize the top 6 frequent words appearing in each stage in Table 2. We found that conversations around daily chats usually start with openings, introduce the goals/focus of the conversation followed by discussions of the details, and finally conclude with certain endings. Table 1 shows an example of the stage view.

Global View and Discrete View

In addition to the aforementioned two structured views, conversations can also be naturally viewed from a relatively coarse perspective, i.e., a global view that concatenates all utterances into one giant block Gliwa et al. (2019), and a discrete view that separates each utterance into a distinct block Liu and Chen (2019); Gliwa et al. (2019).

2 Multi-view Sequence-to-Sequence Model

We extend generic sequence-to-sequence models to encode and combine different conversation views. To better utilize semantic information in recent pre-trained models, we implement our base encoders and decoders with a transformer based pre-trained model, BART Lewis et al. (2019). Note that our multi-view sequence-to-sequence model is agnostic to BART with which it is initialized.

Given a conversation under a specific view $k$ with $n$ blocks: $\mathbf{C}_{k}=\{\mathbf{b}_{1}^{k},...,\mathbf{b}_{n}^{k}\}$ , each token $x_{i,j}^{k}$ in a block $\mathbf{b}_{j}^{k}=\{x_{0,j}^{k},x_{1,j}^{k},...,x_{m,j}^{k}\}$ is first encoded through the conversation encoder $\mathbf{E}$ , e.g., BART encoder as shown in Figure 1(a), into hidden representations:

Note that we add special tokens $x_{0,j}^{k}$ at the beginning of each block and use these tokens’ representations to describe each block, i.e., $S_{j}^{k}=h_{0,j}^{k}$ .

To depict different views using hidden vectors, we aggregate the information from all blocks in one conversation through LSTM layers Hochreiter and Schmidhuber (1997):

We use the last hidden state $S_{n}^{k}$ to represent the current view $k$ , denoted as $V_{k}$ .

Multi-view Decoder

Different views could provide different types of conversational aspects for models to learn and further determine which set of utterances should deserve more attention in order to generate better dialogue summaries. As a result, the ability to strategically combine different views is essential. To this end, we propose a transformer based multi-view decoder to integrate encoded representations from different views and generate summaries as shown in Figure 1(b).

The input to the decoder contains $l-1$ previously generated tokens $t_{1},...,t_{l-1}$ . Via our multi-view decoder $\mathbf{D}$ , the $l$ -th token is predicted via:

Here, $W_{p}$ is a parameter to be learned.

Different from generic transformer decoder, we introduce a multi-view attention layer in each transformer block. Multi-view attention layer first decides the importance $\alpha_{k}$ of each view $V_{k}$ through:

Training

We minimize the cross entropy loss during training:

Specifically, we apply the teacher forcing strategy: at training time, the inputs are previous tokens from the ground truth; at test time, the inputs are previous tokens predicted by the decoder.

Experiments

We evaluate our model on a large-scale dialogue summary dataset SAMSum Gliwa et al. (2019) that has 14732 dialogues with human-written summaries. The data statistics are shown in Table 3. SAMSum contains messenger-like conversations about daily topics, such as chit-chats, arranging meetings, discussing events, etc. We compare our Multi-view Sequence-to-Sequence Model (Multi-view BART) with several baseline models:

Pointer Generator See et al. (2017): Following Gliwa et al. (2019), we added separators between each utterance (discrete view) and used it as input for pointer generator model.

DynamicConv + GPT-2/News Wu et al. (2019): We followed Gliwa et al. (2019) to use GPT-2 to initialize token embeddings Radford et al. (2019). We also added news summarization corpus CNN/DM Nallapati et al. (2016) as extra training data.

Fast Abs RL Enhanced Chen and Bansal (2018) first selects salient sentences and then rewrites them abstractively via sentence-level policy gradient methods. We combined it with the global view Gliwa et al. (2019).

BART + Generic views Lewis et al. (2019) utilized BART, a denoising autoencoder for pretraining sequence-to-sequence models, together with generic views (global view and discrete view). We used the BART-large model with its default settings https://github.com/pytorch/fairseq.

2 Model SettingsMore details are shown in Section A in the Appendix.

We loaded the pre-trained “bert-base-nli-stsb-mean-tokens”https://github.com/UKPLab/sentence-transformers for sentence-BERT to get representations for each utterance. For extracting the topic view via C99, we set the window size 4 and std coefficient 1. For extracting the stage view, we set the number of hidden states 4 in HMM. These hyper-parameters were set with a grid search. The BART + Structured views (stage and topic views) used the same set of parameters as BART + Generic views. For Multi-View BART, we experimented with different view combinations: (1) the best generic view - global view, was combined with two structured views (stage and topic view) separately; (2) the best two structured views are also combined (topic + stage). The settings for BART encoder/decoder kept identical as baselines. We used a one-layer LSTM for encoding sections. The learning rate for section encoder and multi-view attention was set 3e-3. The temperature $T$ was 0.2. The beam search size during inference for all the models was 4.

3 Results

Quantitative Results We evaluated models with the standard metric ROUGE Score (with stemming) Lin and Och (2004), and reported ROUGE-1, ROUGE-2 and ROUGE-LHere we followed BART and used https://github.com/pltrdy/rouge. Note that different tools may generate different ROUGE scores.. Results on the test set for different models were shown in Table 4. Compared to Pointer Generator, using reinforcement learning to select important sentences first (Fast Abs RL Enhanced) slightly increased F scores. Adding pre-trained embeddings or extra documents training data to lightweight convolution models, (DynamicConv + GPT-2/News) lead to even better ROUGE scores. When using pre-trained transformer based model BART with generic views, all ROUGE scores improved significantly, and BART + Global outperformed BART + Discrete especially in terms of ROUGE-L F scores. Segmenting conversations into blocks from structured views (stage view and topic view) further boosted the performance, suggesting that our extracted conversation structures help conversational encoders to capture nuanced and informative aspects of dialogs.

We did not see any performance boost when combining the generic global view with either topic or conversational stage views, partially due to that the coarse granularity of global view does not complement structured views well. In contrast, utilizing both structured views (topic view + stage view) further increased ROUGE scores consistently, indicating the effectiveness of synthesizing informative conversation blocks introduced by both views.

We visualized the attention weight distributions for the stage view and topic view in our best model (see Appendix) and found contributions of topic views are slightly more prominent compared to stage views. This also communicated that the two different structured views can complement each other well though sharing the same dialogue content. Note that the gains from Multi-view BART (Topic + Stage) are mainly from the precision scores while recall scores are kept comparable, suggesting that our proposed model produced fewer irrelevant tokens while preserving necessary information in its generated summary.

We visualized the impact of two essential components in conversations—the number of participants and turns—on rouge scores via our best-performing model Multi-view BART with topic view + stage view in Figure 3. As the number of participants/turns increases, ROUGE scores decrease, indicating that the difficulty of conversation summarization increased with more participants involved in conversations and more utterances.

Qualitative/Human Evaluation

We also conducted human annotations to evaluate the extracted dialogue summaries, in addition to ROUGE scores. Similar to Gliwa et al. (2019), we asked human annotators on Amazon Mechanical Turk https://www.mturk.com/ to rate each summary (200 randomly sampled summaries in total) on the scale of , where -2 means that a summary was poor, extracted irrelevant information or did not make sense at all, 2 means it was understandable and gave a concise overview of the text, and 0 refers to that the summary only extracted only a part of relevant information, or made some mistakes. The score for each summary was averaged among three different annotators. The Intra-class Correlation was 0.583, indicating moderate agreement Koo and Li (2016).

As shown in Figure 4, consistent with ROUGE scores in Table 4, our multi-view model achieved the highest human annotation scores, significantly higher (via a student t-test) than either generic (discrete or global) view or structured (stage or topic) view, which further proved the effectiveness of combing different views.

Model Analysis and Discussion

So far, we have achieved a reasonable summarization performance. To further study why dialog summarization is challenging and how future research could advance this direction, we take a closer look at this dialogue summarization dataset (SAMSum), model generation errors, as well as certain challenges that existing approaches are struggling with.

We conduct a thorough examination of the challenges in conversation summarization and organized them into 7 categories as below:

Informal language use Many conversations especially in online contexts such as Twitter/Reddit Jackson and Moulinier (2007), contain typos, word abbreviations, slang or emoticons/emojis, making it hard to be represented and summarized.

Multiple participants As shown in Figure 3, conversations with more speakers are harder to be summarized since it may require models to accurately differentiate both language styles and content from different speakers, similar to the multiple characters issue in story summarization Zhang et al. (2019).

Multiple turns Similar to long document summarization Xiao and Carenini (2019), conversations with many utterances contain more information to be processed, thus harder to be summarized.

(Referral and coreference People usually refer to each other, mention others’ names or use coreference in their messages, which introduces extra difficulty to dialogue summarization, also a challenge also exists in reading comprehension Chen et al. (2016) and document summarization Falke et al. (2017).

Repetition and interruption Information is generally scattered through the whole conversation, and speakers may interrupt each other, reconfirm, back channeling or repeat themselves, a unique discourse challenge for dialogue summarization.

Negations and rhetorical questions As a long-standing problem in NLP field Li et al. (2016), negation related issues are even more frequent in conversations, as there are more question-answer exchanges between speakers.

Role and language change Conversations usually involve more than one speaker, and the role of a speaker may shift from a questioner to an answerer, requiring the summarization model to dynamically deal with speaker roles and the associated language (e.g., first personal pronouns)

We randomly sampled 100 examplesThe full analyzed set of examples are shown in Appendix. from our test set and classified them using the above challenge taxonomy. A conversation might have more than one category labels, and if it had none of the aforementioned challenges, we labeled it as (0) Generic. Usually, the one marked as Generic were shorter or had a simple structure.

Table 5 presents the percentage of each type of challenge and per-category performances from our best model (Multi-view BART with Topic view + Stage view). We observed that: (i) Referral & coreference (33%) and Role & language change (30%) were the two most frequent challenges that dialogue summarization task faced. (2) As expected, Generic conversations were relatively easier summarize. (3) Our best model performed relatively worse when it came to Repetition & interruption, Multiple turns, and Referral & coreference, calling for more intelligent summarization methods to tackle those challenges.

2 Error AnalysisError analysis for baselines are displayed in the Appendix.

We examined summaries generated by our best-performing model compared to ground-truth summaries, and observed several major error types:

Missing information: content mentioned in references is missing in generated summaries.

Redundancy: content occurred in generated summaries was not mentioned by references.

Wrong references: generated summaries contain information that is not faithful to the original dialogue, and associate one’s actions/locations with a wrong speaker.

Incorrect reasoning: generated summaries reasoned relations in dialogues incorrectly, thus came to wrong conclusions.

Improper gendered pronouns: summaries used improper gendered pronouns (e.g., the misuse of gendered pronouns).

We annotated the same set of 100 randomly sampled summaries via the above error type taxonomy. A summary might have more than one category labels and we categorized a summary as (0) Other if it did not belong to any error types.

Table 6 presents the breakdown of error types and per-category ROUGE scores. We found that: (i) missing information (37%) was the most frequent error type, indicating that current summarization models struggled with identifying key information. (ii) Incorrect reasoning had a percentage of 24% with the worst ROUGE-2; despite of being a minor type 6%, improper gendered pronouns seemed to severely decrease both ROUGE-1 and ROUGE-2. (iii) The relatively low ROUGE scores associated with incorrect reasoning and wrong references urged better summarization models in dealing with faithfulness in dialogue summarization.

3 Relation between Challenges and Errors

To figure out relations between challenges and errors made by our models, i.e., how different types of errors correlate with different types of challenges, we visualized the co-occurrence heat map in Figure 5. We found that: (i) Our model generated good summary for generic, simple conversations. (ii) All kinds of challenges had high correlations with, or could lead to the missing information error. (iii) Wrong references were highly associated with referral & coreference; this was as expected since co-references in conversations would naturally increase the difficulty for models to associate correct speakers with correct actions. (iv) High correlations between role & language change, referral & coreference and incorrect reasoning indicated that interactions between multiple participants with frequent co-references might easily lead current summarization models to reason incorrectly.

Conclusion

In this work, we proposed a multi-view sequence-to-sequence model that leveraged multiple conversational structures (topic view and stage view) and generic views (global view and discrete view) to generate summaries for conversations. In order to strategically combine these different views for better summary generations, we propose a multi-view sequence-to-sequence model. Experiments conducted demonstrated the effectiveness of our proposed models in terms of both quantitative and qualitative evaluations. Via thorough error analyses, we concluded a set of challenges that current models struggled with, which can further facilitate future research on conversation summarization. Due to the lack of annotations, we only adopted simple unsupervised segmentation methods to extract different views. In the future, we plan to annotate some of the data, explore supervised segmentation models Li et al. (2018) and introduce more conversation structures like dialogue acts Oya and Carenini (2014); Joty and Hoque (2016) into abstractive dialogue summarization.

Acknowledgment

We would like to thank the anonymous reviewers for their helpful comments, and the members of Georgia Tech SALT group for their feedback. We acknowledge the support of NVIDIA Corporation with the donation of GPU used for this research.

References

Appendix A Model Settings

We load the pre-trained “bert-base-nli-stsb-mean-tokens”https://github.com/UKPLab/sentence-transformers for sentence-BERT to get representations for each utterance. When extracting the topic view, we set the window size 4 and std coefficient 1 in C99. When extracting the stage view, we set the number of hidden states 4 in HMM. These hyper-parameters were set after a grid search with evaluating randomly sampled segmented results by human. The BART + Structured views (stage and topic views) followed the same parameters as BART + Generic views.

For Multi-View BART, we selected different views to combine: (1) generic view + structured view: best generic view, global view, was combined with two structured views (stage and topic view); (2) structured view + structured view: best two single views are combined (topic + stage). The settings for BART encoder/decoder kept the same as baseline. We used a one layer LSTM for encoding sections. The learning rate for section encoder and multi-view attention was set 3e-3. The temperature $T$ was 0.2. The beam search size during inference for all the models was 4.

Experiments were performed on two Tesla P100 (16GB memory).

Appendix B View Attention Visualization

We visualized the attention weights distribution for the stage view and topic view in our best multi-view model to explore the importance of stage verses topic in Figure 6.We found that the topic views were more prominent than the stage views, consistent with the performances of BART + topic view and BART + stage view. This indicated that having discourse structures about topics might be more important while both topic and stage could improve the conversation summarization. This also communicated that the two different structured views can complement each other well though sharing the same dialogue content.

We displayed two examples in Table 8 with the golden references, each single view’s generated summaries and the combined views’ generated summaries. The combined view could balance the advantages of each single view and generated more precise summaries. And the attention weights the model learned were also consistent with single view’s performances.

Appendix C Supplementary Examples for Model Analysis and Discussion

For the analysis in the Model Analysis and Discussion section in our paper, we randomly sampled 100 examples from the test set of the SAMSum dataset which can be downloaded here https://arxiv.org/abs/1911.12237. Table 7 provides a full index list of the samples.

Table 9 shows the error analysis for BART-Discrete, BART-Global, BART-Stage, BART-Topic and BART-Multi-view models. It can be observed that, (i) without any explicit structures, discrete-view and global-view models generated summaries with more redundancies compared to golden reference summaries, as models may easily lost focus on massive information; (ii) once we introduced certain conversation structures such as topic-view and stage-view, models behaved better in terms of redundancy and incorrect reasoning, which indicated that the structured views could help models to better understand the conversations; (iii) our multi-view models which combined both stage-view and topic-view made the least number of errors compared to all single view models, suggesting the effectiveness of combining different views for conversation summarization.