Abstractive Dialogue Summarization with Sentence-Gated Modeling Optimized by Dialogue Acts

Chih-Wen Goo, Yun-Nung Chen

Introduction

With a large amount of textual information available, text summarization has been widely studied for several years in natural language processing, which can be categorized into two types: extractive summarization and abstractive summarization. Extractive methods assemble the summary from the source text directly , while abstractive methods generate words to form the summary . With the rising trend of neural models, abstractive summarization has been widely investigated recently . In addition, some recent work proposed to combine advantages from two types of methods and achieved better summarization results .

Most of the summarization work focused on single-speaker written documents such as news, scientific publications, etc . In addition to text summarization, speech summarization is equally important especially for spoken or even multimedia documents, which are more difficult to browse than text, such as multi-party meetings. Therefore, speech summarization has been investigated in the past . However, almost all prior work focused on summarizing the documents based on the mentioned salient content instead of the interactive status, but this behavioral signal should be important for dialogue summarization.

To better summarize a meeting, not only the content but also the inter-speaker interactions are important. Prior dialogue summarization work utilized prosody or speaker information as interactive patterns for better extracting salient sentences . However, abstractive summarization for dialogue/meeting summarization has not yet explored due to the lack of suitable benchmark data , because the benchmark dialogue data is only annotated with the importance of utterances without abstractive summaries . In order to bridge the gap, this paper benchmarks the abstractive dialogue summarization task using the AMI meeting corpus , where the summaries are produced based on the annotated topics the speakers discuss about. A topic or a high-level description of a meeting is treated as the abstractive summary; for instance, “evaluation of project new idea for TV” is a summary of the meeting topics. Such dialogue summaries are very short and may not contain words directly mentioned by the speakers, making automatic summarization more challenging.

A dialogue is a sequence of utterances interacting between multiple participants, where each utterance would modify both participants’ cognitive status and the current dialogue state. The effect of an utterance on the context is often called a dialogue act , which provides informative cues for better understanding dialogues. Therefore, dialogue act classification has been widely studied in the spoken language understanding research field, and previous work about dialogue act recognition used information sources from multiple modalities, including linguistic information, global contextual properties like knowledge about participants, and so on . Popular approaches for dialogue act classification include support vector machine (SVM) , Naive Bayes , logistic regression , and recurrent neural network (RNN) .

Dialogue act classification and summarization are usually treated independently and used for different goals. In this paper, we leverage dialogue act information to improve dialogue summarization. Assuming that dialogue acts, indicating interactive signals, may be important for better summarization, how to effectively integrate the information into a neural summarization model is the main focus of this paper. Prior work attempted at modeling the discourse information and proposed a discourse-aware summarization model using the hierarchical RNN , where the between-utterance cues are modeled in an implicit way. Also, they performed the model in a publication summarization task, where the input documents are relatively structured, and there is no interactive behavior in such documents.

Therefore, this work focuses on how to effectively model the interactive signals such as dialogue acts for better dialogue summarization, where we introduce a sentence-gated mechanism to jointly model the explicit relationships between dialogue acts and summaries. To the best of our knowledge, there is no previous study with the similar idea, and we summarize our contributions as three-fold:

The proposed model is the first attempt for dialogue summarization using dialogue acts as explicit interactive signals.

We benchmark the dataset for abstractive summarization in the meeting domain, where the summaries describe the high-level goals of meetings.

Our proposed model achieves the state-of-the-art performance in dialogue summarization and helps us analyze how much each utterance and its dialogue act affect the summaries.

Dialogue Summarization Dataset

Considering that there is no abstractive summarization data in any conversational domain, this paper first builds a dataset in order to benchmark the experiments. The AMI meeting corpus is a well-known meeting data with different annotations , which consists of 100 hours of meeting recordings. The recordings use a range of signals synchronized to a common timeline, including close-talking and far-field microphones, individual and room-view video cameras, and output from a slide projector and an electronic whiteboard. The meetings are recorded in English using three different rooms with different acoustic properties, and include mostly non-native speakers. It contains a wide range of annotations, including dialogue acts, topic descriptions, named entities, hand gesture, and gaze direction. In this work, we use the recording transcripts as the input to our model. Because there is no summary annotation in the AMI data, the annotated topic descriptions are treated as summaries of the dialogues. In AMI data, the annotations for dialogue acts and topic descriptions are not available for all utterances, so we extract a subset of the AMI corpus to construct the benchmark dialogue summarization dataset. Figure 1 is an example dialogue instance, where the summary describes the high-level goal of the meeting.

We use a sliding window size of 50 words to split a meeting into several dialogue samples, where we adjust the boundary to make sure no utterance would be cut in the middle. If the topic changes within the window, all topic descriptions are concatenated according to their appearing order. In each resulting sample, there are around 50 to 100 words in an arbitrary number of sentences. We extract 7,824 samples from 36 meeting recordings and then randomly separate them into three groups: 7,024 samples for training, 400 samples for development, and 400 samples for testing. There are 15 dialogue act labels in the training set. The detailed statistics are shown in Table 1.

Proposed Approach

This section first explains our attention-based RNN model and then introduces the proposed sentence gating mechanism for summarization jointly optimized with dialogue act recognition. The model architecture is illustrated in Figure 2, where there are several modules including 1) a dialogue history encoder, 2) a dialogue act labeler, 3) an attentional summary decoder, and 4) a sentence gate. We detail each module below.

Given a dialogue document, there is a sequence of utterances ${\bf s}=(s_{1},\dots,s_{K})$ as the input, where $K$ is the dialogue length. An utterance is constituted by a word sequence ${\bf x}=(x_{1},\dots,x_{T})$ , and the sentence embedding can be obtained by averaging all word embeddings in that sentenceThe experiments using RNN-learned sentence embeddings are conducted, but the performance is similar to using the average of word embeddings. Considering the parameter size, all experiments use average vectors as sentence embeddings. The bidirectional long short-term memory (BLSTM) model takes a sentence sequence ${\bf s}$ as the input, and then generates a forward hidden state $\overrightarrow{h_{i}^{e}}$ and a backward hidden state $\overleftarrow{h_{i}^{e}}$ . The final hidden state $h_{i}^{e}$ at the time step i is the concatenation of $\overrightarrow{h_{i}^{e}}$ and $\overleftarrow{h_{i}^{e}}$ , i.e. $h_{i}^{e}=[\overrightarrow{h_{i}^{e}},\overleftarrow{h_{i}^{e}}]$ , which can be viewed as the encoded information for the given source document.

2 Dialogue Act Labeler

To leverage the dialogue act information, this module focuses on predicting dialogue acts for all utterances. Specifically, s is mapping to its corresponding dialogue act label ${\bf y}=(y^{DA}_{1},\dots,y^{DA}_{K})$ . For each hidden state $h_{i}$ , we compute the dialogue act context vector $c_{i}^{DA}$ as the weighted sum of LSTM’s hidden states, $h_{1}^{e},...,h_{T}^{e}$ , by the learned attention weights $\alpha^{DA}_{i,j}$ :

where the dialogue act attention weights are computed as

where $\sigma$ is the sigmoid activation function, and $W^{DA}_{he}$ is the weight matrix of a feed-forward neural network. Then all hidden states and dialogue act context vectors are optimized for dialogue act modeling by

where $y^{DA}_{i}$ is the dialogue act label of the $i$ -th sentence in the given dialogue, and $W^{DA}_{hy}$ is the weight matrix. The dialogue act attention is shown as the blue component in Figure 2.

3 Attentional Summary Decoder

Following the prior work , we use an attentional decoder for generating the word sequence as the summary. The summary context vector $c^{S}_{i}$ is computed as $c^{DA}$ similarly:

The summary is generated by a unidirectional LSTM with the initial state set to be $h_{K}^{e}$ , the last hidden state of the dialogue history encoder. The unidirectional LSTM will output words until generating an end-of-string token or reaching the predefined maximum length. The formulation is shown as:

4 Sentence-Gated Mechanism

A gating mechanism is able to model the explicit relationship between two types of information . The proposed sentence-gated model introduces an additional gate that leverages a summary context vector for modeling relationships between dialogue acts and summaries in order to improve the dialogue act labeler and the attentional summary decoder illustrated in Figure 3. The proposed model has two different types:

Full attention The model considers the relations from dialogue acts and summaries using both dialogue act attention and summary attention shown as the blue and green blocks respectively in Figure 2.

Summary attention The model builds the gating mechanism using only summary attention, where the parameter size is smaller than the full attention model.

First, a dialogue act context vector $c_{i}^{DA}$ and an averaged summary context vector $c^{S}$ are combined to pass through a slot gate:

where $v$ and $W$ are a trainable vector and a matrix respectively. The summation is done over elements in one time step. $g$ can be seen as a weighted feature of the joint context vector ( $c_{i}^{DA}$ and $c^{S}$ ). We use $g$ to weight between $h_{i}$ and $c_{i}^{DA}$ for deriving $y^{DA}_{i}$ and then replace (4) as below:

A larger $g$ indicates that the dialogue act context vector and the summary context vector pay attention to the similar part of the input sequence, which also infers that the correlation between the dialogue act and the summary is stronger and the context vector is more reliable for contributing the prediction results.

4.2 Summary Attention

To deeply investigate the power of the sentence gate mechanism, we eliminate the dialogue act attention module in the architecture, so $c^{DA}_{i}$ is replaced with $h^{e}_{i}$ . Accordingly, (8) and (9) are reformed as (10) and (11) respectively,

This version allows the dialogue acts and summaries to share the attention mechanism, so both information would be mutually improved in a more direct manner compared to the full attention version.

5 Joint End-to-End Training

To learn the summarization model optimized by the dialogue act information, we formulate a joint objective as

where $p(y^{DA},y^{S}\mid\textbf{s})$ is the conditional probability of dialogue acts and the summary given the input dialogue. Based on the joint objective, the proposed model that utilizes interactive signals for summarization can be trained in an end-to-end fashion.

Experiments

To evaluate the proposed model, we conduct experiments using the AMI meeting data introduced in Section 2.

In all experiments, the optimizer is adam, the reported numbers are averaged over 20 runs, and the maximum epoch is set to 30 with an early-stop strategy. In our proposed model, the size of hidden vectors are set to 256, and the vector dimensions vary for the compared baselines such that all models have the similar size.

For evaluation metrics, the dialogue act performance is measured by the accuracy (Acc), and the summary performance is measured by ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-3 (R-3), and ROUGE-L (R-L) scores . We also validate the performance improvement with a statistical significance test for all experiments, where single-tailed t-test is performed to measure whether the results from the proposed model are significantly better than all baselines. The dag symbols indicate the significant improvement with $p<0.05$ .

2 Baselines

Considering that there is no previous work for joint dialogue act modeling and summarization, the compared baselines are either for dialogue act classification or text summarization, including a bidirectional LSTM for dialogue act labeler, an attentional seq2seq summarization model , a pointer-generator network , and a discourse-aware hierarchical attentional seq2seq . Please note that the BLSTM dialogue act labeler baseline is the same as our proposed model without the summarization component. The pointer-generator network extends the attentional seq2seq by adding a joint pointer network to enable the copy mechanism, For the discourse-aware model, we only use the concept about the hierarchy introduced by Cohan et al. but do not include its pointer network part. The reason will be latter explained in Section 4.3. Among all baselines, only the discourse-aware model implicitly utilizes the interactive signal, while our model explicitly optimizes the summary together with dialogue acts.

3 Results

The experimental results are shown in Table 2, where the models have similar size of parameters. Among all summarization baselines, the discourse-aware hierarchical seq2seq model achieves better performance than other two baselines, indicating the importance of discourse/interaction cues for dialogue summarization. Comparing between attentional seq2seq and the pointer-generator network, the difference is not obvious, because the high-level descriptions as summaries do not overlap between the input dialogues and the corresponding summaries (1.2% of the overlapping rate for AMI meeting data). Therefore, due to the low overlapping rate, the pointer-generator network performs the worst, because the pointer network and coverage loss parts introduce noises. This is the reason that other baselines and our proposed model do not contain the copy mechanism and coverage loss in the experiments. The finding suggests that the dialogue summarization focuses more on the interaction goal instead of the mentioned content.

Table 2 shows that the proposed sentence-gated mechanism with summary attention significantly outperforms all baselines, where almost all measurements obtain the significant improvement, demonstrating that interactive signal provides useful cues for dialogue summarization, and the proposed sentence-gated mechanism effectively models the relationships between them. The proposed model with full attention performs slightly worse than the one with summary attention only. The probable reason is that the dialogue act attention may not be necessary for predicting the dialogue acts of a single utterance; that is, dialogue acts are often decided only based on the individual utterance, so adding attention on its contextual utterances may not bring much benefit for modeling such interactive behaviors. Moreover, the proposed model reduces the model size by 12% compared to the best baseline combination (BLSTM for dialogue act prediction + discourse-aware hierarchical attention seq2seq for summarization) and demonstrates the better model capacity.

4 Attention Analysis

To further analyze the attention learned in the model, we visualize the utterance attention weights when generating summaries in Figure 4. Figures are colored with different levels of the summary attention, where the darker one has a larger attention value as its importance when generating the target word, and vice versa. It is obvious that the proposed model successfully captures which ones are the key sentences in the dialogues. It may be credited to the proposed sentence gate that learns the dialogue acts conditioned on its summary in order to provide the helpful signal for global optimization of the joint model. In addition, it can be found that the “Inform” dialogue act usually guides the model to pay more attention to it, which aligns well with our intuition. In sum, for dialogue summarization, the experiments show that modeling dialogue acts and summary relations controlled by the novel sentence-gated mechanism can effectively improve abstractive summarization performance in terms of ROUGE scores due to the joint optimization with dialogue act modeling.

Conclusion

This paper focuses on abstractive dialogue summarization by modeling interactive behaviors, where the proposed model uses a novel sentence-gate that allows the dialogue act signal can be conditioned on the learned summarization result, in order to achieve better performance for both tasks. This paper benchmarks the experiments using a meeting dataset, and the experiments show that the proposed approach outperforms all state-of-the-art models, demonstrating the importance of interactive cues in dialogue summarization.

Acknowledgements

We thank the anonymous reviewers for their insightful feedback on this work. The authors are financially supported by Ministry of Science and Technology (MOST) in Taiwan and MediaTek Inc.