SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, Aleksander Wawer

Introduction and related work

The goal of the summarization task is condensing a piece of text into a shorter version that covers the main points succinctly. In the abstractive approach important pieces of information are presented using words and phrases not necessarily appearing in the source text. This requires natural language generation techniques with high level of semantic understanding Chopra et al. (2016); Rush et al. (2015); Khandelwal et al. (2019); Zhang et al. (2019); See et al. (2017); Chen and Bansal (2018); Gehrmann et al. (2018).

Major research efforts have focused so far on summarization of single-speaker documents like news (e.g., Nallapati et al. (2016)) or scientific publications (e.g., Nikolov et al. (2018)). One of the reasons is the availability of large, high-quality news datasets with annotated summaries, e.g., CNN/Daily Mail Hermann et al. (2015); Nallapati et al. (2016). Such a comprehensive dataset for dialogues is lacking.

The challenges posed by the abstractive dialogue summarization task have been discussed in the literature with regard to AMI meeting corpus McCowan et al. (2005), e.g. Banerjee et al. (2015), Mehdad et al. (2014), Goo and Chen (2018). Since the corpus has a low number of summaries (for 141 dialogues), Goo and Chen (2018) proposed to use assigned topic descriptions as gold references. These are short, label-like goals of the meeting, e.g., costing evaluation of project process; components, materials and energy sources; chitchat. Such descriptions, however, are very general, lacking the messenger-like structure and any information about the speakers.

To benefit from large news corpora, Ganesh and Dingliwal (2019) built a dialogue summarization model that first converts a conversation into a structured text document and later applies an attention-based pointer network to create an abstractive summary. Their model, trained on structured text documents of CNN/Daily Mail dataset, was evaluated on the Argumentative Dialogue Summary Corpus Misra et al. (2015), which, however, contains only 45 dialogues.

In the present paper, we further investigate the problem of abstractive dialogue summarization. With the growing popularity of online conversations via applications like Messenger, WhatsApp and WeChat, summarization of chats between a few participants is a new interesting direction of summarization research. For this purpose we have created the SAMSum CorpusThe name is a shortcut for Samsung Abstractive Messenger Summarization which contains over 16k chat dialogues with manually annotated summaries. The dataset is freely available for the research communityThe dataset is shared on terms of the Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license. It accompanies this paper on arXiv..

The paper is structured as follows: in Section 2 we present details about the new corpus and describe how it was created, validated and cleaned. Brief description of baselines used in the summarization task can be found in Section 3. In Section 4, we describe our experimental setup and parameters of models. Both evaluations of summarization models, the automatic with ROUGE metric and the linguistic one, are reported in Section 5 and Section 6, respectively. Examples of models’ outputs and some errors they make are described in Section 7. Finally, discussion, conclusions and ideas for further research are presented in sections 8 and 9.

SAMSum Corpus

Initial approach. Since there was no available corpus of messenger conversations, we considered two approaches to build it: (1) using existing datasets of documents, which have a form similar to chat conversations, (2) creating such a dataset by linguists.

In the first approach, we reviewed datasets from the following categories: chatbot dialogues, SMS corpora, IRC/chat data, movie dialogues, tweets, comments data (conversations formed by replies to comments), transcription of meetings, written discussions, phone dialogues and daily communication data. Unfortunately, they all differed in some respect from the conversations that are typically written in messenger apps, e.g. they were too technical (IRC data), too long (comments data, transcription of meetings), lacked context (movie dialogues) or they were more of a spoken type, such as a dialogue between a petrol station assistant and a client buying petrol.

As a consequence, we decided to create a chat dialogue dataset by constructing such conversations that would epitomize the style of a messenger app.

Process of building the dataset. Our dialogue summarization dataset contains natural messenger-like conversations created and written down by linguists fluent in English. The style and register of conversations are diversified – dialogues could be informal, semi-formal or formal, they may contain slang phrases, emoticons and typos. We asked linguists to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger conversations. It includes chit-chats, gossiping about friends, arranging meetings, discussing politics, consulting university assignments with colleagues, etc. Therefore, this dataset does not contain any sensitive data or fragments of other corpora.

Each dialogue was created by one person. After collecting all of the conversations, we asked language experts to annotate them with summaries, assuming that they should (1) be rather short, (2) extract important pieces of information, (3) include names of interlocutors, (4) be written in the third person. Each dialogue contains only one reference summary.

Validation. Since the SAMSum corpus contains dialogues created by linguists, the question arises whether such conversations are really similar to those typically written via messenger apps. To find the answer, we performed a validation task. We asked two linguists to doubly annotate 50 conversations in order to verify whether the dialogues could appear in a messenger app and could be summarized (i.e. a dialogue is not too general or unintelligible) or not (e.g. a dialogue between two people in a shop). The results revealed that 94% of examined dialogues were classified by both annotators as good i.e. they do look like conversations from a messenger app and could be condensed in a reasonable way. In a similar validation task, conducted for the existing dialogue-type datasets (described in the Initial approach section), the annotators agreed that only 28% of the dialogues resembled conversations from a messenger app.

Cleaning data. After preparing the dataset, we conducted a process of cleaning it in a semi-automatic way. Beforehand, we specified a format for written dialogues with summaries: a colon should separate an author of utterance from its content, each utterance is expected to be in a separate line. Therefore, we could easily find all deviations from the agreed structure – some of them could be automatically fixed (e.g. when instead of a colon, someone used a semicolon right after the interlocutor’s name at the beginning of an utterance), others were passed for verification to linguists. We also tried to correct typos in interlocutors’ names (if one person has several utterances, it happens that, before one of them, there is a typo in his/her name) – we used the Levenshtein distance to find very similar names (possibly with typos e.g. ’George’ and ’Goerge’) in a single conversation, and those cases with very similar names were passed to linguists for verification.

Description. The created dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in conversations: 3-6, 7-12, 13-18 and 19-30. Each utterance contains the name of the speaker. Most conversations consist of dialogues between two interlocutors (about 75% of all conversations), the rest is between three or more people. Table 1 presents the size of the dataset split used in our experiments. The example of a dialogue from this corpus is shown in Table 2.

Dialogues baselines

The baseline commonly used in the news summarization task is Lead-3 See et al. (2017), which takes three leading sentences of the document as the summary. The underlying assumption is that the beginning of the article contains the most significant information. Inspired by the Lead-n model, we propose a few different simple models:

MIDDLE-n, which takes n utterances from the middle of the dialogue,

LONGEST-n, treating only n longest utterances in order of length as a summary,

LONGER-THAN-n, taking only utterances longer than n characters in order of length (if there is no such long utterance in the dialogue, takes the longest one),

MOST-ACTIVE-PERSON, which treats all utterances of the most active person in the dialogue as a summary.

Results of the evaluation of the above models are reported in Table 3. There is no obvious baseline for the task of dialogues summarization. We expected rather low results for Lead-3, as the beginnings of the conversations usually contain greetings, not the main part of the discourse. However, it seems that in our dataset greetings are frequently combined with question-asking or information passing (sometimes they are even omitted) and such a baseline works even better than the MIDDLE baseline (taking utterances from the middle of a dialogue). Nevertheless, the best dialogue baseline turns out to be the LONGEST-3 model.

Experimental setup

This section contains a description of setting used in the experiments carried out.

In order to build a dialogue summarization model, we adopt the following strategies: (1) each candidate architecture is trained and evaluated on the dialogue dataset; (2) each architecture is trained on the train set of CNN/Daily Mail joined together with the train set of the dialogue data, and evaluated on the dialogue test set.

In addition, we prepare a version of dialogue data, in which utterances are separated with a special token called the separator (artificially added token e.g. ’ $<$ EOU $>$ ’ for models using word embeddings, ’ $|$ ’ for models using subword embeddings). In all our experiments, news and dialogues are truncated to 400 tokens, and summaries – to 100 tokens. The maximum length of generated summaries was not limited.

2 Models

We carry out experiments with the following summarization models (for all architectures we set the beam size for beam search decoding to 5):

Pointer generator network See et al. (2017). In the case of Pointer Generator, we use a default configurationhttps://github.com/abisee/pointer-generator, changing only the minimum length of the generated summary from 35 (used in news) to 15 (used in dialogues).

Transformer Vaswani et al. (2017). The model is trained using OpenNMT libraryhttps://github.com/OpenNMT/OpenNMT-py. We use the same parameters for training both on news and on dialogueshttp://opennmt.net/OpenNMT-py/Summarization.html, changing only the minimum length of the generated summary – 35 for news and 15 for dialogues.

Fast Abs RL Chen and Bansal (2018). It is trained using its default parametershttps://github.com/ChenRocks/fast_abs_rl. For dialogues, we change the convolutional word-level sentence encoder (used in extractor part) to only use kernel with size equal 3 instead of 3-5 range. It is caused by the fact that some of utterances are very short and the default setting is unable to handle that.

Fast Abs RL Enhanced. The additional variant of the Fast Abs RL model with slightly changed utterances i.e. to each utterance, at the end, after artificial separator, we add names of all other interlocutors. The reason for that is that Fast Abs RL requires text to be split into sentences (as it selects sentences and then paraphrase each of them). For dialogues, we divide text into utterances (which is a natural unit in conversations), so sometimes, a single utterance may contain more than one sentence. Taking into account how this model works, it may happen that it selects an utterance of a single person (each utterance starts with the name of the author of the utterance) and has no information about other interlocutors (if names of other interlocutors do not appear in selected utterances), so it may have no chance to use the right people’s names in generated summaries.

LightConv and DynamicConv Wu et al. (2019). The implementation is available in fairseqhttps://github.com/pytorch/fairseq Ott et al. (2019). We train lightweight convolution models in two manners: (1) learning token representations from scratch; in this case we apply BPE tokenization with the vocabulary of 30K types, using fastBPE implementationhttps://github.com/glample/fastBPE Sennrich et al. (2015); (2) initializing token embeddings with pre-trained language model representations; as a language model we choose GPT-2 small Radford et al. (2019).

3 Evaluation metrics

We evaluate models with the standard ROUGE metric Lin (2004), reporting the $F_{1}$ scores (with stemming) for ROUGE-1, ROUGE-2 and ROUGE-L following previous works Chen and Bansal (2018); See et al. (2017). We obtain scores using the py-rouge packagehttps://pypi.org/project/py-rouge/.

Results

The results for the news summarization task are shown in Table 4 and for the dialogue summarization – in Table 5. In both domains, the best models’ ROUGE-1 exceeds $39$ , ROUGE-2 – $17$ and ROUGE-L – $36$ . Note that the strong baseline for news (Lead-3) is outperformed in all three metrics only by one model. In the case of dialogues, all tested models perform better than the baseline (LONGEST-3).

In general, the Transformer-based architectures benefit from training on the joint dataset: news+dialogues, even though the news and the dialogue documents have very different structures. Interestingly, this does not seem to be the case for the Pointer Generator or Fast Abs RL model.

The inclusion of a separation token between dialogue utterances is advantageous for most models – presumably because it improves the discourse structure. The improvement is most visible when training is performed on the joint dataset.

Having compared two variants of the Fast Abs RL model – with original utterances and with enhanced ones (see Section 4.2), we conclude that enhancing utterances with information about the other interlocutors helps achieve higher ROUGE values.

The largest improvement of the model performance is observed for LightConv and DynamicConv models when they are complemented with pretrained embeddings from the language model GPT-2, trained on enormous corpora.

It is also worth noting that some models (Pointer Generator, Fast Abs RL), trained only on the dialogues corpus (which has 16k dialogues), reach similar level (or better) in terms of ROUGE metrics than models trained on the CNN/DM news dataset (which has more than 300k articles). Adding pretrained embeddings and training on the joined dataset helps in achieving significantly higher values of ROUGE for dialogues than the best models achieve on the CNN/DM news dataset.

According to ROUGE metrics, the best performing model is DynamicConv with GPT-2 embeddings, trained on joined news and dialogue data with an utterance separation token.

Linguistic verification of summaries

ROUGE is a standard way of evaluating the quality of machine generated summaries by comparing them with reference ones. The metric based on n-gram overlapping, however, may not be very informative for abstractive summarization, where paraphrasing is a keypoint in producing high-quality sentences. To quantify this conjecture, we manually evaluated summaries generated by the models for 150 news and 100 dialogues. We asked two linguists to mark the quality of every summary on the scale of $-1$ , , $1$ , where $-1$ means that a summarization is poor, extracts irrelevant information or does not make sense at all, $1$ – it is understandable and gives a brief overview of the text, and stands for a summarization that extracts only a part of relevant information, or makes some mistakes in the produced summary.

We noticed a few annotations (7 for news and 4 for dialogues) with opposite marks (i.e. one annotator judgement was $-1$ , whereas the second one was $1$ ) and decided to have them annotated once again by another annotator who had to resolve conflicts. For the rest, we calculated the linear weighted Cohen’s kappa coefficient McHugh (2012) between annotators’ scores. For news examples, we obtained agreement on the level of $0.371$ and for dialogues – $0.506$ . The annotators’ agreement is higher on dialogues than on news, probably because of structures of those data – articles are often long and it is difficult to decide what the key-point of the text is; dialogues, on the contrary, are rather short and focused mainly on one topic.

For manually evaluated samples, we calculated ROUGE metrics and the mean of two human ratings; the prepared statistics is presented in Table 6. As we can see, models generating dialogue summaries can obtain high ROUGE results, but their outputs are marked as poor by human annotators. Our conclusion is that the ROUGE metric corresponds with the quality of generated summaries for news much better than for dialogues, confirmed by Pearson’s correlation between human evaluation and the ROUGE metric, shown in Table 7.

Difficulties in dialogue summarization

In a structured text, such as a news article, the information flow is very clear. However, in a dialogue, which contains discussions (e.g. when people try to agree on a date of a meeting), questions (one person asks about something and the answer may appear a few utterances later) and greetings, most important pieces of information are scattered across the utterances of different speakers. What is more, articles are written in the third-person point of view, but in a chat everyone talks about themselves, using a variety of pronouns, which further complicates the structure. Additionally, people talking on messengers often are in a hurry, so they shorten words, use the slang phrases (e.g. ’u r gr8’ means ’you are great’) and make typos. These phenomena increase the difficulty of performing dialogue summarization.

Table 8 and 9 show a few selected dialogues, together with summaries produced by the best tested models:

DynamicConv + GPT-2 embeddings with a separator (trained on news + dialogues),

DynamicConv + GPT-2 embeddings (trained on news + dialogues),

Fast Abs RL Enhanced (trained on dialogues),

Transformer (trained on news + dialogues).

One can easily notice problematic issues. Firstly, the models frequently have difficulties in associating names with actions, often repeating the same name, e.g., for Dialogue 1 in Table 8, Fast Abs RL generates the following summary: ’lilly and lilly are going to eat salmon’. To help the model deal with names, the utterances are enhanced by adding information about the other interlocutors – Fast Abs RL enhanced variant described in Section 4.2. In this case, after enhancement, the model generates a summary containing both interlocutors’ names: ’lily and gabriel are going to pasta…’. Sometimes models correctly choose speakers’ names when generating a summary, but make a mistake in deciding who performs the action (the subject) and who receives the action (the object), e.g. for Dialogue 4 DynamicConv + GPT-2 emb. w/o sep. model generates the summary ’randolph will buy some earplugs for maya’, while the correct form is ’maya will buy some earplugs for randolph’.

A closely related problem is capturing the context and extracting information about the arrangements after the discussion. For instance, for Dialogue 4, the Fast Abs RL model draws a wrong conclusion from the agreed arrangement. This issue is quite frequently visible in summaries generated by Fast Abs RL, which may be the consequence of the way it is constructed; it first chooses important utterances, and then summarizes each of them separately. This leads to the narrowing of the context and loosing important pieces of information.

One more aspect of summary generation is deciding which information in the dialogue content is important. For instance, for Dialogue 3 DynamicConv + GPT-2 emb. with sep. generates a correct summary, but focuses on a piece of information different than the one included in the reference summary. In contrast, some other models – like Fast Abs RL enhanced – select both of the pieces of information appearing in the discussion. On the other hand, when summarizing Dialogue 5, the models seem to focus too much on the phrase ’it’s the best place’, intuitively not the most important one to summarize.

Discussion

This paper is a step towards abstractive summarization of dialogues by (1) introducing a new dataset, created for this task, (2) comparison with news summarization by the means of automated (ROUGE) and human evaluation.

Most of the tools and the metrics measuring the quality of text summarization have been developed for a single-speaker document, such as news; as such, they are not necessarily the best choice for conversations with several speakers.

We test a few general-purpose summarization models. In terms of human evaluation, the results of dialogues summarization are worse than the results of news summarization. This is connected with the fact that the dialogue structure is more complex – information is spread in multiple utterances, discussions, questions, more typos and slang words appear there, posing new challenges for summarization. On the other hand, dialogues are divided into utterances, and for each utterance its author is assigned. We demonstrate in experiments that the models benefit from the introduction of separators, which mark utterances for each person. This suggests that dedicated models having some architectural changes, taking into account the assignation of a person to an utterance in a systematic manner, could improve the quality of dialogue summarization.

We show that the most popular summarization metric ROUGE does not reflect the quality of a summary. Looking at the ROUGE scores, one concludes that the dialogue summarization models perform better than the ones for news summarization. In fact, this hypothesis is not true – we performed an independent, manual analysis of summaries and we demonstrated that high ROUGE results, obtained for automatically-generated dialogue summaries, correspond with lower evaluation marks given by human annotators. An interesting example of the misleading behavior of the ROUGE metrics is presented in Table 9 for Dialogue 4, where a wrong summary – ’paul and cindy don’t like red roses.’ – obtained all ROUGE values higher than a correct summary – ’paul asks cindy what color flowers should buy.’. Despite lower ROUGE values, news summaries were scored higher by human evaluators. We conclude that when measuring the quality of model-generated summaries, the ROUGE metrics are more indicative for news than for dialogues, and a new metric should be designed to measure the quality of abstractive dialogue summaries.

Conclusions

In our paper we have studied the challenges of abstractive dialogue summarization. We have addressed a major factor that prevents researchers from engaging into this problem: the lack of a proper dataset. To the best of our knowledge, this is the first attempt to create a comprehensive resource of this type which can be used in future research. The next step could be creating an even more challenging dataset with longer dialogues that not only cover one topic, but span over numerous different ones.

As shown, summarization of dialogues is much more challenging than of news. In order to perform well, it may require designing dedicated tools, but also new, non-standard measures to capture the quality of abstractive dialogue summaries in a relevant way. We hope to tackle these issues in future work.

Acknowledgments

We would like to express our sincere thanks to Tunia Błachno, Oliwia Ebebenge, Monika Jędras and Małgorzata Krawentek for their huge contribution to the corpus collection – without their ideas, management of the linguistic task and verification of examples we would not be able to create this paper. We are also grateful for the reviewers’ helpful comments and suggestions.