QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, Dragomir Radev

Introduction

Meetings remain the go-to tool for collaboration, with 11 million meetings taking place each day in the USA and employees spending six hours a week, on average, in meetings Mroz et al. (2018). The emerging landscape of remote work is making meetings even more important and simultaneously taking a toll on our productivity and well-being Spataro (2020). The proliferation of meetings makes it hard to stay on top of this sheer volume of information and increases the need for automated methods for accessing key information exchanged during them. Meeting summarization Wang and Cardie (2013); Shang et al. (2018); Li et al. (2019); Zhu et al. (2020) is a task where summarization models are leveraged to generate summaries of entire meetings based on meeting transcripts. The resulting summaries distill the core contents of a meeting that helps people efficiently catch up to meetings.

Most existing work and datasets on meeting summarization Janin et al. (2003); Carletta et al. (2005) pose the problem as a single document summarization task where a single summary is generated for the whole meeting. Unlike news articles where people may be satisfied with a high-level summary, they are more likely to seek more detailed information when it comes to meeting summaries such as topics Li et al. (2019), opinions, actions, and decisions Wang and Cardie (2013). This poses the question of whether a single paragraph is enough to summarize the content of an entire meeting?

Figure 1 shows an example of a meeting about “remote control design”. The discussions in the meeting are multi-faceted and hence different users might be interested in different facets. For example, someone may be interested in learning about the new trends that may lead to the new product standing out, while others may be more interested in what other attendees thought about different elements of the design. It is challenging to compress or compose a short summary that contains all the salient information. Alternatively, summarization systems should adopt a more flexible and interactive approach that allows people to express their interests and caters to their diverse intents when generating summaries Dang (2005, 2006); Litvak and Vanetik (2017); Baumel et al. (2018).

With comprehensive consideration of the multi-granularity meeting contents, we propose a new task, query-based meeting summarization. To enable research in this area, we also create a high-quality multi-domain summarization dataset. In this task, as shown in Figure 1, given a query and a meeting transcript, a model is required to generate the corresponding summary. The query-based approach is a flexible setup that enables the system to satisfy different intents and different levels of granularity. Besides the annotated queries and corresponding gold summaries at different levels of granularity, our new dataset contains a rich set of annotations that include the main topics of each meeting and the ranges of relevant text spans for the annotated topics and each query. We adopt a hierarchical annotation structure that could not only assist people to find information faster, but also strengthen the models’ summarization capacity.

In this paper, we employ a two-stage meeting summarization approach: locate-then-summarize. Specifically, given a query, a model called Locator is used to locate the relevant utterances in the meeting transcripts, and then these extracted spans are used as an input to another model called Summarizer to generate a query-based summary. We present and evaluate several strong baselines based on state-of-the-art summarization models on QMSum. Our results and analysis from different perspectives reveal that the existing models struggle in solving this task, highlighting the challenges the models face when generating query-based meeting summaries. We are releasing our dataset and baselines to support additional research in query-focused meeting summarization.

Overall, our contributions are listed as follows: 1) We propose a new task, query-based multi-domain meeting summarization, and build a new benchmark QMSum with a hierarchical annotation structure. 2) We design a locate-then-summarize model and conduct comprehensive experiments on its strong variants and different training settings. 3) By human evaluation, we further pose the challenges of the new task, including the impact of different query types and factuality errors.

Related Work

Most prior work in text summarization Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016); See et al. (2017); Celikyilmaz et al. (2018); Chen and Bansal (2018); Zhong et al. (2019a); Xu and Durrett (2019); Liu and Lapata (2019); Lebanoff et al. (2019); Cho et al. (2019); Zhong et al. (2020); Wang et al. (2020); Xu et al. (2019); Jia et al. (2020) investigate how to generate better summaries on news article data, such as CNN/DailyMail Hermann et al. (2015), Newsroom Grusky et al. (2018), etc. Scientific paper summarization is another important branch Cohan et al. (2018); Yasunaga et al. (2019); An et al. (2021). Our paper mainly focuses on meeting summarization, a more challenging task compared to news summarization. With the burst of demand for meeting summarization, this task attracts more and more interests from academia Wang and Cardie (2013); Oya et al. (2014); Shang et al. (2018); Zhu et al. (2020) and becomes an emerging branch of text summarization area.

2 Query-based Summarization

Query-based summarization aims to generate a brief summary according to a source document and a given query. There are works studying this task Daumé III and Marcu (2006); Otterbacher et al. (2009); Wang et al. (2016); Litvak and Vanetik (2017); Nema et al. (2017); Baumel et al. (2018); Ishigaki et al. (2020); Kulkarni et al. (2020); Laskar et al. (2020). However, the models focus on news Dang (2005, 2006), debate Nema et al. (2017), and Wikipedia Zhu et al. (2019). Meeting is also a genre of discourses where query-based summarization could be applied, but to our best knowledge, there are no works studying this direction.

3 Meeting Summarization

Meeting summarization has attracted a lot of interest recently Chen and Metze (2012); Wang and Cardie (2013); Mehdad et al. (2013); Oya et al. (2014); Shang et al. (2018); Li et al. (2019); Zhu et al. (2020); Koay et al. (2020). Specifically, Mehdad et al. (2013) leverage entailment graphs and ranking strategy to generate meeting summaries. Wang and Cardie (2013) attempt to make use of decisions, action items and progress to generate the whole meeting summaries. Oya et al. (2014) leverages the relationship between summaries and the meeting transcripts to extract templates and generate summaries with the guidance of the templates. Shang et al. (2018) utilize multi-sentence compression techniques to generate summaries under an unsupervised setting. Li et al. (2019) attempt to incorporate multi-modal information to facilitate the meeting summarization. Zhu et al. (2020) propose a model which builds a hierarchical structure on word-level and turn-level information and uses news summary data to alleviate the inadequacy of meeting data.

Unlike previous works, instead of merely generating summaries for the complete meeting, we propose a novel task where we focus on summarizing multi-granularity contents which cater to different people’s need for the entire meetings, and help people comprehensively understand meetings.

Data Construction

In this section, we show how we collected meeting data from three different domains: academic meetings, product meetings, and committee meetings. In addition, we show how we annotated the three types of meeting data while ensuring annotation quality for query-based meeting summarization.

We introduce the three types of meetings that we used to annotate query-summary pairs.

AMIhttp://groups.inf.ed.ac.uk/ami/download/ Carletta et al. (2005) is a dataset of meetings about product design in an industrial setting. It consists of 137 meetings about how to design a new remote control, from kick-off to completion over the course of a day. It contains meeting transcripts and their corresponding meeting summaries.

ICSIhttp://groups.inf.ed.ac.uk/ami/icsi/index.shtml Janin et al. (2003) dataset is an academic meeting dataset composed of 59 weekly group meetings at International Computer Science Institute (ICSI) in Berkeley, and their summaries. Different from AMI, the contents of ICSI meetings are specific to the discussions about research among students.

Parliamentary committee meeting is another important domain of meetings. These meetings focus on the formal discussions on a wide range of issues (e.g., the reform of the education system, public health, etc.) Also, committee meetings are publicly available, which enables us to access large quantities of meetings. We include 25 committee meetings of the Welsh Parliamenthttps://record.assembly.wales/ and 11 from the Parliament of Canadahttps://www.ourcommons.ca/Committees/en/Home in our dataset.

2 Annotation Pipeline

After collecting meeting transcripts, we recruited annotators and required them to annotate by following annotation instruction. As illustrated in Figure 2, the annotation process is composed by three stages: topic segmentation, query generation, and query-based summarization.

Meeting transcripts are usually long and contain discussions about multiple topics. To assist further annotations, we asked annotators to write down the main topics discussed in the meetings, and their relevant text spans, which makes the meeting structure clear. As shown in Figure 2, “scope of the project and team building” is one of the annotated main topics, and its relevant text spans of the topic are (Turn 25 - 50, Turn 73 - 89). More details are listed in Appendix A.2.1.

Towards the query-based task, we further asked annotators to design queries by themselves. To cater to the need for multi-granularity contents, we categorized two types of queries: queries related to general information (e.g., the contents of whole meetings, etc.) are called general queries; queries focusing on relatively detailed information (e.g., the discussion about certain topics, etc.) are called specific queries.

To alleviate the influence of extremely hard queries and focus on the evaluation of query-based summarization capacity, rather than designing queries in an unconstrained way, we asked annotators to generate queries according to the schema. Details of the query schema list are shown in Appendix A.1. The list consists of important facets people might be interested in, including overall contents of discussions, speakers’ opinions, the reasons why a speaker proposed an idea, etc., which cover the most common queries over meetings involving multiple people discussing several topics.

To query multi-granularity meeting contents, we further divided the query schema list into general and specific ones, and asked annotators to design queries towards general and specific meeting contents, respectively. In terms of general query generation, the annotators were asked to design 1 - 2 general queries according to the general schema list. For specific query generation, annotators were asked to first select 2 - 4 main topics and their relevant text spans, and then design around 3 specific queries based on the specific schema list for each main topic. To ensure the task to be summarization instead of question answering, we asked annotators to design queries of which the relevant text spans are more than 10 turns or 200 words. Therefore, our proposed task would differ from question answering tasks where models merely need to extract phrases or generate answers based on short text spans, and focus on how to summarize based on large stretches of texts. Additional details are in Appendix A.2.2.

According to the designed queries and meeting transcripts, annotators were asked to do faithful summarization. Being accorded with the meeting transcripts and queries is the most important criterion. We also required annotators to write informative summarization. For example, they could add more details about the reasons why the group/committee made such decisions, and which important ideas the group/committee members proposed, etc. Besides, the annotated summaries should be abstractive, fluent and concise. We set word limits for the answers of general queries (50 - 150 words) and specific queries (20 - 100 words) to keep conciseness. More details are shown in Appendix A.2.3.

In the end, we organize all the meeting data after accomplishing the three annotation stages. Detailed annotations of one product meeting and one committee meeting are shown in Appendix A.4. Each meeting transcript is accompanied with annotated main topics, queries, their corresponding summaries, and relevant text span information.

3 Additional Details of Annotation Process

This section describes how we recruited annotators and how we review the annotations in detail.

To guarantee annotation quality given the complexity of the task, instead of employing tasks on Amazon Mechanical Turker, we anonymously recruited undergraduate students who are fluent in English. The annotation team consists of 2 native speakers and 10 non-native speakers majoring in English literature.

To help the annotators fully comprehend the instruction, annotators were trained in a pre-annotation process. Annotations were reviewed across all stages in our data collection process by expert of this annotation task. More details of review standards could be found in Appendix A.3.

4 Dataset Statistics and Comparison

Statistics of the final QMSum dataset is shown in Table 1. There are several advantages of QMSum dataset, compared with the previous datasets.

QMSum includes 232 meetings, which is the largest meeting summarization dataset to our best knowledge. For each query, there is a manual annotation of corresponding text span in the original meeting, so there are a total of 1,808 question-summary pairs in QMSum. Following the previous work, we randomly select about 15% of the meetings as the validation set, and another 15% as the test set.

The average length of summaries in QMSum 69.6 is much shorter than that of previous AMI and ICSI datasets. It is because our dataset also focuses on specific contents of the meetings, and the length of their corresponding summaries would not be long. It leaves a challenge about how to precisely capture the related information and compress it into a brief summary.

Previous datasets are specified to one domain. However, the model trained on the summarization data of a single domain usually has poor generalization ability Wang et al. (2019); Zhong et al. (2019b); Chen et al. (2020). Therefore, QMSum contains meetings across multiple domains: Product, Academic and Committee meetings. We expect that our dataset could provide a venue to evaluate the model’s generalization ability on meetings of different domains and help create more robust models.

Method

In this section, we first define the task of query-based meeting summarization, then describe our two-stage locate-then-summarize solution in detail.

Existing meeting summarization methods define the task as a sequence-to-sequence problem. Specifically, each meeting transcript X=(x1,x2,,xn)X=(x_{1},x_{2},\cdots,x_{n}) consists of nn turns, and each turn xix_{i} represents the utterance uiu_{i} and its speaker sis_{i}, that is, xi=(ui,si)x_{i}=(u_{i},s_{i}). Additionally, each utterance contains lil_{i} words ui=(w1,,wli)u_{i}=(w_{1},\cdots,w_{l_{i}}). The object is to generate a target summary YY =(y1,y2,,ym)=({y}_{1},{y}_{2},\cdots,{y}_{m}) by modeling the conditional distribution p(y1,y2,,ym(u1,s1),,(un,sn))p(y_{1},y_{2},\cdots,y_{m}|(u_{1},s_{1}),\cdots,(u_{n},s_{n})).

However, meetings are usually long conversations involving multiple topics and including important decisions on many different matters, so it is necessary and practical to use queries to summarize a certain part of the meeting. Formally, we introduce a query Q=(w1,,wQ)Q=(w_{1},\cdots,w_{|Q|}) for meeting summarization task, the objective is to generate a summary YY by modeling p(y1,y2,,ymQ,(u1,s1),,(un,sn))p(y_{1},y_{2},\cdots,y_{m}|Q,(u_{1},s_{1}),\cdots,(u_{n},s_{n})).

2 Locator

In our two-stage pipeline, the first step requires a model to locate the relevant text spans in the meeting according to the queries, and we call this model a Locator. The reason why we need a Locator here is, most existing abstractive models cannot process long texts such as meeting transcripts. So we need to extract shorter, query-related paragraphs as input to the following Summarizer.

We mainly utilize two methods to instantiate our Locator: Pointer Network Vinyals et al. (2015) and a hierarchical ranking-based model. Pointer Network has achieved widespread success in extractive QA tasks Wang and Jiang (2017). For each question, it will point to the pair in the source document, and the span is the predicted answer. Specific to our task, Pointer Network will point to the start turn and the end turn for each query. It is worth noting that one query can correspond to multiple spans in our dataset, so we always extract three spans as the corresponding text for each query when we use Pointer Network as Locator in the experiments.

In addition, we design a hierarchical ranking-based model structure as the Locator. As shown in Figure 3, we first input the tokens in each turn to a feature-based BERT to obtain the word embedding, where feature-based means we fix the parameters of BERT, so it is actually an embedding layer. Next, CNN Kim (2014) is applied as a turn-level encoder to capture the local features such as bigram, trigram and so on in each turn. Here we do not use Transformer because previous work Kedzie et al. (2018) shows that this component does not matter too much for the final performance. We combine different features to represent the utterance ui\mathbf{u}_{i} in each turn, and concatenate the speaker embedding si\mathbf{s}_{i} as the turn-level representation: xi=[ui;si]\mathbf{x_{i}}=[\mathbf{u_{i}};\mathbf{s_{i}}], where [;][;] denotes concatenation and si\mathbf{s}_{i} is a vector randomly initialized to represent the speaking style of meeting participants.

Then these turn representations will be contextualized by a document-level Transformer Vaswani et al. (2017) encoder.

Next, we introduce query embedding q\mathbf{q} which is obtained by a CNN (shared parameters with CNN in turn-level encoder) and use MLP to score each turn.

We use binary cross-entropy loss to train our Locator. Finally, turns with the highest scores are selected as the relevant text spans of each query and will be inputted to the subsequent Summarizer.

3 Summarizer

Given the relevant paragraphs, our goal in the second stage is to summarize the selected text spans based on the query. We instantiate our Summarizer with the current powerful abstractive models to explore whether the query-based meeting summarization task on our dataset is challenging. To be more specific, we choose the following three models:

Pointer-Generator Network See et al. (2017) is a popular sequence-to-sequence model with copy mechanism and coverage loss, and it acts as a baseline system in many generation tasks. The input to Pointer-Generator Network (PGNet) is: “ Query Relevant Text Spans ”.

BART Lewis et al. (2020) is a denoising pre-trained model for language generation, translation and comprehension. It has achieved new state-of-the-art results on many generation tasks, including summarization and abstractive question answering. The input to BART is the same as PGNet.

HMNet Zhu et al. (2020) is the state-of-the-art meeting summarization model. It contains a hierarchical structure to process long meeting transcripts and a role vector to depict the difference among speakers. Besides, a cross-domain pretraining process is also included in this strong model. We add a turn representing the query at the beginning of the meeting as the input of HMNet.

Experiments

In this section, we introduce the implementation details, effectiveness of Locator, experimental results and multi-domain experiments on QMSum.

For our ranking-based Locator, the dimension of speaking embedding is 128 and the dimension of turn and query embedding is 512. Notably, we find that removing Transformers in Locator has little impact on performance, so the Locator without Transformer is used in all the experiments. To reduce the burden of the abstractive models, we utilize Locator to extract 1/6 of the original text and input them to Summarizer. The hyperparameters used by PGNet and HMNet are consistent with the original paper. Due to the limitation of computing resources, we use the base version of pre-trained models (including feature-based BERT and BART) in this paper. We use fairseq libraryhttps://github.com/pytorch/fairseq/tree/master/examples/bart to implement BART model. For PGNet and BART, we truncate the input text to 2,048 tokens, and remove the turns whose lengths are less than 5. All results reported in this paper are averages of three runs.

2 Effectiveness of Locator

First, we need to verify the effectiveness of the Locator to ensure that it can extract spans related to the query. Instead of the accuracy of capturing relevant text spans, we focus on the extent of overlap between the selected text spans and the gold relevant text spans. It is because whether the summarization process is built on similar contexts with references or not is essential for Summarizer. Therefore, we use ROUGE-L recall to evaluate the performance of different models under the setting of extracting the same number of turns.

We introduce two additional baselines: Random and Similarity. The former refers to randomly extracting a fixed number of turns from the meeting content, while the latter denotes that we obtain turn embedding and query embedding through a feature-based BERT, and then extract the most similar turns by cosine similarity. As shown in Table 2, because there are usually a large number of repeated conversations in the meetings, Random can get a good ROUGE-L recall score, which can be used as a baseline to measure the performance of the model. Similarity performs badly, even worse than Random, which may be due to the great difference in style between the BERT pre-trained corpus and meeting transcripts. Pointer Network is only slightly better than Random. We think this is because in the text of with an average of more than 500 turns, only three pairs are given as supervision signals, which is not very informative and therefore is not conducive to model learning.

On the contrary, our hierarchical ranking-based Locator always greatly exceeds the random score, which demonstrates that it can indeed extract more relevant spans in the meeting. Even if 1/6 of the original text is extracted, it can reach a 72.51 ROUGE-L recall score, which significantly reduces the burden of subsequent Summarizer processing long text while ensuring the amount of information.

3 Experimental Results on QMSum

For comparison, we introduce two basic baselines: Random and Extractive Oracle. We randomly sample 10 turns of the original meeting for each query as an answer and this is the Random baseline in Table 3. Besides, we implement the Extractive Oracle, which is a greedy algorithm for extracting the highest-scoring sentences, usually regarded as the the upper bound of the extractive method Nallapati et al. (2017). An unsupervised method, TextRank is also included in our experiment. We treat each turn as a node and add a query node to fully connect all nodes. Finally, the 10 turns with the highest scores are selected as the summary.

Table 3 shows that the performance of three typical neural network models is significantly better than Random and TextRank. When equipped with our Locator, both PGNet and BART have brought evident performance improvements (PGNet: 28.74 -> 31.37 R-1, BART: 29.20 -> 31.74 R-1). Compared to PGNet∗, the advantage of BART∗ lies in the ROUGE-L score (1.13 improvement), which indicates that it can generate more fluent sentences. The current state-of-the-art meeting summarization model HMNet achieves the best performance, which may be attributed to its cross-domain pre-training process making HMNet more familiar with the style of meeting transcripts.

In addition, we also use the gold text spans as the input of different models to measure the performance loss caused by Locator. Surprisingly, for models (PGNet and BART) that need to truncate the input text, although Locator is an approximate solution, the models equipped with it can achieve comparable results with the models based on gold span inputs. Therefore, in this case, our two-stage pipeline is a simple but effective method in the meeting domain. However, for some models (HMNet) that use a hierarchical structure to process long text, inputting gold text spans can still bring huge performance improvements.

4 Experiments on Different Domains

In addition, we also conduct multi-domain and cross-domain experiments. First, we perform in-domain and out-domain tests in the three domains of QMSum dataset. In Table 4, we can conclude that there are obvious differences between these three domains. For instance, the models trained on the Academic and Committee domains perform poorly when tested directly on the Product domain, with only the ROUGE-L scores of 24.09 and 22.17 respectively. However, the model trained on the single domain of Product can achieve a ROUGE-L score of 31.37, which illustrates although these domains are all in the form of meeting transcript, they still have visible domain bias.

On the other hand, when we train all the domains together, we can obtain a robust summarization model. Compared with models trained on a single domain, models trained on QMSum can always achieve comparable results. In the Academic domain, the model with multi-domain training can even get higher ROUGE-2 (5.05 vs 4.32) and ROUGE-L (23.01 vs 22.58) scores. These results show that the multi-domain setting in meeting summarization task is apparently necessary and meaningful. Meeting transcripts cover various fields, making the transfer of models particularly difficult. Therefore, we need to introduce multi-domain training to make the model more robust, so it can be applied to more practical scenarios.

Analysis

In this section, we conduct comprehensive analysis of query types and errors in the model output.

We manually divide the query in QMSum into five aspects: personal opinion, multi-person interaction, conclusion or decision, reason, and overall content. For example, “Summarize the whole meeting.” requires a summary of the overall content and “Why did A disagree with B?” requires a summary of some reasons. The questions we are concerned about are: what is the distribution of different types of queries in QMSum? Are there differences in the difficulty of different types of queries?

To figure out the above issues, we randomly sample 100 queries from the test set, count the number of each type, and score the difficulty of each query. Table 5 illustrates that answering 40% of queries requires summarizing the interaction of multiple people, and the queries that focus on personal opinions and different aspects of conclusions or decisions account for almost 20% each. Besides, queries about a specific reason are less frequent in the meetings.

We also perform a human evaluation of the difficulty of various query types. For each query, the relevant text spans and query-summary pair are shown to annotators. Annotators are asked to score the difficulty of this query in two dimensions: 1) the difficulty of locating relevant information in the original text; 2) the difficulty of organizing content to form a summary. For each dimension, they can choose an integer between 1 and 3 as the score, where 1 means easy and 3 means difficult.

As we can see from Table 5, query about reasons is the most difficult to locate key information in related paragraphs, and this type of query is also challenging to organize and summarize reasonably. Queries about multi-person interaction and overall content are relatively easy under human evaluation scores. The relevant paragraphs of the former contain multi-person conversations, which are usually redundant, so the effective information is easier to find; the latter only needs to organize the statements in the chronological order of the meeting to write a summary, so it has the lowest Diff. 2 score. The model performance also confirms this point, BART can get more than 30 R-L score on these two types of queries, but performs poorly on the rest. Therefore, the remaining three types of queries in QMSum are still very challenging even for powerful pre-trained models, and further research is urgently needed to change this situation.

2 Error Analysis

Although ROUGE score can measure the degree of overlap between the generated summary and the gold summary, it cannot reflect the factual consistency between them or the relevance between the predicted summary and the query. Therefore, in order to better understand the model performance and the difficulty of the proposed task, we sample 100 generated summaries for error analysis. Specifically, we ask 10 graduate students to do error analysis on the sampled summaries. Each summary is viewed by two people. They discuss and agree on whether the sample is consistent with the original facts and whether it is related to the query.

According to Cao et al. (2018), nearly 30% of summaries generated by strong neural models contain factual errors. This problem is even more serious on QMSum: we find inconsistent facts in 74% of the samples, which may be because the existing models are not good at generating multi-granularity summaries. Although BART can achieve state-of-the-art performance in the single-document summarization task, it does not seem to be able to truly understand the different aspects of the meeting, thus create factual errors. What’s worse, 31% summaries are completely unrelated to the given query. This not only encourages us to design more powerful models or introduce more prior knowledge to overcome this challenge, but also shows better metrics are needed to evaluate model performance in generating multi-granularity summaries.

Conclusion

We propose a new benchmark, QMSum, for query-based meeting summarization task. We build a locate-then-summarize pipeline as a baseline and further investigate variants of our model with different Locators and Summarizers, adopt different training settings including cross-domain and multi-domain experiments to evaluate generalizability, and analyze the task difficulty with respect to query types. The new task and benchmark leave several open research directions to explore: 1) how to process the long meeting discourses; 2) how to make a meeting summarization model generalize well; 3) how to generate summaries consistent with both meeting transcripts and queries. 4) how to reduce the annotation cost for meeting summarization.

Acknowledgements

The Language, Information, and Learning lab at Yale (LILY lab) would like to acknowledge the research grant from Microsoft Research. We would also like to thank annotators for their hard work and reviewers for their valuable comments.

Ethics Consideration

We propose a novel query-based meeting summarization task, accompanying with a high-quality dataset QMSum. Since the paper involves a new dataset and NLP application, this section is further divided into the following two parts.

Collecting user data touches on the intellectual property and privacy rights of the original authors: both of the collected meeting transcripts and recruited annotators. We ensure that the dataset construction process is consistent with the intellectual property and privacy rights of the original authors of the meetings. All the meeting transcripts we collected are public and open to use according to the regulation http://groups.inf.ed.ac.uk/ami/corpus/license.shtml http://groups.inf.ed.ac.uk/ami/icsi/license.shtml https://senedd.wales/en/help/our-information/Pages/Open-data.aspx https://www.ourcommons.ca/en/important-notices. The annotation process is consistent with the intellectual property and privacy rights of the recruited annotators as well.

We estimated the time for annotating one meeting is around 1 - 2 hours. Therefore, we paid annotators around 14foreachproductandacademicmeetingand14 for each product and academic meeting and28 for each committee meeting. To further encourage annotators to work on annotations, we proposed bonus mechanism: the bonus of each of the 5th to 8th meetings would be 4;thebonusofeachofthe9thto12thmeetingswouldbe4; the bonus of each of the 9th to 12th meetings would be5, and so on. Some of the authors also did annotations and they were paid as well.

The most possible problems which may exist in the dataset is bias problem and the inconsistency among queries, annotated summaries and original meeting contents. With regard to bias problem, we find that product meeting dataset rarely contains any explicit gender information, but annotators still tended to use ‘he’ as pronoun. To avoid the gender bias caused by the usage of pronouns, we required annotators to replace pronouns with speaker information like ‘Project Manager’, ‘Marketing’ to avoid the problem. Also, when designing queries based on query schema list, we found that annotators usually used the same query schema, which might lead to bias towards a certain type of query. Therefore, we asked the annotators to use different schemas as much as possible. For the inconsistency problem, each annotation step was strictly under supervision by ‘experts’ which are good at annotation and could be responsible for reviewing.

2 NLP Applications

The query-based meeting summarization application is aiming at summarizing meetings according to queries from users. We could foresee that the trained model could be applied in companies to further improve the efficiency of workers, and help the staff comprehensively understand the meeting contents. The annotated QMSum dataset could be used as a benchmark for researchers to study how to improve the performance of summarization on such long texts and how to make models more generalizable on the meetings of different domains.

The current baseline models still tend to generate ungrammatical and factually inconsistent summaries. If a trained baseline model was directly applied in companies, the misinformation would negatively affect comprehension and further decision making. Further efforts are needed to generate high-quality summaries which are fluent and faithful to the meeting transcripts and queries.

Training and test data are often biased in ways that limit system accuracy on domains with small data size or new domains, potentially causing distribution mismatch issues. In the data collection process, we control for the gender bias caused by pronouns such as ‘he’ and ‘she’ as much as possible. Also, we attempt to control the bias towards a certain type of query schema by requiring annotators to use diverse schemas as much as possible. However, we admit that there might be other types of bias, such as political bias in committee meetings. Thus, the summarization models trained on the dataset might be biased as well. and We will include warnings in our dataset.

We emphasize that the application should be used with careful consideration, since the generated summaries are not reliable enough. It is necessary for researchers to develop better models to improve the quality of summaries. Besides, if the model is trained on internal meeting data, with the consideration of intellectual property and privacy rights, the trained model should be used under strict supervision.

Future projects have to be aware of the fact that some meeting transcripts are intended for internal use only. Thus, researchers should be aware of the privacy issues about meeting data before training the model.

References

Appendix A Appendix

As mentioned in Section 3.2, we make query schema list to help annotators design queries. Towards general and specific meeting contents, we further divide the query schema list into general query schema list and specific query schema list. The detailed list is shown in Table 6.

A.2 Other Details of Annotation Instruction

We show other details except those in Section 3.

We require annotators to use noun phrases to represent the main topics. As we hope to select the most important topics, the number of annotated main topics should not be too much, and 3 to 8 is proper range.

Relevant text spans of a main topic may be scattered in different parts of meetings, e.g., the main topic ‘scope of the project and team building’ in the leftmost part of Figure 2. So annotators are asked to label all the relevant spans, and these spans are allowed to be not contiguous.

Main topics should be objectively cover most of the meeting contents. In meetings, group members might take much time chatting. Though this is not close to the main theme of the meetings, it should also be counted as a main topic if the chat took lots of time.

A.2.2 Query Generation

For the specific queries, we encourage the annotators to choose different schemas to compose queries, since we intend to diversify the query types and reduce the bias towards certain query types.

Each query-answer pair should be independent, that is, not dependent on previous queries and answers. For example, for the query ‘How did they reach agreement afterwards?’, it seems that the query is dependent on the previous annotations, and it is ambiguous to know when they reached agreement and what the agreement referred to if we treat the query independently. Annotators should specify the information of when they reached an agreement and what the agreement is to make the query clear. For example, annotators could rewrite the query as ‘How did the group agree on the button design when discussing the design of remote control?’.

A.2.3 Query-based Summarization

When answering queries like ‘What did A think of X?’, annotators were required to not only summarize what A said, but also briefly mention the context which is relevant to what A said. This is designed to rich the contents of summaries and further challenge the model’s capability of summarizing relevant contexts.

Since there might be multiple relevant text spans, we asked annotators to annotate all of them.

Since all the meetings happened, we ask annotators to use past tense.

If the gender information is unclear, we would ask annotators not to use ‘he/she’ to denote speakers. We also asked them not to abbreviations (e.g., PM) to denote speakers, and use the full name like ‘Project Manager’ instead.

In the raw meeting transcripts, some of abbreviations are along with character ‘_’. If the annotators encountered abbreviations, e.g., ‘L_C_D_’, ‘A_A_A_’, etc., they would be required to rewrite them like LCD or AAA.

A.3 Annotation Review Standards

We launched a ‘pre-annotation’ stage in which annotators were asked to try annotating one meeting and the experts who are good at our annotation task would review it and instruct the annotators by providing detailed feedback. Requirements include 1) faithfulness, 2) informativeness, 3) lengths of the relevant text spans of designed queries, 4) typo errors, etc. They could continue annotating only if they passed ‘pre-annotation’ stage. After ‘pre-annotation’, experts would keep carefully reviewing all the annotations according to the requirements. We write down the standards for the three annotation stages individually in annotation instruction. Details of annotation review standards could be referred to Appendix A.2.

A.4 Examples of QMSum Dataset

We show two examples of our proposed QMSum dataset. One belongs to product meeting (Table 7), and the other one is about committee meeting (Table 8).