Joint Learning of Answer Selection and Answer Summary Generation in Community Question Answering

Yang Deng, Wai Lam, Yuexiang Xie, Daoyuan Chen, Yaliang Li, Min Yang, Ying Shen

Introduction

Recent years have witnessed a spectacular increase in real-world applications of community question answering (CQA), such as Yahoo! Answerhttps://answers.yahoo.com/ and StackExchangehttps://stackexchange.com/. Many studies have been made on different tasks in CQA, such as answer selection, question-question relatedness, and comment classification (?; ?; ?). However, due to the length and redundancy of answers in CQA scenario, there are several challenges that need to be tackled in real-world applications. (i) The noise introduced by the redundancy of answers makes it difficult for answer selection model to pick out correct answers from a set of candidates. (ii) Compared with other QA systems (e.g., factoid question answering), answers in CQA are often too long for community users to read and comprehend.

Current state-of-the-art answer selection models (?; ?) employ the attention mechanism to attend the important correlated information between question-answer pairs. These methods perform well when ranking short answers, while the accuracy goes down with the increase in the length of answers (?; ?). Recent studies on coarse-to-fine question answering for long documents, such as Reading Comprehension (RC) (?; ?; ?), focus on the answer span extraction in factoid QA, in which those factoid questions can be answered by a certain word or a short phrase. Conversely, in non-factoid CQA, discrete and complex information from multiple sentences makes up the answers together. Besides, generative RC methods (?) only give one certain answer, while there are often multiple useful answers in CQA. Thus, these approaches are not suitable for addressing the redundancy issue of answers in CQA.

On the other hand, text summarization provides an effective approach to alleviating the aforementioned issue. Text summarization methods can generally be divided into two categories: extractive summarization (?; ?) and abstractive summarization (?; ?). The aim is to assemble or generate summaries from the source article or external vocabulary, based on the information from the source text. In the existing studies, answer summarization in CQA is mainly explored by extractive summarization models (?; ?). However, due to the length of answers, extractive methods sometimes fall short of generalization of all the important information in the whole answer and consistency of the core idea. Besides, the correlation information between question and answer, which plays a crucial role in human comprehension, is underutilized by current query-based summarization studies (?; ?). Therefore, we intend to take advantage of both the contextual information from the source text and the relationship between the question-answer pair to generate abstractive answer summaries in CQA.

We aim to simultaneously tackle the above issues in CQA, including (i) improving the performance of non-factoid answer selection with long answers, (ii) generating abstractive summaries of the answers. We jointly learn answer selection and abstractive summarization to generate answer summaries for CQA. First, we exploit the correlated information between question-answer pairs to improve abstractive answer summarization, which enables the summarizer to generate abstractive summaries related to questions. Then, we measure the relevancy degrees between questions and answer summaries to alleviate the impact of noise from original answers. Besides, since obtaining reference summaries is usually labor-intensive and time-consuming in a new domain, a transfer learning strategy is designed to improve resource-poor CQA tasks with large-scale supervision data.

We summarize our contributions as follows:

1. We jointly learn answer selection and answer summary generation to tackle the lengthiness and redundancy issues of the answer in CQA with a unified model. A novel joint learning framework of answer selection and abstractive summarization (ASAS) is proposed to employ the question information to guide the abstractive summarization, and meanwhile leverage the summaries to reduce noise in answers for precisely measuring the correlation degrees of QA pairs.

2. We construct a new dataset, WikiHowQA, for the task of answer summary generation in CQA, which can be adapted to both answer selection and summarization tasks. Experimental results on WikiHowQA show that the proposed joint learning method outperforms SOTA answer selection methods and meanwhile generates more precise answer summaries than existing summarization methods.

3. To handle resource-poor CQA tasks, we design a transfer learning strategy, which enable those tasks without reference answer summaries to conduct the joint learning with impressive experimental results.

Related Work

Community Question Answering. Answer selection is the core and the most widely-studied problem in community question answering. Recent studies have evolved from feature-based methods (?; ?) into deep learning models, such as convolutional neural network (CNN) (?) and recurrent neural network (RNN) (?). In order to capture the interactive information in QA sentences, various attention mechanisms (?; ?) are developed to align the related words between questions and answers. However, the lengthy and redundant answers in CQA scenario may introduce much noise and scatter important information, which causes difficulties in answer selection. Some studies leverage additional information to compensate the imbalance of information between questions and answers, such as user model (?; ?), latent topic (?), external knowledge (?) or question subject (?). Some existing transfer learning studies on CQA focus on cross-domain adaptation (?; ?). In this work, we employ summarization method to reduce noise in the original lengthy answers to improve the answer selection performance in CQA.

Text Summarization. Text summarization techniques are mainly classified into two categories: extractive and abstractive summarization. Extractive approaches regard summarization as a sentence classification (?) or a sequence labeling task (?) to select sentences from the article to form the summary, while abstractive approaches usually employ attention-based encoder-decoder models (?; ?) to generate abstractive summaries. Answer summarization in CQA was first introduced by ? (2006) as an application of extractive summarization. After that, studies on answer summarization are still regarded as a separate extractive summarization module in QA pipeline (?; ?). Besides, query-based summarization methods (?; ?) also can be a good solution for this task, however, these approaches are reported to perform worse than answer selection methods on question answering scenario (?).

Multi-task Learning. Inspired by the success of multi-task learning in other NLP tasks, several attempts have been made to solve answer selection with different tasks. ? (2017) and ? (2018) enhance answer selection in CQA via multi-task learning with the auxiliary tasks of question-question relatedness and question-comment relatedness. ? (2019) leverage the question categorization to enhance the question representation learning for CQA. ? (2019) propose a multi-view attention based multi-task learning model to jointly tackle answer selection and knowledge base question answering tasks. In this work, we jointly learn answer selection and abstractive summarization to select and generate precise answers in CQA.

Method

We aim to jointly conduct two tasks, answer selection and abstractive summarization, to select and generate concise answers for CQA. Given a question $q_{i}$ , the goal is to simultaneously select the set of correct answers from a set of candidates $A_{i}=\{a^{(1)}_{i},...,a^{(j)}_{i}\}$ and generate an abstractive summary $\beta^{(*)}_{i}$ for each selected answer $a^{(*)}_{i}$ .

The dataset $D$ for learning typically contains a set of questions $Q$ with the number of $N$ . For each question $q_{i}\in Q$ , there are $M_{i}$ candidate answers $A_{i}$ with the corresponding reference summary $\beta^{(j)}_{i}$ written by human and the label $y^{(j)}_{i}$ determining whether $a^{(j)}_{i}$ can answer $q_{i}$ .

Model

We introduce the proposed joint learning model for answer selection and abstractive summarization (ASAS). As is depicted in Fig. 1, The overall framework of ASAS consists of four components: (i) Shared Compare-Aggregate Bi-LSTM Encoder, (ii) Sequence-to-sequence Model with Question-aware Attention, (iii) Question Answer Alignment with Summary Representations, (iv) Question-driven Pointer-generator Network.

Seq2Seq Model with Question-aware Attention.

With the intuition that the information in the question is supposed to be helpful in attending the important elements in the original answer sentence, we propose a question-aware attention based seq2seq model to decode the encoded sentence representation of the answer. We adopt a unidirectional LSTM as the decoder. On each step $t$ , the decoder produces the hidden state $s_{t}$ with the input of the previous word $w_{t-1}$ . The question-aware attention $\alpha^{t}$ is generated by:

where $m$ , $W_{h}$ , $W_{s}$ , $W_{q}$ are attention parameter matrices to be learned. The question-aware attention weight $\alpha^{t}$ is used to generate context vector $\hat{h}_{t}$ as a probability distribution over the source words:

The context vector aggregates the information from the source text and the question for the current step. We concatenate it with the decoder state $s_{t}$ and pass through a linear layer to generate the summary representation $h^{s}_{t}$ :

where $W_{1}$ and $b_{1}$ are parameters to be learned.

Question Answer Alignment with Summary Representations.

We apply a two-way attention mechanism to generate the co-attention between the encoded question representation $H_{q}$ and the decoded summary representation $H_{s}$ :

We conduct dot product between the attention vectors and the question and summary representations to generate the final attentive sentence representations for answer selection:

Compared with encoded answer representations, decoded summary representations are more concise and compressive, which enable answer selection model to precisely capture the interactive information between questions and answers.

Question-driven Pointer-generator Network.

First, the probability distribution $P_{vocab}$ over the fixed vocabulary is obtained by passing the summary representation $h^{s}_{t}$ through a softmax layer:

where $W_{2}$ and $b_{2}$ are parameters to be learned. Then, a question-aware pointer network is proposed to copy words from the source article with the guidance of the question information. The question-aware generation probability $p_{gen}\in$ takes into account the decoded summary representation $h^{s}_{t}$ , the decoder input $x_{t}$ and the question representation $o_{q}$ :

where $w_{h}$ , $w_{x}$ , $w_{q}$ and $b_{p}$ are parameters to be learned, and $\sigma$ is the sigmoid function. Following the basic pointer-generator network (PGN) (?), we obtain the final probability distribution over both the fixed vocabulary and words from the source article:

To be specific, the question information is involved in not only the generating process, but also the copying process in the question-driven PGN. (i) The question information directs the calculation of the generation probability to decide whether generating a word from the vocabulary or copying from the source text. (ii) The question-aware attention weights integrate the question information to attend the important words in the source text for copying. (iii) The probability distribution over the vocabulary is learned from the question-aware attentive summary representations.

Joint Training Procedure

The attentive representations of questions and summaries go through a softmax layer for binary classification:

where $p$ is the output of the softmax layer and $y$ is the binary classification label of the QA pair.

Summarization Loss.

The summarization task is trained to minimize the negative log likelihood:

Coverage Loss.

Coverage loss (?) was proposed to discourage the repetition in abstractive summarization. In each decoder timestep $t$ , the coverage vector $c^{t}=\sum^{t-1}_{t^{\prime}=0}a^{t^{\prime}}$ is used to represent the degree of coverage so far. The coverage vector $c^{t}$ will be applied to compute the attention weight $\alpha^{t}$ . The coverage loss is trained to penalize the repetition in updated attention weight $\alpha^{t}$ :

Overall Loss Function.

For joint training, the final objective function is to minimize above three loss functions:

where $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ are hyper-parameters to balance losses.

Handling Resource-poor Datasets

Since annotating gold answer summaries is a labor-intensive work, we intend to leverage the knowledge learned from the joint learning of answer selection and answer summary generation on a large-scale supervision dataset and apply it to resource-poor datasets without reference answer summaries. The goal can be achieved by a transfer learning strategy involving two steps: (i) initialize the the parameters of model pre-trained on the source dataset, (ii) further fine-tune on the target dataset. A straightforward way is to fine-tune all the parameters learned from the source data on the target training dataset. Another fashion is to fine-tune a certain part of parameters and keep the remaining part of model fixed during fine-tuning. In this case, we first pre-train the whole joint learning model on the source dataset, and then only fine-tune the answer selection modules (including Shared Compare-Aggregate Bi-LSTM Encoder & Question Answer Alignment). On one hand, fixing the summarization part can not only reduce the demand for annotating summary data, but also prevent model over-fitting. On the other hand, questioning styles and answer contents vary from CQA tasks in different domains, thus, the answer selection part is supposed to benefit from fine-tuning in target domains.

Datasets and Experimental Setting

Most of the widely-adopted answer selection benchmark datasets are composed of short sentences, such as WikiQA (?), SemEval (?). WikiPassageQA (?) and StackExchange (?), two latest non-factoid answer selection datasets with long passages (about 150 words) as candidate answers, lack of the reference summary for answer summarization evaluation in our defined answer summary generation task.

We present a new CQA corpus, WikiHowQA, for answer summary generation, which contains labels for the answer selection task as well as reference summaries for the text summarization task. To prepare this dataset, we modify a latest text summarization dataset, WikiHow (?), which was obtained from WikiHowhttp://www.wikihow.com/ knowledge base. The WikiHow dataset contains detailed answers written by community users for non-factoid questions starting with “How to”. The original answers are composed by multiple steps of different methods for the question, and the description in each step is associated with an abstractive summary. The WikiHow dataset only contains the selected ground-truth answers and the reference summaries for each answer, while the whole candidate answer set is required when we wish to conduct answer selection experiments on this dataset. Therefore, we construct a new CQA dataset based on the WikiHow dataset.

We first clean up the WikiHow dataset by filtering out those questions without answers or summaries and those answers with punctuation only. After that, the dataset size is reduced from 230,843 to 203,596, including 107,041 unique questions. The clean WikiHow dataset is split into 142,063 / 18,909 / 42,624 as train / dev / test sets. In order to retrieve the candidate answer pool for all the questions, we write a crawler to collect the relevant questions for each question from the WikiHow website. The answers of the relevant questions posted on WikiHow are labeled as negative answers for the given question. Finally, we obtain 1,188,189 question-answer pairs with corresponding answer summaries and matching labels as the WikiHowQA dataset. In accordance with the clean WikiHow dataset, we split the WikiHowQA dataset into 904,460 / 72,474 / 211,255 as train / dev / test sets, which implies that there is no overlapping of samples among the three split sets. The statistics of the WikiHowQAhttps://github.com/dengyang17/wikihowQA dataset are shown in Table 1.

In addition, we evaluate the proposed method on a resource-poor CQA dataset, StackExchange (?), which lacks of reference answer summaries. The statistics of the StackExchange dataset are presented in Table 2, which is a real-life CQA dataset containing data with long answers from different domains, including travel, cooking, academia, apple, and aviation. We adopt WikiHowQA as the source dataset for transfer learning due to its high quality and large quantity, while StackExchange are used as the target dataset.

Implementation Details

We train all the implemented models with pre-trained GloVE embeddingshttp://nlp.stanford.edu/data/glove.6B.zip of 100 dimensions as word embeddings and set the vocabulary size to 50k for both source and target text. During training and testing procedure, we truncate the article to 400 words and restrict the length of generated summaries within 100 words. We apply early stopping based on the answer selection evaluation result on the validation set. We train our model and implement answer selection models for 5 epochs, while we implement summarization models for 20 epochs for fair comparisons, since the answers may repetitively occur in the candidates for different questions in the WikiHowQA dataset.

In our model, we train with a learning rate of 0.15 and an initial accumulator value of 0.1. The dropout rate is set to 0.5. The hidden unit sizes of the BiLSTM encoder and the LSTM decoder are all set to 150. We train our models with the batch size of 32. All other parameters are randomly initialized from [-0.05, 0.05]. $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ are all set to 1.

Experimental Result

We first compare the proposed method with several state-of-the-art methods on the answer selection task, including Siamese BiLSTM (?), Att-BiLSTM (?), AP-LSTM (?), CA (Compare-Aggregate) (?) and COALA (?). Besides, we perform several Two-Stage methods, which first summarize the original answers and then conduct answer selection. To validate the effectiveness of different components of ASAS, we also conduct ablation tests. MAP and MRR are adopted as evaluation metrics.

Answer selection results on WikiHowQA are summarized in Table 3. We show that the joint learning model (ASAS) achieves state-of-the-art performance. There are several notable observations in the results. (i) BM25 model and even the basic deep learning model slightly improve the performance compared to random guessing, which signifies that the testing set is indeed a difficult one. (ii) The Compare-Aggregate methods (including CA and COALA) and AP-BiLSTM, which have been proven to be relatively effective in long-sentence answer selection (?; ?), outperforms other strong baseline methods. (iii) Although Two-Stage methods actually improve the final answer selection result, it is time-consuming and inconvenient to train two separate models. In specific, using gold summary (GOLD) achieves the best performance, and Question-driven PGN (QPGN) performs better than original PGN. With the same summarization method, different answer selection models achieve similar results. (iv) Finally, the proposed joint learning model (ASAS) decently and substantially enhances the performance, which not only achieves the SOTA result, but also is easily trained by end-to-end fashion. By doing so, we precisely pick out the correct answers from candidate answers with long sentences, and meanwhile generate abstractive summaries for the convenience of community users. (v) The ablation study shows both the two-way attention mechanism and the pointer network contribute to the final result. The two-way attention mechanism enhances the interaction between questions and decoded answer summaries, while the pointer network aids in generating a better summary.

Answer Summary Generation Result

To evaluate the generated answer summary, we also compare the proposed method with the following state-of-the-art baseline methods on text summarization subtask, including four extractive methods (Lead3, TextRank (?), NeuralSum (?), NeuSum (?)), two abstractive methods (Seq2Seq (?), PGN (?)) and two query-based methods ( $\text{SD}_{2}$ (?), biASBLSTM (?)). ROUGE F1 scores are used to evaluate the summarization methods.

Text summarization results on WikiHowQA are summarized in Table 4. The experimental results show that the question-driven PGN outperforms all the state-of-the-art methods of both extractive and abstractive summarization, which demonstrates the effectiveness of incorporating question information to generate summaries for answers. The question information directly involves in the calculation of the generation probability to determine the next word whether generated from the vocabulary or copied from the source text. In addition, jointly learning with answer selection, ASAS further improves the result with a noticeable margin. The correlation information between question-answer pairs also aids in attending important words in the original answer, which are related to the question. These results show that ASAS can effectively generate high-quality summaries for the selected answers.

Analysis of The Length of Answers

In order to validate the effectiveness of the proposed method on long-sentence answer selection, we split the test set in terms of the length of the answer. As shown in Fig. 2, we compare ASAS with two baseline methods, AP-LSTM and Compare-Aggregate Model (CA), by measuring the accuracy, which is the ratio of correctly selected answers. We observe that ASAS performs better especially for long answers. For answers that are shorter than 100 words, CA and AP-LSTM is slightly better than ASAS, which indicates that the summary may have lost some information for short answers. However, the performance of these two methods goes down with the increase in the answer length, while ASAS maintains a great stability.

Human Evaluation on Summarization

We conduct human evaluation on a sample of test set to evaluate the generated answer summaries from four aspects: (1) Informativity: how well does the summary capture the key information from the original answer? (2) Conciseness: how concise the summary is? (3) Readability: how fluent and coherent the summary is? (4) Correlatedness: how correlated the summary and the given question are? We randomly sample 50 answers and generate their summaries by three methods, including NeuralSum, PGN w/ coverage and the proposed ASAS. Three data annotators are asked to score each generated summary with 1 to 5 (higher the better).

Table 5 shows the human evaluation results. The results show that ASAS consistently outperforms other methods in all aspects. Noticeably, the proposed method learns well to generate answer summaries that are highly related to the given questions so there is a substantial margin on Correlatedness. In order to intuitively observe the advantage of the proposed method, we randomly choose one example to show the answer summary generation results. As shown in the Fig. 3, the extractive method (e.g., NeuralSum) selects important sentences from the original answer to form the answer summary, which still contains many insignificant or redundant information. The abstractive method (e.g., PGN) generates the answer summary from the vocabulary and the original answer, which may miss some key words and essential information. Upon these defects, the proposed joint learning method (ASAS) takes into account the information provided by the question to capture the core idea of the original answer and generate a precise summary. More importantly, unlike other methods, answer summaries are generated at the same time that the answers are selected.

Resource-poor CQA Results

To evaluate the transferring ability and applicability of the proposed method, we conduct experiments on the resource-poor CQA task with transfer learning. We also conduct several ablations that use no pre-training or no fine-tuning, including (i) Finetune/- is the baseline without pre-training, (ii) Finetune/No is trained with the training set of source data without fine-tuning on the target training data, (iii) Finetune/Yes is to first pre-train a model on the source data, and then use the learned parameters to initialize the model parameters for only fine-tuning the answer selection part on the target data. Following previous studies (?), we adopt the ratio of correctly selected answers as the evaluation metrics. Note that we use an unsupervised summarization method, TextRank (?), to generate reference summaries roughly for Finetune/- settings with ASAS, since there is no reference summary in the original StackExchange dataset.

The experimental results show that even with the coarse reference summaries, ASAS (Finetune/-) achieves the best performance in 4 out of 5 domains, which demonstrates the applicability of the proposed joint learning framework. Under the zero-shot setting, ASAS (Finetune/No) also achieves competitive results as those strong baseline methods, which shows the strong transferring ability of the proposed method and the value of the large-scale source dataset, WikiHowQA. Fine-tuning the answer selection part further outperforms all the baselines by about 4%. This result indicates that there are actually some gaps between different CQA datasets and the fine-tune strategy effectively overcomes these domain differences. Compared with ASAS and AP-BiLSTM, CA and COALA hardly benefit from pre-training due to their reliance on unsupervised embedding matching features.

In addition, Fig. 4 presents examples of answer summary generation results from target datasets. For those resource-poor CQA tasks without reference answer summaries, ASAS can not only achieve state-of-the-art results on answer selection, but also automatically generate decent and concise summaries via a simple transfer learning strategy with a resource-rich dataset.

Conclusion

We study the joint learning of answer selection and answer summary generation in CQA. We propose a novel model to employ the question information to improve the summarization result, and meanwhile leverage the summaries to reduce noise in answers for a better performance on long-sentence answer selection. In order to evaluate the answer generation task in CQA, we construct a new large-scale CQA dataset, WikiHowQA, which contains both labels for answer selection task and reference summaries for text summarization task. The experimental results show that the proposed joint learning method outperforms the state-of-the-art methods on both answer selection and summarization tasks, and processes robust applicability and transferring ability for resource-poor CQA tasks.