SemEval-2017 Task 3: Community Question Answering

Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, Karin Verspoor

Introduction

Community Question Answering (CQA) on web forums such as Stack Overflowhttp://stackoverflow.com/ and Qatar Living,http://www.qatarliving.com/forum is gaining popularity, thanks to the flexibility of forums to provide information to a user Moschitti et al. (2016). Forums are moderated only indirectly via the community, rather open, and subject to few restrictions, if any, on who can post and answer a question, or what questions can be asked. On the positive side, a user can freely ask any question and can expect a variety of answers. On the negative side, it takes efforts to go through the provided answers of varying quality and to make sense of them. It is not unusual for a popular question to have hundreds of answers, and it is very time-consuming for a user to inspect them all.

Hence, users can benefit from automated tools to help them navigate these forums, including support for finding similar existing questions to a new question, and for identifying good answers, e.g., by retrieving similar questions that already provide an answer to the new question.

Given the important role that natural language processing (NLP) plays for CQA, we have organized a challenge series to promote related research for the past three years. We have provided datasets, annotated data and we have developed robust evaluation procedures in order to establish a common ground for comparing and evaluating different approaches to CQA.

In greater detail, in SemEval-2015 Task 3 “Answer Selection in Community Question Answering” Nakov et al. (2015),http://alt.qcri.org/semeval2015/task3 we mainly targeted conventional Question Answering (QA) tasks, i.e., answer selection. In contrast, in SemEval-2016 Task 3 Nakov et al. (2016b), we targeted a fuller spectrum of CQA-specific tasks, moving closer to the real application needs,A system based on SemEval-2016 Task 3 was integrated in Qatar Living’s betasearch Hoque et al. (2016): http://www.qatarliving.com/betasearch particularly in Subtask C, which was defined as follows: “given (i) a new question and (ii) a large collection of question-comment threads created by a user community, rank the comments that are most useful for answering the new question”. A test question is new with respect to the forum, but can be related to one or more questions that have been previously asked in the forum. The best answers can come from different question–comment threads. The threads are independent of each other, the lists of comments are chronologically sorted, and there is meta information, e.g., date of posting, who is the user who asked/answered the question, category the question was asked in, etc.

The comments in a thread are intended to answer the question initiating that thread, but since this is a resource created by a community of casual users, there is a lot of noise and irrelevant material, in addition to the complications of informal language use, typos, and grammatical mistakes. Questions in the collection can also be related in different ways, although there is in general no explicit representation of this structure.

In addition to Subtask C, we designed subtasks A and B to give participants the tools to create a CQA system to solve subtask C. Specifically, Subtask A (Question-Comment Similarity) is defined as follows: “given a question from a question–comment thread, rank the comments according to their relevance (similarity) with respect to the question.” Subtask B (Question-Question Similarity) is defined as follows: “given a new question, rerank all similar questions retrieved by a search engine, assuming that the answers to the similar questions should also answer the new question.”

The relationship between subtasks A, B, and C is illustrated in Figure 1. In the figure, $q$ stands for the new question, $q^{\prime}$ is an existing related question, and $c$ is a comment within the thread of question $q^{\prime}$ . The edge $\overline{qc}$ relates to the main CQA task (subtask C), i.e., deciding whether a comment for a potentially related question is a good answer to the original question. This relation captures the relevance of $c$ for $q$ . The edge $\overline{qq^{\prime}}$ represents the similarity between the original and the related questions (subtask B). This relation captures the relatedness of $q$ and $q^{\prime}$ . Finally, the edge $\overline{q^{\prime}c}$ represents the decision of whether $c$ is a good answer for the question from its thread, $q^{\prime}$ (subtask A). This relation captures the appropriateness of $c$ for $q^{\prime}$ . In this particular example, $q$ and $q^{\prime}$ are indeed related, and $c$ is a good answer for both $q^{\prime}$ and $q$ .

The participants were free to approach Subtask C with or without solving Subtasks A and B, and participation in the main subtask and/or the two subtasks was optional.

We had three objectives for the first two editions of our task: (i) to focus on semantic-based solutions beyond simple “bag-of-words” representations and “word matching” techniques; (ii) to study new NLP challenges arising in the CQA scenario, e.g., relations between the comments in a thread, relations between different threads, and question-to-question similarity; and (iii) to facilitate the participation of non-IR/QA experts.

The third objective was achieved by providing the set of potential answers and asking the participants to (re)rank the answers, and also by defining two optional subtasks (A and B), in addition to the main subtask (i.e., C).

Last year, we were successful in attracting a large number of participants to all subtasks. However, as the task design was new (we added subtasks B and C in the 2016 edition of the task), we felt that participants would benefit from a rerun, with new test sets for subtasks A–C.

We preserved the multilinguality aspect (as in 2015 and 2016), providing data for two languages: English and Arabic. In particular, we had an Arabic subtask D, which used data collected from three medical forums. This year, we used a slightly different procedure for the preparation of test set compared to the way the training, development, and test data for subtask D was collected last year.

Additionally, we included a new subtask, subtask E, which enables experimentation on Question–Question Similarity on a large-scale CQA dataset, i.e., StackExchange, based on the CQADupStack data set Hoogeveen et al. (2015). Subtask E is a duplicate question detection task, and like Subtask B, it is focused on question–question similarity. Participants were asked to rerank 50 candidate questions according to their relevance with respect to each query question. The subtask included several elements that differentiate it from Subtask B (see Section 3.2).

We provided manually annotated training data for both languages and for all subtasks. All examples were manually labeled by a community of annotators using a crowdsourcing platform. The datasets and the annotation procedure for the old data for subtasks A, B and C are described in Nakov et al. (2016b). In order to produce the new data for Subtask D, we used a slightly different procedure compared to 2016, which we describe in Section 2.

The remainder of this paper is organized as follows: Section 2 introduces related work. Section 3 gives a more detailed definition of the subtasks; it also describes the datasets and the process of their creation, and it explains the evaluation measures we used. Section 4 presents the results for all subtasks and for all participating systems. Section 5 summarizes the main approaches used by these systems and provides further discussion. Finally, Section 6 presents the main conclusions.

Related Work

The first step to automatically answer questions on CQA sites is to retrieve a set of questions similar to the question that the user has asked. This set of similar questions is then used to extract possible answers for the original input question. Despite its importance, question similarity for CQA is a hard task due to problems such as the “lexical gap” between the two questions.

Question-question similarity has been featured as a subtask (subtask B) of SemEval-2016 Task 3 on Community Question Answering Nakov et al. (2016b); there was also a similar subtask as part of SemEval-2016 Task 1 on Semantic Textual Similarity Agirre et al. (2016). Question-question similarity is an important problem with application to question recommendation, question duplicate detection, community question answering, and question answering in general. Typically, it has been addressed using a variety of textual similarity measures. Some work has paid attention to modeling the question topic, which can be done explicitly, e.g., using question topic and focus Duan et al. (2008) or using a graph of topic terms Cao et al. (2008), or implicitly, e.g., using a language model with a smoothing method based on the category structure of Yahoo! Answers Cao et al. (2009) or using LDA topic language model that matches the questions not only at the term level but also at the topic level Zhang et al. (2014).

Another important aspect is syntactic structure, e.g., Wang et al. (2009) proposed a retrieval model for finding similar questions based on the similarity of syntactic trees, and Da San Martino et al. (2016) used syntactic kernels. Yet another emerging approach is to use neural networks, e.g., dos Santos et al. (2015) used convolutional neural networks (CNNs), Romeo et al. (2016) used long short-term memory (LSTMs) networks with neural attention to select the important part of text when comparing two questions, and Lei et al. (2016) used a combined recurrent–convolutional model to map questions to continuous semantic representations. Finally, translation Jeon et al. (2005); Zhou et al. (2011) and cross-language models Da San Martino et al. (2017) have also been popular for question-question similarity.

Question-answer similarity has been a subtask (subtask A) of our task in its two previous editions Nakov et al. (2015, 2016b). This is a well-researched problem in the context of general question answering. One research direction has been to try to match the syntactic structure of the question to that of the candidate answer. For example, Wang et al. (2007) proposed a probabilistic quasi-synchronous grammar to learn syntactic transformations from the question to the candidate answers. Heilman and Smith (2010) used an algorithm based on Tree Edit Distance (TED) to learn tree transformations in pairs. Wang and Manning (2010) developed a probabilistic model to learn tree-edit operations on dependency parse trees. Yao et al. (2013) applied linear chain conditional random fields (CRFs) with features derived from TED to learn associations between questions and candidate answers. Moreover, syntactic structure was central for some of the top systems that participated in SemEval-2016 Task 3 Filice et al. (2016); Barrón-Cedeño et al. (2016).

Another important research direction has been on using neural network models for question-answer similarity Feng et al. (2015); Severyn and Moschitti (2015); Wang and Nyberg (2015); Tan et al. (2015); Barrón-Cedeño et al. (2016); Filice et al. (2016); Mohtarami et al. (2016). For instance, Tan et al. (2015) used neural attention over a bidirectional long short-term memory (LSTM) neural network in order to generate better answer representations given the questions. Another example is the work of Tymoshenko et al. (2016), who combined neural networks with syntactic kernels.

Yet another research direction has been on using machine translation models as features for question-answer similarity Berger et al. (2000); Echihabi and Marcu (2003); Jeon et al. (2005); Soricut and Brill (2006); Riezler et al. (2007); Li and Manandhar (2011); Surdeanu et al. (2011); Tran et al. (2015); Hoogeveen et al. (2016a); Wu and Zhang (2016), e.g., a variation of IBM model 1 Brown et al. (1993), to compute the probability that the question is a “translation” of the candidate answer. Similarly, Guzmán et al. (2016a, b) ported an entire machine translation evaluation framework Guzmán et al. (2015) to the CQA problem.

Using information about the answer thread is another important direction, which has been explored mainly to address Subtask A. In the 2015 edition of the task, the top participating systems used thread-level features, in addition to local features that only look at the question–answer pair. For example, the second-best team, HITSZ-ICRC, used as a feature the position of the comment in the thread, such as whether the answer is first or last Hou et al. (2015). Similarly, the third-best team, QCRI, used features to model a comment in the context of the entire comment thread, focusing on user interaction Nicosia et al. (2015). Finally, the fifth-best team, ICRC-HIT, treated the answer selection task as a sequence labeling problem and proposed recurrent convolutional neural networks to recognize good comments Zhou et al. (2015b).

In follow-up work, Zhou et al. (2015a) included long-short term memory (LSTM) units in their convolutional neural network to model the classification sequence for the thread, and Barrón-Cedeño et al. (2015) exploited the dependencies between the thread comments to tackle the same task. This was done by designing features that look globally at the thread and by applying structured prediction models, such as CRFs.

This research direction was further extended by Joty et al. (2015), who used the output structure at the thread level in order to make more consistent global decisions about the goodness of the answers in the thread. They modeled the relations between pairs of comments at any distance in the thread, and combined the predictions of local classifiers using graph-cut and Integer Linear Programming. In follow up work, Joty et al. (2016) proposed joint learning models that integrate inference within the learning process using global normalization and an Ising-like edge potential.

Question–External comment similarity is our main task (subtask C), and it is inter-related to subtasks A and B, as described in the triangle of Figure 1. This task has been much less studied in the literature, mainly because its definition is specific to our SemEval Task 3, and it first appeared in the 2016 edition Nakov et al. (2016b). Most of the systems that took part in the competition, including the winning system of the SUper team Mihaylova et al. (2016), approached the task indirectly by solving subtask A at the thread level and then using these predictions together with the reciprocal rank of the related questions in order to produce a final ranking for subtask C. One exception is the KeLP system Filice et al. (2016), which was ranked second in the competition. This system combined information from different subtasks and from all input components. It used a modular kernel function, including stacking from independent subtask A and B classifiers, and applying SVMs to train a Good vs. Bad classifier Filice et al. (2016). In a related study, Nakov et al. (2016a) discussed the input information to solve Subtask C, and concluded that one has to model mainly question-to-question similarity (Subtask B) and answer goodness (subtask A), while modeling the direct relation between the new question and the candidate answer (from a related question) was found to be far less important.

Finally, in another recent approach, Bonadiman et al. (2017) studied how to combine the different CQA subtasks. They presented a multitask neural architecture where the three tasks are trained together with the same representation. The authors showed that the multitask system yields good improvement for Subtask C, which is more complex and clearly dependent on the other two tasks.

Some notable features across all subtasks. Finally, we should mention some interesting features used by the participating systems across all three subtasks. This includes fine-tuned word embeddingshttps://github.com/tbmihailov/semeval2016-task3-cqa Mihaylov and Nakov (2016b); features modeling text complexity, veracity, and user trollnessUsing a heuristic that if several users call somebody a troll, then s/he should be one Mihaylov et al. (2015a, b); Mihaylov and Nakov (2016a); Mihaylov et al. (2017b). Mihaylova et al. (2016); sentiment polarity features Nicosia et al. (2015); and PMI-based goodness polarity lexicons Balchev et al. (2016); Mihaylov et al. (2017a).

Subtasks and Data Description

The 2017 challenge was structured as a set of five subtasks, four of which (A, B, C and E) were offered for English, while the fifth (D) one was for Arabic. We leveraged the data we developed in 2016 for the first four subtasks, creating only new test sets for them, whereas we built a completely new dataset for the new Subtask E.

The first four tasks and the datasets for them are described in Nakov et al. (2016b). Here we review them briefly.

Question-Comment Similarity. Given a question $Q$ and the first ten commentsWe limit the number of comments we consider to the first ten only in order to spare some annotation efforts. in its question thread ( $c_{1},\dots,c_{10}$ ), the goal is to rank these ten comments according to their relevance with respect to that question.

Note that this is a ranking task, not a classification task; we use mean average precision (MAP) as an official evaluation measure. This setting was adopted as it is closer to the application scenario than pure comment classification. For a perfect ranking, a system has to place all “Good” comments above the “PotentiallyUseful” and the “Bad” comments; the latter two are not actually distinguished and are considered “Bad” at evaluation time. This year, we elliminated the “PotentiallyUseful” class for test at annotation time.

Question-Question Similarity. Given a new question $Q$ (aka original question) and the set of the first ten related questions from the forum ( $Q_{1},\dots,Q_{10}$ ) retrieved by a search engine, the goal is to rank the related questions according to their similarity with respect to the original question.

In this case, we consider the “PerfectMatch” and the “Relevant” questions both as good (i.e., we do not distinguish between them and we will consider them both “Relevant”), and they should be ranked above the “Irrelevant” questions. As in subtask A, we use MAP as the official evaluation measure. To produce the ranking of related questions, participants have access to the corresponding related question-thread.Note that the search engine indexes entire Web pages, and thus, the search engine has compared the original question to the related questions together with their comment threads. Thus, being more precise, this subtask could have been named Question — Question+Thread Similarity.

Question-External Comment Similarity. Given a new question $Q$ (also known as the original question), and the set of the first ten related questions ( $Q_{1},\dots,Q_{10}$ ) from the forum retrieved by a search engine for $Q$ , each associated with its first ten comments appearing in $Q$ ’s thread ( $c_{1}^{1},\dots,c_{1}^{10},\dots,c_{10}^{1},\dots,c_{10}^{10}$ ), the goal is to rank these 10 $\times$ 10 = 100 comments $\{c_{i}^{j}\}_{i,j=1}^{10}$ according to their relevance with respect to the original question $Q$ .

This is the main English subtask. As for subtask A, we want the “Good” comments to be ranked above the “PotentiallyUseful” and the “Bad” comments, which will be considered just bad in terms of evaluation. Although, the systems are supposed to work on 100 comments, we take an application-oriented view in the evaluation, assuming that users would like to have good comments concentrated in the first ten positions. We believe users care much less about what happens in lower positions (e.g., after the 10th) in the rank, as they typically do not ask for the next page of results in a search engine such as Google or Bing. This is reflected in our primary evaluation score, MAP, which we restrict to consider only the top ten results for subtask C.

Rank the correct answers for a new question. Given a new question $Q$ (aka the original question), the set of the first 30 related questions retrieved by a search engine, each associated with one correct answer ( $(Q_{1},c_{1})\dots,(Q_{30},c_{30})$ ), the goal is to rank the 30 question-answer pairs according to their relevance with respect to the original question. We want the “Direct” and the “Relevant” answers to be ranked above the “Irrelevant” answers; the former two are considered “Relevant” in terms of evaluation. We evaluate the position of “Relevant” answers in the rank, and this is again a ranking task. Unlike the English subtasks, here we use 30 answers since the retrieval task is much more difficult, leading to low recall, and the number of correct answers is much lower. Again, the systems were evaluated using MAP, restricted to the top-10 results.

1.1 Data Description for A–D

The English data for subtasks A, B, and C comes from the Qatar Living forum, which is organized as a set of seemingly independent question–comment threads. In short, for subtask A, we annotated the comments in a question-thread as “Good”, “PotentiallyUseful” or “Bad” with respect to the question that started the thread. Additionally, given original questions, we retrieved related question–comment threads and annotated the related questions as “PerfectMatch”, “Relevant”, or “Irrelevant” with respect to the original question (Subtask B). We then annotated the comments in the threads of related questions as “Good”, “PotentiallyUseful” or “Bad” with respect to the original question (Subtask C).

For Arabic, the data was extracted from medical forums and has a different format. Given an original question, we retrieved pairs of the form (related_question, answer_to_the_related_question). These pairs were annotated as “Direct” answer, “Relevant” and “Irrelevant” with respect to the original question.

we annotated new English test data following the same setup as for SemEval-2016 Task 3 Nakov et al. (2016b), except that we elliminated the “Potentially Useful” class for subtask A. We first selected a set of questions to serve as original questions. In a real-world scenario those would be questions that had never been asked previously, but here we used existing questions from Qatar Living.

From each original question, we generated a query, using the question’s subject (after some word removal if the subject was too long). Then, we executed the query against Google, limiting the search to the Qatar Living forum, and we collected up to 200 resulting question-comment threads as related questions. Afterwards, we filtered out threads with less than ten comments as well as those for which the question was more than 2,000 characters long. Finally, we kept the top-10 surviving threads, keeping just the first 10 comments in each thread.

We formatted the results in XML with UTF-8 encoding, adding metadata for the related questions and for their comments; however, we did not provide any meta information about the original question, in order to emulate a scenario where it is a new question, never asked before in the forum. In order to have a valid XML, we had to do some cleansing and normalization of the data. We added an XML format definition at the beginning of the XML file and we made sure it validated.

We organized the XML data as a sequence of original questions (OrgQuestion), where each question has a subject, a body, and a unique question identifier (ORGQ_ID). Each such original question is followed by ten threads, where each thread consists of a related question (from the search engine results) and its first ten comments.

We made available to the participants for training and development the data from 2016 (and for subtask A, also from 2015), and we created a new test set of 88 new questions associated with 880 question candidates and 8,800 comments; details are shown in Table 1.

we had to annotate new test data. In 2016, we used data from three Arabic medical websites, which we downloaded and indexed locally using Solr.https://lucene.apache.org/solr/ Then, we performed 21 different query/document formulations, and we merged the retrieved results, ranking them according to the reciprocal rank fusion algorithm Cormack et al. (2009). Finally, we truncated the result list to the 30 top-ranked question–answer pairs.

This year we only used one of these websites, namely Altibbi.comhttp://www.altibbi.com/\<طبية¿-\<اسئلة¿ First, we selected some questions from that website to be used as original questions, and then we used Google to retrieve potentially related questions using the site:* filter.

We turned the question into a query as follows: We first queried Google using the first thirty words from the original question. If this did not return ten results, we reduced the query to the first ten non-stopwordsWe used the following Arabic stopword list: https://sites.google.com/site/kevinbouge/stopwords-lists from the question, and if needed we further tried using the first five non-stopwords only. If we did not manage to obtain ten results, we discarded that original question.

If we managed to obtain ten results, we followed the resulting links and we parsed the target page to extract the question and the answer, which is given by a physician, as well as some metadata such as date, question classification, doctor’s name and country, etc.

In many cases, Google returned our original question as one of the search results, in which case we had to exclude it, thus reducing the results to nine. In the remaining cases, we excluded the 10th result in order to have the same number of candidate question–answer pairs for each original question, namely nine. Overall, we collected 1,400 original questions, with exactly nine potentially related question–answer pairs for each of them, i.e., a total of 12,600 pairs.

We created an annotation job on CrowdFlower to obtain judgments about the relevance of the question–answer pairs with respect to the original question. We controlled the quality of annotation using a hidden set of 50 test questions. We had three judgments per example, which we combined using the CrowdFlower mechanism. The average agreement was 81%. Table 2 shows statistics about the resulting dataset, together with statistics about the datasets from 2016, which could be used for training and development.

1.2 Evaluation Measures for A–D

The official evaluation measure we used to rank the participating systems is Mean Average Precision (“MAP”), calculated over the top-10 comments as ranked by a participating system. We further report the results for two unofficial ranking measures, which we also calculated over the top-10 results only: Mean Reciprocal Rank (“MRR”) and Average Recall (“AvgRec”). Additionally, we report the results for four standard classification measures, which we calculate over the full list of results: Precision, Recall and F1 (with respect to the Good/Relevant class), and Accuracy.

We released a specialized scorer that calculates and returns all the above-mentioned scores.

2 The New Subtask E

Subtask E is a duplicate question detection task, similar to Subtask B. Participants were asked to rerank 50 candidate questions according to their relevance with respect to each query question. The subtask included several elements that distinguish it from Subtask B:

Several meta-data fields were added, including the tags that are associated with each question, the number of times a question has been viewed, and the score of each question, answer and comment (the number of upvotes it has received from the community, minus the number of downvotes), as well as user statistics, containing information such as user reputation and user badges.The complete list of available meta-data fields can be found on the Task website.

At test time, two extra test sets containing data from two surprise subforums were provided, to test the participants’ system’s cross-domain performance.

The participants were asked to truncate their result list in such a way that only “PerfectMatch” questions appeared in it. The evaluation metrics were adjusted to be able to handle empty result lists (see Section 3.2.2).

The data was taken from StackExchange instead of the Qatar Living forums, and reflected the real-world distribution of duplicate questions in having many query questions with zero relevant results.

The cross-domain aspect was of particular interest, as it has not received much attention in earlier duplicate question detection research.

The data consisted of questions from the following four StackExchange subforums: Android, English, Gaming, and Wordpress, derived from a data set known as CQADupStack Hoogeveen et al. (2015). Data size statistics can be found in Table 3. These subforums were chosen due to their size, and to reflect a variety of domains.

The data was provided in the same format as for the other subtasks. Each original question had 50 candidate questions, and these related questions each had a number of comments. On top of that, they had a number of answers, and each answer potentially had individual comments. The difference between answers and comments is that answers should contain a well-formed answer to the question, while comments contain things such as requests for clarification, remarks, and small additions to someone else’s answer. Since the content of StackExchange is provided by the community, the precise delineation between comments and the main body of a post can vary across forums.

The relevance labels in the development and in the training data were sourced directly from the users of the StackExchange sites, who can vote for questions to be closed as duplicates: these are the questions we labeled as PerfectMatch.

The questions labeled as Related are questions that are not duplicates, but that are somehow similar to the original question, also as judged by the StackExchange community. It is possible that some duplicate labels are missing, due to the voluntary nature of the duplicate labeling on StackExchange. The development and training data should therefore be considered a silver standard Hoogeveen et al. (2016b).

For the test data, we started an annotation project together with StackExchange.A post made by StackExchange about the project can be found here: http://meta.stackexchange.com/questions/286329/project-reduplication-of-deduplication-has-begun The goal was to obtain multiple annotations per question pair in the test set, from the same community that provided the labels in the development and in the training data. We expected the community to react enthusiastically, because the data would be used to build systems that can improve duplicate question detection on the site, ultimately saving the users manual effort. Unfortunately, only a handful of people were willing to annotate a sizeable set of question pairs, thus making their annotations unusable for the purpose of this shared task.

An example that includes a query question from the English subforum, a duplicate of that question, and a non-duplicate question (with respect to the query) is shown below:

Query: Why do bread companies add sugar to bread?

Duplicate: What is the purpose of sugar in baking plain bread?

Non-duplicate: Is it safe to eat potatoes that have sprouted?

2.2 Evaluation Measure for E

In CQA archives, the majority of new questions do not have a duplicate in the archive. We maintained this characteristic in the training, in the development, and in the test data, to stay as close to a real world setting as possible. This means that for most query questions, the correct result is an empty list.

This has two consequences: (1) a system that always returns an empty list is a challenging baseline to beat, and (2) standard IR evaluation metrics like MAP, which is used in the other subtasks, cannot be used, because they break down when the result list is empty or there are no relevant documents for a given query.

To solve this problem we used a modified version of MAP, as proposed by Liu et al. (2016). To make sure standard IR evaluation metrics do not break down on empty result list queries, Liu et al. (2016) add a nominal terminal document to the end of the ranking returned by a system, to indicate where the number of relevant documents ended. This terminal document has a corresponding gain value of:

The result of this adjustment is that queries without relevant documents in the index, receive a MAP score of 1.0 for an empty result ranking. This is desired, because in such cases, the empty ranking is the correct result.

Participants and Results

The list of all participating teams can be found in Table 4. The results for subtasks A, B, C, and D are shown in tables 5, 6, 7, and 8, respectively. Unfortunately, there were no official participants in Subtask E, and thus we present baseline results in Table 9. In all tables, the systems are ranked by the official MAP scores for their primary runsParticipants could submit one primary run, to be used for the official ranking, and up to two contrastive runs, which are scored, but they have unofficial status. (shown in the third column). The following columns show the scores based on the other six unofficial measures; the ranking with respect to these additional measures are marked with a subindex (for the primary runs).

Twenty two teams participated in the challenge presenting a variety of approaches and features to address the different subtasks. They submitted a total of 85 runs (36 primary and 49 contrastive), which breaks down by subtask as follows: The English subtasks A, B and C attracted 14, 13, and 6 systems and 31, 34 and 14 runs, respectively. The Arabic subtask D got 3 systems and 6 runs. And there were no participants for subtask E.

The best MAP scores had large variability depending on the subtask, going from 15.46 (best result for subtask C) to 88.43 (best result for subtask A). The best systems for subtasks A, B, and C were able to beat the baselines we provided by sizeable margins. In subtask D, only the best system was above the IR baseline.

Table 5 shows the results for subtask A, English, which attracted 14 teams (two more than in the 2016 edition). In total 31 runs were submitted: 14 primary and 17 contrastive. The last four rows of the table show the performance of four baselines. The first one is the chronological ranking, where the comments are ordered by their time of posting; we can see that all submissions but one outperform this baseline on all three ranking measures. The second baseline is a random baseline, which is 10 MAP points below the chronological ranking. Baseline 3 classifies all comments as Good, and it outperforms all but three of the primary systems in terms of F1 and one system in terms of Accuracy. However, it should be noted that the systems were not optimized for such measures. Finally, baseline 4 classifies all comments as Bad; it is outperformed by all primary systems in terms of Accuracy.

The winner of Subtask A is KeLP with a MAP of 88.43, closely followed by Beihang-MSRA, scoring 88.24. Relatively far from the first two, we find five systems, IIT-UHH, ECNU, bunji, EICA and SwissAlps, which all obtained an MAP of around 86.5.

2 Subtask B, English (Question-Question Similarity)

Table 6 shows the results for subtask B, English, which attracted 13 teams (3 more than in last year’s edition) and 34 runs: 13 primary and 21 contrastive. This is known to be a hard task. In contrast to the 2016 results, in which only 6 out of 11 teams beat the strong IR baseline (i.e., ordering the related questions in the order provided by the search engine), this year 10 of the 13 systems outperformed this baseline in terms of MAP, AvgRec and MRR. Moreover, the improvements for the best systems over the IR baseline are larger (reaching $>7$ MAP points absolute). This is a remarkable improvement over last year’s results.

The random baseline outperforms two systems in terms of Accuracy. The “all-good” baseline is below almost all systems on F1, but the “all-false” baseline yields the best Accuracy results. This is partly because the label distribution in the dataset is biased (81.5% of negative cases), but also because the systems were optimized for MAP rather than for classification accuracy (or precision/recall).

The winner of the task is SimBow with a MAP of 47.22, followed by LearningToQuestion with 46.93, KeLP with 46.66, and Talla with 45.70. The other nine systems scored sensibly lower than them, ranging from about 41 to 45. Note that the contrastive1 run of KeLP, which corresponds to the KeLP system from last year Filice et al. (2016), achieved an even higher MAP of 49.00.

3 Subtask C, English (Question-External Comment Similarity)

The results for subtask C, English are shown in Table 7. This subtask attracted 6 teams (sizable decrease compared to last year’s 10 teams), and 14 runs: 6 primary and 8 contrastive. The test set from 2017 had much more skewed label distribution, with only 2.8% positive instances, compared to the $\sim$ 10% of the 2016 test set. This makes the overall MAP scores look much lower, as the number of examples without a single positive comment increased significantly, and they contribute 0 to the average, due to the definition of the measure. Consequently, the results cannot be compared directly to last year’s.

All primary systems managed to outperform all baselines with respect to the ranking measures. Moreover, all but one system outperformed the “all true” system on F1, and all of them were below the accuracy of the “all false” baseline, due to the extreme class imbalance.

The best-performing team for subtask C is IIT-UHH, with a MAP of 15.46, followed by bunji with 14.71, and KeLP with 14.35. The contrastive1 run of bunji, which used a neural network, obtained the highest MAP, 16.57, two points higher than their primary run, which also uses the comment plausibility features. Thus, the difference seems to be due to the use of comment plausibility features, which hurt the accuracy. In their SemEval system paper, Koreeda et al. (2017) explain that the similarity features are more important for Subtask C than plausibility features.

Indeed, Subtask C contains many comments that are not related to the original question, while candidate comments for subtask A are almost always on the same topic. Another explanation may be the overfitting to the development set since the authors manually designed plausibility features using that set. As a result, such features perform much worse on the 2017 test set.

4 Subtask D, Arabic (Reranking the Correct Answers for a New Question)

Finally, the results for subtask D, Arabic are shown in Table 8. This year, subtask D attracted only 3 teams, which submitted 6 runs: 3 primary and 3 contrastive. Compared to last year, the 2017 test set contains a significantly larger number of positive question–answer pairs ( $\sim$ 40% in 2017, compared to $\sim$ 20% in 2016), and thus the MAP scores are higher this year. Moreover, this year, the IR baseline is coming from Google and is thus very strong and difficult to beat. Indeed, only the best system was able to improve on it (marginally) in terms of MAP, MRR and AvgRec.

As in some of the other tasks, the participants in Subtask D did not concentrate on optimizing for precision/recall/F1/accuracy and they did not produce sensible class predictions in most cases.

The best-performing system is GW_QA with a MAP score of 61.16, which barely improves over the IR baseline of 60.55. The other two systems UPC-USMBA and QU_BIGIR are about 3-4 points behind.

5 Subtask E, English (Multi-Domain Question Duplicate Detection)

The baselines for Subtask E can be found in Table 9. The IR baseline is BM25 with perfect truncation after the final relevant document for a given document (equating to an empty result list if there are no relevant documents). The zero results baseline is the score for a system that returns an empty result list for every single query. This is a high number for each subforum because for many queries there are no duplicate questions in the archive.

As previously stated, there are no results submitted by participants to be discussed for this subtask. Eight teams signed up to participate, but unfortunately none of them submitted test results.

Discussion and Conclusions

In this section, we first describe features that are common across the different subtasks. Then, we discuss the characteristics of the best systems for each subtask with focus on the machine learning algorithms and the instance representations used.

The features the participants used across the sutbtasks can be organized into the following groups:

(i) similarity features between questions and comments from their threads or between original questions and related questions, e.g., cosine similarity applied to lexical, syntactic and semantic representations, including distributed representations, often derived using neural networks;

(ii) content features, which are special signals that can clearly indicate a bad comment, e.g., when a comment contains “thanks”;

(iii) thread level/meta features, e.g., user ID, comment rank in the thread;

(iv) automatically generated features from syntactic structures using tree kernels.

Generally, similarity features were developed for the subtasks as follows:

Similarities between question subject vs. comment, question body vs. comment, and question subject+body vs. comment.

Similarities between the original and the related question at different levels: subject vs. subject, body vs. body, and subject+body vs. subject+body.

The same as above, plus the similarities of the original question, subject and body at all levels with the comments from the thread of the related question.

The same as above, without information about the thread, as there is no thread.

The similarity scores to be used as features were computed in various ways, e.g., most teams used dot product calculated over word $n$ -grams ( $n$ =1,2,3), character $n$ -grams, or with TF-IDF weighting. Simple word overlap, i.e., the number of common words between two texts, was also considered, often normalized, e.g., by question/comment length. Overlap in terms of nouns or named entities was also explored.

2 Learning Methods

This year, we saw variety of machine learning approaches, ranging from SVMs to deep learning.

The KeLP system, which performed best on Subtask A, was SVM-based and used syntactic tree kernels with relational links between questions and comments, together with some standard text similarity measures linearly combined with the tree kernel. Variants of this approach were successfully used in related research Tymoshenko et al. (2016); Da San Martino et al. (2016), as well as in last year’s KeLP system Filice et al. (2016).

The best performing system on Subtask C, IIT-UHH, was also SVM-based, and it used textual, domain-specific, word-embedding and topic-modeling features. The most interesting aspect of this system is their method for dialogue chain identification in the comment threads, which yielded substantial improvements.

The best-performing system on Subtask B was SimBow. They used logistic regression on a rich combination of different unsupervised textual similarities, built using a relation matrix based on standard cosine similarity between bag-of-words and other semantic or lexical relations.

This year, we also saw a jump in the popularity of deep learning and neural networks. For example, the Beihang-MSRA system was ranked second with a result very close to that of KeLP for Subtask A. They used gradient boosted regression trees, i.e., XgBoost, as a ranking model to combine (i) TF $\times$ IDF, word sequence overlap, translation probability, (ii) three different types of tree kernels, (iii) subtask-specific features, e.g., whether a comment is written by the author of the question, the length of a comment or whether a comment contains URLs or email addresses, and (iv) neural word embeddings, and the similarity score from Bi-LSTM and 2D matching neural networks.

LearningToQuestion achieved the second best result for Subtask B using SVM and Logistic Regression as integrators of rich feature representations, mainly embeddings generated by the following neural networks: (i) siamese networks to learn similarity measures using GloVe vectors Pennington et al. (2014), (ii) bidirectional LSTMs, (iii) gated recurrent unit (GRU) used as another network to generate the neural embeddings trained by a siamese network similar to Bi-LSTM, (iv) and convolutional neural networks to generate embeddings inside the siamese network.

The bunji system, second on Subtask C, produced features using neural networks that capture the semantic similarities between two sentences as well as comment plausibility. The neural similarity features were extracted using a decomposable attention model Parikh et al. (2016), which can model alignment between two sequences of text, allowing the system to identify possibly related regions of a question and of a comment, which then helps it predict whether the comment is relevant with respect to the question. The model compares each token pair from the question tokens and comment tokens associating them with an attention weight. Each question-comment pair is mapped to a real-value score using a neural network with shared weights and the prediction loss is calculated list-wise. The plausibility features are task-specific, e.g., is the person giving the answer actually trying to answer the question or is s/he making remarks or asking for more information. Other features are the presence keywords such as what, which, who, where within the question. There are also features about the question and the comment length. All these features were merged in a CRF.

Another interesting system is that of Talla, which consists of an ensemble of syntactic, semantic, and IR-based features, i.e., semantic word alignment, term frequency Kullback-Leibler divergence, and tree kernels. These were integrated in a pairwise-preference learning handled with a random forest classifier with 2,000 weak estimators. This system achieved very good performance on Subtask B.

Regarding Arabic, GW_QA, the best-performing system for Subtask D, used features based on latent semantic models, namely, weighted textual matrix factorization models (WTMF), as well as a set of lexical features based on string lengths and surface-level matching. WTMF builds a latent model, which is appropriate for semantic profiling of a short text. Its main goal is to address the sparseness of short texts using both observed and missing words to explicitly capture what the text is and is not about. The missing words are defined as those of the entire training data vocabulary minus those of the target document. The model was trained on text data from the Arabic Gigaword as well as on Arabic data that we provided in the task website, as part of the task. For Arabic text processing, the MADAMIRA toolkit was used.

The second-best team for Arabic, QU-BIGIR, used SVM-rank with two similarity feature sets. The first set captured similarity between pairs of text, i.e., synonym overlap, language model score, cosine similarity, Jaccard similarity, etc. The second set used word2vec to build average word embedding and covariance word embedding similarity to build the text representation.

The third-best team for Arabic, UPC-USMBA, combined several classifiers, including (i) lexical string similarities in vector representations, and (ii) rule-based features. A core component of their approach was the use of medical terminology covering both Arabic and English terms, which was organized into the following three categories: body parts, drugs, and diseases. In particular, they translated the Arabic dataset into English using the Google Translate service. The linguistic processing was carried out with Stanford CoreNLP for English and MADAMIRA for Arabic. Finally, WordNet synsets both for Arabic and English were added to the representation without performing word sense disambiguation.

Conclusions

We have described SemEval-2017 Task 3 on Community Question Answering, which extended the four subtasks at SemEval-2016 Task 3 Nakov et al. (2016b) with a new subtask on multi-domain question duplicate detection. Overall, the task attracted 23 teams, which submitted 85 runs; this is comparable to 2016, when 18 teams submitted 95 runs. The participants built on the lessons learned from the 2016 edition of the task, and further experimented with new features and learning frameworks. The top systems used neural networks with distributed representations or SVMs with syntactic kernels for linguistic analysis. A number of new features have been tried as well.

Apart from the new lessons learned from this year’s edition, we believe that the task has another important contribution: the datasets we have created as part of the task, and which we have released for use to the research community, should be useful for follow-up research beyond SemEval.

Finally, while the new subtask E did not get any submissions, mainly because of the need to work with a large amount of data, we believe that it is about an important problem and that it will attract the interest of many researchers of the field.

Acknowledgements

This research was performed in part by the Arabic Language Technologies (ALT) group at the Qatar Computing Research Institute (QCRI), HBKU, part of Qatar Foundation. It is part of the Interactive sYstems for Answer Search (Iyas) project, which is developed in collaboration with MIT-CSAIL. This research received funding in part from the Australian Research Council.