Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering

Shiyue Zhang, Mohit Bansal

Introduction

In contrast to the rapid progress shown in Question Answering (QA) tasks Rajpurkar et al. (2016); Joshi et al. (2017); Yang et al. (2018), the task of Question Generation (QG) remains understudied and challenging. However, as an important dual task to QA, QG can not only be used to augment QA datasets Duan et al. (2017), but can also be applied in conversation and education systems Heilman and Smith (2010); Lindberg et al. (2013). Furthermore, given that existing QA models often fall short by doing simple word/phrase matching rather than true comprehension Jia and Liang (2017), the task of QG, which usually needs complicated semantic reasoning and syntactic variation, should be another way to encourage true machine comprehension Lewis and Fan (2019). Recently, we have seen an increasing interest in the QG area, with mainly three categories: Text-based QG Du et al. (2017); Zhao et al. (2018), Knowledge-Base-based QG Reddy et al. (2017); Serban et al. (2016), and Image-based QG Li et al. (2018); Jain et al. (2017). Our work focuses on the Text-based QG branch.

Current QG systems follow an attention-based sequence-to-sequence structure, taking the paragraph-level context and answer as inputs and outputting the question. However, we observed that these QG models often generate questions that semantically drift away from the given context and answer; we call this the “semantic drift” problem. As shown in Figure 1, the baseline QG model generates a question that has almost contrary semantics with the ground-truth question, and the generated phrase “the principle of enlightenment” does not make sense given the context. We conjecture that the reason for this “semantic drift” problem is because the QG model is trained via teacher forcing only, without any high-level semantic regularization. Hence, the learned model behaves more like a question language model with some loose context constraint, while it is unaware of the strong requirements that it should be closely grounded by the context and should be answered by the given answer. Therefore, we propose two semantics-enhanced rewards to address this drift: QPP and QAP. Here, QPP refers to Question Paraphrasing Probability, which is the probability of the generated question and the ground-truth question being paraphrases; QAP refers to Question Answering Probability, which is the probability that the generated question can be correctly answered by the given answer. We regularize the generation with these two rewards via reinforcement learning. Experiments show that these two rewards can significantly improve the question generation quality separately or jointly, and achieve the new state-of-the-art performance on the SQuAD QG task.

Next, in terms of QG evaluation, previous works have mostly adopted popular automatic evaluation metrics, like BLEU, METEOR, etc. However, we observe that these metrics often fall short in properly evaluating the quality of generated questions. First, they are not always correlated to human judgment about answerability Nema and Khapra (2018). Second, since multiple questions are valid but only one reference exists in the dataset, these traditional metrics fail to appropriately score question paraphrases and novel generation (shown in Figure 3). Therefore, we introduce a QA-based evaluation method that directly measures the QG model’s ability to mimic human annotators in generating QA training data, because ideally, we hope that the QG model can act like a human to ask questions. We compare different QG systems using this evaluation method, which shows that our semantics-reinforced QG model performs best. However, this improvement is relatively minor compared to our improvement on other QG metrics, which indicates improvement on typical QG metrics does not always lead to better question annotation by QG models for generating QA training set.

Further, we investigate how to use our best QG system to enrich QA datasets and perform semi-supervised QA on SQuADv1.1 Rajpurkar et al. (2016). Following the back-translation strategy that has been shown to be effective in Machine Translation Sennrich et al. (2016) and Natural Language Navigation Fried et al. (2018); Tan et al. (2019), we propose two methods to collect synthetic data. First, since multiple questions can be asked for one answer while there is only one human-labeled ground-truth, we make our QG model generate new questions for existing context-answer pairs in SQuAD training set, so as to enrich it with paraphrased and other novel but valid questions. Second, we use our QG model to label new context-answer pairs from new Wikipedia articles. However, directly mixing synthetic QA pairs with ground-truth data will not lead to improvement. Hence, we introduce two empirically effective strategies: one is a “data filter” based on QAP (same as the QAP reward) to filter out examples that have low probabilities to be correctly answered; the other is a “mixing mini-batch training” strategy that always regularizes the training signal with the ground-truth data. Experiments show that our method improves both BiDAF Seo et al. (2016); Clark and Gardner (2018) and BERT Devlin et al. (2018) QA baselines by 1.69/1.27 and 1.19/0.56 absolute points on EM/F1, respectively; even without introducing new articles, it can bring 1.51/1.13 and 0.95/0.13 absolute improvement, respectively.

Related Works

Early QG studies focused on using rule-based methods to transform statements to questions Heilman and Smith (2010); Lindberg et al. (2013); Labutov et al. (2015). Recent works adopted the attention-based sequence-to-sequence neural model Bahdanau et al. (2014) for QG tasks, taking answer sentence as input and outputting the question Du et al. (2017); Zhou et al. (2017), which proved to be better than rule-based methods. Since human-labeled questions are often relevant to a longer context, later works leveraged information from the whole paragraph for QG, either by extracting additional information from the paragraph Du and Cardie (2018); Song et al. (2018); Liu et al. (2019) or by directly taking the whole paragraph as input Zhao et al. (2018); Kim et al. (2018); Sun et al. (2018). A very recent concurrent work applied the large-scale language model pre-training strategy for QG and also achieved a new state-of-the-art performance Dong et al. (2019). However, the above models were trained with teacher forcing only. To address the exposure bias problem, some works applied reinforcement learning taking evaluation metrics (e.g., BLEU) as rewards Song et al. (2017); Kumar et al. (2018). Yuan et al. (2017) proposed to use a language model’s perplexity ( $R_{PPL}$ ) and a QA model’s accuracy ( $R_{QA}$ ) as two rewards but failed to get significant improvement. Their second reward is similar to our QAP reward except that we use QA probability rather than accuracy as the probability distribution is more smooth. Hosking and Riedel (2019) compared a set of different rewards, including $R_{PPL}$ and $R_{QA}$ , and claimed none of them improved the quality of generated questions. For QG evaluation, even though some previous works conducted human evaluations, most of them still relied on traditional metrics (e.g., BLEU). However, Nema and Khapra (2018) pointed out the existing metrics do not correlate with human judgment about answerability, so they proposed “Q-metrics” that mixed traditional metrics with an “answerability” score. In our work, we will show QG results on traditional metrics, Q-metrics, as well as human evaluation, and also propose a QA-based QG evaluation.

Question Generation for QA

As the dual task of QA, QG has been often proposed for improving QA. Some works have directly used QG in QA models’ pipeline Duan et al. (2017); Dong et al. (2017); Lewis and Fan (2019). Some other works enabled semi-supervised QA with the help of QG. Tang et al. (2017) applied the “dual learning” algorithm He et al. (2016) to learn QA and QG jointly with unlabeled texts. Yang et al. (2017) and Tang et al. (2018) followed the GAN Goodfellow et al. (2014) paradigm, taking QG as a generator and QA as a discriminator, to utilize unlabeled data. Sachan and Xing (2018) proposed a self-training cycle between QA and QG. However, these works either reduced the ground-truth data size or simplified the span-prediction QA task to answer sentence selection. Dhingra et al. (2018) collected 3.2M cloze-style QA pairs to pre-train a QA model, then fine-tune with the full ground-truth data which improved a BiDAF-QA baseline. In our paper, we follow the back-translation Sennrich et al. (2016) strategy to generate new QA pairs by our best QG model to augment SQuAD training set. Further, we introduce a data filter to remove poorly generated examples and a mixing mini-batch training strategy to more effectively use the synthetic data. Similar methods have also been applied in some very recent concurrent works Dong et al. (2019); Alberti et al. (2019) on SQuADv2.0. The main difference is that we also propose to generate new questions from existing articles without introducing new articles.

Question Generation

We first introduce our base model which mainly adopts the model architecture from the previous state-of-the-art Zhao et al. (2018). The differences are that we introduce two linguistic features (POS & NER), apply deep contextualized word vectors, and tie the output projection matrix with the word embedding matrix. Experiments showed that with these additions, our base model results surpass the results reported in Zhao et al. (2018) with significant margins. Our base model architecture is shown in the upper box in Figure 2 and described as follow. If we have a paragraph $p=\{x_{i}\}_{i=1}^{M}$ and an answer $a$ which is a sub-span of $p$ , the target of the QG task is to generate a question $q=\{y_{j}\}_{j=1}^{N}$ that can be answered by $a$ based on the information in $p$ .

The model first concatenates four word representations: word vector, answer tag embedding, Part-of-Speech (POS) tag embedding, and Name Entity (NER) tag embedding, i.e., $e_{i}=[w_{i},a_{i},p_{i},n_{i}]$ . For word vectors, we use the deep contextualized word vectors from ELMo Peters et al. (2018) or BERT Devlin et al. (2018). The answer tag follows the BIO“B”, for “Begin”, tags the start token of the answer span; “I”, for “Inside”, tags other tokens in the answer span; “O”, for “Other”, tags other tokens in the paragraph. tagging scheme.

Encoder

The output of the embedding layer is then encoded by a two-layer bi-directional LSTM-RNNs, resulting in a list of hidden representations $H$ . At any time step $i$ , the representation $h_{i}$ is the concatenation of $\overrightarrow{h_{i}}$ and $\overleftarrow{h_{i}}$ .

Self-attention

A gated self-attention mechanism Wang et al. (2017) is applied to $H$ to aggregate the long-term dependency within the paragraph. $\alpha_{i}$ is an attention vector between $h_{i}$ and each element in $H$ ; $u_{i}$ is the self-attention context vector for $h_{i}$ ; $h_{i}$ is then updated to $f_{i}$ using $u_{i}$ ; a soft gate $g_{i}$ decides how much the update is applied. $\hat{H}=[\hat{h}_{i}]_{i=1}^{M}$ is the output of this layer.

Decoder

The decoder is another two-layer uni-directional LSTM-RNN. An attention mechanism dynamically aggregates $\hat{H}$ at each decoding step to a context vector $c_{j}$ which is then used to update the decoder state $s_{j}$ .

The probability of the target word $y_{j}$ is computed by a maxout neural network.

In practice, we keep the weight matrix $W^{e}$ the same as the word embedding matrix and fix it during training. Furthermore, we apply a “pointer network” Gu et al. (2016) to enable the model to copy words from input.

2 Semantics-Reinforced Model

To address the “semantic drift” problem shown in Figure 1, we propose two semantics-enhanced rewards to regularize the generation to focus on generating semantically valid questions.

To deal with the “exposure bias” problem, many previous works directly used the final evaluation metrics (e.g., BLEU) as rewards to train the generation models Rennie et al. (2017); Paulus et al. (2017). However, these metrics sometimes fail to evaluate equally to question paraphrases and thus provide inaccurate rewards. Hence, we propose to use a pre-trained question paraphrasing classification (QPC) model to provide paraphrasing probability as a reward. Since paraphrasing is more about semantic similarity than superficial word/phrase matching, it treats question paraphrases more fairly (Example 1 in Figure 3). Therefore, we first train a QPC model with Quora Question Pairs dataset. Next, we take it as an environment, and the QG model will interact with it during training to get the probability of the generated question and the ground-truth question being paraphrases as the reward.

QAP Reward

Two observations motivate us to introduce QAP reward. First, in a paragraph, usually, there are several facts relating to the answer and can be used to ask questions. Neither the teacher forcing or the QPP reward can favor this kind of novel generation (Example 2 in Figure 3). Second, we find semantically-drifted questions are usually unanswerable by the given answer. Therefore, inspired by the dual learning algorithm He et al. (2016), we propose to take the probability that a pre-trained QA model can correctly answer the generated question as a reward, i.e., $p(a^{*}|q^{s};p)$ , where $a^{*}$ is the ground-truth answer and $q^{s}$ is a sampled question. Using this reward, the model can not only gets positive rewards for novel generation but also be regularized by the answerability requirement. Note that, this reward is supposed to be carefully used because the QG model can cheat by greedily copying words in/near the answer to the generated question. In this case, even though high QAPs are achieved, the model loses the question generation ability.

Policy Gradient

To apply these two rewards, we use the REINFORCE algorithm Williams (1992) to learn a generation policy $p_{\theta}$ defined by the QG model parameters $\theta$ . We minimize the loss function $L_{RL}=-E_{q^{s}\sim p_{\theta}}[r(q^{s})]$ , where $q^{s}$ is a sampled question from the model’s output distribution. Due to the non-differentiable sampling procedure, the gradient is approximated using a single sample with some variance reduction baseline $b$ :

We follow the effective SCST strategy Rennie et al. (2017) to take the reward of greedy search result $q^{g}$ as the baseline, i.e., $b=r(q^{g})$ . However, only using this objective to train QG will result in poor readability, so we follow the mixed loss setting Paulus et al. (2017): $L_{mixed}=\gamma L_{RL}+(1-\gamma)L_{ML}$ . In practice, we find the mixing ratio $\gamma$ for QAP reward should be lower, i.e., it needs more regularization from teacher forcing, so that it can avoid the undesirable cheating issue mentioned above. Furthermore, we also apply the multi-reward optimization strategy Pasunuru and Bansal (2018) to train the model with two mixed losses alternately with an alternate rate $n:m$ , i.e., train with $L_{mixed}^{qpp}$ for $n$ mini-batches, then train with $L_{mixed}^{qap}$ for $m$ mini-batches, repeat until convergence. $n$ and $m$ are two hyper-parameters.

Experiments show that these two rewards can significantly improve the QG performance separately or jointly, and we achieve new state-of-the-art QG performances, see details in Section 6.

3 QA-Based QG Evaluation

Inspired by the idea that “a perfect QG model can replace humans to ask questions”, we introduce a QA-based evaluation method that measures the quality of a QG model by its ability to mimic human annotators in labeling training data for QA models. The evaluation procedure is described as follows. First, we sample some unlabeled Wikipedia paragraphs with pre-extracted answer spans from HarvestingQA dataset Du and Cardie (2018). Second, we make a QG model act as an “annotator” to annotate a question for each answer span. Third, we train a QA model using this synthetic QA dataset. Lastly, we use the QA model’s performance on the original SQuAD development set as the evaluation for this QG model. The higher this QA performance is, the better the QG model mimics a human’s question-asking ability. We believe that this method provides a new angle to evaluate QG model’s quality and also a more reliable way to choose QG models to conduct data augmentation and semi-supervised QA.

Semi-Supervised Question Answering

Since one of the major goals of developing QG systems is to generate new QA pairs and augment QA datasets, we investigate how to use our QG system to act as a question annotator, collect new QA pairs, and conduct semi-supervised QA. Figure 4 illustrates the overall procedure of our semi-supervised QA approach.

To generate synthetic QA pairs, we follow the effective “back translation” approach proposed in Neural Machine Translation (NMT) Sennrich et al. (2016). In NMT, the back translation method first obtains synthetic source sentences by running a pre-trained target-to-source translation model on a monolingual dataset of the target language; then, it combines the synthetic and ground-truth translation pairs to train the desired source-to-target translation model. Similarly, in the QA scenario, the paragraph/answer can be viewed as the “target sentence”, while the question can be taken as the “source sentence”. One tricky difference is that even if the paragraphs can be easily obtained from Wikipedia, there are no answer span labels. Therefore, we use two sources to generate questions from, as discussed below.

In SQuAD Rajpurkar et al. (2016), each context-answer pair only has one ground-truth question. However, usually, multiple questions can be asked. The diversity lies in question paraphrasing and different facts in the context that can be used to ask the question. Therefore, without introducing new Wikipedia articles, we make our QG model generate diverse questions for the existing context-answer pairs in SQuAD training set by keeping the all beam search outputs for each example.

Generate from New Articles

To use unlabeled Wikipedia articles for data augmentation, an automatic answer extractor is indispensable. Some previous works have proposed methods to detect key phrases from a paragraph and automatically extract potential answer spans Yang et al. (2017); Du and Cardie (2018); Subramanian et al. (2018). Instead of building up our answer extractor, we directly take advantage of the released HarvestingQA dataset. It contains 1.2M synthetic QA pairs, in which both the answer extractor and the QG model were proposed by Du and Cardie (2018). We use their paragraphs with answer span labels but generate questions with our QG models, and only use their questions for comparison.

2 Synthetic Data Usage

In practice, we find that directly mixing the synthetic data with the ground-truth data does not improve QA performance. We conjecture the reason is that some poor-quality synthetic examples mislead the learning process of the QA model. Therefore, we propose two empirical strategies to better utilize synthetic data.

In “self-training” literature, similar issues have been discussed that using model-labeled examples to train the model will amplify the model’s error. Later works proposed co-training or tri-training that uses two or three models as judges and only keeps examples that all models agree on Blum and Mitchell (1998); Zhou and Li (2005). Sachan and Xing (2018) also designed question selection oracles based on curriculum learning strategy in their QA-QG self-training circle. In this paper, we simply design a data filter based on our QAP measure (same definition as the QAP reward) to filter poor-quality examples. We think if one question-answer pair has a low QAP, i.e., the probability of the answer given the question is low, it is likely to be a mismatched pair. Hence, we filter synthetic examples with $QAP<\epsilon$ , where $\epsilon$ is a hyper-parameter that we will tune for different synthetic datasets.

Mixing Mini-Batch Training

When conducting semi-supervised learning, we do not want gradients from ground-truth data are overwhelmed by synthetic data. Previous works Fried et al. (2018); Dhingra et al. (2018) proposed to first pre-train the model with synthetic data and then fine-tune it with ground-truth data. However, we find when the synthetic data size is small (e.g., similar size as the ground-truth data), catastrophic forgetting will happen during fine-tuning, leading to similar results as using ground-truth data only. Thus, we propose a “mixing mini-batch” training strategy, where for each mini-batch we combine half mini-batch ground-truth data with half mini-batch synthetic data, which keeps the data mixing ratio to 1:1 regardless of what the true data size ratio is. In this way, we can have the training process generalizable to different amounts of synthetic data and keep the gradients to be regularized by ground-truth data.

Experiment Setup

For QG, we use the most commonly used SQuAD QG dataset first used by Du et al. (2017). For QA-based QG evaluation, we obtain unlabeled paragraph and answer labels from HarvestingQA Du and Cardie (2018), and have different QG systems to label questions. For semi-supervised QA, we use SQuADv1.1 Rajpurkar et al. (2016) as our base QA task, and split the original development set in half as our development and test set respectively. Plus, we make our QG model generate new questions from both SQuAD and HarvestingQA. We will sample 10% – 100% examples from HarvestingQA which are denoted by H1-10 in our experiments.

Evaluation Metrics

For QG, we first adopt 3 traditional metrics (BLEU4/METEOR/ROUGE-L). Second, we apply the new Q-BLEU1 metric proposed by Nema and Khapra (2018). Moreover, we conduct a pairwise human evaluation between our baseline and QPP&QAP model on MTurk. We gave the annotators a paragraph with an answer bold in context and two questions generated by two models (randomly shuffled). We asked them to decide which one is better or non-distinguishable. For both QA-based QG evaluation and semi-supervised QA, we follow the standard evaluation method for SQuAD to use Exact Match (EM) and F1.

More details about datasets, evaluation metrics, human evaluation setup, and model implement details are provided in the Appendix.

Results

First, as shown in Table 1, our baseline QG model obtains a non-trivial improvement over previous best QG system Zhao et al. (2018) which proves the effectiveness of our newly introduced setups: introduce POS/NER features, use deep contexturalized word vectors (from ELMo or BERT), and tie output projection matrix with non-trainable word embedding matrix. Second, we apply three evaluation metrics as rewards to deal with the exposure bias issue and improve performance. All the metrics are significantlyThe significance tests in this paper are conducted following the bootstrap test setup Efron and Tibshirani (1994). ( $p<0.001$ ) improved except QPP, which supports that high traditional evaluation metrics do not always correlate to high semantic similarity.

Semantics-Reinforced Models

As shown in Table 1, when using QAP and QPP separately, all metrics are significantly ( $p<0.001$ ) improved over our baseline and all metrics except ROUGE-L are significantly ( $p<0.05$ ) improved over the models using traditional metrics as rewards. After applying multi-reward optimization, our model performs consistently best on BLEU4/METEOR/ROUGE-L and Q-BLEU1. Notably, using one of these two rewards will also improve the other one at the same time, but using both of them achieves a good balance between these two rewards without exploiting either of them and results in the consistently best performance on other metrics, which is a new state-of-the-art result. Human Evaluation Results: Table 2 shows the MTurk anonymous human evaluation study, where we do a pairwise comparison between our baseline and QPP&QAP model. We collected 300 responses in total, 160 of which voted the QPP&QAP model’s generation is better, 131 of which favors the baseline model, and 9 of which selected non-distinguishable.

QA-Based Evaluation

As shown in Table 3, we compare three QG systems using QA-based evaluation on three different amounts of synthetic data and their corresponding semi-supervised QA setups (without filter). It can be observed that both our baseline and our best QG model can significantly improve the synthetic data’s QA performance, which means they can act as better “annotators” than the QG model proposed by Du and Cardie (2018). However, our best QG model only has a minor improvement over our baseline model, which means significant improvement over QG metrics does not guarantee significant better question annotation ability.

2 Semi-Supervised Question Answering

As shown in Table 4, when using synthetic data only, adding the data filter can significantly improve QA performance. In terms of semi-supervised QA, the improvement is relatively smaller, due to the regularization from ground-truth data, but still consistent and stable.

Semi-Supervised QA results

Table 5 demonstrates the semi-supervised QA results. Without introducing new articles, we keep beam search outputs as additional questions. It can be seen that using beam search with beam size 10 (+Beam10) improves the BiDAF-QA baseline by 1.51/1.13 absolute points on the testing set. With introducing new articles, the best performance (+H8) improves the BiDAF-QA baseline by 1.69/1.27 absolute points on the testing set. We also combine the two best settings (Beam10+H8), but it does not perform better than using them separately.

We conduct two ablation studies on the development set. First, we compare beam search with different beam sizes and diverse beam search Li et al. (2016), but all of them perform similarly. Second, increasing the size of synthetic data from H1 to H10, the performance saturates around H2-H4. We also observed that when using a big synthetic dataset, e.g., H10, the model converges even before all examples were used for training. Based on these results, we conjecture that there is an upper bound of the effect of synthetic data which might be limited by the QG quality. To further improve the performance, more diverse and tricky questions need to be generated. To show how QG models help or limit the QA performance, we include some synthetic QA examples in Appendix. Finally, we compare our semi-supervised QA methods with Dhingra et al. (2018). As shown in Table 6, with no or less new data injection, our methods achieve larger improvements over a stronger baseline than their method.

3 QG and QA Results with BERT

The Bidirectional Encoder Representations from Transformers (BERT) Devlin et al. (2018) has recently improved a lot of NLP tasks by substantial margins. To verify if our improvements still hold on BERT-based baselines, we propose a BERT-QG baseline and test our two semantics-enhanced rewards; further, we conduct our semi-supervised QA method on a BERT-QA baseline.

Without modifying our QG model’s architecture, we simply replaced ELMo used above with BERT. Table 7 shows that our BERT-QG baseline improves previous ELMo-QG baseline by a large margin; meanwhile, our QPP/QAP rewards significantly improve the stronger QG baseline and achieve the new state-of-the-art QG performance w.r.t both traditional metrics and QA-based evaluation. One difference is that the QAP-only model has the overall best performance instead of the multi-reward model. Note that, we also obtain the QPP and QAP rewards from BERT-based QPC and QA models, respectively.

BERT-QA

Using our QAP-reinforced BERT-QG model, we apply the same semi-supervised QA method on a BERT-QA baseline. As shown in Table 8, though with smaller margins, our method improves the strong BERT-QA baseline by 1.19/0.56 absolute points on testing set; even without introducing new articles, it obtains 0.95/0.13 absolute gains.

Conclusion

We proposed two semantics-enhanced rewards to regularize a QG model to generate semantically valid questions, and introduced a QA-based evaluation method that directly evaluates a QG model’s ability to mimic human annotators in generating QA training data. Experiments showed that our QG model achieves new state-of-the-art performances. Further, we investigated how to use our QG system to augment QA datasets and conduct semi-supervised QA via two synthetic data generation methods along with a data filter and mixing mini-batch training. Experiments showed that our approach improves both BiDAF and BERT QA baselines even without introducing new articles.

Acknowledgments

We thank the reviewers for their helpful comments and Hao Tan for useful discussions. This work was supported by DARPA (YFA17-D17AP00022), NSF-CAREER Award #1846185, ONR Grant #N00014-18-1-2871, and faculty awards from Google, Facebook, and Salesforce. The views contained in this article are those of the authors and not of the funding agency.

References

Appendix

Appendix A Experiment Setup

For QG, we use the SQuAD-based QG datasethttps://github.com/xinyadu/nqg/tree/master/data first introduced by Du et al. (2017) which was the most widely-used QG dataset in previous works Song et al. (2018); Zhao et al. (2018); Du and Cardie (2018); Kim et al. (2018); Sun et al. (2018). It was derived from SQuADv1.1 Rajpurkar et al. (2016). Since the testing set is not open, they sampled 10% articles from the training set as the testing set, and the original development set is still used for validation.

For the QA-based QG evaluation, we obtain new paragraphs with pre-extracted answer spans from HarvestingQA Du and Cardie (2018). Without using their provided questions, we have different QG models act as “annotators” to generate questions, and then use the different QG-labeled synthetic datasets to train QA models. We use the same dev-test setup as described below.

QA

For QA, we use SQuADv1.1 Rajpurkar et al. (2016). Previous semi-supervised QA works sampled 10% from training set as the testing set Yang et al. (2017); Dhingra et al. (2018). Since we want to use the full training set in semi-supervised QA setup without any data size reduction, we instead split the original development set in half for validation and testing respectively.

For semi-supervised QA, first, without introducing new articles, we generate new questions for SQuAD training set by keeping all beam search outputs. Second, with introducing new articles, we obtain new paragraphs with pre-extracted answer spans from HarvestingQA Du and Cardie (2018). Without using their provided questions, we use our best QG model to label questions. Meanwhile, we investigate the influence of synthetic data size, so we sample 10% to 100% examples from HarvestingQA, which are denoted as H1-H10 in our experiments.

A.2 Evaluation Metrics

First, we use three traditional automatic evaluation metrics: BLEU4 Papineni et al. (2002), METEOR Denkowski and Lavie (2014), ROUGE-L Lin (2004). Second, we adopt the new “Q-metrics” proposed by Nema and Khapra (2018), and we only use “Q-BLEU1” that was shown to have the highest correlation with human judgments on SQuAD. We also take the QPP and QAP rewards as two additional evaluation metrics. Further, we conduct a pairwise human comparison between our baseline and best QG models. Detailed human evaluation setup is described in the next section. For the QA-based QG evaluation, we use the same QA evaluation metrics as follows.

QA

Following the standard evaluation method for SQuADv1.1 Rajpurkar et al. (2016), we use Exact Match (EM) and F1 as two metrics.

A.3 Human Evaluation

We performed pairwise human evaluation between our baseline and the QPP&QAP multi-reward model on Amazon Mechanical Turk. We selected human annotators that are located in the US, have an approval rate greater than 98%, and have at least 10,000 approved HITs. We showed the annotators an input paragraph with the answer bold in the paragraph and two questions generated by two QG models (randomly shuffled to anonymize model identities). We then asked them to decide which one is better or choose “non-distinguishable” if they are equally good/bad. We give human three instructions about what is a good question: first, “answerability” – a good question should be answerable by the given answer; “making sense” – a good question should be making sense given the surrounding context; “overall quality” – a good question should be as fluent, non-ambiguous, semantically compact as possible. Ground-truth questions were not provided to avoid simple matching with ground-truth.

Appendix B Implementation Details

For ELMo-QG, we first tokenize and obtain the POS/NER tags by Standford Corenlp toolkithttps://stanfordnlp.github.io/CoreNLP/, then lower-case the entire dataset. We use 2-layer LSTM-RNNs for both encoder and decoder with hidden size 600. Dropout with a probability of 0.3 is applied to the input of each LSTM-RNN layer. We use the pre-trained character-level word embedding from ELMo Peters et al. (2018) both as our word embedding and output-projection matrix, and keep it fixed. We use Adam Kingma and Ba (2014) as optimizer with learning rate 0.001 for teacher forcing and 0.00001 for reinforcement learning. Batch size is set to 32. For stability, we first pre-train the model with teacher forcing until convergence, then fine-tune it with the mixed loss. Hyper-parameters are tuned on development set: $\gamma^{qpp}=0.99$ , $\gamma^{qap}=0.97$ , and $n:m=3:1$ . We use beam search with beam size 10 for decoding and apply a bi-gram/tri-gram repetition penalty as proposed in Paulus et al. (2017).

For BERT-QG, we simply replace the ELMo used above to BERT Devlin et al. (2018). To match with BERT’s tokenization, we use the WordPiece tokenizer to tokenize each word obtained above and extend the POS/NER tags to each word piece. Decoder’s word-piece outputs will be mapped to normal words by post-processing. Hyper-parameters are tuned on development set: $\gamma^{qpp}=0.99$ , $\gamma^{qap}=0.97$ , and $n:m=1:3$ .

QA

For BiDAF-QA, we implement the BiDAF+Self-attention architecture proposed by Clark and Gardner (2018). We use GRUs for all RNN layers with hidden size 90 for GRUs and 180 for linear layers. Dropout with a probability of 0.2 is applied to the input of each GRU-RNN layer. We optimize the model using Adadelta with batch size 64. We also add ELMo to both the input and output of the contextual GRU-RNN layer as proposed in Peters et al. (2018). To match with QG model’s setup, we also apply lower-case on QA datasets.

For BERT-QA, we use the pre-trained uncased BERT-base modelhttps://github.com/google-research/bert and fine-tune it on QA datasets.

QPC

For ELMo-QPC, we follow the model architecture proposed by Conneau et al. (2017). First, two input questions are embedded with ELMo Peters et al. (2018). Second, the embedded questions are encoded by two 2-layer bidirectional LSTM-RNNs separately with hidden size 512. Next, a max-pooling layer outputs the sentence embedding of each question, denoted by $q_{1}$ and $q_{2}$ . Lastly, we input $[q_{1},q_{2},|q_{1}-q_{2}|,q_{1}*q_{2}]$ to an MLP to predict whether these two questions are paraphrases or not. This QPC model is trained using the Quora Question Pairshttps://tinyurl.com/y2y8u5ed dataset. We use Adam Kingma and Ba (2014) as optimizer with learning rate 0.0004 and batch size 64. This model obtained 86% accuracy on QQP development set.

For BERT-QPC, we also use the pre-trained uncased BERT-base model and fine-tune it on QQP dataset, which obtained 90% accuracy on QQP development set.

Appendix C Examples

Figure 5 shows some synthetic QA examples generated by our QG models. On SQuAD, the first two examples show our QG models generate some paraphrases or novel questions that enrich the dataset; the last two examples show our QG models generate easier or wrong questions that limit the semi-supervised QA’s performance. On HarvestingQA, our QG models can output better questions than Du and Cardie (2018) did but still generate some wrong questions.