Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence

Hung-Ting Chen, Michael J. Q. Zhang, Eunsol Choi

Introduction

Traditionally, QA models have relied on retrieved documents to provide provenance for their answers Chen et al. (2017). Recent studies Petroni et al. (2019) have shown that large language models are able to retain vast amounts of factual knowledge seen during pretraining, and closed-book QA systems Roberts et al. (2020) build upon this foundation by memorizing facts from QA finetuning. Retrieval-based generation approaches Izacard and Grave (2021); Lewis et al. (2020) emerge as the best of both worlds – generating free-form answers from the question paired with retrieved evidence documents. They further combine these parametric knowledge sources with a large number of retrieved evidence documents, achieving state-of-the-art performances on open retrieval QA datasets Joshi et al. (2017); Kwiatkowski et al. (2019).

Understanding how retrieval-based generation models combine information from parametric and non-parametric knowledge sources is crucial for interpreting and debugging such complex systems, particularly in adversarial and complex real world scenarios where these sources may conflict with each other (see an example in Figure 1). This can aid both developers to debug such models and for users to estimate how much they should trust an answer Ribeiro et al. (2016). Thus, we focus on the following core question: when provided with numerous evidence passages and a pretrained and finetuned language model, which knowledge source do models ground their answers in?

A recent study Longpre et al. (2021) investigated this in a limited single evidence document setting. We expand this study to consider a more realistic scenario, where models consider multiple evidence passages (up to 100 passages), and observe results diverging from their reported heavy reliance on parametric knowledge. We further simulate a setting where a subset of evidence passages are perturbed to suggest a different answer to reflect the realistic scenario where retrieval returns a mixed bag of information. Such scenarios are common in settings where some passages are updated with new information, while other passages remain outdated Shah et al. (2020); Zhang and Choi (2021). Such conflicts can also occur when passages are adversarially edited to contain false information Du et al. (2022), or when passages are authored by multiple people who have differing opinions about an answer Chen et al. (2019).

Our extensive studies on two datasets Joshi et al. (2017); Kwiatkowski et al. (2019) and two models Izacard and Grave (2020); Lewis et al. (2020) exhibit that retrieval-based generation models are primarily extractive and are heavily influenced by a few most relevant documents instead of aggregating information over a large set of documents. Learning that models mostly rely on evidence passages rather than parametric knowledge, we evaluate how sensitive models are toward semantic perturbation to the evidence documents (e.g., adding negation). We find retrieval-based generation models behave similarly to extractive models, sharing their weakness of returning answer candidates with high confidence, even after the context is modified to no longer support the answer Ribeiro et al. (2020).

What should models do when confronted with conflicting knowledge sources? We propose a new calibration setting (Section 5), where a model is encouraged to abstain from proposing a single answer in such scenarios. We find that teaching models to abstain when there are more than one plausible answers is challenging, and training a separate calibrator with augmented data helps moderately.

To summarize, we empirically test how QA models Izacard and Grave (2021); Lewis et al. (2020) use diverse knowledge sources. We present the first analysis of knowledge conflicts where (1) the model uses multiple passages, (2) knowledge conflicts arise from ambiguous and context-dependent user queries, and (3) there are knowledge conflicts between different passages. Our findings are as follows: when provided with a high recall retriever, models rely almost exclusively on the evidence passages without hallucinating answers from parametric knowledge. When different passages suggest multiple conflicting answers, models prefer the answer that matches their parametric knowledge. Lastly, we identify various weaknesses of retrieval-based generation models, including its confidence score not reflecting the existence of conflicting answers between knowledge sources. Our initial calibration study suggests that dissuading models from presenting a single answer in the presence of rich, potentially conflicting, knowledge sources is challenging, and demands future study.

Background

We first describe the task setting, QA models, and calibrator used in our study.

We study open retrieval QA, where the goal is to find an appropriate answer $y^{*}$ for a given question $q$ . Systems for open retrieval QA may also be provided with access to a knowledge corpus consisting of a large number of passages, $p$ , which is used to help answer the question. We use the open retrieval split Lee et al. (2019) of the NaturalQuestions dataset (NQ-Open) Kwiatkowski et al. (2019) and TriviaQA Joshi et al. (2017), and use Wikipedia as our knowledge corpus.Following Lee et al. (2019), we use the English Wikipedia dump from Dec. 20, 2018. We use 100-word text segments as passages following Karpukhin et al. (2020).

We investigate two retrieval-based generation QA models: Fusion-in-Decoder Izacard and Grave (2021) and Retrieval Augmented Generation model Lewis et al. (2020). Both architectures have reader and retriever components, using the same dense phrase retriever Karpukhin et al. (2020) which learns an embedding of question and passage, and retrieves a fixed number ( $N$ ) of passages that are most similar to the query embedding. They mainly differ in their reader architecture and learning objective, which we describe below.

The reader model is based on pretrained language model (specifically, T5-large Raffel et al. (2020)). Each retrieved passage, $p_{i}$ $(i=[1,N])$ , is concatenated with the question, $q$ , before being encoded by T5 to generate representations, $[h^{i}_{1},...,h^{i}_{m}]$ , where $m$ is the length of the $i$ th passage prepended with the question. All $N$ passages are then concatenated to form a single sequence, $[h^{1}_{1},...,h^{1}_{m},...,h^{N}_{1},...,h^{N}_{m}]$ , which the decoder interacts with using cross-attention to generate the answer.We use the version proposed in Izacard and Grave (2020) with knowledge distillation from reader.

We use trained FiD (large) checkpoint provided by the authors for most analysis.https://github.com/facebookresearch/FiD When evaluating models with access to different number of passages, we re-train FiD model (pretrained weights loaded from T5-large) using 1, 5, 20 and 50 passages retrieved by DPR. Refer to Appendix A.2 for full model and training details.

Retrieval Augmented Generation (RAG)

RAG conditions on each retrieved evidence document individually to produce an answer, marginalizing the probability of producing an answer over all retrieved evidence documents.RAG also presents a variant of a model that relies on each retrieved document to generate for each token, but shows worse performance. We use the version in https://huggingface.co/facebook/rag-sequence-nq By applying this constraint, RAG is able to jointly train the reader and retriever, at the cost of ignoring interactions between evidence documents. FiD, in contrast, is able to model such interactions during decoding while the reader and retriever is completely disjoint.

Recent work explored jointly training the reader and retriever in FiD Izacard and Grave (2020); Sachan et al. (2021); Yang and Seo (2020), showing small gains. Table 1 summarizes different architectures, including two open book approaches Karpukhin et al. (2020); Guu et al. (2020), one closed book approach Roberts et al. (2020) and two retrieval-based generation approaches. As FiD is efficient and effective, we focus most of our analysis (Section 4, B) on it. We only report RAG results on a few of our main analyses to verify that general trends of the FID model hold for RAG (which they typically do).

2 Model Confidence Study

We analyze the model confidence score, asking a more nuanced question: is model’s confidence on the gold answer decreased after we perturb knowledge sources? We compare the model confidence on the same example before and after perturbation. We determine the confidence of the model using either (1) the generation probability of the answer (i.e., the product of the probability of generating each token conditioned on all the previously generated tokens) or (2) the confidence score of separately trained answer calibrator, which provides a score indicating the probability of the model correctly predicting the answer for each example. We train a binary calibrator following prior work Kamath et al. (2020); Zhang et al. (2021), using gradient boosting library XGBoost Chen and Guestrin (2016). The goal of the calibrator is to enable selective question answering – equipping models to decide when to abstain from answering. Given an input question $q$ and learned model $M_{\theta}$ , the calibrator predicts whether the predicted answer $\hat{y}=M_{\theta}(q)$ will match the annotated answer $y^{*}$ . We follow the settings of calibrator from prior work Zhang et al. (2021), and details can be found in Appendix A.1.

When do retrieval-based generation models rely on parametric knowledge?

As an initial step investigating whether retrieval-based generation models ground their answers in the retrieval corpus or in the pretrained language model’s parametric knowledge, we evaluate whether models generate a novel answer that is not present in a set of evidence documents. Unlike extractive QA models Seo et al. (2017), generation based approaches Roberts et al. (2020); Izacard and Grave (2021) do not require the evidence documents to contain the gold answer span. Thus, we first analyze whether they actually generate novel answer spans not found in the retrieved passages.

Table 2 reports how often models generate a span not found in the evidence passages, split by the retrieval performance on the NQ-Open Kwiatkowski et al. (2019); Lee et al. (2019) and TriviaQA Joshi et al. (2017) development set. We observe that models typically copy a span from the evidence passages, only generating novel spans for 3.4%/6.2% of examples in NQ/TriviaQA for FiD and 20.2% for RAG in NQ. Even for the small subset of examples where the retrieved documents do not contain the answer string, FiD remains extractive for 82.9%/69.6% of examples in NQ/TriviaQA. In contrast, for RAG, where retrieved documents frequently miss the gold answer (37%), such copying behavior was less common, generating unseen text for 42.1% of examples. The results suggest reliance on retrieved documents increased as retriever performance increases. We also report the percentage of examples where the model prediction is different from that of a T5 closed-book question answering (CBQA) model trained on the same data.The training details are in Appendix A.2 Over 70% of examples have different answers from the CBQA model, even when the answer is abstractive, suggesting hybrid models use passages even when there is no exact string match.

This observation stands at odds with an earlier study on knowledge conflict Longpre et al. (2021) which simulates knowledge conflict by substituting the existing answer with a new answer candidate in the evidence passage (see Table 3 for an example), creating a mismatch between knowledge from parametric knowledge and the evidence document. They showed that models frequently rely on parametric knowledge, generating answers not present in the evidence passage. The original passage is minimally changed, yet now suggests an alternative, incorrect answer candidate that likely contradicts with knowledge from LM. The model produced the original answer 17% of the time, even when the answer no longer appears in the passage.

We identify that the main difference in their experimental setup is in using a single evidence passage rather than multiple evidence passages. We re-visit their study, as single document setting is impractical. Most open-retrieval QA models Lewis et al. (2020); Karpukhin et al. (2020); Izacard and Grave (2021) are trained with multiple passage to make up for imperfect passage retrieval. According to the answer recall in Table 4 and 5, when the model is provided with 100 passages, the correct span is available nearly 90% of the time (compared up to 50% when provided one passage), thus the model remains extractive.

Following their setup, we only evaluate on examples that the model has correctly answered (as perturbing examples where models are already confused is unnecessary) and where the answer is an entity.This process removes roughly 70-80% of examples in NQ dataset, 60% in TriviaQA dataset. Because of the filtering process, each row in Table 4 and 5 are its own subset of the data. We then substitute every answer entity mention in all evidence passages with a random entity of same type sampled from the training data.The entity type is coarsely defined as person, date, numeric, organization and location. All manipulation was done only at inference, and after the passages are retrieved.

We report the exact match score to the original answer. Prior to perturbation, the exact match score against the original answer is 100%. We also report the exact match score to the substituted answer and memorization ratio ( $M_{R}=\frac{p_{o}}{p_{o}+p_{s}}$ ) where $p_{o}$ is the fraction of examples where the model predicts the original answer, and $p_{s}$ is the fraction of examples predicting the substitute answer.

Table 4 and 5 reports how models respond to entity-substituted contexts with a differing number of passages available at training and inference time. In congruence with our prior experiments, we observe higher reliance on parametric knowledge as answer recall in the retrieved evidence decreases. Departing from Longpre et al. (2021), we find that memorization in FiD is uncommon (less than 3.6%/8.5% for NQ/TriviaQA) when reader is provided with multiple passages at training time, and FiD grounds its answers mostly in evidence passages instead of its parametric knowledge when answer recall is reliably high. Furthermore, when provided with multiple evidence passages with comparable answer recall, FiD exhibits far less memorization than RAG, suggesting that using a multi-passage reader that doesn’t marginalize over passages inhibits memorization. We study domain transfer setting in Appendix A.9, showing that the memorization is still rare when the reader models are evaluated on out-of-domain datasets, as long as retriever performance was high during its training.

Takeaway

Retrieval-based reader models exhibit little memorization when the retriever has a high recall during its training.

Simulating Mixed Bag of Evidence Passages

Having identified that retrieval-based generation models rely heavily on evidence passages, especially when paired with a high-performance retriever, we study how models make use of multiple evidence passages when different passages suggest different answers. This happens frequently in real life, as questions can be ambiguous based on different, valid interpretations of the question Min et al. (2020) or different extra-linguistic contexts Zhang and Choi (2021).

We introduce two perturbations – an entity substitution perturbation Longpre et al. (2021) (Section 4.1) and adversarial semantic perturbation Jia and Liang (2017) (Section 4.2) – both will dissuade model from returning the original answer in the evidence passage (see Table 3 for examples). We analyze the best performing FiD model trained with 100 passages.

To simulate a mixed bag of evidence passages, we perform partial entity substitution, changing answers to a subset of passages mentioning the answer entity. On average, the answer entity is mentioned in 16.7 out of 100 retrieved evidence passages for NQ-open and 21.5 for TriviaQA dataset. We substituted the answer entity mentions in 25%, 50%, 75% and 100% of evidence passages that contain the original gold answer span with a new entity. We sample passages to substitute answer entity in one of three ways.

top-retrieval: select top passages ranked by retrieval score.

top-attention: select top passages ranked by attention score. Attention score for each passage is computed as the cross-attention score on the first decoded token averaged across layers, heads and the tokens in the passage, as defined in Izacard and Grave (2020).

Results.

Figure 2 reports our results with different amounts of perturbation (i.e., how many evidence passages are perturbed) and different methods of sampling passages to substitute entities in. After perturbing all of the passages, so that the original answer is no longer within any of the passages, the model successfully refrains from predicting the original answer 98% of the time. However, after randomly selecting 50% of the passages to perturb, we find that the model still favors the original answer almost twice as frequently on NQ (52% vs. 25%) and almost four times on TriviaQA (59% vs. 15%). This indicates that parametric knowledge still plays a significant role when more than one potential answer exists in the retrieval results.

When we perturb the top scoring passages, as measured by either retrieval or attention score, the model changes its answer much more frequently. Using either scoring metric, perturbing the top 25% of passages successfully changes the predicted answers in about 30% of examples compared to the 8% of examples whose answers are successfully changed by perturbing randomly sampled passages. This suggests that the model may be ignoring lower-scoring retrieved passages that are less relevant to the query, despite containing the answer entity.

Confidence Study.

Table 6 reports the change in model confidence after performing random entity substitution in the evidence passages. Consistent with the results from Zhang et al. (2021), we find that a separately trained calibrator consistently outperforms the model’s inherent confidence score. Surprisingly, there is no clear connection between the percentage of perturbed passages and model confidence. Ideally, when given a mixed bag of evidence, a model’s confidence should decrease to reflect the uncertainty from seeing multiple, conflicting answers. We revisit this in Section 5 where we pilot a calibrator whose confidence drops when presented with conflicting evidence.

Additional Analysis.

Our confidence study suggests model might not consider all provided passages. To further investigate this, we substitute answers in all passages except top K passages, ranked by the attention score from the reader. Table 7 presents the results. If you change the answer to all passages except for the top scoring article, the model already outputs the original answer more frequently than the substitute answer. This again suggests that the model might focus on a handful of most relevant passages and ignore other passages.

In Appendix B, we include two further studies. First, we study whether the choice of alternative answer impacts its behaviors. When we provide more realistic alternative answer (either drawn from out-dated corpus or answers to the slightly different interpretation of the question), unsurprisingly, model is less biased to choose the original answer. Second, we study whether model’s parametric knowledge is learned during pre-training phase or fine-tuning phase, concluding most of its parametric knowledge is learned during the fine-tuning stage.

Takeaway.

The models resort to the parametric knowledge to resolve conflicts between different retrieved passages.

Model confidence itself cannot be used to identify knowledge conflicts.

The model rely on a few most relevant passages, ignoring others.

2 Adversarial Semantic Perturbation

Semantic perturbation follows earlier work on counterfactual example generation with heuristics Ribeiro et al. (2020) which perturbs the sentence containing the answer. We simulate four perturbations, and after each perturbation, the model should refrain from returning the original answer. We aim to test model’s understanding of the passage with such perturbation.

We design the four perturbations applicable to question answering: negation, changing to future tense, adding modal verb and text infilling. Examples of each perturbation are in Table 3. To generate these, we run a dependency parser on the sentence containing the gold answer span.We use StanfordNLP Qi et al. (2018) toolkit. We then filter examples where the root token of answer sentence is not a verb (about 40% of sentences, see Appendix A.3 for full statistics). Finally, we apply simple rules (see Appendix A.4) to modify the verb. For text infilling, the only difference is that we convert the root token into “[blank]" and fill in the blank using language modeling Donahue et al. (2020). For passages containing multiple gold answer spans, we apply these perturbations to all sentences as long as their root tokens are verbs.

Results.

In Table 8, we report the exact match to the original answer after applying semantic perturbations. Since our perturbation rules only cover 67-86% of all sentences containing an answer string, we further subreport our results based on whether there are any remaining unperturbed answer sentences in the evidence. The “partial coverage” subset is the set we created based on the perturbation rules. The “full coverage” subset is created by removing the examples where not all answer sentences have been perturbed.

Since our perturbation rules only cover 67-86% of all sentences containing an answer string, we further subreport our results based on whether there are any remaining unperturbed answer sentences in the evidence (partial coverage) or if all answer sentences are perturbed (full coverage). Examples with partial coverage simulate a mixed bag of evidence which may induce the model to return the original answer. In all instances, we expect the exact match to drop significantly after perturbation, as all edits invalidate the original answer; however, we observe that models still return the original answer after perturbation, mirroring what Ribeiro et al. (2020) finds with extractive models.Semantic perturbation details (e.g., statistics of % of valid examples after perturbation) in the Appendix A.3.

Confidence Study.

We repeat the calibration study with semantic perturbation. We find that calibration scores remain mostly steady after the perturbation for all four perturbation types, only for 30-40% of examples we see a decrease in calibration score after the perturbation. The model is particularly less sensitive to temporal perturbation (future). The exact numbers and the ratio of calibration scores before and after the perturbation can be found in the Appendix 3. We observe that model behaves similarly to extractive model Ribeiro et al. (2020), returning an answer matching the answer type with high confidence even when the passage no longer supports it.

Re-Calibrating Models Given a Mixed Bag of Evidence

When presented with a mixed bag of evidence, systems should inform users of the multiple, conflicting answers. While there are many of approaches for relaying this information to users (e.g., composing a paragraph aggregating answer candidates, or providing set of answers mapped to documents supporting them), a necessary prerequisite to all such systems is the ability to detect when there are conflicting answers in the evidence. Thus, we explore creating systems that can detect and abstain from predicting on instances with conflicting evidence. Questions should only be answered if (1) there is no knowledge conflict in its evidence set and (2) model’s predicted answer matches the annotated answer. We report calibrator’s binary calibration accuracy following prior work Kamath et al. (2020). We explain four evaluation settings here.

Original We use the original NQ development set as is to provide a reference for the performances of calibrators.

In the following settings, we only look at examples where the original FID model correclty answers. Thus, the calibrator should only abstain for knowledge conflict. We construct three different types of knowledge conflict set where calibrator should abstain on half of the examples because of the knowledge conflict. To construct these set, we use the original question, 100 evidence passage set (where model should present its answer), and augment one perturbed example, where 100 evidence passage set is perturbed to have multiple answer candidates to the same question. We discuss three ways to introduce perturbed evidence set, with more than one valid answer candidate now.

Partial Substitution We use the sets of conflicting evidence passages constructed in Section 4.1 (randomly sampling 50% of the retrieved passages to substitute a new entity in).

AmbigQA Instead of random new entity, we sample valid alternative answer to the question taken from a different interpretation of the same question from AmbigQA Min et al. (2020) dataset. Instead of simply replacing answer in existing passage, we retrieve new passage for each rewritten, disambiguated version of the question.

SituatedQA We sample valid alternative answer from either corpus taken from a different time period from SituatedQA Zhang and Choi (2021) dataset. We use the same query, but retrieve over two different snapshots of the same corpus (the Wikipedia dump from 2018 and from 2021).

We evenly combine retrieved passages from conflicting answer sets, using the top retrieved passages that contain the respective answer and backing off to the passages with high retrieval scores if not enough passages contain the answer string.

As a baseline, we use same calibration model from our prior study in Section 2.2. We also retrain separator calibrators for each of our three substituted answer types, which are trained by applying the same data augmentation process that was applied to the evaluation set (described above) to training portion of filtered NQ-Open dataset.

Results

We report the results in Table 9. We observe vanilla model confidence outperforms trained calibrator, showing robustness towards out of domain setting. This could be caused by a large gap in accuracy of FiD model for training (80%) and testing data (52%). Base calibrators, without data augmentation, struggles substantially, particularly on real world knowledge conflict scenario where it is presented with multiple valid answer candidates (AmbigQA, SituatedQA). Training with data augmentation improves the calibrator’s performance; however, this fix does not easily generalize over different methods of collecting conflicting answers and evidence sets. Interestingly, training with more realistic conflicting evidence sets (AmbigQA, Situated QA), while being substantially smaller, generalizes better than simulated conflicting evidence set (Partial Substitution). Training over all types of conflicting evidence sets jointly improves performance over the baseline calibrators only modestly compared to the gains from training on data from each method separately. Future work can explore improving calibrator generalization across different knowledge conflict types.

Related Work

Recent analysis Lewis et al. (2021); Krishna et al. (2021) pointed the overlap in training and evaluation dataset inflates question answering performances. Longpre et al. (2021) showed that the reader model tend to memorize entity answers despite the answer mentions are substituted by another entity. We showed that memorization do occur when the model can only have access to one passage, but can be reduced significantly if the model is trained with multiple passages. Concurrent work Pan et al. (2021) investigates QA models’ robustness to misinformation by providing contradicting contexts. They focus on generating conflicting passages, while we focus on understanding how models behave under such settings, including in-depth study of their confidence score.

Recent works evaluated robustness by minimally perturbing input examples Kaushik et al. (2020); Gardner et al. (2020) to identify models that are invariant under distributional shift. Prior work explored automatically generating such perturbed input (counterfactual data) with heuristics Ribeiro et al. (2020) or learned models Wu et al. (2021); Bartolo et al. (2020); Paranjape et al. (2021). Recent work Du et al. (2022) studies knowledge poisoning for a related task, fact checking. Our perturbation methods are rule-based similar to Ribeiro et al. (2020), but designed specifically for QA task.

Conclusion

We summarize our findings: Do models ground their answers from retrieved document or parametric knowledge? (Section 3) Current SoTA models ground their answers mostly from retrieved passages, when paired with a high recall retriever (Table 2, 4).

How do models use multiple passages when different passages suggest different answers? (Section 4.1) Models rely on a few, most relevant passages (Table 7), and use parametric knowledge to break ties (Figure 2, Table 18).

How do models behave if some passages are perturbed not to support an answer? (Section 4.2) Models largely ignore semantic perturbations and outputs potential answer entity in the retrieved passages (Table 8).

How is the model’s confidence score affected by knowledge conflicts? Confidence score is not sensitive to knowledge conflicts (Table 6, Figure 3), and a separately trained calibrator offers some improvements.

Can we train a model to refrain from returning a single answer when there is conflicting evidence? If we train a calibrator on the conflicting evidence set, calibrator can learn to refrain, but does not generalize to different types of conflicting evidence sets (Table 9).

What should the model do when there is conflicting evidence? We present a partial solution of training a calibrator which learns to abstain from answering when provided conflicting evidence. Future work can explore summarizing and comparing different answers suggested by diverse passages.

Overall, models’ limited ability to aggregate conflicting information among its rich knowledge sources encourage future work in this domain.

Limitations

Our study is based on current state-of-the-art model on popular benchmark datasets. For other datasets (e.g., datasets where retrieval quality is substantially worse) or different models Brown et al. (2020); Chowdhery et al. (2022); Rae et al. (2021); Thoppilan et al. (2022) of substantially richer parametric knowledge, our observation that memorization is relatively rare will not hold.

We focus on extractive question answering task, where the answer consists of short entity span. Studying knowledge conflicts in complex question answering tasks where answer is multi-sentence Fan et al. (2019) or conditional Sun et al. (2022) requires future work.

Lastly, most of our knowledge conflicts study (except the settings where we retrieve passages with AmbigQA and SituatedQA) are simulated, and we leave identifying and evaluating model on real-world knowledge conflicts as future work.

Acknowledgements

We thank members of UT Austin NLP community and Sewon Min for providing feedback in earlier draft of the paper. The work is partially supported by a grant from Open Philanthrophy and Google Research Award.

References

Appendix A Appendix

The input to the calibrator is the concatenation of the generation probability and the encoder feature representation averaged across length, and the output is a score indicating the probability of the model correctly predicting the answer. For each dataset, we reserve 4K examples of the training set for validation, and trained our calibrator on the remaining data. Hyperparameters are selected based on AUROC on validation set.

We use 100 boosting rounds, subsample ratio of 0.5 and learning rate of 0.5. The same subsample ratio is applied for constructing each tree, for each level and for each split.

A.2 Model and Training Details

The Fusion-in-Decoder (FiD) model consist of a retriever and a reader module. The retriever Karpukhin et al. (2020) is a BERT bi-encoder model, which calculate the similarity between the question $q$ and each of the passages $\{p_{i}\}$ in the knowledge source and output the most similar ones. The similarity is computed as the dot product of the encoded vectors

where $E_{Q}$ is the question encoder and $E_{P}$ is the passage encoder.

The reader module is a pretrained T5-large Raffel et al. (2020), an encoder-decoder model containing 770M parameters. Each passage is concatenated with the question and truncated to 250 word pieces. For our experiments finetuning FiD, we train the reader module with 1, 20, and 50 evidence passages. To train the reader, we use the AdamW optimizer Loshchilov and Hutter (2018) and a learning rate of $5\cdot 10^{-5}$ with linear warmup of 8000 steps followed by linear decay to zero. The total training steps is 300k, and the final model checkpoint is selected based on exact match score on NQ Open development set. We only use batch size of 1 due to memory constraints. The models take roughly 7 GPU days to train on a Quadro RTX 8000 machine.

The closed-book question answering (CBQA) model is trained using a T5-large pretrained model, with a batch size of 32, 500k total training steps, and all the other hyperparameters the same as FiD reader models. It roughly take 2 GPU days to train on a Quadro RTX 8000 machine.

A.3 Perturbation Coverage

As mentioned in Section 4, if the root token of the answer sentence is not a verb, then we ignore that sentence, and thus some examples would be excluded. The first row shows the percentage of valid examples after applying the rules mentioned in Section 4. We consider it valid example if one of the gold answer span can be perturbed. The corresponding percentage of perturbed gold answer spans is shown in the third row. A small portion of gold answer spans remain unchanged after performing the perturbation. For the second and fourth row it shows the same except the model has access to 100 passages. The percentage of valid examples are much higher since we consider the example valid if one of the gold answer spans in any of the passages can be perturbed. The last two rows show the percentage of examples where all gold answer spans in all the retrieved passages can be perturbed.

A.4 Technical Details on Semantic Perturbations

For perturbation schemes except text infilling, we first identify the root token’s part-of-speech tag. If it is in one of [VB, VBP, VBZ], then we treat it as the present tense, and modify the verb accordingly. (e.g. V $\rightarrow$ "does not V"/"do not V" for negation, V $\rightarrow$ "may V" for modality, V $\rightarrow$ "will V" for future tense) The lemmatized verb forms after "will" and "may" are obtained by the "WordNetLemmatizer" class in nltkhttps://www.nltk.org/_modules/nltk/stem/wordnet.html. We also identify ["is", "am", "are"] and modify the verbs into their corresponding forms. If the part-of-speech tag is VBD, then it is in past tense and the root token is modified similarly to present tense. Lastly, if the part-of-speech tag is VBN or VBG, then it is present/past participle or gerund. We then identify the be-verbs and/or ["had", "have", "has"], and perform modifications accordingly.

A.5 Model Tested on NQ Open Subset

Both AmbigQA and SituatedQA annotate subsets of NQ Open. To ensure identical data distribution and isolate the effect of different substitute answers, we report results of random entity substitution on AmbigQA set and SitutatedQA set respectively. We present the results in Table 11. For AmbigQA subset, different substitute entity types (random or alternative valid entity) do not seem to affect the results too much. However, the model seems to bias toward the substitute answer more with valid alternative entity substitutions on SituatedQA subset, indicating the parametric knowledge of model do know which answers are more likely to be correct. One possible explanation is that AmbigQA answers do not always take the same form as the original ones (e.g. 76th season and 1995 in Table 3).

A.6 Answer Entity Sampling Details

When substituting with AmbigQA answers, we consider only the examples with multiple valid answers. For each example, we randomly sample one answer not in the original answer set of NQ as the substitute answer. For substitution with SituatedQA answers, we select the most recent answer as substitute answer. We also include the result of randomly sample an answer from SituatedQA answer set in Table 13.

A.7 Full Results on No Answer Overlap Set

Table 12 contain the full results on NAO set for NQ Open, AmbigQA, and SituateQA.

A.8 Confidence Study Full Results

Table 14 contains the full results for confidence study on adversarial semantic perturbation.

A.9 Domain Adaptation Results for Entity Substitution

We would like to study the memorization issue when the model is tested on out-of-domain datasets. Following the setting in Section 3, we substitute the answer entity mentions in the retrieved passages with random entities of the same type after the retrieval step. The only difference is that the reader model is trained on a different domain. We evaluate FiD reader model which is trained with NQ-Open on TriviaQA dataset, and vice versa. The results are presented in Table 15 and 16. The memorization ratio is still low with high-recall retrievers for both settings, indicating that the model actually relies on the retrieved passages under the distribution shift.

Appendix B Further Analysis

We further examine our results, focusing on the quality of substitute answer in entity substitution study and which parametric knowledge (pre-training vs. fine-tuning) was used.

Prior work Longpre et al. (2021) substitutes answer entity with another entity with same coarse entity type. This makes substitute entities sometimes unreasonable, despite better than randomly sampling entities without type constraint. For example, “Heartbreak Hotel" was substituted as an answer to the following question “who did the lions play on thanksgiving last year”.

We make perturbation more realistic by substituting with alternative answer from two datasets, AmbigQA Min et al. (2020) and SituatedQA Zhang and Choi (2021), which augmented existing NQ open dataset. Both datasets annotated valid alternative answers for different interpretation of the same question (AmbigQA) and answers belonging to different temporal contexts (SituatedQA) for NQ-Open dataset. We sample these additional answers as a new answer to inject (details in Appendix A.6).

Table 17 presents perturbation results with valid entities sourced from AmbigQA and SituatedQA. We identify a surprising trend – that model outputs original answers more frequently when substituted with better alternatives. This contradicts our intuition as model should be less hesitant to choose new substitute answer as they are also valid answer to the question, for different contexts. We further investigate this issue below.

Does parametric knowledge come from pre-training or fine-tuning?

Some memorization (2–15%) remains even after all the evidence documents are perturbed, and model is biased toward the original answer under partial substitution. We aim to identify whether it comes from pretraining or fine-tuning of the reader model by using the evaluation data splits from prior work Lewis et al. (2021): questions where answers were seen (Answer Overlap (AO)) and questions where answers were unseen (No Answer Overlap (NAO)). If memorization ratio is higher on AO set compared to NAO set, we can hypothesize that memorization mostly happens during fine-tuning compared to pre-training.Earlier study Longpre et al. (2021) in a single document setting also reports memorization is more severe in AO set.

Table 18 presents results for 50% and 100% substitution setting.See Appendix A.7 for 25% and 75% substitution setting. This study shed lights on mysterious trend: there were more examples with answer overlap in AmbigQA/SituatedQA subset. If we perturb all the evidence documents, the model exhibit little to no memorization on NAO portion. We can thus infer that memorization effect comes almost exclusively from fine-tuning. When accounting for different proportion of answer overlap examples in the subsets, memorization ratio is lower in AmbigQA/SituatedQA NAO set. This suggests that model uses parametric knowledge – which answer candidate is more reasonable – in a subtle way, even when behaving as a copying model.