RealTime QA: What's the Answer Right Now?

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, Kentaro Inui

cs.CL

Introduction

How many home runs has Shohei Ohtanihttps://en.wikipedia.org/wiki/Shohei_Ohtani. hit so far this season? A user of a question answering (QA) system might ask such time-sensitive questions and seek out answers in real time (Fig. 1). Widely-used evaluation benchmarks of QA systems, however, implicitly assume that answers are static regardless of the time of inquiry. Several recent works Jia et al. (2018); Chen et al. (2021); Zhang and Choi (2021); Liška et al. (2022) challenged this assumption and proposed QA datasets that specify the temporal context (e.g., who was the President of the U.S. in 1940?). We extend these recent efforts on time-sensitive QA to fulfill real-time, instantaneous information needs from users: we establish a dynamic benchmark based on newly-published news articles—RealTime QA—and provide a regularly-updated (weekly in the current version) evaluation platform for the research community.

We develop an annotation framework (§2) and a benchmarking timeline for real-time QA system submissions. Every week, RealTime QA retrieves news articles and $\sim$ 30 human-written, multiple-choice questions from news websites (CNN, THE WEEK, and USA Today), covering a wide range of topics, including politics, business, sports, and entertainment. We upload these data to our website as well as our baseline results, and any model submission can be evaluated until the next set of questions is posted. This dynamic scheme contrasts with the well-established QA annotations Chen et al. (2017); Chen and Yih (2020) that are performed only once with information available at the time. Such annotations are effective for factoid Berant et al. (2013); Hermann et al. (2015); Rajpurkar et al. (2016); Joshi et al. (2017) or commonsense questions Zellers et al. (2018, 2019); Talmor et al. (2019); Sakaguchi et al. (2020), but not the real-time information needs that are our target.

We present two classes of real-time baseline systems that are built on strong, recent models (GPT-3: Brown et al., 2020; T5: Raffel et al., 2020; BART: Lewis et al., 2020a): open-book and closed-book QA models. We present a prompting method to use GPT-3 for open domain QA. The former class uses an external knowledge source, such as Wikipedia Min et al. (2019); Guu et al. (2020); Lewis et al. (2020b); Izacard and Grave (2021) or news articles. The latter class of closed-book models directly outputs an answer to each question. By design, these closed-book baselines have no access to information more recent than the time of pretraining or finetuning, thereby helping us understand the degree to which real-time information is truly necessary. Notably, some of the questions in RealTime QA do not strictly require recent information; for example, Shohei Ohtani hits a home run today, leading one to ask where he was born. The closed-book baselines thus outperform random selection among multiple choices.

We evaluate six baselines both in multiple-choice and generation settings in real time and report the results over the period of June 17 through July 22, 2022. These evaluation data resulted in a total of 179 QA pairs. Further, we provide 2,886 QA pairs that are collected in the same way but preceded our real-time evaluations. These can be used in later work for model development (e.g., finetuning). Our results show that an open-book GPT-3 model augmented with up-to-date text retrieval substantially outperforms closed-book baselines, as well as open-book models with retrieval from a past Wikipedia dump Lewis et al. (2020b). This result illustrates that the large language model can adjust its knowledge, based on the retrieved passages (§3). Nonetheless, we find that it still struggles, especially when the multiple choices include uncertainty (e.g., “none of the above”). Most of the errors originate from retrieval, rather than reading comprehension. The RealTime QA benchmark, therefore, highlights the importance of fast, up-to-date text retrieval Seo et al. (2019) to better serve instantaneous information needs. We share all data and code to reproduce our baselines so that follow-up work can build upon our first attempts to tackle RealTime QA.https://github.com/realtimeqa/realtimeqa_public.

RealTime QA can also serve as an important step toward much-needed, broader real-time applications of natural language processing. For example, a QA system with timely updates can improve emergency management of natural disasters Imran et al. (2013, 2015, 2016); Nguyen et al. (2016) or pandemics (e.g., COVID-19; Wang et al., 2020; Lee et al., 2020; Möller et al., 2020; Alzubi et al., 2021).A more detailed discussion of NLP applications for emergence response can be found in §4.1. With the advent of online news, prior work developed automated systems that regularly retrieve and summarize news articles from the Internet Allan et al. (2001); Radev et al. (2001); McKeown et al. (2002, 2003); Evans et al. (2004). Models developed for the RealTime QA task can be further enhanced with such retrieval/summarization systems. We hope that our RealTime QA interface and baseline models will serve as a useful platform for research and real-world applications.

RealTime QA Framework

RealTime QA is a dynamic platform that announces questions every week, based on news articles published since a week before. Here we establish the workflow (§2.1) and the framework for annotations (§2.2) and evaluations (§2.3). We then discuss our built-in baselines (§2.4) that are continually evaluated every week.

Fig. 2 depicts the RealTime QA workflow for each week. We announce $\sim$ 30 multiple-choice questions at 3 am GMT on every Saturday. We internally run API search (Google custom search, GCS) for these questions and share a set of documents (mostly news articles) with their URLs that are available at that time. Participants run their model on these questions, optionally using the documents from our API search as a knowledge source (indicated as dashed lines in Fig. 2). While we provide our document set to lower barriers to submission, participants are also allowed to create and use knowledge sources by themselves (e.g., custom retrieval models or other external APIs such as Twitter API). System submissions are shared on our website with their performance and submission time. The submission window closes when the new set of questions is announced the next week.

Note that fair, retroactive comparisons of systems are also possible, as long as they use data available when the submission window was still open. For instance, participants might be interested in evaluating their model against a past submission on the Week N questions. In this case, they can do so by ensuring that their system only relies on data up to Week N and simulating how their system would have performed at that time. Our platform still focuses on real-time evaluations and encourages every participant to submit real-time results to better reflect real-world applications.

2 Annotation

Using each of these $\sim$ 30 questions as a retrieval query, we run Google custom searchhttps://programmablesearchengine.google.com/about/. to collect the top-10 documents from the web. The retrieval target is all articles from CNN, USA Today, and THE WEEK. We then parse every document using the newspaper3k packagehttps://github.com/codelucas/newspaper. and store the text as well as metadata such as the publication date and author name. In some rare cases, articles from the search are taken down and no longer available for extraction, in which case we disregard them. This indeed illustrates a unique challenge of real-time applications with constantly-changing, dynamic information.

3 Evaluation

Since RealTime QA is a multiple-choice question dataset, we can simply measure performance by accuracy. We also explored a NOTA (none of the above) setting: one of the original choices is randomly replaced with “none of the above,” thereby preventing models from exploiting heuristics Rajpurkar et al. (2018). As expected, the NOTA setting resulted in performance degradation across the board (§1). NOTA choices can be found in other multiple-choice QA or reading comprehension datasets, such as MCTest Richardson et al. (2013) and RACE Lai et al. (2017).

We also experiment with a generation setting where no choices are given, to better reflect real-world applications. Under this setting, we evaluate performance with exact matching and token-based F1 scores, following standard practice in question answering Rajpurkar et al. (2016); Asai et al. (2021). There are a small number of questions that are specifically designed for multiple-choice question answering: (e.g., Apple announced new iPhone features coming with iOS 16 at this week’s WWDC event. All of these are new additions except: A) Customizable lock screens; B) Editing messages after they are sent; C) A new notification center; D) Ability to enhance images after zooming in. We disregard these questions in this setting.

Many QA datasets estimate human performance as a reference point for automatic QA systems (e.g., Rajpurkar et al., 2016; Yang et al., 2018; Clark et al., 2020). For the sustainability of the dynamic benchmark, we do not provide an estimate of human performance. However, we note that most questions in RealTime QA, if not all, are straightforward (e.g., single-hop questions) and a human with Internet access can easily answer them. In fact, USA Today has a record of human top scorers every week, and they all get perfect scores.E.g., https://www.usatoday.com/storytelling/quiz/news-quiz/2022-07-01/. We can thus assume that the human accuracy would be close to 100% in RealTime QA.

4 Real-time Baselines

RealTime QA executes six baselines in real time that are based on strong pretrained models: four open-book and two closed-book models. These six models are evaluated and made publicly available when weekly questions are announced. Any submission to RealTime QA is compared against them. Participants can also build their model upon our baselines.

Open-book QA models follow a two-step pipeline: document retrieval that finds evidence documents from an external knowledge source (e.g., Wikipedia) and answer prediction (or reading comprehension) that outputs an answer conditioned on the question and evidence documents. For either step, we experiment with two variants, resulting in a total of four configurations. Open-book systems have the advantage of being capable of updating the external knowledge source at test time Lewis et al. (2020b). This property is particularly crucial for questions in RealTime QA that inquire about information at the present time.

For the retrieval step, we experiment with two configurations: top-5 Wikipedia documents from dense passage retrieval (DPR; Karpukhin et al., 2020) and top-5 news articles from GCS (§2.2). In DPR, English Wikipedia articles from the December, 2018 dump are segmented into 100-word documents Wang et al. (2019). DPR encodes the question and every document into 768-dimensional vectors; it then computes the inner product to obtain a matching score and selects documents with top-5 matching scores. We use the BERT-based model Devlin et al. (2019), finetuned on the Natural Questions dataset Kwiatkowski et al. (2019) from the Hugging Face Transformers library Wolf et al. (2020). GCS uses an external API, and we found that it sometimes returned fewer than five documents; in this case, we add top documents from DPR to create a top-5 document set.

We explore two methods for answer prediction, conditioned on the question and the corresponding retrieved text: retrieval-augmented generation (RAG; Lewis et al., 2020b) and a prompting method with GPT-3 (text-davinci-002, Brown et al., 2020). In the multiple-choice setting, we compute the log probability of every choice and normalize it by the generation sequence length. We then select the choice with the best score. For the generation setting, we simply perform text decoding.

For the RAG baseline, we use the BART-based Lewis et al. (2020a) RAG-sequence model, again finetuned on Natural Question from the Transformers library. This model predicts the answer sequence ${\mathbf{y}}$ autoregressively from left to right while marginalizing over the set of top-5 retrieved documents ( $\mathcal{Z}$ ):

Here $P(z)$ is given by the matching score from the retrieval step.Unlike DPR, GCS does not provide matching scores. We treat top-5 documents with equal probabilities. The conditioned-upon question is suppressed for brevity in the equation.

We propose a straightforward GPT-3 prompting method with temporal contexts (Fig. 5).See Lazaridou et al. (2022) for alternative prompt templates. We prepend to every question the title and the first two paragraphs of the top-5 articles from the document retrieval step.This substantially reduces the inference computations. They contain most of the key information in each article. The publication date is inserted, using the metadata of each retrieved article (e.g., “Article on November 2, 2022” in Fig. 5). For Wikipedia passages retrieved by DPR, we prepend “December 31, 2018,” based on the Wikipedia dump date Karpukhin et al. (2020). Our ablation studies on date insertion will show that the open-book GPT-3 system benefits from specifying the dates of the question and the retrieved articles to some extent (§3.2).

4.2 Closed-book QA Models

Closed-book QA models directly answer questions without access to external knowledge. They have proven competitive with open-book models on some QA datasets Roberts et al. (2020); Guu et al. (2020). Since these models are trained/finetuned on the data available at that time, they cannot address questions about new events or information. Nonetheless, some of the real-time information needs do not necessarily require up-to-date information. Indeed, RealTime QA contains a small portion of such questions. For instance, Larry Tesler died this week at the age of 74. He was the creator of which popular keyboard function?; Microsoft retired its Internet Explorer browser this week. What year did it debut? These questions are triggered by a new event but inquire about facts in the past that have not changed recently. Closed-book baselines thus quantify the degree to which up-to-date information is necessary to answer questions in RealTime QA. We use the following two strong methods for closed-book QA.

We use the T5 model (T5-11B; Raffel et al., 2020) finetuned on the Natural Questions data, again from the Transformer library. Following the open-book baseline, we select the choice with the maximum average log score in the multiple-choice setting.

Similar to the open-book baselines (§2.4.1), we apply a prompting method to GPT-3 (text-davinci-002). We use the same prompt as Fig. 5, except that no articles are inserted before the question. Again following the open-book baseline, we select the choice with the maximum average log score in the multiple-choice setting.

Experiments and Analysis

We started our real-time experiments on June 17 2022, spanning six weeks as of July 22 (179 questions in total). We will continue our weekly annotations, but here we report our experimental and analysis results so far and give guidance to future participants.

Seen in Table 1 are the results from the six weeks. In all three settings (original/NOTA multiple choice and generation), GPT-3 with Google custom search (GCS) retrieval achieves the best performance. In particular, GPT-3 with GCS substantially outperforms both closed-book GPT-3 and GPT-3 with DPR (from a December 2018 Wikipedia dump): e.g., 28.7 vs. 7.3/8.4 in generation exact matching. This suggests that GPT-3 is able to answer questions based on the given prompt, rather than relying on past information from pretraining. Nevertheless, we still see a large performance drop of all baselines from the original multiple-choice setting to NOTA (“none of the above”): e.g., 59.8 vs. 69.3 for GPT-3 with GCS retrieval. Future work can further improve GPT-3’s ability of reading comprehension, especially regarding answer uncertainty. We also note that the best baseline uses the black-box Google custom search API; we encourage participants to develop a competitive open-source system.

2 Analysis and Ablations

Fig. 6 plots the performance of the open-book GPT-3 baseline with Google custom search (GCS) over varying submission (i.e., GCS retrieval) time. All results are averaged over 179 questions between June 17 and July 22, 2022. We see a consistent pattern: the performance remains high (or improves) up to around 24 hours after the announcement but substantially degrades later. While the performance can improve when GCS starts to retrieve more recent articles, it eventually suffers from temporal gaps. Our website provides the submission time of every system as well as its performance.

Our prompt for the GPT-3 baselines prepends date information both to the articles and question (Fig. 5). Table 2 shows results from ablation studies on date insertion for the open-book (GPT-3 with Google custom search) and closed-book GPT-3 models. Temporal specification almost always helps the open-book GPT-3 model. Interestingly, it hurts the performance of the closed-book model, perhaps because the specified date is generally unseen during pretraining and the prompt becomes “out-of-domain.”

We conducted a manual error analysis of the results so far. In particular, we categorized answers from the best generation model (open-book GPT-3 with GCS) into three categories: correct, retrieval error, and reading comprehension error. For the 179 questions, the breakdown was the following: correct (52%), retrieval error (34%), and reading comprehension error (13%). This suggests that the key to instantenous applications of question answering is accurate, up-to-date information retrieval.

Table 3 shows some examples that compare the closed-book and open-book GPT-3 models. The first three examples illustrate that GPT-3 can correctly update its answer based on the retrieved documents across diverse genres: natural disasters, the COVID-19 pandemic, and entertainment. The last three cases, on the other hand, demonstrate a critical limitation of current large language models in temporal understanding: the retrieved documents do not suffice to answer the questions due to a temporal gap, and GPT-3 still generates an outdated answer. Ideally, GPT-3 should inform the user or even the retrieval module that it does not have enough evidence to answer the question. This way, the retrieval module can expand its search, or the user can consult other resources.

Note that it is possible to limit the retrieval target to recent articles,Indeed, Google custom search has a paid version with a date range feature that filters retrieval results by date. but there are potential failure modes. Firstly, some questions in RealTime QA inquire about the past, and models can benefit from older articles when answering such questions. Further, the appropriate date range for retrieval varies from question to question in real-world applications; some questions inquire about this year, while others about this week. We thus do not implement such filtering for the current real-time baselines.

Related Work

RealTime QA has time sensitivity, which several prior works addressed on various NLP tasks. Here we discuss its relation to long-standing summarization and text retrieval tasks (§4.1), as well as recent work on temporal misalignment between training and evaluation (§4.2). We then discuss its connections to broad frameworks of dynamic evaluations (§4.3) and open domain QA (§4.4).

Temporal (or timeline) summarization is a task that retrieves documents from the web and provides their summary over time Catizone et al. (2006); Aslam et al. (2013, 2014, 2015); Martschat and Markert (2017, 2018). Update summarization Witte et al. (2007); Dang and Owczarzak (2008) and new event detection/track Allan et al. (1998); Li et al. (2005) are tasks that monitor and track newly-added information. Prior work created datasets and systems for these tasks Tran et al. (2013, 2015); Wang et al. (2015); Chen et al. (2019); Gholipour Ghalandari and Ifrim (2020). Their evaluations are usually executed statically, with information available at the time of data collection.

In contrast, the TREC real-time summarization track evaluates systems in real time during a 1–2 week evaluation period Lin et al. (2016, 2017); Sequiera et al. (2018). Several other works and initiatives focused particularly on financial news summarization Filippova et al. (2009); Passali et al. (2021) or emergency management technology Temnikova et al. (2014); Ghosh et al. (2017); McCreadie et al. (2019), including the COVID-19 pandemic Buntain et al. (2020); Pasquali et al. (2021). This work regularly evaluates question answering systems over diverse topics in real time, but we share the goal of dealing with novel and evolving information over time; retrieval or summarization methods from these tasks (e.g., Yan et al., 2011a, b, 2012; Shou et al., 2013) can be combined with models in RealTime QA to serve various time-sensitive information needs from users. RealTime QA can also be used to evaluate time-sensitive retrieval systems by the downstream question answering performance.

2 Temporal Misalignment and Degradation

While not particularly motivated by instantaneous information needs like RealTime QA, prior work also explored temporal aspects of a variety of NLP tasks. A flurry of recent work analyzed performance degradation from temporal misalignement between (pre)training and evaluation/deployment on many NLP tasks Lazaridou et al. (2021); Röttger and Pierrehumbert (2021); Luu et al. (2022); Onoe et al. (2022) and proposed mitigation methods for temporal adaptation Huang and Paul (2018, 2019); Dhingra et al. (2022); Jang et al. (2022a, b); Lee et al. (2022). An open-book QA model conditions answer generation upon newly-retrieved documents Lewis et al. (2020b), but the extent to which answer generation can be updated based on the retrieved documents is limited Longpre et al. (2021b). Temporal degradation is, therefore, one of the challenges that models in RealTime QA need to address.

3 Dynamic Benchmarks

Unlike the majority of datasets in natural language processing, RealTime QA evaluates systems dynamically and its evaluations change over time. Several other prior works update challenge test sets Kiela et al. (2021); Potts et al. (2021); Ma et al. (2021), evaluation tasks Thrush et al. (2022), or text generation metrics Gehrmann et al. (2021, 2022); Mishra and Arunkumar (2021); Kasai et al. (2022). RealTime QA hosts a similar online platform and adopts a dynamic scheme specifically to pursue instantaneous applications.

4 Open Domain QA

Much prior work proposed datasets for open domain (or open retrieval) QA for English and beyond Clark et al. (2020); Asai et al. (2021, 2022); Longpre et al. (2021a); Zhang et al. (2021). Several recent works challenged the conventional problem setups Chen and Yih (2020) where correct answers can be found from a fixed, external knowledge source, such as Wikipedia. Similar to RealTime QA, Zhang and Choi (2021); Liška et al. (2022) focused on temporal or geographical contexts that can change the answer to the same question. Min et al. (2020) pointed out inherent ambiguity in questions from human users and proposed a benchmark that provides multiple answers to every question. Consistent with these prior efforts, RealTime QA aims toward broader applications of question answering beyond the conventional framework.

Conclusion and Future Work

We introduced RealTime QA, a dynamic, open domain QA benchmark that asks questions at the present time. The current version announces $\sim$ 30 questions every week and continually evaluates six real-time baselines. Our experiments from the first six weeks suggest that accurate, up-to-date information retrieval is particularly important to serve speedy information needs. We hope that RealTime QA encourages research efforts toward fast, accurate applications of natural language processing.

Acknowledgements

We thank Noriyuki Kojima, Alisa Liu, Ofir Press, Koji Shiono, Wenya Wang, the ARK group at UW, and the Mosaic team at the Allen Institute for AI for their helpful feedback on this work.