Temporal Reasoning on Implicit Events from Distant Supervision

Ben Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, Ashish Sabharwal, Dan Roth

Introduction

Understanding temporal relations between events in narrative text is a crucial part of text understanding. When reading a story, a human can construct a latent timeline about events’ start and end times, similar to the one shown in Fig. 1 about an automobile accident. This timeline not only contains the placements of explicitly mentioned events (e.g., ride a bicycle), but also accounts for implicit events (e.g., Farrah was distracted so she looked away). Such a latent timeline explains the dynamics between events; for example, the possible chain of events between ride and recovered in this context contains get hit and injured. The ability to construct such a timeline is essential for understanding the causal dynamics of a situation. Without it, NLP systems cannot truly understand situations and reliably solve tasks such as temporal question-answering, causal inference, and scheduling assistance.

To better evaluate this ability, we introduce a new dataset called Tracie (TempoRAl Closure InfErence) that focuses on temporal relations on implicit events in short stories. Our dataset contains high-quality annotations of both start and end time queries that test a system’s understanding of the full temporal closure (i.e., both start and end time) of events. As a task that requires considerable commonsense knowledge, we follow Zhou et al. (2020) in minimizing the size of the training set, therefore making Tracie mainly an evaluation set. The final Tracie dataset contains a total of 5.4k human-curated instances, provided in a (multi-premise) textual entailment (TE) format, as illustrated at the bottom of Fig 1. A Pre-trained language model such as T5-Large Raffel et al. (2020) fine-tuned on our new dataset achieves a modest binary prediction accuracy of $67.9$ %.The same model achieves 77.4% on MATRES Ning et al. (2018b) with a similar amount of training instances. All Tracie numbers reported in this section are from Table 2. Consistent with other studies on temporal reasoning Zhou et al. (2020), these results reveal serious limitations in existing pre-trained language models.

To build models better capable of understanding time with minimal direct training data, we propose a novel distant supervision technique that improves generalization by extracting temporal patterns in large-scale free text as part of an additional pre-training step. In contrast to other attempts at extracting temporal data through patterns at a sentence level Gusev et al. (2011); Zhou et al. (2020), we extract over large windows of text such as paragraphs. This allows for capturing global information related to multiple events and extracting signals that do not appear in small-window local contexts. The resulting model, PtnTime (PatternTime), achieves a $76.6$ % accuracy on Tracie, a $9\%$ gain over using standard T5-Large. We also show the applicability of PtnTime on a standard temporal reasoning benchmark involving only explicit events, MATRES Ning et al. (2018b), with a $9$ point gain in a low-resource setting.

We achieve further improvements by coupling PtnTime with a duration model from Zhou et al. (2020) to create a neural-symbolic reasoning model called SymTime. The key idea in SymTime is to decompose the computation of temporal relations to the predictions of relative distances between start times and those of durations. For example, in Fig 1, we can decide that distracted likely ends before try starts because the duration of distracted is likely to be shorter than the distance between the two start times. This allows for better prediction on the end time, which rarely appears in the natural text and has been previously shown to be difficult to annotate Ning et al. (2018b). Such a symbolic computation involves a logical combination of the individual models in a way that formalizes part of the Allen interval algebra Allen (1983). This model, which supports a wider range of temporal computation and can be used with and without task-specific supervision, achieves a final accuracy of $78.9$ % on Tracie’s binary classification metric. We also show that SymTime is more robust to different distributions of the training data, demonstrating the benefits of using a temporal model with a transparent reasoning process.

In summary, we make the following 3 contributions: (1) a temporal relation dataset Tracie focusing on implicit events (§3); (2) a distant supervision process for temporal understanding of implicit events (§4); and (3) a reasoning model that makes end-time comparisons using predictions of start-time distances and durations (§5). Finally, we demonstrate the effectiveness of our models on Tracie, as well as the applicability of our approach to an existing temporal benchmark (§6).

Related Work

Temporal reasoning has received much attention in the NLP community, and to date, there are many datasets that focus on temporal ordering Pustejovsky et al. (2003); Bethard et al. (2007); Cassidy et al. (2014); Reimers et al. (2016); O’Gorman et al. (2016); Ning et al. (2018b, 2020b), and other temporal knowledge Pan et al. (2006); Zhou et al. (2019). We focus here on modeling implicit events, which has received relatively little attention. Multiple systems have been proposed as part of research into temporal ordering Do et al. (2012); Moens and Leeuwenberg (2017); Leeuwenberg and Moens (2018); Meng and Rumshisky (2018); Ning et al. (2018c); Han et al. (2019), duration prediction Vashishtha et al. (2019) and other tasks. Our decision to use a textual entailment style follows recent work on natural language inference Williams et al. (2017); Nie et al. (2020); Bhagavatula et al. (2020), which tends to not focus on time (for recent work on temporal NLI, see Vashishtha et al. (2020)). Many have used distant supervision for temporal reasoning Gusev et al. (2011); Ning et al. (2018a); Zhou et al. (2020). Comparatively, our work captures longer-range dependencies in narrative text (for related ideas, see Ammanabrolu et al. (2021)).

We are inspired by structural predictions and constraints that combat the sparsity of temporal knowledge Ning et al. (2017); Do et al. (2012), as well as neural module networks Andreas et al. (2016); Gupta et al. (2019) and other decomposition-based approaches Talmor and Berant (2018); Khashabi et al. (2018); Li et al. (2019); Wolfson et al. (2020); Khot et al. (2021). In particular, we build neural-symbolic transformer models that operationalize some of the classical interval-based computations used in earlier work on temporal reasoning Allen (1983); Gerevini and Schubert (1995) (for related ideas, compare with Leeuwenberg and Moens (2018); Vashishtha et al. (2019)).

This work is broadly related to works on causal dynamics Pearl (2009). The nature of combined temporal and causal focuses is also related to procedural text modeling Tandon et al. (2018, 2020).

The TRACIE Dataset

In this section, we introduce the Tracie dataset.We release Tracie and its leaderboard at https://leaderboard.allenai.org/tracie

The goal of Tracie is to test a system’s ability to compare start and end times of non-extractive implicit event phrases instead of extractive triggers from the context. Such tests in Tracie take the form of multi-premise textual entailment (TE) Lai et al. (2017). Each Tracie instance contains 1) a context story (or premise) consisting of a sequence of explicit narrative events; 2) an implicit event in the form of a natural language phrase that is unmentioned but has some role in the story; 3) a comparator of either {starts,ends}; 4) an explicit event also in the form of a phrase, and 5) a temporal relation of either {before,after} that marks the relationship in the dimension defined by the comparator between the implicit-event and the explicit-event. With these 4 components, we are able to generate TE-style instances, using the context story as the premise and temporal queries about pair-wise relations between implicit and explicit events as hypotheses. For example, in the first positive instance shown in Fig. 1, “distracted” is the implicit-event, “starts” is the comparator, “try” is explicit-event and “before” is the temporal-relation. They form a positive hypothesis “distracted starts before try.”All event phrases are shortened to triggers here for simplicity. See Fig. 2 for actual phrases. We flip the temporal-relation (i.e., “before” to “after” and vice versa) to create negative (contradiction) instances, as shown in the second example instance in Fig. 1.

Since the start times of explicit-events are more obvious to human annotators, we use them as reference points and compare the implicit-event’s start or end time with them (depending on the comparator), according to the label definitions shown in Fig. 3. In rare cases where two time points are the same (e.g., hit and get hit start at the same time in Fig.1), we use the causal relation to decide the order, so that hit starts before get hit. Such instances are created through a multi-stage annotation process as detailed (in respective order) below. All steps are implemented with the CrowdAQ platform Ning et al. (2020a) with qualification exams.

We randomly sample short stories from the ROCStories dataset Mostafazadeh et al. (2016). For each story, one annotator writes 5 implicit event phrases that are not explicitly mentioned by the given story, but are inferable and relevant. The annotator additionally rewrites two explicit events closest to the implicit event’s start and end time, respectively. With these two events, we can build two Tracie instances (minus the temporal-relation) per implicit event, which accounts for 10 instances in total per story.

Automatic Instance Generation

We use AllenNLP Gardner et al. (2018) to extract all verbs and relevant arguments with its semantic role labeling (SRL) model. With all the verbs and their arguments, we construct a pool of explicit events in the form of short phrases. For each implicit event, we randomly select two {explicit-event, comparator} pairs from the pool and build 10 additional instances (without temporal-relation).

Label Collection

For each of the 20 instances per story, we annotate the temporal-relation with four different annotators. Annotators follow the label definition in §3.1 to produce four temporal-relations for each instance. We use the majority agreement as the final label and filter out unagreeable instances. Two authors additionally verify the instances with ambiguous verbs (e.g., “have”) and corrected $5$ % of the end-time instances.

2 Splits and Analysis

We split the data under the independent and identically distributed (i.i.d.) assumption based on stories, with a 20/80 train/test ratio. We use a small training set, following Zhou et al. (2019), as we believe temporal relations involve much commonsense knowledge. As we later show in §6.3, it is infeasible to collect a large enough human-annotated training set to capture all the knowledge needed to tackle this problem completely, and a system must acquire knowledge from external resources. As a result, we use a small training set just to define the task, and at the same time, use an extensive testing set for more robust evaluation.

The authors conduct a human upper-bound analysis on 100 randomly sampled instances, following the procedure in Zhou et al. (2020). There is a $94$ % agreement and a $98$ % resolved accuracy,This is obtained after the authors discuss and resolve any disagreements before comparing with the annotated labels. suggesting that Tracie has a high annotation quality.

Pattern-Based Pre-Training

As argued in §3.2, we believe that it is more efficient to build a model that learns the prior knowledge needed for the task with distant signals and only subsequently learns the task definition through a small training set. This section describes how we collect the distant signals related to events’ start-time comparisons and pre-train a novel temporally-aware transformer model called PtnTime. While PtnTime will be used for fine-tuning directly on Tracie, it will also form the basis of a more general temporal reasoning model called SymTime that we describe in §5.

We describe the sources of distant supervision signals with the goal of understanding the relative order between two events’ start times as well as the relative distance between them.

We collect start time comparisons between pairs of events heuristically from free-text using “before/after" keywords (following much prior work in temporal modeling and extraction Do et al. (2012)). We use AllenNLP’s SRL model to process each input sentence and find verbs with a temporal argument that starts with either “before” or “after”, and contains at least another verb. If there are multiple verbs in the temporal argument, we take the one with the largest number of tokens as arguments. We match the two extracted verbs with the relation indicated by the first word of either “before” or “after”. As the example in Fig. 4 shows, the extractor identifies that purchase food is before go to park as indicated by the “before” keyword mentioned in the text. We acquire 2.8 million instances from the May 2020 Wikipedia dump using this process.

Cross-Sentence Extraction

The data collected from the within-sentence patterns does not reveal the relative distance between two start times. In addition, because writers often save trivial inferences for efficiency, certain event pairs rarely co-occur within a small textual window, making one event often implicit to the other one in these pairs. To better collect such signals, we employ a cross-sentence extraction that finds direct temporal expressions of hours and dates. Because these temporal expressions (e.g., 2021-01-01) are globally comparable, the compared events can be anywhere in a document. Therefore, this process collects more supervision signals about time-point comparisons and their relative distance on event pairs with trivial causal relations. We apply the SRL model and find all temporal arguments and their associated verbs. We find the exact temporal values by filling unmentioned elements of a temporal expression with the nearest previous mention (e.g., we add “January” to the expression of “the 10th” in Fig. 4.) These extractions have high precision, as the SRL model does well on identifying temporal arguments.

We then construct supervision instances under the assumption that the extracted temporal expressions describe the start times of the associated verbs (e.g., went started on January $1^{\textrm{st}}$ in Fig. 4) . Each instance comprises an event pair, a temporal relation, and an estimation on the temporal difference between the two start times. Each event is a phrase constructed by taking all relevant arguments of the predicate verb in the SRL parses. We represent the differences between the two start times as one of seven coarse temporal units: { $\leq$ minutes, hours, days, weeks, months, years, $\geq$ decades}. For example, we get go to park is weeks before write review as shown in Fig. 4. In addition to the event pairs, we randomly sample sentences within the paragraph to use as the context that better defines the events. We collect 700k instances from this cross-sentence extraction process from Wikipedia.

Language Model (LM) Pre-Training Data

We couple the specialized temporal pre-training data described above with additional paragraphs that are used to perform conventional language model pre-training using the original denoising task proposed in Raffel et al. (2020). This is done to maintain part of the original language model’s semantics and to avoid overfitting. We use the Gutenberg Dataset Lahiri (2014) as the source and collect 1 million paragraphs for this purpose.

Data Format

We then format the within / cross-sentence extraction data to consistent instances that have input sequences of event:[EventA] starts [Relation][EventB].story:[Paragraph] and output sequences of answer:[Label][Distance]. Here [EventA] represents the tokens that describe the first event; [EventB] represents the ones that describe the second event; and [Paragraph] represents the tokens of the context, which is non-empty only for cross-sentence extractions. [Relation] is either before or after, and [Label] is either positive or negative. When the label is positive, the relation will be the gold relation extracted from the text; when it is negative, the relation will be the inverse of the extracted relation. We randomly make 50% of the instances negative. [Distance] is one of the 7 coarse temporal units represented with a set of blank tokens [extra_id_N]. We leave it to be blank for the within-sentence extractions so that the objective function will not include it in loss computations. The LM pre-training data follows the original format in Raffel et al. (2020).

2 Pattern-Based Temporal Model (PtnTime)

We use a pre-trained sequence-to-sequence model as our base model and additionally pre-train this model using the data collected in §4.1 (for modeling details, see §6.1). We call the resulting model PtnTime. As a result of this additional pre-training step, PtnTime serves as new set of temporally-aware model weights that can be used in place of existing pre-trained models and fine-tuned on Tracie. As we describe next, we also use PtnTime to build a modular temporal reasoning model called SymTime that attempts to go beyond a standard language modeling approach and improve start and end point prediction.

Symbolic Temporal Reasoning Model (SymTime)

To address the challenge of predicting event end times for which it is difficult to obtain high-quality direct or distant supervision, we introduce a new reasoning model called SymTime in this section. This model makes end-time comparisons by symbolically combining start time distance and duration from separate predictions based on some of the components introduced in the previous section. Different from Leeuwenberg and Moens (2018) and Vashishtha et al. (2019), our model does not rely on explicit annotations on timepoints, but only relative comparisons between them.

As described in §3.1, hypotheses in Tracie make pair-wise comparisons between two events $e_{1}$ and $e_{2}$ using a comparator $l$ from $\{\texttt{starts},\texttt{ends}\}$ and a query-relation $r$ from $\{\texttt{before},\texttt{after}\}$ based on a provided story context. We associate each $e_{j}$ with a latent start time $\mathbf{start}_{j}$ and an end time $\mathbf{end}_{j}$ , as well as, for convenience, a duration $\mathbf{duration}_{j}=\mathbf{end}_{j}-\mathbf{start}_{j}$ . Under this formulation, a symbolic approach to solving Tracie involves computing the relation functions $r_{l}$ shown in Figure 5. For example, given exact numeric values $\mathbf{end}_{1}$ and $\mathbf{start}_{2}$ , as one would assume in a classical interval-based approach to temporal reasoning Allen (1983)In the Allen algebra, the values $\mathbf{end}_{x}$ and $\mathbf{start}_{y}$ correspond to the right and left end points $x^{+},y^{-}$ in the intervals $(x^{-},x^{+}),(y^{-},y^{+})$ . Likewise, our $\mathbf{duration}_{x}$ corresponds to the value $(x^{+}-x^{-})$ ., determining if the first event ends before the second involves simply computing whether $\mathbf{end}_{1}$ is less than $\mathbf{start}_{2}$ .

In what follows, we describe how we approximate the values of the two functions via individual neural modules (see illustration in Fig. 6).

2 Duration Estimation

3 Computation and Learning

Experiments

In this section, we detail our experimental setup (§6.1-6.2) and report our main results (§6.3-6.5).We release the systems for reproduction at http://cogcomp.org/page/publication_view/937

We use T5-Large implemented by Wolf et al. (2019) as our base sequence-to-sequence model for both PtnTime and the duration model in §5.2 as it provides for faster iterations. We use early stopping, batch size of 32 and other default parameters. PtnTime converges after 45k steps ( $\sim$ 1.4M instances) and the duration model converges after 80k steps ( $\sim$ 2.6M instances). We use these pre-trained weights in SymTime as well as SymTime-ZeroShot which uses no Tracie supervision.

We compare with our proposed models with a host of baselines based on the same pre-trained language model, including BaseLM: T5-Large, and BaseLM-MATRES: T5-Large fine-tuned on 20k MATRES training data. We also compare with other architectures/models, including BiLSTM as used in Williams et al. (2017), Roberta-Large Liu et al. (2019) and T5-3B. All models and baselines follow a standard TE setup and default parameters. We report a 3-run average and each model is run until convergence.

2 Metrics and Settings

We measure system performance on Tracie separately for start-time hypotheses and end-time hypotheses. We also employ a story-wide exact match metric, which is the percentage of stories with all its related hypotheses answered correctly.

In addition to Tracie’s standard i.i.d. split, we propose a pruned version of the training set with balanced prior distributions. For example, in the i.i.d. training set, 70% of the examples with the comparator ends and relation after are positive. We randomly remove instances from the majority classes to produce a uniform-prior training set such that a model can no longer rely on such prior distributions. We believe this setting better evaluates a system’s true understanding of the task.

3 Main Results

Table 1 shows system performance on Tracie’s i.i.d. setting. We observe that PtnTime improves on all metrics over the base language model, with 6% on start-time comparisons and 8% on story-wide exact match. It also outperforms BaseLM-MATRES, suggesting that distant supervision is more efficient than extensive human annotation.

With a symbolic end-time inference, SymTime further improves on all metrics, with 7%, 4%, and 9% gains over the base language model on start time, end time and story-wide exact match, respectively. SymTime can further improve the performance on start-time hypotheses over PtnTime even though they use the same model to predict start-time queries. This is because PtnTime is not designed to understand end time from pre-training, and fine-tuning on such data hurts its representation in general. This illustrates the benefits of models using explicit and sensible reasoning processes.

Table 2 compares systems in the uniform-prior training setting. Compared to the setting in Table 1, a system cannot exploit prior knowledge about the label distribution when making predictions. Given this, we see that all baselines produce a much lower performance, e.g., the BiLSTM, which is a model that lacks much of the pre-requisite knowledge for reasoning, suddenly performs near random chance. Compared to the baseline models, PtnTime only drops $2.7$ %, suggesting that it is more invariant to evaluation settings and better understands temporal common sense. SymTime has the smallest drop among all models (1.7%) because of its explicit reasoning process on end-time hypotheses. SymTime-ZeroShot does not use any Tracie training examples, so it has the same performance in the uniform-prior setting which outperforms all supervised baselines including T5-3B.

4 Extrinsic Evaluation

To show that our model is not limited to the Tracie dataset and is general in temporal relation reasoning, we also evaluate on MATRES Ning et al. (2018b), a temporal relation dataset focused on comparing explicit events’ start times. We train and evaluate only the instances with a label of either “before” or “after”, which accounts for about 80% of all instances. We compare the performance of SymTimeThis is virtually the same as using PtnTime as MATRES does not evaluate duration nor end times. with BaseLM. We report four results - OT-NS (original test, no story): train and test with only the sentences containing the trigger verbs; OT: train and test with the entire document (down-sampled to be below the maximum sequence length) as an auxiliary input; OT-MS (original test, minimal supervision): train with 1.2k (6%) training instances; PT (perturbed test): train with the complete training set and test on a perturbed test set from Gardner et al. (2020). In OT-NS, we also report a SOTA system from Wang et al. (2020) under the same two-labelWang et al. (2020) is trained with two additional labels. We constraint the output space to only “before” and “after” using argmax, but this process makes it not directly comparable. setting.

Table 3 shows the performance of our model and the baselines. We see that our model is consistently better than BaseLM, and at the same time, comparable to Wang et al. (2020). Our model benefits more from input contexts, and only drops 4% in the OT-MS setting with minimal supervision (from 89.6 to 86.1), comparing to the 10% drop from T5-Large. This shows the effectiveness of our distant signals in §4.1, which are also designed to encourage contextual understandings.

5 Ablation Studies and Analysis

To better understand the improvements from our models, we conduct several ablation studies.

Table 4 shows the results on Tracie where the story is not provided as part of the inputs to systems (a no-story setting). While such a setting bares some resemblance to the partial-input baselines often employed in TE Poliak et al. (2018), in our setting, it is often possible to predict temporal relations in the absence of stories because of strong commonsense priors. Indeed, we estimate that $65$ % of the instances can be correctly predicted from the hypotheses alone, based on expert analysis in § 3.2. This suggests a $82.5$ % human upper-boundWe assume that the remaining $35$ % non-predictable instances are decided by random guessing. in this no-story setting. Hence, such a setting partly evaluates a model’s ability to incorporate commonsense priors when making decisions.

We see that BaseLM is close to random chance, whereas PtnTime and SymTime improve 20% and 22% respectively. This suggests that our models better understand temporal common sense through the distant supervision on both start times and duration. On the other hand, we observe much smaller drops in our model’s performances in this no-story setting. This suggests that our models do not improve as much on the 35% instances that require multi-hop timeline constructions over more than two events, motivating future work.

Table 5 compares the two pre-training sources described in §4.1 by individually pre-training two models with only within-sentence or cross-sentence extracted data. We see that the cross-sentence extraction brings the most performance gain on Tracie’s start-time binary metric under the uniform-prior training setting. This suggests that the global extraction rule is able to introduce new knowledge that is not seen in localized language model pre-training. Combining the within-sentence data further improves the performance.

Through analysis on the interval predictions made by SymTime, we notice a tendency for the model to predict “after” for end-time instances, possibly due to overly-estimated durations: a byproduct of natural biases in text. Given the weak signal used to learn such intervals and these potential biases, this is not altogether surprising. We leave the task of learning more robust and faithful interval representations for future work.

Conclusion

We introduce a challenging dataset Tracie, to evaluate systems’ temporal understanding of implicit events. We propose a distant supervision process that improves language models’ understanding of start times of both explicit and implicit events. We further combine this process with a distantly supervised model that estimates events’ duration to compare event end times, under the explicit rule that end times are start times plus durations. We show that our model improves over Tracie and MATRES, suggesting the effectiveness of high-precision pre-training and symbolic temporal reasoning. Despite these advances, Tracie continues to be a challenging task for future work on general temporal reasoning.

Acknowledgments

This research is based upon work supported in part by the office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the BETTER Program, and by Contract FA8750-19-2-1004 with the US Defense Advanced Research Projects Agency (DARPA). The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.