Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly

Nora Kassner, Hinrich Schütze

Introduction

PLMs like Transformer-XL Dai et al. (2019), ELMo Peters et al. (2018) and BERT Devlin et al. (2019) have emerged as universal tools that capture a diverse range of linguistic and factual knowledge. Recently, Petroni et al. (2019) introduced LAMA (LAnguage Model Analysis) to investigate whether PLMs can recall factual knowledge that is part of their training corpus. Since the PLM training objective is to predict masked tokens, question answering (QA) tasks can be reformulated as cloze questions. For example, “Who wrote ‘Dubliners’?” is reformulated as “[MASK] wrote ‘Dubliners’.” In this setup, Petroni et al. (2019) show that PLMs outperform automatically extracted knowledge bases on QA. In this paper, we investigate this capability of PLMs in the context of (1) negation and what we call (2) mispriming.

(1) Negation. To study the effect of negation on PLMs, we introduce the negated LAMA dataset. We insert negation elements (e.g., “not”) in LAMA cloze questions (e.g., “The theory of relativity was not developed by [MASK].”) – this gives us positive/negative pairs of cloze questions.

Querying PLMs with these pairs and comparing the predictions, we find that the predicted fillers have high overlap. Models are equally prone to generate facts (“Birds can fly”) and their incorrect negation (“Birds cannot fly”). We find that BERT handles negation best among PLMs, but it still fails badly on most negated probes. In a second experiment, we show that BERT can in principle memorize both positive and negative facts correctly if they occur in training, but that it poorly generalizes to unseen sentences (positive and negative). However, after finetuning, BERT does learn to correctly classify unseen facts as true/false.

(2) Mispriming. We use priming, a standard experimental method in human psychology Tulving and Schacter (1990) where a first stimulus (e.g., “dog”) can influence the response to a second stimulus (e.g., “wolf” in response to “name an animal”). Our novel idea is to use priming for probing PLMs, specifically mispriming: we give automatically generated misprimes to PLMs that would not mislead humans. For example, we add “Talk? Birds can [MASK]” to LAMA where “Talk?” is the misprime. A human would ignore the misprime, stick to what she knows and produce a filler like “fly”. We show that, in contrast, PLMs are misled and fill in “talk” for the mask.

We could have manually generated more natural misprimes. For example, misprime “regent of Antioch” in “Tancred, regent of Antioch, played a role in the conquest of [MASK]” tricks BERT into chosing the filler “Antioch” (instead of “Jerusalem”). Our automatic misprimes are less natural, but automatic generation allows us to create a large misprime dataset for this initial study.

Contribution. We show that PLMs’ ability to learn factual knowledge is – in contrast to human capabilities – extremely brittle for negated sentences and for sentences preceded by distracting material (i.e., misprimes). Data and code will be published.https://github.com/norakassner/LAMA˙primed˙negated

Data and Models

LAMA’s cloze questions are generated from subject-relation-object triples from knowledge bases (KBs) and question-answer pairs. For KB triples, cloze questions are generated, for each relation, by a templatic statement that contains variables X and Y for subject and object (e.g, “X was born in Y”). We then substitute the subject for X and MASK for Y. In a question-answer pair, we MASK the answer.

LAMA is based on several sources: (i) Google-RE. 3 relations: “place of birth”, “date of birth”, “place of death”. (ii) T-REx Elsahar et al. (2018). Subset of Wikidata triples. 41 relations. (iii) ConceptNet Li et al. (2016). 16 commonsense relations. The underlying corpus provides matching statements to query PLMs. (iv) SQuAD Rajpurkar et al. (2016). Subset of 305 context-insensitive questions, reworded as cloze questions.

We use the source code provided by Petroni et al. (2019) and Wolf et al. (2019) to evaluate Transformer-XL large (Txl), ELMo original (Eb), ELMo 5.5B (E5B), BERT-base (Bb) and BERT-large (Bl).

Negated LAMA. We created negated LAMA by manually inserting a negation element in each template or question. For ConceptNet we only consider an easy-to-negate subset (see appendix).

Misprimed LAMA. We misprime LAMA by inserting an incorrect word and a question mark at the beginning of a statement; e.g., “Talk?” in “Talk? Birds can [MASK].” We only misprime questions that are answered correctly by BERT-large. To make sure the misprime is misleading, we manually remove correct primes for SQuAD and ConceptNet and automatically remove primes that are the correct filler for a different instance of the same relation for T-REx and ConceptNet. We create four versions of misprimed LAMA (A, B, C, D) as described in the caption of Table 3; Table 1 gives examples.

Results

Negated LAMA. Table 2 gives spearman rank correlation $\rho$ and % overlap in rank 1 predictions between original and negated LAMA.

Our assumption is that the correct answers for a pair of positive question and negative question should not overlap, so high values indicate lack of understanding of negation. The two measures are complementary and yet agree very well. The correlation measure is sensitive in distinguishing cases where negation has a small effect from those where it has a larger effect.A reviewer observes that spearman correlation is generally high and wonders whether high spearman correlation is really a reliable indicator of negation not changing the answer of the model. As a sanity check, we also randomly sampled, for each query correctly answered by BERT-large (e.g., “Einstein born in [MASK]”), another query with a different answer, but the same template relation (e.g., “Newton born in [MASK]”) and computed the spearman correlation between the predictions for the two queries. In general, these positive-positive spearman correlations were significantly lower than those between positive (“Einstein born in [MASK]”) and negative (“Einstein not born in [MASK]”) queries (t-test, $p<0.01$ ). There were two exceptions (not significantly lower): T-REx 1-1 and Google-RE birth-date. % overlap is a measure that is direct and easy to interpret.

In most cases, $\rho>85\%$ ; overlap in rank 1 predictions is also high. ConcepNet results are most strongly correlated but TREx 1-1 results are less overlapping. Table 4 gives examples (lines marked “N”). BERT has slightly better results. Google-RE date of birth is an outlier because the pattern “X (not born in [MASK])” rarely occurs in corpora and predictions are often nonsensical.

In summary, PLMs poorly distinguish positive and negative sentences.

We give two examples of the few cases where PLMs make correct predictions, i.e., they solve the cloze task as human subjects would. For “The capital of X is not Y” (TREX, 1-1) top ranked predictions are “listed”, “known”, “mentioned” (vs. cities for “The capital of X is Y”). This is appropriate since the predicted sentences are more common than sentences like “The capital of X is not Paris”. For “X was born in Y”, cities are predicted, but for “X was not born in Y”, sometimes countries are predicted. This also seems natural: for the positive sentence, cities are more informative, for the negative, countries.

Balanced corpus. Investigating this further, we train BERT-base from scratch on a synthetic corpus. Hyperparameters are listed in the appendix. The corpus contains as many positive sentences of form “ $x_{j}$ is $a_{n}$ ” as negative sentences of form “ $x_{j}$ is not $a_{n}$ ” where $x_{j}$ is drawn from a set of 200 subjects $\mathcal{S}$ and $a_{n}$ from a set of 20 adjectives $\mathcal{A}$ . The 20 adjectives form 10 pairs of antonyms (e.g., “good”/”bad”). $\mathcal{S}$ is divided into 10 groups $g_{m}$ of 20. Finally, there is an underlying KB that defines valid adjectives for groups. For example, assume that $g_{1}$ has property $a_{m}$ = “good”. Then for each $x_{i}\in g_{1}$ , the sentences “ $x_{i}$ is good” and “ $x_{i}$ is not bad” are true. The training set is generated to contain all positive and negative sentences for 70% of the subjects. It also contains either only the positive sentences for the other 30% of subjects (in that case the negative sentences are added to test) or vice versa. Cloze questions are generated in the format “ $x_{j}$ is [MASK]”/“ $x_{j}$ is not [MASK]”. We test whether (i) BERT memorizes positive and negative sentences seen during training, (ii) it generalizes to the test set. As an example, a correct generalization would be “ $x_{i}$ is not bad” if “ $x_{i}$ is good” was part of the training set. The question is: does BERT learn, based on the patterns of positive/negative sentences and within-group regularities, to distinguish facts from non-facts.

Table 5 (“pretrained BERT”) shows that BERT memorizes positive and negative sentences, but poorly generalizes to the test set for both positive and negative. The learning curves (see appendix) show that this is not due to overfitting the training data. While the training loss rises, the test precision fluctuates around a plateau. However, if we finetune BERT (“finetuned BERT”) on the task of classifying sentences as true/false, its test accuracy is 100%. (Recall that false sentences simply correspond to true sentence with a “not” inserted or removed.) So BERT easily learns negation if supervision is available, but fails without it. This experiment demonstrates the difficulty of learning negation through unsupervised pretraining. We suggest that the inability of pretrained BERT to distinguish true from false is a serious impediment to accurately handling factual knowledge.

Misprimed LAMA. Table 3 shows the effect of mispriming on BERT-large for questions answered correctly in original LAMA; recall that Table 1 gives examples of sentences constructed in modes A, B, C and D. In most cases, mispriming with a highly ranked incorrect object causes a precision drop of over 60% (C). Example predictions can be found in Table 4 (lines marked “M”). This sensitivity to misprimes still exists when the distance between misprime and cloze question is increased: the drop persists when 20 sentences are inserted (D). Striking are the results for Google-RE where the model recalls almost no facts (C). Table 4 (lines marked “M”) shows predicted fillers for these misprimed sentences. BERT is less but still badly affected by misprimes that match selectional restrictions (B). The model is more robust against priming with random words (A): the precision drop is on average more than 35% lower than for (D). We included the baseline (A) as a sanity check for the precision drop measure. These baseline results show that the presence of a misprime per se does not confuse the model; a less distracting misprime (different type of entity or a completely implausible answer) often results in a correct answer by BERT.

Discussion

Whereas Petroni et al. (2019)’s results suggest that PLMs are able to memorize facts, our results indicate that PLMs largely do not learn the meaning of negation. They mostly seem to predict fillers based on co-occurrence of subject (e.g., “Quran”) and filler (“religious”) and to ignore negation.

A key problem is that in the LAMA setup, not answering (i.e., admitting ignorance) is not an option. While the prediction probability generally is somewhat lower in the negated compared to the positive answer, there is no threshold across cloze questions that could be used to distinguish valid positive from invalid negative answers (cf. Table 4).

We suspect that a possible explanation for PLMs’ poor performance is that negated sentences occur much less frequently in training corpora. Our synthetic corpus study (Table 5) shows that BERT is able to memorize negative facts that occur in the corpus. However, the PLM objective encourages the model to predict fillers based on similar sentences in the training corpus – and if the most similar statement to a negative sentence is positive, then the filler is generally incorrect. However, after finetuning, BERT is able to classify truth/falseness correctly, demonstrating that negation can be learned through supervised training.

The mispriming experiment shows that BERT often handles random misprimes correctly (Table 3 A). There are also cases where BERT does the right thing for difficult misprimes, e.g., it robustly attributes “religious” to Quran (Table 4). In general, however, BERT is highly sensitive to misleading context (Table 3 C) that would not change human behavior in QA. It is especially striking that a single word suffices to distract BERT. This may suggest that it is not knowledge that is learned by BERT, but that its performance is mainly based on similarity matching between the current context on the one hand and sentences in its training corpus and/or recent context on the other hand. Poerner et al. (2019) present a similar analysis.

Our work is a new way of analyzing differences between PLMs and human-level natural language understanding. We should aspire to develop PLMs that – like humans – can handle negation and are not easily distracted by misprimes.

Related Work

PLMs are top performers for many tasks, including QA Kwiatkowski et al. (2019); Alberti et al. (2019). PLMs are usually finetuned Liu et al. (2019); Devlin et al. (2019), but recent work has applied models without finetuning Radford et al. (2019); Petroni et al. (2019). Bosselut et al. (2019) investigate PLMs’ common sense knowledge, but do not consider negation explicitly or priming.

A wide range of literature analyzes linguistic knowledge stored in pretrained embeddings Jumelet and Hupkes (2018); Gulordava et al. (2018); Giulianelli et al. (2018); McCoy et al. (2019); Dasgupta et al. (2018); Marvin and Linzen (2018); Warstadt and Bowman (2019); Kann et al. (2019). Our work analyzes factual knowledge. McCoy et al. (2019) show that BERT finetuned to perform natural language inference heavily relies on syntactic heuristics, also suggesting that it is not able to adequately acquire common sense.

Warstadt et al. (2019) investigate BERT’s understanding of how negative polarity items are licensed. Our work, focusing on factual knowledge stored in negated sentences, is complementary since grammaticality and factuality are mostly orthogonal properties. Kim et al. (2019) investigate understanding of negation particles when PLMs are finetuned. In contrast, our focus is on the interaction of negation and factual knowledge learned in pretraining. Ettinger (2019) defines and applies psycho-linguistic diagnostics for PLMs. Our use of priming is complementary. Their data consists of two sets of 72 and 16 sentences whereas we create 42,867 negated sentences covering a wide range of topics and relations.

Ribeiro et al. (2018) test for comprehension of minimally modified sentences in an adversarial setup while trying to keep the overall semantics the same. In contrast, we investigate large changes of meaning (negation) and context (mispriming). In contrast to adversarial work (e.g., (Wallace et al., 2019)), we do not focus on adversarial examples for a specific task, but on pretrained models’ ability to robustly store factual knowledge.

Conclusion

Our results suggest that pretrained language models address open domain QA in datasets like LAMA by mechanisms that are more akin to relatively shallow pattern matching than the recall of learned factual knowledge and inference.

Implications for future work on pretrained language models. (i) Both factual knowledge and logic are discrete phenomena in the sense that sentences with similar representations in current pretrained language models differ sharply in factuality and truth value (e.g., “Newton was born in 1641” vs. “Newton was born in 1642”). Further architectural innovations in deep learning seem necessary to deal with such discrete phenomena. (ii) We found that PLMs have difficulty distinguishing “informed” best guesses (based on information extracted from training corpora) from “random” best guesses (made in the absence of any evidence in the training corpora). This implies that better confidence assessment of PLM predictions is needed. (iii) Our premise was that we should emulate human language processing and that therefore tasks that are easy for humans are good tests for NLP models. To the extent this is true, the two phenomena we have investigated in this paper – that PLMs seem to ignore negation in many cases and that they are easily confused by simple distractors – seem to be good vehicles for encouraging the development of PLMs whose performance on NLP tasks is closer to humans.

Acknowledgements. We thank the reviewers for their constructive criticism. This work was funded by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A and by the European Research Council (Grant No. 740516). The authors of this work take full responsibility for its content.

References

Appendix

We use source code provided by Petroni et al. (2019) github.com/facebookresearch/LAMA. T-REx, parts of ConceptNet and SQuAD allow multiple true answers (N-M). To ensure single true objects for Google-RE, we reformulate the templates asking for location to specifically ask for cities (e.g., “born in [MASK]” to “born in the city of [MASK]”). We do not change any other templates. T-REx still queries for ”died in [MASK]”.

For ConceptNet we extract an easy-to-negate subset. The final subset includes 2,996 of the 11,458 samples. We proceed as follows:

1. We only negate sentences of maximal token sequence length 4 or if we find a match with one of the following patterns: “is a type of ”, “is made of”, “is part of”, “are made of.”, “can be made of”, “are a type of ”, “are a part off”.

2. The selected subset is automatically negated by a manually created verb negation dictionary.

1.2 Details on misprimed LAMA

To investigate the effect of distance between the prime and the cloze question, we insert a concatenation of up to 20 “neutral” sentences. The longest sequence has 89 byte pair encodings. The distance upon the full concatenation of all 20 sentences did not lessen the effect of the prime much. The used sentences are: ”This is great.”, ”This is interesting.”, ”Hold this thought.”, ”What a surprise.”, ”Good to know.”, ”Pretty awesome stuff.”, ”Nice seeing you.”, ”Let’s meet again soon.”, ”This is nice.”, ”Have a nice time.”, ”That is okay.”, ”Long time no see.”, ”What a day.”, ”Wonderful story.”, ”That’s new to me.”, ”Very cool.”, ”Till next time.”, ”That’s enough.”, ”This is amazing.”, ”I will think about it.”

2 Details on the balanced corpus

We pretrain BERT-base from scratch on a corpus on equally many negative and positive sentences. We concatenate multiples of the same training data into one training file to compensate for the little amount of data. Hyper-parameters for pretraining are listed in Table 6. The full vocabulary is 349 tokens long.

Figure 1 shows that training loss and test accuracy are uncorrelated. Test accuracy stagnates around 0.5 which is not more than random guessing as for each relation half of the adjectives hold.

We finetune on the task of classifying sentences as true/false. We concatenate multiples of the same training data into one training file to compensate for the little amount of data. Hyperparameters for finetuning are listed in Table 7.

We use source code provided by Wolf et al. (2019) github.com/huggingface/transformers.