Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers

Anne Lauscher, Olga Majewska, Leonardo F. R. Ribeiro, Iryna Gurevych, Nikolai Rozanov, Goran Glavaš

Introduction

Self-supervised neural models like ELMo Peters et al. (2018), BERT Devlin et al. (2019); Liu et al. (2019b), GPT Radford et al. (2018, 2019), or XLNet Yang et al. (2019) have rendered language modeling a very suitable pretraining task for learning language representations that are useful for a wide range of language understanding tasks Wang et al. (2018, 2019). Although shown versatile w.r.t. the types of knowledge Rogers et al. (2020) they encode, much like their predecessors – static word embedding models Mikolov et al. (2013); Pennington et al. (2014) – neural LMs still only “consume” the distributional information from large corpora. Yet, a number of structured knowledge sources exist – knowledge bases (KBs) Suchanek et al. (2007); Auer et al. (2007) and lexico-semantic networks Miller (1995); Liu and Singh (2004); Navigli and Ponzetto (2010) – encoding many types of knowledge that are underrepresented in text corpora.

Starting from this observation, most recent efforts focused on injecting factual Zhang et al. (2019); Liu et al. (2019a); Peters et al. (2019) and linguistic knowledge Lauscher et al. (2019); Peters et al. (2019) into pretrained LMs and demonstrated the usefulness of such knowledge in language understanding tasks Wang et al. (2018, 2019). Joint pretraining models, on the one hand, augment distributional LM objectives with additional objectives based on external resources Yu and Dredze (2014); Nguyen et al. (2016); Lauscher et al. (2019) and train the extended model from scratch. For models like BERT, this implies computationally expensive retraining from scratch of the encoding transformer network. Post-hoc fine-tuning models Zhang et al. (2019); Liu et al. (2019a); Peters et al. (2019), on the other hand, use the objectives based on external resources to fine-tune the encoder’s parameters, pretrained via distributional LM objectives. If the amount of fine-tuning data is substantial, however, this approach may lead to catastrophic forgetting of distributional knowledge obtained in pretraining Goodfellow et al. (2014); Kirkpatrick et al. (2017).

In this work, similar to the concurrent work of Wang et al. (2020), we turn to the recently proposed adapter-based fine-tuning paradigm Rebuffi et al. (2018); Houlsby et al. (2019), which remedies the shortcomings of both joint pretraining and standard post-hoc fine-tuning. Adapter-based training injects additional parameters into the encoder and only tunes their values: original transformer parameters are kept fixed. Because of this, adapter training preserves the distributional information obtained in LM pretraining, without the need for any distributional (re-)training. While Wang et al. (2020) inject factual knowledge from Wikidata Vrandečić and Krötzsch (2014) into BERT, in this work, we investigate two resources that are commonly assumed to contain general-purpose and common sense knowledge:Our results in §3.2 scrutinize this assumption. ConceptNet Liu and Singh (2004); Speer et al. (2017) and the Open Mind Common Sense (OMCS) corpus Singh et al. (2002), from which the ConceptNet graph was (semi-)automatically extracted. For our first model, dubbed CN-Adapt, we first create a synthetic corpus by randomly traversing the ConceptNet graph and then learn adapter parameters with masked language modelling (MLM) training Devlin et al. (2019) on that synthetic corpus. For our second model, named OM-Adapt, we learn the adapter parameters via MLM training directly on the OMCS corpus.

We evaluate both models on the GLUE benchmark, where we observe limited improvements over BERT on a subset of GLUE tasks. However, a more detailed inspection reveals large improvements over the base BERT model (up to 20 Matthews correlation points) on language inference (NLI) subsets labeled as requiring World Knowledge or knowledge about Named Entities. Investigating further, we relate this result to the fact that ConceptNet and OMCS contain much more of what in downstream is considered to be factual world knowledge than what is judged as common sense knowledge. Our findings pinpoint the need for more detailed analyses of compatibility between (1) the types of knowledge contained by external resources; and (2) the types of knowledge that benefit concrete downstream tasks; within the emerging body of work on injecting knowledge into pretrained transformers.

Knowledge Injection Models

In this work, we are primarily set to investigate if injecting specific types of knowledge (given in the external resource) benefits downstream inference that clearly requires those exact types of knowledge. Because of this, we use the arguably most straightforward mechanisms for injecting the ConceptNet and OMCS information into BERT, and leave the exploration of potentially more effective knowledge injection objectives for future work. We inject the external information into adapter parameters of the adapter-augmented BERT Houlsby et al. (2019) via BERT’s natural objective – masked language modelling (MLM). OMCS, already a corpus in natural language, is directly subjectable to MLM training – we filtered out non-English sentences. To subject ConceptNet to MLM training, we need to transform it into a synthetic corpus.

Following established previous work (Perozzi et al., 2014; Ristoski and Paulheim, 2016), we induce a synthetic corpus from ConceptNet by randomly traversing its graph. We convert relation strings into NL phrases (e.g., synonyms to is a synonym of) and duplicate the object node of a triple, using it as the subject for the next sentence. For example, from the path “alcoholism $\xrightarrow{\mathit{causes}}$ stigma $\xrightarrow{\textit{hasContext}}$ christianity $\xrightarrow{\textit{partOf}}$ religion” we create the text “alcoholism causes stigma. stigma is used in the context of christianity. christianity is part of religion.”. We set the walk lengths to $30$ relations and sample the starting and neighboring nodes from uniform distributions. In total, we performed 2,268,485 walks, resulting with the corpus of 34,560,307 synthetic sentences.

Adapter-Based Training.

Evaluation

We first briefly describe the downstream tasks and training details, and then proceed with the discussion of results obtained with our adapter models.

We evaluate BERT and our two adapter-based models, CN-Adapt and OM-Adapt, with injected knowledge from ConceptNet and OMCS, respectively, on the tasks from the GLUE benchmark (Wang et al., 2018):

CoLA Warstadt et al. (2018): Binary sentence classification, predicting grammatical acceptability of sentences from linguistic publications;

SST-2 Socher et al. (2013): Binary sentence classification, predicting binary sentiment (positive or negative) for movie review sentences;

MRPC Dolan and Brockett (2005): Binary sentence-pair classification, recognizing sentences which are are mutual paraphrases;

STS-B Cer et al. (2017): Sentence-pair regression task, predicting the degree of semantic similarity for a given pair of sentences;

QQP Chen et al. (2018): Binary classification task, recognizing question paraphrases;

MNLI Williams et al. (2018): Ternary natural language inference (NLI) classification of sentence pairs. Two test sets are given: a matched version (MNLI-m) in which the test domains match the domains from training data, and a mismatched version (MNLI-mm) with different test domains;

QNLI: A binary classification version of the Stanford Q&A dataset (Rajpurkar et al., 2016);

RTE Bentivogli et al. (2009): Another NLI dataset, ternary entailment classification for sentence pairs;

Diag Wang et al. (2018): A manually curated NLI dataset, with examples labeled with specific types of knowledge needed for entailment decisions.

Training Details.

We inject our adapters into a BERT Base model ( $12$ transformer layers with $12$ attention heads each; $H=768$ ) pretrained on lowercased corpora. Following Houlsby et al. (2019), we set the size of all adapters to $m=64$ and use GELU Hendrycks and Gimpel (2016) as the adapter activation $f$ . We train the adapter parameters with the Adam algorithm Kingma and Ba (2015) (initial learning rate set to $1e^{-4}$ , with $10000$ warm-up steps and the weight decay factor of $0.01$ ). In downstream fine-tuning, we train in batches of size $16$ and limit the input sequences to $T=128$ wordpiece tokens. For each task, we find the optimal hyperparameter configuration from the following grid: learning rate $l\in\{2\cdot 10^{-5},3\cdot 10^{-5}\}$ , epochs in $n\in\{3,4\}$ .

2 Results and Analysis

Table 1 reveals the performance of CN-Adapt and OM-Adapt in comparison with BERT Base on GLUE evaluation tasks. We show the results for two snapshots of OM-Adapt, after 25K and 100K update steps, and for two snapshots of CN-Adapt, after 50K and 100K steps of adapter training. Overall, none of our adapter-based models with injected external knowledge from ConceptNet or OMCS yields significant improvements over BERT Base on GLUE. However, we observe substantial improvements (of around 3 points) on RTE and on the Diagnostics NLI dataset (Diag), which encompasses inference instances that require a specific type of knowledge.

Since our adapter models draw specifically on the conceptual knowledge encoded in ConceptNet and OMCS, we expect the positive impact of injected external knowledge – assuming effective injection – to be most observable on test instances that target the same types of conceptual knowledge. To investigate this further, we measure the model performance across different categories of the Diagnostic NLI dataset. This allows us to tease apart inference instances which truly test the efficacy of our knowledge injection methods. We show the results obtained on different categories of the Diagnostic NLI dataset in Table 2.

The improvements of our adapter-based models over BERT Base on these phenomenon-specific subsections of the Diagnostics NLI dataset are generally much more pronounced: e.g., OM-Adapt (25K) yields a 7% improvement on inference that requires factual or common sense knowledge (KNO), whereas CN-Adapt (100K) yields a 6% boost for inference that depends on lexico-semantic knowledge (LS). These results suggest that (1) ConceptNet and OMCS do contain the specific types of knowledge required for these inference categories and that (2) we managed to inject that knowledge into BERT by training adapters on these resources.

Fine-Grained Knowledge Type Analysis.

In our final analysis, we “zoom in” our models’ performances on three fine-grained categories of the Diagnostics NLI dataset – inference instances that require Common Sense Knowledge (CS), World Knowledge (World), and knowledge about Named Entities (NE), respectively. The results for these fine-grained categories are given in Table 3.

These results show an interesting pattern: our adapter-based knowledge-injection models massively outperform BERT Base (up to 15 and 21 MCC points, respectively) for NLI instances labeled as requiring World Knowledge or knowledge about Named Entities. In contrast, we see drops in performance on instances labeled as requiring common sense knowledge. This initially came as a surprise, given the common belief that OMCS and ConcepNet contain the so-called common sense knowledge. Manual scrutiny of the diagnostic test instances from both CS and World categories uncovers a noticeable mismatch between the kind of information that is considered common sense in KBs like ConceptNet and what is considered common sense knowledge in the downstream. In fact, the majority of information present in ConceptNet and OMCS falls under the World Knowledge definition of the Diagnostic NLI dataset, including factual geographic information (stockholm [partOf] sweden), domain knowledge (roadster [isA] car) and specialized terminology (indigenous [synonymOf] aboriginal).

In contrast, many of the CS inference instances require complex, high-level reasoning, understanding metaphorical and idiomatic meaning, and making far-reaching connections. We display NLI Dignostics examples from the World Knowledge and Common Sense categories in Table 4. In such cases, explicit conceptual links often do not suffice for a correct inference and much of the required knowledge is not explicitly encoded in the external resources. Consider, e.g., the following CS NLI instance: [premise: My jokes fully reveal my character ; hypothesis: If everyone believed my jokes, they’d know exactly who I was ; entailment]. While ConceptNet and OMCS may associate character with personality or personality with identity, the knowledge that the phrase who I was may refer to identity is beyond the explicit knowledge present in these resources. This sheds light on the results in Table 3: when the knowledge required to tackle the inference problem at hand is available in the external resource, our adapter-based knowledge-injected models significantly outperform the baseline transformer; otherwise, the benefits of knowledge injection are negligible or non-existent. The promising results on world knowledge and named entities portions of the Diagnostics dataset suggest that our methods does successfully inject external information into the pretrained transformer and that the presence of the required knowledge for the task in the external resources is an obvious prerequisite.

Conclusion

We presented two simple strategies for injecting external knowledge from ConceptNet and OMCS corpus, respectively, into BERT via bottleneck adapters. Additional adapter parameters store the external knowledge and allow for the preservation of the rich distributional knowledge acquired in BERT’s pretraining in the original transformer parameters. We demonstrated the effectiveness of these models in language understanding settings that require precisely the type of knowledge that one finds in ConceptNet and OMCS, in which our adapter-based models outperform BERT by up to 20 performance points. Our findings stress the importance of having detailed analyses that compare (a) the types of knowledge found in external resources being injected against (b) the types of knowledge that a concrete downstream reasoning tasks requires. We hope this work motivates further research effort in the direction of fine-grained knowledge typing, both of explicit knowledge in external resources and the implicit knowledge stored in pretrained transformers.

Acknowledgments

Anne Lauscher and Goran Glavaš are supported by the Eliteprogramm of the Baden-Württemberg Stiftung (AGREE grant). Leonardo F. R. Ribeiro has been supported by the German Research Foundation as part of the Research Training Group AIPHES under the grant No. GRK 1994/1. This work has been supported by the German Research Foundation within the project “Open Argument Mining” (GU 798/25-1), associated with the Priority Program “Robust Argumentation Machines (RATIO)” (SPP-1999). The work of Olga Majewska was conducted under the research lab of Wluper Ltd. (UK/ 10195181).