Learning How to Ask: Querying LMs with Mixtures of Soft Prompts

Guanghui Qin, Jason Eisner

Introduction

Pretrained language models, such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and BART (Lewis et al., 2020a), have proved to provide useful representations for other NLP tasks. Recently, Petroni et al. (2019) and Jiang et al. (2020) demonstrated that language models (LMs) also contain factual and commonsense knowledge that can be elicited with a prompt. For example, to query the date-of-birth of Mozart, we can use the prompt “Mozart\contourwhiteMozart was born in ,” where we have filled the first blank with “Mozart,” and ask a cloze language model to fill in the second blank. The prompts used by Petroni et al. (2019) are manually created, while Jiang et al. (2020) use mining and paraphrasing based methods to automatically augment the prompt sets.

Finding out what young children know is difficult because they can be very sensitive to the form of the question Donaldson (1978). Opinion polling is also sensitive to question design (Broughton, 1995). We observe that when we are querying an LM rather than a human, we have the opportunity to tune prompts using gradient descent—the workhorse of modern NLP—so that they better elicit the desired type of knowledge.

A neural LM sees the prompt as a sequence of continuous word vectors (Baroni et al., 2014). We tune in this continuous space, relaxing the constraint that the vectors be the embeddings of actual English words. Allowing “soft prompts” consisting of “soft words” is not only convenient for optimization, but is also more expressive. Soft prompts can emphasize particular words (by lengthening their vectors) or particular dimensions of those words. They can also adjust words that are misleading, ambiguous, or overly specific. Consider the following prompt for the relation date-of-death:

This prompt may work for the male singer Cab Calloway, but if we want it to also work for the female painter Mary Cassatt, it might help to soften “performed” and “his” so that they do not insist on the wrong occupation and gender, and perhaps to soften “until” into a weaker connective (as Cassatt was in fact too blind to paint in her final years).

Another way to bridge between these cases is to have one prompt using “performed” and another using “painted.” In general, there may be many varied lexical patterns that signal a particular relation, and having more patterns will get better coverage Hearst (1992); Riloff and Jones (1999). We therefore propose to learn a mixture of soft prompts.

We test the idea on several cloze language models, training prompts to complete factual and common sense relations from 3 datasets. Comparing on held-out examples, our method dramatically outperforms previous work, even when initialized randomly. So when regarded as approximate knowledge bases, language models know more than we realized. We just had to find the right ways to ask.

Related Work

Factual knowledge is traditionally extracted from large corpora using a pipeline of NLP tools (Surdeanu and Ji, 2014), including entity extraction (Lample et al., 2016), entity linking (Rao et al., 2013) and relation extraction (Sorokin and Gurevych, 2017).

However, recent work has shown that simply training a system to complete sentences—language modeling—causes it to implicitly acquire non-linguistic abilities from its training corpora (Rogers et al., 2020), including factual knowledge (Petroni et al., 2019; Jiang et al., 2020), common sense (Bisk et al., 2019), reasoning (Talmor et al., 2020; Brown et al., 2020), summarization (Radford et al., 2019), and even arithmetic (Bouraoui et al., 2020).

Most of the previous work manually creates prompts to extract answers from the trained language model. We use LAMA Petroni et al. (2019) as a baseline. Building on LAMA, the LM Prompt And Query Archive (LPAQA) method Jiang et al. (2020) searches for new prompts by either mining a corpus or paraphrasing existing prompts. AutoPrompt Shin et al. (2020) searches for improved prompts using a gradient signal, although its prompts are limited to sequences of actual (“hard”) English words, unlike our method. We compare our novel soft prompts against all of these systems.

After we submitted the present paper in November 2020, two still unpublished manuscripts appeared on arXiv that also investigated soft prompts. Li and Liang (2021) considered the setting of generating text from a pretrained language model (GPT-2 or BART) conditioned on a textual prompt. To improve the results, they prepended a few task-specific “soft tokens” to the prompt and tuned the embeddings of only these tokens (at all embedding layers). Liu et al. (2021) adopted a strategy similar to ours by tuning fill-in-the-blank prompts in a continuous space, testing on GPT-2 and BERT models, although they did not use the enhancements we proposed in §§ 3.2–3.4 below. Like our work, both these papers achieved strong gains.

In other work, Bouraoui et al. (2020) mine prompts from a corpus, then fine-tune the whole language model so that it more accurately completes the prompts. Schick and Schütze (2020a, b) are similar but fine-tune the language model differently for each prompt. Our method complements these by tuning the prompts themselves.

“Probing” systems that ask what language models know about particular sentences (e.g., Eichler et al., 2019) usually use feedforward networks rather than further natural-language prompts. Yet Shin et al. (2020) show how to use natural-language prompts to ask about particular sentences. Our method could potentially be applied to those prompts, or to “few-shot learning” prompts that include input-output examples Brown et al. (2020).

Method

Our experiments will specifically aim at extracting relational knowledge from language models. We are given a fixed pretrained LM, a specific binary relation rr such as date-of-death, and a training dataset Er\mathcal{E}_{r} consisting of known (x,y)(x,y) pairs in rr, such as (Mary Cassatt, 1926). We will then train a system to predict yy from xx, and evaluate it on held-out (x,y)(x,y) pairs of the same relation.

We can initialize these vectors to match those of a given hard prompt. (Each token of a hard prompt may be a word, subword, or punctuation mark, according to the tokenization procedure used by the LM.) However, we can then tune the vectors continuously. We do not change the number of vectors or their positions. For the prompt shown above, we have a 6d6d-dimensional search space.

2 Deeply Perturbed Prompts

Perturbing only layer 0 is equivalent to tuning viv_{i} directly as in § 3.1. However, if we are more aggressive and perturb all layers, we now have 6d(L+1)6d\cdot(L+1) parameters to tune a 6-token prompt. The perturbations (Δ\Delta vectors) can be kept small through early stopping or some other form of regularization. Our intuition is that small perturbations will yield more “familiar” activation patterns that are similar to those that the LM was originally trained on. (Li and Liang (2021) tried a rather different approach to preventing overfitting when tuning all layers.)

3 Mixture Modeling

Given a set Tr\mathcal{T}_{r} of soft prompts for relation rr, we can define the ensemble predictive distribution

where the learned mixture weights p(tr)p(\mathbf{t}\mid r) form a distribution over the soft prompts tTr\mathbf{t}\in\mathcal{T}_{r}. Ensembling techniques other than mixture-of-experts could also be used, including product-of-experts Jiang et al. (2020).

4 Data-Dependent Mixture Modeling

As an extension, we can replace the mixture weights p(tr)p(\mathbf{t}\mid r) with p(tr,x)p(\mathbf{t}\mid r,x), to allow the model to select prompts that are appropriate for the given xx. For example, a plural noun xx might prefer prompts t\mathbf{t} that use a plural verb.

5 Training Objective

Given an initial set of prompts Tr\mathcal{T}_{r}, we jointly optimize the soft prompts tT\mathbf{t}\in\mathcal{T} and their mixture weights p(tr)p(\mathbf{t}\mid r) (and logT\log T in § 3.4) to minimize the log-loss of the predictive distribution (1):

This is a continuous and differentiable objective whose gradient can be computed by back-propagation. It can be locally minimized by gradient descent (using a softmax parameterization of the mixture weights). Equivalently, it can be locally minimized by the EM algorithm: the E step finds a posterior distribution over latent prompts for each (x,y)(x,y) example, and the M step performs gradient descent to optimize the prompts in that mixture.

Experiments

The relations we learn to predict are T-REx original (Elsahar et al., 2018), T-REx extended (Shin et al., 2020), Google-RE (Orr, 2013), and ConceptNet (Speer et al., 2017)—or rather, the subsets that were used by the LAMA and AutoPrompt papers. See Appendix A for some statistics.

2 Language Models

3 Dataset Splits

For the two T-REx datasets, we inherit the training-validation-test split from Shin et al. (2020). For the other datasets, we split randomly in the ratio 80-10-10.The LAMA paper (Petroni et al., 2019) provided no split but used everything as test data for their zero-shot method. Since all pairs (x,y)(x,y) are distinct, there are no common triples among these three sets. Common xx values are also rare because each dataset has at least 174 distinct xx values. However, the number of distinct yy values can be as small as 6. Thus, in another set of experiments (Appendix E), we used a more challenging split that ensures that there are no common yy values among these three sets. This tests whether our model generalizes to unseen values.

4 Prompts

For the T-REx and Google-RE datasets, we have four sources of initial prompts:

(sin.) LAMA provides a single manually created hard prompt for each relation type rr.

(par.) LPAQA (Jiang et al., 2020) provides a set of 13–30 hard prompts for each rr, which are paraphrases of the LAMA prompt.The LPAQA system combines their predictions via a learned weighted product of experts.

(min.) LPAQA also provides a set of 6–29 hard prompts for each rr, based on text mining.

(ran.) For each (min.) prompt, we replace each word with a random vector, drawn from a Gaussian distribution fit to all of the LM’s word embeddings. The number of words and the position of the blanks are preserved.

For the ConceptNet dataset, LAMA uses the gold Open Mind Common Sense (OMCS) dataset (Singh et al., 2002). In this dataset, each example (xi,yi)(x_{i},y_{i}) is equipped with its own prompt ti\mathbf{t}_{i}. (Each example is really a sentence with two substrings marked as xx and yy, which are removed to obtain ti\mathbf{t}_{i}.) These prompts are often overly specific: often yiy_{i} can be predicted from (ti,xi)(\mathbf{t}_{i},x_{i}), or just from ti\mathbf{t}_{i} alone, but yjy_{j} cannot be predicted from (ti,xj)(\mathbf{t}_{i},x_{j}). Thus, for each relation rr, we use only the prompts that appear more than 10 times, resulting in 1–38 prompts.

Statistics about the prompts are in Appendix B.

We used only a single copy of each prompt, but a generalization would be to allow multiple slightly perturbed copies of each prompt, which could diverge and specialize during training Rose (1998).

5 Training

We optimize equation 2 with the method introduced in § 3.5. We use the Adam optimizer (Kingma and Ba, 2015) with its default configuration. For gradient training, we set the batch size as 64, early-stop patience as 4, and test with the model that performs best on the dev set among 16 training epochs.

Training is fast. Even for our largest model (BERT-large-cased) and largest dataset (T-REx extended), tuning a single prompt completes within a few minutes. With a mixture of prompts, training scales roughly linearly with the number of prompts. It is still presumably much cheaper in time and memory than fine-tuning the entire BERT model, which must back-propagate a much larger set of gradients.

6 Metrics and Baselines

Our method outputs the most probable yy given (r,x)(r,x). Here and in the supplementary material, we report its average performance on all test examples, with precision-at-1 (P@1), precision-at-10 (P@10) and mean reciprocal rank (MRR) as metrics. We measure the improvement from tuning LAMA, LPAQA, and random prompts. We also compare with AutoPrompt. Baseline numbers come from prior papers or our reimplementations.

7 Results

Table 1 shows results on T-REx datasets obtained by querying three BERT-style models, with P@1 as the metric. Additional metrics and language models are shown in Tables 2 and 3 as well as Tables 5 and 6 in the supplementary material.

We consistently get large improvements by tuning the initial prompts. Remarkably, our method beats all prior methods even when throwing away the words of their informed prompts in favor of random initial vectors. It simply finds a prompt that works well on the (x,y)(x,y) training examples.

We conduct an ablation study where we adjust only the mixture weights (which are initially uniform) or only the word vectors in the prompts t\mathbf{t}. As Table 4 shows, each helps, but the major benefit comes from tuning the word vectors to get soft prompts. Appendix C visualizes a set of soft prompts, and Appendix D analyzes the mixture weights. We also experiment on a challenging setting where the yy labels are distinct for training and test (Appendix E in the supplementary materials), and find that soft prompts still yield some benefits.

The above results are for our basic method that tunes only the words of the prompt (i.e., layer 0). When we tune all layers—the “deeply perturbed prompts” of § 3.2—we typically obtain small additional gains, across various models and initializations, although tuning all layers does substantially hurt RoBERTa. These results are shown in Tables 5 and 6 in the supplementary material.

The tables show that the winning system—for each combination of language model, T-REx dataset, and evaluation metric—always uses a mixture of soft prompts initialized to mined prompts. It always tunes all layers, except with RoBERTa.

Finally, we also tried using data-dependent mixture weights as in § 3.4. This had little effect, because training learned to discard the xx information by setting the temperature parameter TT high.

Conclusion

Well-crafted natural language prompts are a powerful way to extract information from pretrained language models. In the case of cloze prompts used to query BERT and BART models for single-word answers, we have demonstrated startlingly large and consistent improvements from rapidly learning prompts that work—even though the resulting “soft prompts” are no longer natural language.

Our code and data are available at https://github.com/hiaoxui/soft-prompts.

How about few-shot prediction with pretrained generative LMs? Here, Lewis et al. (2020b) show how to assemble a natural language prompt for input xx from relevant input-output pairs (xi,yi)(x_{i},y_{i}) selected by a trained retrieval model. Allowing fine-tuned soft string pairs is an intriguing future possibility for improving such methods without needing to fine-tune the entire language model.

Acknowledgments

We thank the anonymous reviewers for helpful comments. This work was supported by DARPA KAIROS and by the National Science Foundation under Grant No. 1718846. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes. The views and conclusions contained in this publication are those of the authors, and should not be interpreted as representing official policies nor endorsement by the funding agencies or by Microsoft (where Dr. Eisner is also a paid employee, in an arrangement that has been reviewed and approved by the Johns Hopkins University in accordance with its conflict of interest policies).

References

Appendix A Statistics of Relational Databases

The statistics of the various relational databases are shown in Table 8.

Appendix B Statistics of the Initial Prompts

Table 7 shows some statistics of the prompts we use to initialize the SoftPrompt model.

Appendix C Visualization of Soft Prompts

Figure 1 shows what a mixture of soft prompts looks like when we tune only layer 0. The soft prompts are not too interpretable. The words closest to the tuned tokens (shown in blue) seem to be largely on the music topic. However, the soft templates do not seem to form meaningful phrases, nor is it obvious why they would prime for yy to be an instrument when xx is a musician.

Appendix D Entropy of the Mixture Model

For any given relation rr, the entropy of the mixture weights is

We then take 2H[1,Tr]2^{H}\in[1,|\mathcal{T}_{r}|] as a measure of the effective number of prompts that were retained. Table 10 shows some statistics of the effective number of prompts. In some cases, tuning the mixture weights essentially selected a single prompt, but on average, it settled on a mixture of several variant prompts (as illustrated by Figure 1).

Appendix E Challenging dataset with distinct y𝑦y’s

As described in § 4.3, we conducted an additional experiment to determine whether the prompts could generalize to novel yy values. We conduct another experiment and ensure that there are no common yy values among the train / dev / test sets. We use T-REx as the base relational database and split the datasets to make the ratio close to 80-10-10. The experiment results are shown in Table 9. We can observe that our method again improves the results, just as in Tables 5 and 6, which shows the generalizability of our method.