Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification

Timo Schick, Helmut Schmid, Hinrich Schütze

Introduction

Pretraining language models on large corpora has led to improvements on a wide range of NLP tasks [Radford et al. (2018, Devlin et al. (2019, Liu et al. (2019, inter alia], but learning to solve tasks from only a few examples remains a challenging problem. As small datasets are common for many real-world applications of NLP, solving this challenge is crucial to enable broad applicability. A promising direction for many tasks is to reformulate them (e.g., by appending an instruction such as “translate into French”) so that they can directly be solved by a pretrained language model [Radford et al. (2019, Schick and Schütze (2020a, Brown et al. (2020]. The key idea of Pet [Schick and Schütze (2020a], one such approach aimed at text classification, is to rephrase each input as a cloze question for which the language model’s prediction can somehow be mapped to a label; an example is illustrated in Figure 1. While Pet achieves remarkable results with little or no labeled training data, manually defining the required mapping between a language model’s predictions and labels is difficult as it requires both task-specific knowledge and an understanding of the language model’s inner workings to identify words that it understands sufficiently well.

In this work, we show how this mapping can be obtained automatically, removing the need for expert knowledge: We introduce Pet with Automatic Labels (Petal), a simple approach for identifying words that can serve as proxies for labels given small amounts of training data. At its core, our approach breaks the intractable problem of finding the mapping that maximizes the likelihood of the training data into several manageable subproblems. Integrating our approach into Pet significantly outperforms regular supervised training and almost matches the performance of Pet with a manually defined mapping.

Related Work

Reformulating problems as language modeling tasks has been explored in fully unsupervised settings [Radford et al. (2019, Puri and Catanzaro (2019, Davison et al. (2019], in few-shot scenarios with limited amounts of training data [Opitz (2019, Shwartz et al. (2020, Brown et al. (2020], and even in high-resource settings [Raffel et al. (2019]. The same idea is also commonly used for probing the knowledge contained within pretrained language models [Petroni et al. (2019, Talmor et al. (2019, Schick and Schütze (2020b, Ettinger (2020, inter alia].

Our method is a direct extension of Pet [Schick and Schütze (2020a] and is similar in spirit to automatic verbalizer search (AVS) introduced therein. AVS is another method for automatically finding a mapping from labels to words that works as follows: First, the mapping is initialized by assigning a random word to each label and then, the mapping is improved over multiple iterations by successively replacing words with better alternatives given the current mapping in a greedy fashion. In contrast, our approach offers a closed-form solution that is conceptually simpler and faster, requires fewer hyperparameters – which can be crucial in a data-scarce scenario – and performs much better, especially for difficult tasks.

For Pet, expert knowledge is mostly encoded in the mapping from a language model’s prediction to labels, which is why we focus on automating this part. The complementary problem of automatically transforming inputs before processing them with a language model has been studied by ?). This is also closely related to approaches for extracting patterns in relation extraction [Brin (1999, Agichtein and Gravano (2000, Batista et al. (2015, Bouraoui et al. (2020].

Pattern-Exploiting Training

We review Pattern-Exploiting Training (Pet) as proposed by ?). Let $M$ be a pretrained masked language model (MLM), $T$ its vocabulary and $\texttt{[MASK]}{}\in T$ the mask token. We consider the task of mapping textual inputs $\mathbf{x}\in X$ to some label $y\in Y$ where we assume w.l.o.g. that $Y=\{1,\ldots,k\}$ for some $k\in\mathbb{N}$ . In addition to training data $\mathcal{T}=\{(\mathbf{x}_{1},y_{1}),\ldots,(\mathbf{x}_{n},y_{n})\}$ , Pet requires a set of pattern-verbalizer pairs (PVPs). As exemplified in Figure 1, each PVP $\mathbf{p}=(P,v)$ consists of

a pattern $P$ that is used to convert inputs to cloze questions. Formally, $P:X\rightarrow T^{*}$ is defined as a function that maps each input to a sequence of tokens containing exactly one [MASK] token;

a verbalizer $v:Y\rightarrow T$ that maps each label to a single token representing its meaning. For Pet to work, the verbalizer must be chosen so that for each input $\mathbf{x}\in X$ , $v(y)$ is a suitable replacement for the mask token in $P(\mathbf{x})$ if and only if $y$ is the correct label for $\mathbf{x}$ . We call $v(y)$ the verbalization of $y$ and abbreviate it as $v_{y}$ .

Based on this intuition, ?) define the conditional probability distribution $q_{\mathbf{p}}$ of $Y$ given $X$ as

where $M(t\mid P(\textbf{x}))$ denotes the raw score that $M$ assigns to $t$ at the masked position in $P(\mathbf{x})$ ; that is, the probability of $y$ being the correct label for $\mathbf{x}$ is derived from the probability of its verbalization $v_{y}$ being the “correct” token at the masked position in $P(\mathbf{x})$ .

$P(\mathbf{x})$ American Duo Wins Opening Beach Volleyball Match $\mathbf{x}$ News:[MASK]213 $y$ WorldBusinessSports $v(y)$ $q_{\mathbf{p}}(y\mid\mathbf{x})$ Figure 1: Exemplary application of a pattern-verbalizer pair $\mathbf{p}=(P,v)$ : An input $\mathbf{x}$ is converted into a cloze question by applying $P$ . The probability $q_{\mathbf{p}}(y\mid x)$ of each label $y$ is derived from the probability of its verbalization $v(y)$ being a plausible choice for the masked position. Pet basically works in three steps:

For each PVP $\mathbf{p}$ , a separate MLM is finetuned on $\mathcal{T}$ , using the cross entropy between the true labels $y_{i}$ and $q_{\mathbf{p}}(y_{i}\mid\mathbf{x}_{i})$ as loss function.

The resulting ensemble of finetuned MLMs is used to annotate a large set of unlabeled examples with soft labels.

Another pretrained language model with a sequence classification head is finetuned on the resulting soft-labeled dataset; this model serves as the final classifier for the task considered.

There are several additional details to Pet (e.g., an additional language modeling objective to prevent catastrophic forgetting); we skip these details as they are not relevant to our approach. For a more thorough explanation, we refer to ?).

Likelihood Ratio Verbalizer Search

Manually defining the verbalizer $v:Y\rightarrow T$ required for Pet can be challenging: It requires knowledge not only of a task’s labels and how they can best be expressed in natural language using a single word, but also of the used MLM’s capabilities as it is crucial to choose only such words as verbalizations that are understood sufficiently well by the language model and correspond to a single token in its vocabulary. We thus aim to automatically find a good verbalizer $v$ for some pattern $P$ without requiring task- or model-specific knowledge.

Our method requires sets $\mathcal{V}_{y}\subseteq T$ of verbalization candidates for each label $y\in Y$ ; for now, we simply assume $\mathcal{V}_{y}=T$ for all $y$ . Let $\mathcal{V}$ be the set of all verbalizers consistent with these candidate sets, i.e., $v\in\mathcal{V}$ if and only if $v_{y}\in\mathcal{V}_{y}$ for all $y\in Y$ . A natural criterion for measuring the suitability of a verbalizer $v$ is to compute the likelihood of the training data given $v$ , leading to the maximum likelihood estimate

Unfortunately, iterating over $\mathcal{V}$ to find the best verbalizer is intractable: the number of possible verbalizers $|\mathcal{V}|=|T|^{k}$ grows exponentially in the number of labels and for a typical MLM, $T$ contains tens of thousands of tokens.

To circumvent this problem, we reframe the $k$ -class classification task as $k$ one-vs-rest classifications: For each $y\in Y$ , we search for a verbalization $v_{y}$ that enables $M$ to distinguish examples with label $y$ from examples with any other label. To this end, we introduce binarized training sets $\mathcal{T}_{y}=\{(\mathbf{x}_{1},\tilde{y}_{1}),\ldots,(\mathbf{x}_{n},\tilde{y}_{n})\}$ where $\tilde{y}_{i}=1$ if $y_{i}=y$ and otherwise. For $t\in T$ , we define

analogous to Eq. 1 except that we consider all tokens $t^{\prime}\in T$ for normalization, and $q_{(P,t)}(0\mid\mathbf{x})=1-q_{(P,t)}(1\mid\mathbf{x})$ . This enables us to formulate (and compute) the maximum likelihood estimate for each verbalization $v_{y}$ independently as

However, this reframing creates a label imbalance: If $\mathcal{T}$ is balanced, each $\mathcal{T}_{y}$ contains $k-1$ times as many negative examples as positive ones. To compensate for this, we raise each $q_{(P,v_{y})}(\tilde{y}\mid\mathbf{x})$ to the power of

where $n_{y}$ is the number of examples in $\mathcal{T}$ with label $y$ . A similar fix for this imbalance problem was suggested by ?) for multi-class classification with support vector machines.

We next reformulate maximizing the likelihood as minimizing the cross entropy between $\tilde{y}$ and $q_{(P,v_{y})}(\tilde{y}\mid\mathbf{x})$ , that is, $\hat{v}_{y}=\operatorname*{arg\,min}_{v_{y}\in\mathcal{V}_{y}}L_{\text{CE}}(\mathcal{T};v_{y})$ where

This can easily be derived from Eq. 4 after compensating for the label imbalance as described above. Unfortunately, there is the following problem with Eq. 6: As the vocabulary $T$ is quite large for most pretrained MLMs, $q_{(P,v_{y})}(0\mid\mathbf{x})$ will almost always be close to $1$ and thus, $\log q_{(P,v_{y})}(0\mid\mathbf{x})\approx\log 1=0$ . This means that negative examples contribute almost nothing to this cross entropy loss, so optimizing for $L_{\text{CE}}$ results in verbalizations $\hat{v}_{y}$ that are overall highly likely, but do not necessarily reflect the meaning of $y$ . We fix this problem by considering not the absolute values of $q_{(P,v_{i})}(\tilde{y}\mid\mathbf{x})$ , but the likelihood ratio (LR):

Independently, this LR criterion was recently shown to compare favorably to cross entropy in gradient-based neural network training for image classification [Yao et al. (2020].

To arrive at $L_{\text{LR}}$ , we have made quite a number of modifications to our starting point, the intractable maximum likelihood estimate. However, the two objectives are in fact quite similar. The key difference is that Eq. 2 enforces a large distance between $M(v_{y}\mid P(\mathbf{x}))$ and the maximum score assigned to the verbalizations of other labels, whereas Eq. 7 enforces a large distance between $M(v_{y}\mid P(\mathbf{x}))$ and the average score assigned to the verbalizations of other labels; this is shown in Appendix A.

Our above formulation requires sets of verbalization candidates $\mathcal{V}_{y}$ for each $y\in Y$ . These candidate sets can trivially be obtained by setting $\mathcal{V}_{y}=T$ , but to facilitate verbalizer search, we create candidate sets $\mathcal{V}_{y}\subset T$ containing only a small subset of the vocabulary. First, we follow ?) and reduce $T$ by removing all tokens that do not correspond to real words or do not contain at least 2 alphabetic characters. From the remaining list, we collect the 10,000 tokens that occur most frequently in the task’s unlabeled data and denote this filtered vocabulary by $T_{f}$ .

As our loss formulation in Eq. 7 considers the likelihood ratio, it is indifferent to the overall likelihood of a token. To make sure that candidates are both syntactically and semantically plausible for a given pattern, we further restrict the set of candidates by keeping only tokens that maximize the likelihood of all positive examples: For each label $y\in Y$ , we define a candidate set $T_{f,y}$ that contains the 1000 tokens $t\in T_{f}$ that maximize $L_{\text{CE}}(\mathcal{T}_{y}^{+};t)$ where $\mathcal{T}_{y}^{+}=\{(\mathbf{x},\tilde{y})\in\mathcal{T}_{y}\mid\tilde{y}=1\}$ . Naturally, this induces a bias towards frequent words. As recently shown by ?), pretrained language models tend to understand frequent words much better than rare words, so all other things being equal, a frequent word should be preferred over a rare word as verbalization; that is, this bias towards frequent words is indeed desirable.

2 Multi-Verbalizers

For some tasks, it makes sense to assign multiple verbalizations to some label.222For example, one of the categories in the AG’s News classification dataset [Zhang et al. (2015] is “Science/Tech” which can best be modeled by using two verbalizations “Science” and “Tech”. This applies all the more if the verbalizations are found automatically, as it may easily occur that the most likely verbalizations for a given label cover different aspects thereof. We thus introduce the concept of multi-verbalizers, a generalization of verbalizers to functions $v:Y\rightarrow\mathcal{P}(T)$ where $\mathcal{P}(T)$ denotes the power set of $T$ . To integrate multi-verbalizers into Pet, we replace the conditional probability distribution in Eq. 1 with

That is, we substitute the raw score that $M$ assigns to a label’s verbalization in standard PET with the average score across all its verbalizations.

Experiments

For our experiments with Petal, we use the Pet implementation of ?) and follow their experimental setup. In particular, we use RoBERTa-large [Liu et al. (2019] as underlying MLM, we use the same set of hyperparameters for Pet, the same evaluation tasks with the same patterns, and the same strategy for downsampling training sets. We deviate from ?) in that we convert all inputs to single sequences (i.e., we remove all [SEP] tokens) as we found this to slightly improve the verbalizers found by our approach in preliminary experiments. To ensure that our results are comparable with previous work and improvements in Pet’s performance are not simply due to this modification of patterns, we do so only for finding verbalizers and not for actual Pet training and inference.

We first analyze the verbalizers found by our method qualitatively. To this end, we consider Yahoo Questions [Zhang et al. (2015], a dataset consisting of questions and answers that have to be categorized into one of ten possible categories such as “Health”, “Sports” and “Politics”. We use the simple pattern

and 50 training examples, meaning that we provide just five examples per label. Table 1 shows the most likely verbalizations obtained for all labels using $L_{\text{CE}}$ and $L_{\text{LR}}$ ; for the latter, we consider both an unrestricted set of verbalization candidates and the candidate sets defined in Section 4. As can be seen, $L_{\text{CE}}$ does not lead to useful verbalizers for the reason outlined in Section 4: it only identifies words that are overall highly likely substitutes for the [MASK] in $P(\mathbf{x})$ . While $L_{\text{LR}}$ with $\mathcal{V}_{y}=T$ finds reasonable verbalizers, some verbalizations are rather uncommon tokens (“PLoS”, “phylogen”, “gcc”); using more restrained candidate sets ( $\mathcal{V}_{y}=T_{f,y}$ ) mitigates this issue and finds words that, in most instances, correspond well to the task’s actual labels. The shown verbalizations also illustrate the benefit of using multi-verbalizers. For example, the verbalizations for “Computer” include “hardware” and “software”; in isolation, none of these terms fully covers this category, but their combination does cover most of its aspects.

Next, we consider the more challenging MNLI dataset [Williams et al. (2018], a natural language inference dataset where given two sentences $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ , the task is to decide whether both sentences contradict each other, one sentence entails the other, or neither. On this dataset, Table 2 compares Petal to AVS, the approach of ?) for automatically finding verbalizers, using the pattern

and 50 labeled training examples. While both approaches clearly fail to find good verbalizations for the label “Neutral”, using Petal results in much better verbalizations for the other two labels, with most of the words identified by AVS being entirely unrelated to the considered labels.

To evaluate our approach quantitatively, we use the Yelp Review Full Star (Yelp) and AG’s News (AG’s) datasets [Zhang et al. (2015] in addition to Yahoo Questions and MNLI. The task for Yelp is to guess the number of stars (ranging from 1 to 5) that a customer gave to a restaurant based on their textual review; for AG’s, one of the four categories “World”, “Business”, “Sports” and “Science/Tech” has to be assigned to a news article.

Following ?), we again consider a scenario where we have $|\mathcal{T}|=50$ labeled training examples and a set of $10\,000\cdot k$ unlabeled examples for each task; the unlabeled examples are only required for Pet and not used for finding a verbalizer. For our approach, we consider both a variant where verbalizers are computed for each pattern separately (sep), and a variant were a single verbalizer is computed for all patterns as in AVS (joint); for the latter, the likelihood ratio losses for all patterns are simply added up and minimized jointly. We use a multi-verbalizer $\hat{v}$ where $\hat{v}(y)$ are the $n_{v}=10$ most likely verbalizations per label and compare Petal to the following baselines:

supervised: Regular supervised learning without Pet, i.e., we add a regular sequence classification head on top of the pretrained language model and perform finetuning as in ?).

Pet + random: We generate a multi-verbalizer by randomly choosing 10 words per label uniformly from $T_{f}$ . We include this baseline to verify that any improvements over supervised learning are not simply due to Pet using additional unlabeled examples and auxiliary objectives, but that the actual source of improvement is the improved verbalizer.

Pet + AVS: We generate a multi-verbalizer with 10 labels per word using automatic verbalizer search with its default parameters.

Pet + manual: We consider the manually defined verbalizers of ?). This serves as an upper bound of what is achievable by incorporating task- and model-specific knowledge.

Results can be seen in Table 3. On average, Pet with random verbalizers performs slightly better than regular supervised learning; we surmise that this is due to Pet leveraging additional unlabeled data. Random verbalizers perform much worse than AVS which, in turn, is cleary outperformed by our method for 3 out of 4 tasks, with an especially large margin on MNLI. This holds true for both the joint and sep variant of Petal, with the latter performing slightly better on average. Furthermore, especially for MNLI, our approach almost matches the performance of Pet with manually defined mappings while requiring no task-specific knowledge for finding verbalizers. The large gap between supervised learning and Petal is especially surprising given that the patterns – the only other source of task-specific knowledge in Pet – are very generic in nature.

We finally note that our method adds a single hyperparameter to Pet: the number of verbalizations per label $n_{v}$ , which may be difficult to optimize for small training sets. However, as shown in Figure 2, results on all tasks are relatively stable for a wide range of values ranging from $1$ to $100$ ; the best result across all tasks is obtained for $n_{v}=3$ .

Conclusion

We have devised Petal, a simple approach that enriches Pet with the ability to automatically map labels to words. Qualitative and quantitative analysis shows that our approach is able to identify words that are suitable to represent labels with as little as 50 examples and almost matches the performance of hand-crafted mappings for some tasks. For future work, it would be interesting to see whether the patterns required by Pet can similarly be obtained in an automated fashion.

Acknowledgements

This work was supported by the European Research Council (grant #740516).

References

Appendix A Relation of Maximum Likelihood Estimate and One-Vs-Rest Likelihood Ratio

We analyze the impact of all modifications introduced in Section 4: reframing $k$ -class classification as $k$ one-vs-rest classifications, downsampling negative examples and replacing $L_{\text{CE}}$ with $L_{\text{LR}}$ . For the sake of conciseness, we drop the condition on $\mathbf{x}$ and $P(\mathbf{x})$ in $q_{\mathbf{p}}(y\mid\mathbf{x})$ and $M(y\mid P(\mathbf{x}))$ , respectively. We start by reformulating the maximum likelihood estimate in Eq. 2 as

through logarithmization and multiplication by $-1$ . By applying the definition of $q_{\mathbf{p}}$ , we obtain

Finally, we can derive from the tangent line approximation $\log(a+b)\approx\log a+b/a$ that the left part of each addend is a soft approximation of $\max_{y^{\prime}\in Y}M(v_{y^{\prime}})$ (also commonly referred to as LogSumExp), so we can approximate $\hat{v}$ as

We now consider the verbalizer obtained using $L_{\text{LR}}$ as in Eq. 7, for which we assume that $\mathcal{T}$ is a balanced dataset. That is, for each label $y\in Y$ , there are $|\mathcal{T}|/k$ examples with label $y$ in $\mathcal{T}$ . We abbreviate the set $Y\setminus\{y\}$ of all labels except $y$ as $Y_{\setminus y}$ .

As $L_{\text{LR}}$ for each verbalization $v_{y}$ is independent of all verbalizations for other labels, we can simply write the optimization criterion for $\hat{v}$ as the sum of likelihood ratio losses for all verbalizations:

As can be seen in the definition of $\mathcal{T}_{y}$ , each $(\mathbf{x},y)\in\mathcal{T}$ contributes to the above sum $k$ times: $k-1$ times as negative example $(\mathbf{x},0)\in\mathcal{T}_{y^{\prime}}$ for each $y^{\prime}\neq y$ , and once as a positive example $(\mathbf{x},1)\in\mathcal{T}_{y}$ . We can thus rewrite the above as

⋅𝑠1subscript𝑞𝑃subscript𝑣𝑦1subscript𝑞𝑃subscript𝑣𝑦0subscriptsuperscript𝑦′subscript𝑌𝑦⋅𝑠0subscript𝑞𝑃subscript𝑣superscript𝑦′0subscript𝑞𝑃subscript𝑣superscript𝑦′1\displaystyle=\operatorname*{arg\,min}_{v\in\mathcal{V}}-\smashoperator[]{\sum_{(\mathbf{x},y)\in\mathcal{T}}^{}}\ \ \left(s(1)\cdot\log\frac{q_{(P,v_{y})}(1)}{q_{(P,v_{y})}(0)}+\sum_{y^{\prime}\in Y_{\setminus y}}s(0)\cdot\log\frac{q_{(P,v_{y^{\prime}})}(0)}{q_{(P,v_{y^{\prime}})}(1)}\right) (15) and again use the fact that $q_{(P,t)}(0)\approx 1$ for all $t\in T$ as well as the definition of $q_{(P,t)}$ and $s$ to obtain: $\displaystyle\hat{v}$ $\displaystyle\approx\operatorname*{arg\,min}_{v\in\mathcal{V}}-\smashoperator[]{\sum_{(\mathbf{x},y)\in\mathcal{T}}^{}}\ \ \left(\log q_{(P,v_{y})}(1)-\sum_{y^{\prime}\in Y_{\setminus y}}s(0)\cdot\log{q_{(P,v_{y^{\prime}})}(1)}\right)$ (16) $\displaystyle=\operatorname*{arg\,min}_{v\in\mathcal{V}}-\smashoperator[]{\sum_{(\mathbf{x},y)\in\mathcal{T}}^{}}\ \ \left(\log\frac{e^{M(v_{y})}}{\sum_{t\in T}e^{M(t)}}-\frac{1}{k-1}\sum_{y^{\prime}\in Y_{\setminus y}}\log\frac{e^{M(v_{y^{\prime}})}}{\sum_{t\in T}e^{M(t)}}\right)$ (17) Using $\log(a/b)=\log a-\log b$ and the fact that $\sum_{t\in T}e^{M(t)}$ is independent of $v$ , we can further simplify:

This concludes our verification of the statement made in Section 4: Eq. 2 enforces a large distance between $M(v_{y})$ and the maximum score of other verbalizations, whereas Eq. 7 penalizes their average score.