Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, Huajun Chen

Introduction

The pre-train—fine-tune paradigm has become the de facto standard for natural language processing (NLP), and has achieved excellent results in several benchmarks (Devlin et al., 2019; Liu et al., 2019; Lewis et al., 2020; Dong et al., 2019; Bao et al., 2020a). The success of these pioneers seems to suggest that large-scale pre-trained models are always nothing short of a panacea for boosting machine intelligence. However, supervised fine-tuning is still prone to labeled data in practice and faces unignorable challenges owing to the variations of domains, language, and tasks. These drawbacks lead to the research of an important technique, few-shot learning, which can significantly improve the learning capabilities of machine intelligence and practical adaptive applications by accessing only a small number of labeled examples.

The GPT-3 model, introduced by Brown et al. (2020), exhibits impressive few-shot learning capabilities. Given a natural language prompt and 16 labeled samples as demonstrations in the contextual input, GPT-3 achieves 80% of the SOTA results. However, GPT-3 is a fully dense transformer model with 175B parameters, which makes it challenging to deploy in most real-world applications.

Recently, an emerging fine-tuning methodology has arisen to equip smaller language models (LMs) with few-shot capabilities: adapting the pre-trained LM directly as a predictor through completion of a cloze task (Schick & Schütze (2021; 2020); Gao et al. (2020); Liu et al. (2021c)), which treats the downstream task as a (masked) language modeling problem. These prompts can be used in fine-tuning to provide the classifier with additional task information, especially in the low-data regime. Notably, Scao & Rush (2021) observe that prompting can often compensate for hundreds of data points on average across multiple classification tasks. However, determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets (Perez et al. (2021)). Recent studies (Lu et al. (2021); Zhao et al. (2021)) have reported that the manual prompt format can be sub-optimal, which would result in the accuracy varying from random guess performance to near the state-of-the-art. Therefore, previous approaches have attempted to search for discrete prompt tokens automatically. However, it is non-trivial for widespread classification tasks to obtain an optimized prompt template and target label token. For example, specific classification tasks such as relation extraction with the label of alternate_namealternate\_name and country_of_birthcountry\_of\_birth cannot specify a single label token in the vocabulary.

In this paper, we propose a novel DifferentiAble pRompT (DART) fine-tuning approach, which is model-agnostic, parameter-efficient. As illustrated in Figure 1, the key idea is to leverage a few parameters (unused tokens) in the language model, which serve as the template and label tokens, and to optimize them in the continuous space using backpropagation. Subsequently, we introduce differentiable prompt learning to obtain optimized prompt templates as well as labels. Since fine-tuning with limited samples can be affected by instability (Dodge et al. (2020); Zhang et al. (2021)), we propose an optimization algorithm to jointly learning templates as well as labels. We further introduce an auxiliary fluency constraint object to ensure the association among the prompt embeddings.

We conduct extensive experiments on 15 NLP datasets. With only a few training samples across all the tasks, our approach (DART) can obtain a better performance. Notably, absolute performance improvement of up to 23.28%, over the conventional fine-tuning, is obtained on average in the setting of K=8K=8 (and 1.55% for fully supervised settings) on relation extraction datasets with complex label semantics. Our approach can be applied to real-world classification tasks without the high cost of collecting and annotating a large amount of data. The main contributions of this study are as follows:

We propose a new simple framework for few-shot learning, which is pluggable, extensible, and efficient. To the best of our knowledge, optimizing label tokens in continuous space is also a new branch of research that has not been explored in language model prompting.

A systematic evaluation of 15 NLP tasks shows that the simple-yet-effective method contributes towards improvements across all these tasks. Remarkably, given only 8 labeled samples per class, our proposed approach can achieve 90% performance of the SOTA results (full dataset).

Related Work

The language model prompting has emerged with the introduction of GPT-3 (Brown et al. (2020)), which demonstrates excellent few-shot performance (Liu et al. (2021b)). However, GPT-3 is not designed for fine-tuning; it mainly relies on the handcraft prompt (in-context learning (Liu et al. (2021a); Zhao et al. (2021); Ding et al. (2021); Min et al. (2021))). Thus, recent studies (Qin & Eisner (2021); Hambardzumyan et al. (2021); Chen et al. (2021)) conducted in this field have been focused on automatically searching the prompts. Schick & Schütze (2021; 2020) propose the PET, which reformulates the NLP tasks as cloze-style questions and performs gradient-based fine-tuning. Tam et al. (2021) improve the PET with a denser supervision object during fine-tuning. Shin et al. (2020) propose the AUTOPROMPT to create prompts for a diverse set of tasks based on a gradient-guided search. Han et al. (2021) propose an approach called PTR, which leverages logic rules to construct prompts with sub-prompts for many-class text classification. Wang et al. (2021) reformulate potential NLP task into an entailment one, and then fine-tune the model with few-shot samples. Hu et al. (2021) propose an approach to incorporate external knowledge graph into the verbalizer with calibration. Additionally, Gao et al. (2020) present LM-BFF—better few-shot fine-tuning of language models, which leverages T5 (Raffel et al. (2020)) to generate templates and search label tokens in the vocabulary. However, the utilization of the generative model and the label search with validation is computation-intensive. Moreover, the prompt search over discrete space is sub-optimal due to the continuous nature of neural networks.

To overcome these limitations, Liu et al. (2021c) propose P-tuning, which employs trainable continuous prompt embeddings learned by an LSTM. Zhong et al. (2021) propose an effective continuous method called OPTIPROMPT to optimize prompts for factual probing. Liu et al. (2021c) propose prefix-tuning, which keeps language model parameters frozen but optimizes a small continuous task-specific vector for natural language generation tasks. Lester et al. (2021) propose a mechanism for learning “soft prompts” to condition frozen language models to perform downstream tasks. However, these approaches still have to optimize the external parameters (e.g., LSTM in P-tuning) and are prone to complex label space.

Conversely, this study aims to develop a novel few-shot learning framework based on pre-trained language models which can reduce the prompt engineering (including templates and labels) and external parameter optimization. Furthermore, the proposed approach only leverages the noninvasive modification of the model, which can be plugged into any pre-trained language model and extended to the widespread classification task.

Few-shot learning can significantly improve the learning capabilities for machine intelligence and practical adaptive applications by accessing only a small number of labeled examples (Zhang et al. (2020)). The proposed approach corresponds to the other few-shot NLP methods, including: (1) Meta-learning (Yu et al. (2018); Bao et al. (2020b); Bansal et al. (2020); Deng et al. (2020b; a); Yu et al. (2020)), in which the quantities of the auxiliary tasks are optimized. (2) Intermediate training (Phang et al. (2018); Yin et al. (2020)), which supplements the pre-trained LMs with further training on the data-rich supervised tasks. (3) Semi-supervised learning (Miyato et al. (2017); Xie et al. (2020)), which leverages unlabeled samples. The proposed approach focuses on a more realistic few-shot setting (the number of labeled instances per class can be any variable).

Background

where w{w} represents the wthw^{th} label token of class yy.

Our Approach

It can be observed from the previous empirical findings (Gao et al. (2020); Scao & Rush (2021)) that an optimal prompt is necessary for the improvement of the pre-trained language models for the few-shot learners. Since templates with discrete tokens may be sub-optimal and are insufficient to represent a specific classIt is non-trivial to evaluate all options of templates and label tokens., this study proposes DifferentiAble pRompT, referred to as DART, which can reduce the requirement of prompt engineering in order to improve the applicability of the proposed method in various domains.

2 Differentiable Template Optimization

where hi(0ij)h_{i}(0\leq i\leq j) are trainable parameters. Differentiable template optimization can obtain expressive templates beyond the original vocabulary V\mathcal{V}. Lastly, the templates, hih_{i}, are differentially optimized by:

Note that the values of the prompt embeddings, hih_{i}, must be co-dependent with each other rather than independent. Unlike P-tuning (Liu et al. (2021c)), which utilizes a bidirectional LSTM, DART leverages an auxiliary fluency constraint objective to associate the prompt embeddings with each other, thus stimulating the model to focus on context representation learning.

3 Differentiable Label Optimization

Prompt-based fine-tuning requires filling in one word, and the masked word prediction is mapped to a verbalizer, which produces a class (i.e., ”Yes”: True. ”No”: False). For each class cYc\in Y, the previous approaches such as LM-BFF (Gao et al. (2020)) estimate the conditional likelihood of the initial L\mathcal{L} on a pruned set VcV\mathcal{V}^{c}\subset\mathcal{V} of the top kk vocabulary words.

However, the brute-forcing label searching: (1) is computationally intensive and tedious because the Ddev\mathcal{D}_{\text{dev}} is generally very large, requiring multiple rounds of evaluation. (2) has poor scalability with an increase in the class numbers (many classification datasets have more than 100 classes), the number of searches may be kCk^{C} (CC represents the total number of classes), which is exponential and thus intractable. Additionally, the labels of classes contain rich, complex semantic knowledge, and one discrete token may be insufficient to represent this information.

Specifically, with the labels, Y={Y1,Y2,..,Ym}Y=\{Y_{1},Y_{2},..,Y_{m}\}, different from the previous approach which converts the class type YiY_{i} into a variable number of label tokens {…,v1v_{1},..,vkv_{k},…}, DART maps the YjY_{j} to a continuous vocabulary space as follows:

where mm is the number of trainable embedding in template. To avoid optimizing any external parameters, {h1,...,hm,..,hm+n}\{h_{1},...,h_{m},..,h_{m+n}\} is replaced with unused tokens (e.g., [unused1] or special tokens in vocabulary) in V\mathcal{V} to generate V\mathcal{V^{\prime}}, as shown in Figure 1.

4 Training Objectives

Since the pseudo tokens in the prompt template must be co-dependent with each other, we introduce an auxiliary fluency constraint training without optimizing any other parameters inspired by Liu et al. (2021c); Tam et al. (2021). Overall, there are two objectives: the class discrimination objective LC\mathcal{L}_{C} and the fluency constraint objective LF\mathcal{L}_{F}.

To ensure the association among the template tokens and to maintain the ability of language understanding inherited from the PLMs, we leverage a fluency constraint object with the MLM. As shown in Figure 1, one token in the input sentence is randomly masked, and the masked language prediction is conducted. xx and xx^{\prime} are the original and masked sequences, respectively. Let xmx^{m} be the target token that has been masked out in xx^{\prime}, and g(xmx,y)g(x^{m}|x^{\prime},y) is maximized as followsWe use the golden label yy rather than the [MASK] in the input of the fluency constraint object.:

By optimizing LF\mathcal{L}_{F}, the language model can obtain a better contextual representation with a rich association among the template tokens. We have the following training object:

where λ\lambda is the hyper-parameter. Lastly, we introduce the overall optimization procedure of DART. To mitigate the instability of the few-shot fine-tuning, we jointly optimize templates and labels. Note that our approach can reuse the same transformer architecture (rather than additional LSTM) so that it enjoys the beauty of simplicity for prompt-tuning.

Experiments

In this section, we detail the comprehensive experimental results conducted on classification tasks. The promising results demonstrate that our proposed DART substantially outperforms the conventional fine-tuning method, thus, making pre-trained language models better few-shot learners.

We conduct a comprehensive study across 15 NLP tasks, which covers sentiment analysis, natural language inference, paraphrases, sentence similarity, relation extraction, and event extraction (We only report event argument extraction performance). The evaluation consisted of 10 popular sentence classification datasets (SST-2, MR, CR, Subj, TREC, MNLI, SNLI, QNLI, MRPC, QQP).To further evaluate the effectiveness of the proposed approach with complex label space, we conduct experiments on the relation extraction and event extraction datasets, including SemEval-2010 Task 8 (Hendrickx et al., 2010), TACRED-Revisit (Alt et al. (2020)), Wiki80https://github.com/thunlp/OpenNRE/ (Han et al., 2019), ChemProt (Kringelum et al., 2016), and ACE-2005https://catalog.ldc.upenn.edu/LDC2006T06.

2 Settings

The proposed model is implemented using Pytorch (Paszke et al. (2019)). Our experiments are conducted with the same setting following LM-BFF ( Gao et al. (2020)), which measures the average performance with a fixed set of seeds, Sseed\mathcal{S}_{\text{seed}}, across five different sampled Dtrain\mathcal{D}_{\text{train}} for each task. We utilize a grid search over multiple hyperparameters and select the best result as measured on Ddev\mathcal{D}_{\text{dev}} for each set {Dtrains,Ddev},sSseed\{\mathcal{D}_{\text{train}}^{s},\mathcal{D}_{\text{dev}}\},s\in\mathcal{S}_{\text{seed}}. We employ AdamW as the optimizer. We conduct experiments with a RoBERTa-large (Liu et al. (2019)) on classification tasks for a fair comparison with LM-BFF. We leverage an uncased BERT-large (Devlin et al. (2019)) for relation extraction datasets, except that we use SCIBERT (Beltagy et al. (2019)) for the ChemProt dataset. We follow Soares et al. (2019) and use special entity markers uniformly to highlight the entity mentions for relation extraction.

3 Main Results

As shown in Table 1, we observe that our approach obtains better performance than conventional fine-tuning and achieves comparable results with LM-BFF. Note that DART can reduce the prompt engineering without external models (e.g., T5 in LM-BFF) to generate templates that are readily easy to adapt to other datasets. DART can obtain 11.3% improvement with only 16 training samples per class on the MR dataset, comparable with LM-BFF, which leverages T5 to generate appropriate prompts. These results indicate that DART can better stimulate potential ability and makes the pre-trained language model a better few-shot learner. We also notice that DART yields better performance than P-tuning, which indicates that label optimization is beneficial.

For the classification tasks with the complex label space, as shown in Table 2 and Figure 2(a), we observe that DART outperforms the conventional fine-tuning approach as well as LM-BFF with a large margin on relation extraction and event extraction datasets in both the few-shot and fully supervised settings. The proposed approach achieves an improvement of 2.8% of the absolute performance on the TACRED-Revisit dataset with full supervision and yields 18.4% gains with only 8 training samples per class. These findings also indicate that more relevant templates and labels can be determined without expert intervention, making it possible to generalize the proposed approach to other domains. We attribute the significant improvements to the fact that, unlike the GLUE datasets containing small categories, in relation extraction and event extraction tasks, the datasets consist of a large number of classes with complex label space, making it more challenging to obtain suitable label tokens. Furthermore, we notice that the improvement decays slowly when KK becomes larger (i.e., from 88 to 3232). Our approach is a simple yet effective fine-tuning paradigm that can reduce prompt engineering within the complex label space, thus, making it possible to be an appropriate plug-in for some SOTA models.

4 Ablation Study

We conduct an ablation study to validate the effectiveness of the components in the proposed approach. We observe that DART exhibits a performance decay in the absence of any one of the modules, i.e., fluency constraint object, differentiable template, or differentiable label, demonstrating that all the modules are advantageous. Furthermore, we notice that differentiable label optimization is more sensitive to performance and is highly beneficial for DART, especially for low-resource settings. Since the proposed approach is the first approach that utilizes the differentiable label optimization, these findings illustrate that a suitable label token is important.

5 Analysis and Discussion

To evaluate whether the proposed approach can be applied to other LMs, we conduct experiments using GPT-2-medium We do not utilize the fluency constraint object in GPT-2-medium since the model is not pre-trained with MLM objective. . From Figure 2(b), we observe that DART with GPT-2-medium yields better performance than the conventional fine-tuning approach. Furthermore, we notice that DART with GPT-2-medium can achieve performance on par with BERT-large, as observed by Liu et al. (2021c), indicating that the potential of GPT-style architectures for natural language understanding has been underestimated.

Why do Differentiable Prompts Yield Better Performance?

To further analyze why our differentiable prompts method yields better performance compared with prompts with fixed templates and label tokens, we visualize the representation of masked tokens in the CR dataset during different training steps (from left to right) as shown in Figure 3 (fixed) and 4 (differentiable), respectively. While both methods learn separable hidden states, differentiable prompts’ representation is relatively more compact while the representation generated from fixed prompts is more scattered. This observation of differentiable prompts generating more discriminative representations than the fixed prompts method is supported by an indicator RDR_{D}, the ratio between average intra-class and average inter-class distance. We believe the main reason behind its better performance lies in the more discriminative representation of the differentiable method. More details can be found in Appendix A.6.

What Exactly is Optimized Prompt?

Since prompt templates and label tokens in the proposed approach are mapped as {h1,...,hm,..,hm+n}\{h_{1},...,h_{m},..,h_{m+n}\}, we further analyze what exactly optimized label learned. We conduct a nearest-neighbor vocabulary embedding search to project the Top-3 optimized pseudo-label tokens in V{\mathcal{V}} to a readable natural language.We use t-SNE (Van der Maaten & Hinton (2008)) with normalization to visualize labels on Wiki80 dataset. For example, “military_branchmilitary\_branch” refers to as red \color[rgb]{1,0,0}{\star} in Figure 5 represents the relation type, which is learned by optimizing the pseudo label in the continuous space, and the “volunteeredvolunteered”, “corporalcorporal” and “buddiesbuddies”, refers to as \color[rgb]{1,0,0}{\bullet} are the tokens closest to the label. This finding indicates that the differentiable method generates better semantic representation.

DART v.s. Conventional Fine-tuning

The ability of DART to perform few-shot learning can be attributed to the label and being a true language understanding task, that once the model is capable of performing it correctly, it can easily apply this knowledge to other tasks that are framed as such. Superficially, (i) DART does not optimize any new parameters; however, conventional fine-tuning should learn an explicit classifier head over [CLS] embeddings, which may fail in the low-data regime. (ii) DART has the same task setting as large-scale language model pre-training.

Conclusion and Future Work

This paper presents DART, a simple yet effective fine-tuning approach that improves the fast-shot learning pre-trained language model. The proposed approach can produce satisfactory improvements in the few-shot scenarios when compared to the conventional fine-tuning approaches. The proposed method is also pluggable for other language models (e.g., BART) and can be extended to other tasks, such as intent detection and sentiment analysis. Intuitively, the results obtained in this study can be used to stimulate future research directions in the few-shot or lifelong learning for NLP.

Acknowledgments

We want to express gratitude to the anonymous reviewers for their hard work and kind comments. This work is funded by National Key R&D Program of China (Funding No.SQ2018YFC000004), NSFCU19B2027/NSFC91846204, Zhejiang Provincial Natural Science Foundation of China (No. LGG22F030011), Ningbo Natural Science Foundation (2021J190), and Yongjiang Talent Introduction Programme (2021A-156-G).

Reproducibility Statement

Our code is available in https://github.com/zjunlp/DART for reproducibility. Hyper-parameters are provided in the Appendix A.1.

References

Appendix A Appendix

Our code is available in the supplementary materials for reproducibility. This section contains details about the training procedures and hyperparameters for each of the datasets. We utilize Pytorch (Paszke et al., 2019) to conduct experiments with 1 Nvidia 3090 GPUs. All optimizations are performed with the AdamW optimizer with a linear warmup of learning rate over the first 10% of gradient updates to a maximum value, then linear decay over the remainder of the training. Gradients are clipped if their norm exceeds 1.0, and weight decay on all non-bias parameters is set to 0.01. Early stopping is adopted to reduce over-fitting on the training set.

We follow LM-BFF (Gao et al., 2020) to measure the average performance of models trained on 5 different randomly sampled Dtrain\mathcal{D}_{\text{train}} and Ddev\mathcal{D}_{\text{dev}} splits, and perform grid search for optimal hyper-parameter combinations on each split, including learning-rate, weight decay, and batch size.

For P-tuning (Liu et al., 2021c), due to the limit of search space, we do not set anchor tokens in prompt tokens.

For DART, we adopt joint optimization to acquire optimal prompts and fine-tune over global parameters. Note that we use base prompts as templates of pseudo tokens to accelerate convergence.

To compare fairly, we use RoBERTa-large (Liu et al., 2019) as pre-trained model for both DART and P-tuning framework, following LM-BFF (Gao et al., 2020). We also adopt the best discrete prompts together with label words in LM-BFF as base prompt settings for each framework, as stated below.

SST-2, MR, CR, Subj, TREC, QNLI, MRPC, QQP

The hyper-parameter search space is (the optimal set of parameters may vary across different tasks and data splits):

The hyper-parameter search space is (the optimal set of parameters may vary across different tasks and data splits):

The hyper-parameter search space is (the optimal set of parameters may vary across different tasks and data splits):

A.2 Base Prompt and Label Words

prompt template(length=3length=3) [”texttext”, ”it”, ”was”, ”<<mask>>”, ”.”]

label words {”0”: ”terrible”, ”1”: ”great”}

prompt template(length=3length=3) [”texttext”, ”This”, ”is”, ”<<mask>>”, ”.”]

label words {”0”: ”incorrect”, ”1”: ”correct”}

prompt template(length=1length=1) [”<<mask>>”, ”:”, ”texttext”]

label words {”0”: ”Description”, ”1”:”Entity”,”2: ”Expression”,”3”: ”Human”,”4”: ”Location”,”5”:”Number”}

prompt template(length=2length=2) [”textatext_{a}”, ”?”, ”<<mask>>”, ”,”, ”textbtext_{b}”]

label words {”contradiction”: ”No”,”entailment”: ”Yes”, ”neutral”: ”Maybe”}

prompt template(length=2length=2) [”textatext_{a}”, ”?”, ”<<mask>>”, ”,”, ”textbtext_{b}”]

label words {”not_entailment”: ”No”,”entailment”: ”Yes”}

prompt template(length=2length=2) [”textatext_{a}”, ”?”, ”<<mask>>”, ”,”, ”textbtext_{b}”]

prompt template(length=3length=3) [”texttext”, Entity1, ”is”, ”the”, ”<<mask>>”, ”of”, Entity2]

label words {”country_of_origin”, ”participating_team”, ”participant_of”,…}

A.3 Template Length Analysis

We define the length of a template as the number of tokens except for input sentence and <<mask>> token, and apply DART on templates with different length. The performance of a specific template length ll is derived by summarizing the averaging accuracy on each few-shot data splits, using template T=t1,t2,...,tlT={t_{1},t_{2},...,t_{l}}. From the Table 4, we observe that for the SST-2 task, the model whose template length is three yield best performance; however, the overall impact of template length is rather insignificant as models with different template length obtain relatively similar performance.

A.4 Performance on Full Training Set

We conduct experiments and report the performance of DART with full-sized training data of GLUE tasks. From Table 5, we notice that DART obtain better or comparable results compared with the standard fine-tuning and LM-BFF, indicating that prompt-based tuning methods benefit less from full-sized data.

A.5 Performance with Constrained Label Tokens

We conduct a nearest neighbor vocabulary embedding search to project the best optimized differentialble label token to a readable natural token. Those tokens are chosen based on cosine-similarity between all tokens’ embedding and the optimized differentialble label token of DART. We list them in descending order with similarity scores (i.e., the token ‘great‘ is chosen as its cosine-similarity score with trained positive label embedding of DART is the highest among all tokens, and the token ‘terrible‘ is the most similar token with the trained negative label embedding; the other tokens are selected and listed in descending order with similarity scores). From Table 6, we observe that the performance of fixed prompt models is related to the similarity score of the chosen label token and that the DART model learns more semantic representation for label tokens, thus, yield best performance.

A.6 More Experiments

We numeralize our observation on representation of masked token with a ratio between the average intra-class distance and average inter-class distance of hidden state vectors as RD=DˉintraDˉinterR_{D}=\frac{\bar{D}_{intra}}{\bar{D}_{inter}}, where:

where distance\operatorname{distance} is the euclidean metric between two vectors, and Hc[i]H_{c}[i] means the hidden state representation of masked token of ii-th sample from class cc. For discriminative representation, its average intra-class distance is low as data points within the same class tend to gather together, and its average inter-class distance is high as data points from different classes are separated, so its RDR_{D} ratio should be close to 0.

As is shown in Figure 6, the RDR_{D} ratio of the differentiable method grows lower than that of the fixed label method, which shows the hidden state representation trained in the differentiable method has better linear separability.

Note that in a masked language model, a linear transformation is performed on the hidden state representations, with a linear decoder sharing weights with the model’s word embeddings serving as the final token classifier. Hence it is evident that better linear separability of the representations leads to better performance. In our case, the differentiable method yields better performance due to its better linear separability.

A.7 Limitations

Our work may fail when the distribution of the task corpus varies from that of the pre-training corpus. For example, a general pre-trained language model may be fine-tuned with more training instances in a specific domain (e.g., medical domain). This issue can be addressed by intermediate training (Phang et al., 2018; Yin et al., 2020; Zhao et al., 2021), and will be analyzed in the future work. Besides, our work also shows an instability associated with hyper-parameters which is also observed by Dodge et al. (2020); Zhang et al. (2021); Perez et al. (2021) as volatility of few-shot learning in NLP. Overall, however, we believe our work will inspire future work to few-shot settings with more practical applications to low-data settings, e.g., that involve low-resource languages or expert annotation.

A.8 Broader Impact

The pre-train-fine-tune approach has become the standard for natural language processing (NLP). However, supervised fine-tuning is still practically affected by labeled data. This study proposes a novel pluggable, extensible, and efficient approach named DifferntiAble pRompT (DART), which can convert small language models into better few-shot learners. We believe that our study makes a significant contribution to the literature because determining the appropriate prompts requires domain expertise, and handcrafting a high-performing prompt often requires impractically large validation sets, and these issues have been overcome with the use of the proposed method, which is model-agnostic, parameter-efficient. We experimentally verified our proposed approach on 13 standard NLP tasks, and it was seen to outperform several standard NLP platforms.