ZeroGen: Efficient Zero-shot Learning via Dataset Generation

Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, Lingpeng Kong

Introduction

While generating training data with language model is not new to natural language processing Anaby-Tavor et al. (2020); Puri et al. (2020); Kumar et al. (2020), it has garnered enormous interests recently due to the superior generative capacity of large-scale pre-trained language models (PLMs). Training examples created in such a manner have been found effective in various scenarios via the data augmentation procedure (Lee et al., 2021; Schick and Schütze, 2021; Wang et al., 2021; Meng et al., 2022, inter alia).

In this paper, we study an extreme scenario of such an approach, ZeroGen. Given a downstream task, we first generate its training data from scratch using a powerful PLM, whose generation is steered by carefully designed task-specific prompts. Then, we train a tiny task model (TAM), which has orders of magnitude fewer parameters than PLMs, under the supervision of the synthesized training data. Machine generated text is the only medium that connects the PLMs to the final task models, and no human annotations are required in the entire process. The TAM can be of any choice (e.g., loglinear or neural), allowing efficient inferenceAmazon estimates that 90% of production ML infrastructure costs are for inference , rather than training Jain et al. (2019). and deployment. Besides, TAM can be flexibly designed with any task-specific strategies (e.g., inductive bias or loss), which could provide superior performance.

Apart from being annotation-free and efficient, we are also interested in ZeroGen for the following reasons. First, ZeroGen can be seen as a variant of knowledge distillation (KD; Hinton et al. (2015)) that provides some exciting new features. Unlike conventional KD, ZeroGen does not require any human annotations during distillation. Furthermore, ZeroGen makes no presumption on the architecture choice of student models, thus we can incorporate any task-specific inductive bias into the design of student models conveniently. Second, ZeroGen can serve as an unreferenced evaluation method for text generation (Guan and Huang, 2020; Pillutla et al., 2021): the downstream tasks’ performance is dominated by the quality of the synthesized text, thus can serve as an indirect measure of the generation models and algorithms. Third, ZeroGen sheds new lights on prompt engineering (Petroni et al., 2019; Brown et al., 2020) (i.e., the design of the prompts in PLMs). As manual prompts reflect our essential knowledge of specific tasks, an intriguing question here is to what extend we can incorporate human knowledge or instructions in these prompts.

We evaluate ZeroGen in three NLP tasks which are text classification, question answering, and natural language inference, across six datasets. Our key research findings are summarized as follows:

The zero-shot performance of TAM in ZeroGen framework significantly surpasses its PLM counterparts (which often serves as the teacher models under the knowledge distillation context), with only $\sim$ 0.4% number of parameters (§2);

In some low-resourced settings, TAM trained with synthesized data even outperforms the same model trained with human annotations in a fully supervised manner (§4.3);

The quality of the generated text by known models and algorithms are well reflected in downstream tasks’ performance, and decoding strategies that encourage more diversity also result in greater noise (§4.4);

Prompt engineering is challenging – the performance of more instructive or natural language style prompts varies in different tasks (§4.5).

In conclusion, we argue that ZeroGen is a viable and promising approach towards flexible and efficient zero-shot learning in NLP. It also has a great potential as a data-free model-agnostic knowledge distillation and unreferenced text evaluation method. Our code can be found at https://github.com/HKUNLP/ZeroGen.

Preliminary: Prompt-based Zero-Shot Learning

We start with preliminary knowledge about prompt-based zero-shot learning framework (named Prompting).

Giving a pre-trained language model (PLM) $\mathcal{P}$ and a text classification (TC) task $\mathcal{D}=(\mathcal{X},\mathcal{Y})$ , Prompting first instantiates a prompt $\mathcal{T(\cdot)}$ with each input $\mathbf{x}_{i}\in\mathcal{X}$ and outputs a natural language sequence to be completed by $\mathcal{P}$ . For instance, we show an example on sentiment analysis task in Figure 1(a), where $\mathbf{x}_{i}$ is "A deep and meaningful film." and $\mathcal{T}(\mathbf{x}_{i})$ is "A deep and meaningful film. The sentiment of the movie review is ". Furthermore, Prompting defines a verbalizer $\mathcal{M(\cdot)}$ that maps each label/class $y_{i}$ to a word/words in $\mathcal{P}$ ’s vocabulary. For instance, "positive" and "negative" represents the two classes. In this way, Prompting models the probability of class $y_{i}\in\mathcal{Y}$ for $\mathbf{x}_{i}$ as:

During the whole process, the pre-trained weights of $\mathcal{P}$ are frozen and no training is required.

The vast linguistic Jawahar et al. (2019); Goldberg (2019); Tenney et al. (2019) and factual Petroni et al. (2019); Jiang et al. (2020b) knowledge encoded in PLMs’ parameters is the key towards Prompting’s success. However, Prompting fails to fully exert the capacity of PLMs and heavily relies on gigantic PLMs during inference. This motivates us to explore a more flexible and efficient way of conducting zero-shot learning with PLMs.

ZeroGen

In this work, we take the dataset generation method to the extreme and study ZeroGen, a flexible and efficient zero-shot learning framework via dataset generation. ZeroGen framework comprises three sequential stages as shown in Figure 1(b):

The goal of the first stage is to make use of the generative power of PLMs to synthesize a dataset to solve the downstream task. With carefully designed prompts and a powerful PLM, the generated dataset is believed to incorporate rich task-specific knowledge.

Given pseudo dataset synthesized as above, we then train a tiny task model (TAM) to solve the task. TAM can integrate with any task-specific inductive bias and is also order-of-magnitude smaller than PLMs.

Finally, we perform efficient inference on target task using the trained model. During the whole process, no human annotations are involved, thus the evaluation setting is purely zero-shot.

For a single-sentence classification task $\mathcal{D}$ , we aim to generate a pseudo dataset $\mathcal{D}^{g}=(\mathcal{X}^{g},\mathcal{Y}^{g})$ with the help of a left-to-right PLM $\mathcal{P}$ . We first sample a class label $y^{g}$ from a uniform distribution:

where $k$ is the number of classes. $y^{g}$ is then wrapped up into a label-descriptive prompt $\mathcal{T}(y^{g})$ to steer the generation of $\mathbf{x}^{g}$ :

Since the parameters of $\mathcal{P}$ is frozen and the generation $\mathbf{x}^{g}$ for each $y^{g}$ is deterministic, we can adopt different sampling algorithms (e.g., Top-k sampling Fan et al. (2018) and nucleus sampling Holtzman et al. (2020)) to increase the diversity of generated dataset. We then pair the generated $\mathbf{x}^{g}$ with $y^{g}$ to construct a pseudo training data. We show an example about generating a pseudo sentiment classification dataset in Figure 1(b). The prompt $\mathcal{T}(y^{g})$ for a positive label $y^{g}$ is "The movie review in positive sentiment is: "". With the sampling strategies, this prompt steers PLMs to generate multiple sentence ending with another quotation mark, e.g., "A deep and meaningful movie."" or "Good film!!!"".

For sentence-pair classification tasks, we need to generate two sequences that bear certain relationships (e.g., premise and hypothesis in NLI, context and question in QA). We decompose the generation into two steps: (i) We first generate and/or sample a conditional context $\mathbf{c}^{g}$ (e.g., $\mathbf{c}^{g}$ represents premise in NLI and context in QA). The context $\mathbf{c}^{g}$ is then concatenated with a sampled label $y_{g}$ and transformed into a prompt $\mathcal{T}(\mathbf{c}^{g},y_{g})$ . (ii) Giving the prompt $\mathcal{T}(\mathbf{c}^{g},y_{g})$ , we can now generate the other sentence $\mathbf{x}^{g}$ (e.g., hypothesis in NLI and question in QA) as in Equation (3). In current implementation, we sample $\mathbf{c}^{g}$ from an unlabeled corpus. But $\mathbf{c}^{g}$ can also be generated following procedure of generation for single-sentence classification task. Since there could be no predefined label set for extractive QA task, we use publicly available spaCyhttps://spacy.io/ toolkit to annotate entities, and then uniformly select an entity as $y^{g}$ . Finally, the generated sentence-pair and label can form the pseudo dataset $\mathcal{D}^{g}=(\mathcal{C}^{g},\mathcal{X}^{g},\mathcal{Y}^{g})$ . We elaborate details on prompts chosen for each task in Section 4.5.

Pseudo-supervised training

With the pseudo dataset $\mathcal{D}^{g}$ , we train a tiny task model TAM to conduct the given task. This procedure is highly flexible, meaning that we can use any model architecture, loss function, and training strategy. In this work, we primarily focus on the overall framework, thus we leave the tuning of these components for future work. Under the zero-shot learning setting, it should be noted that we have no access to the standard validation set. Therefore, we use a portion (e.g., 10%) of the pseudo dataset as the validation set for model selection.

Zero-shot evaluation

Finally, we conduct inference on the trained TAM model. As TAM is order-of-magnitude smaller than PLM, it is able to perform extremely efficient inference.

Experiments

We perform experiments across three different tasks including six different NLP datasets. The detailed experimental setup (i.e., Implementation Details) are described in Appendix A.

We consider two Text Classification datasets (i.e., SST-2 Socher et al. (2013) and IMDb Maas et al. (2011)), two Natural Language Inference datasets (i.e., QNLI Rajpurkar et al. (2016) and RTE Dagan et al. (2005); Haim et al. (2006); Giampiccolo et al. (2007); Bentivogli et al. (2009)), and two Question Answering datasets (i.e., SQuAD1.1 Rajpurkar et al. (2016) and AdversarialQA Bartolo et al. (2020)). The number of training examples for SST-2 and RTE is 6.9k and 2.5k, which can be considered as low resource compared with IMDb (25k), QNLI (105k), SQuAD (87k) and AdversarialQA (30k). We adopt Exact-Match (EM) and $F_{1}$ as the metrics for question answering tasks and Accuracy for other tasks.

Baselines

We compare ZeroGen framework with two baselines: (1) Prompting. The prompt-based zero-shot learning framework via PLMs. We use GPT2 (117M), GPT2-large (762M), and GPT2-XL (1.5B) Radford et al. (2019) via the HuggingFace Transformers library Wolf et al. (2019). (2) Supervised. The TAMs are trained on standard dataset (i.e., human annotations). Regarding model architecture of TAMs, we use two types of model for each task: a LSTM-based model (i.e., BiLSTM Hochreiter and Schmidhuber (1997) for TC and NLI tasks, and BiDAF Seo et al. (2017) for QA task), and a tiny pre-trained model (i.e., DistilBERT Sanh et al. (2019)).

Evaluation Strategy

Due to restricted test set access for some datasets (i.e., SQuAD1.1 and SST-2), we held out a small subset (i.e., 10%) of the training set for validation for model trainined in Supervised setting, and report results on the validation set. For models trained with synthetic dataset in ZeroGen framework, we also use a portion (i.e., 10%) as the validation set, without accessing to original validation set. For Prompting, we directly evaluate on the original validation set.

2 ZeroGen vs. Prompting

Table 1 compares ZeroGen with Prompting framework. We observe that ZeroGen significantly outperforms Prompting on most datasets we evaluated, and this superiority is consistent across different PLM generators and TAMs. In particular, when using DistilBERT as TAM, we find that among 18 (3 generators $\times$ 6 tasks) head-to-head comparison with Prompting, ZeroGen achieves better performance in 15 casesNote that with careful prompt design and selection, our Prompting baseline achieves an accuracy of 89.22% with GPT2-XL on SST-2 dataset, substantially higher than the previous best results (i.e., 87.4% Holtzman et al. (2021)). The reasons for the superior performance are mainly two-folds: 1) compared with the general purpose generation model, task-specific classification model may encourage a more deterministic decision boundary, which shares the same spirits with entropy minimization Grandvalet and Bengio (2006) or self-training Lee et al. (2013), and 2) classification tasks benefit from the inductive bias in the architecture. For example, method that predicts the start and end positions greatly narrows down the searching space for extractive question answering tasks, in comparison with free-generation on the vocabulary space.

Besides the superior effectiveness in zero-shot learning, it’s also worth noting that ZeroGen is also quite efficient. ZeroGen can achieve comparable (LSTM) and even better (DistilBERT) performance than Prompting, using more than 200 times and 20 times fewer parameters, respectively. Nowadays, with increasingly larger pre-trained language models (e.g., 175B GPT-3 Brown et al. (2020), 1571B Switch-C Fedus et al. (2021)), the advantage of ZeroGen becomes even more pronounced. The gigantic PLMs can improve the quality of synthesized dataset and lead to better zero-shot performance. Meanwhile, the TAM can remain light-weighted for efficient inference and serving.

Furthermore, when scaling up PLMs, we observe continuous performance boost for both Prompting and ZeroGen. This indicates that larger-scale PLMs might have been trained to store more knowledge that is useful for generating accurate dataset for a task.

3 ZeroGen vs. Supervised

It’s commonly accepted that the zero-shot performance of a NLP model can lag way behind its fully-supervised performance (trained on human annotations). However, we find that ZeroGen even outperform its Supervised counterpart on SST-2 and RTE datasets (highlighted with ∗ in the Table 1). Our conjecture is the size of the datasets are the key factor. ZeroGen automatically generates much more data as supervision during training (i.e., 200k synthesized samples vs. 6.9k/2.5k human annotations). These results are encouraging because they suggest that: (i) ZeroGen is quite effective in low-resource scenario; (ii) it’s possible to synthesize training samples to approximate human-annotations in a fully unsupervised manner.

We further investigate if we can trade data volume in exchange for zero-shot performance in ZeroGen. Our results are shown in Figure 2.

Overall, we find that the final performance improves continuously as the amount of data grows, despite diminishing returns. We find that generating 10k of training samples leads to better performance than Prompting method on most datasets. In addition, by increasing data size, we find that ZeroGen even outperforms the Supervised baseline on SST-2 and RTE. But still, on some datasets examined (e.g., SQuAD, QNLI), there remain a performance gap between ZeroGen and Supervised.

4 ZeroGen as Text Generation Evaluator

The quality of the synthesized text is the key to the performance of the downstream tasks. ZeroGen can thus be seen as an indirect measure of the generation models and algorithms. It is a commonly accepted belief that the quality of generated text should be in an ascending order in GPT-2, GPT2-Large, and GPT2-XL, due to the growing in the parameter size. We find this trend is well aligned in the downstream application performance (Table 1).

Besides the model, another important aspect in text generation is its decoding algorithm, where the goal is to achieve better diversity without the text quality (e.g., fluency, coherence, and correctness). We show that how the trade-off between diversity and correctness is reflected in the framework of ZeroGen.

Sampling strategies (e.g., top-k sampling and nucleus sampling) are known to be able to generate text with a higher degree of diversity than other decoding strategies (e.g., greedy search) Fan et al. (2018); Holtzman et al. (2020). Empirical results in Table 2 demonstrate that a more diverse decoding strategy does not always ensure better performance on downstream tasks. For example, the results of the nucleus sampling strategy, which is considered to generate the most diverse data, achieves a performance nearly 6% and 3% lower than the best decoding strategy on both the SQuAD and QNLI datasets, respectively, while greedy decoding strategy could obtain better results than some sampling strategy (e.g., top-k=40, top-k=80 and nucleus sampling). In contrast, all sampling strategies are superior to the greedy decoding strategy on the IMDb dataset. Regarding the inconsistent better downstream performance of more diverse decoding strategies, we hypothesize that diversity may come at a price, such as generating samples not pertain to the class described in the prompt. Therefore, we assess the quality of a dataset from two perspectives: Diversity and Correctness for quantitative analysis of different datasets.

Diversity

We follow previous work Holtzman et al. (2020) and compute Self-BLEU Zhu et al. (2018) as a metric of diversity. Self-BLEU is calculated by computing the BLEU score of each generated text using all other generations in the evaluation set as referencesSpecifically, we randomly sample 1000 generations, each of which is compared with all 999 other generations as references.. A lower Self-BLEU score implies higher diversity. We report 4-gram based Self-BLEU in the first part of Table 3. We find that decoding strategies such as top-k and nucleus sampling lead to more diverse generations. This finding is consistent with previous works Li et al. (2016); Vijayakumar et al. (2016); Welleck et al. (2020); Holtzman et al. (2020).

Correctness

Different from the vanilla generation scenario that ends with the generated text, we use the generated text as training dataset for another small model. Therefore, ZeroGen requires a more emphasis on the correctness of generated text, i.e., whether the generated text pertain to the corresponding class described in the prompt. To access the correctness of a synthetic dataset, we first train a RoBERTa-Large Liu et al. (2019) model on the standard training dataset, which is then used as a validator to evaluate the synthetic dataset. In summary, we find a tradeoff between diversity and correctness, i.e., greater diversity leads to lower correctness. We notice even deteriorated outcomes by increasing k, while greedy search achieves the highest performance in terms of correctness. These results reflect those of Massarelli et al. (2020) who also found a tradeoff between factuality and diversity, i.e., while decoding strategies such as top-k and nucleus sampling lead to less repetitive generations, they also produce less verifiable text. Besides, among different tasks, we find the correctness on oracle datasets are similar (i.e., all larger than 90%), while that varies substantially on synthetic datasets (i.e., up-to 94.46% on IMDb and merely 31.07% on SQuAD). Compared with generating datasets for single text classification tasks (e.g., IMDb), where the PLM only needs to consider a single condition (i.e., label), generating for text-pair tasks requires PLMs to consider multiple conditions synchronously (e.g., answer and context when generating question), which makes it more difficult to control the correctness of the generated sample. This possibly explains the observed variance among tasks.

Human Evaluation

We report the human evaluation results in Table 4. The quality of generated data is measured by the correctness and naturalness metrics. The correctness measures whether the label is correct and the content is relevant to the task topic (e.g. movie review for IMDb). The naturalness measures whether the generated text is fluent and similar to human-generated text. We invite 4 experts to participate in the evaluation and each participant is randomly assigned 25 generated samples (100 samples in total) for each decoding strategy. Table 4 report the mean scores. The results show that greedy search achieves the highest performance in terms of correctness, which is consistent with the automatic evaluation using Roberta-Large. However, in terms of naturalness/fluency, the greedy search performs the worst. The top-k and nucleus decoding strategies can generate a more fluent context.

5 Prompt Engineering in ZeroGen

The design of prompts can have huge impact on Prompting, as pointed by many previous works (Mishra et al., 2021a; Wei et al., 2022). In this section, we investigate how prompt design instructs text generation and affects ZeroGen’s performance. We examine three commonly used prompt types: (1) Control code Keskar et al. (2019), (2) Control code with task description, (3) Natural language style. For SST-2 and IMDb, example prompts and corresponding results can be found in Table 5 (check Appendix A for other tasks).

From Table 5, we first observe that natural language prompts are favored by both ZeroGen and Prompting, rather than prompts contain control code. We hypothesize the reason being that during the pre-training process, the majority of text data fed to the PLMs are natural language sentences, and therefore the PLMs do not contain enough knowledge in control code. Moreover, we observe that ZeroGen is more robust towards different prompts than Prompting: for Prompting, a minor change from $P_{4}$ to $P_{5}$ will lead to a huge drop in accuracy (16.2% drop in IMDb); for ZeroGen, applying the same prompt revision, the decrement decreases to 9.4%. Compared with Prompting which use prompt to directly instruct label words, ZeroGen use synthesized data as medium to connect PLM and TAM, thus mitigating the sharp change brought by prompts.

To further explore the potential of prompting, we investigate the two-stage conditional prompt inspired by Schick and Schütze (2021). In the running example, based on the task characteristic(to generate a movie review), we first generate movie name using prompt [Movie: "] and then prompt sentence using $P_{4}^{{}^{\prime}}$ . We can find that with the control of movie name, the generated training corpus is more diverse than using $P_{4}$ . With the desirable correctness (see Table 3), the higher diversity leads to a higher accuracy (from 81.84 to 83.40 in IMDb).

The most suitable prompting type in Question Answering and Natural Language Inference tasks has some differences with Text Classification due to different task characteristics. For details, please refer to the Appendix B.

6 ZeroGen via Larger PLM Generator

We further investigates the performance of ZeroGen on a larger PLM (i.e., OPT (Zhang et al., 2022) with 175B parameters). We find both Prompting and ZeroGen benefit from the larger PLM on hard tasks (i.e., SQuAD). But on relatively simpler text classification tasks, the results degrades. This demonstrates that prompt selection is still important for larger models, and the prompt that suits for one model (e.g., GPT2-XL) may not suit for another (e.g., OPT).

Related Work

With manual crafted natural language prompt, large-scale PLMs have shown impressive zero-shot abilities in a wide array of NLP tasksRadford et al. (2019); Brown et al. (2020). However, current prompt-based zero-shot learning can be unstable: the choice of prompt contributes a lot to the final performance. This motivates researchers to investigate better ways to automatically search and/or manually construct a proper prompt Jiang et al. (2020a); Shin et al. (2020); Reynolds and McDonell (2021); Mishra et al. (2021b). To improve the zero-shot generalization across different prompts, another line of work uses a multitask training mixture made up of a large set of different tasks specified in natural language prompts. This induces a model to better generalize to unseen tasks, as well as being more robust to the wording choices of the prompts. Khashabi et al. (2020); Zhong et al. (2021); Mishra et al. (2021c); Wei et al. (2021); Sanh et al. (2021); Xu et al. (2022). In comparison, we advocate and analyse a new paradigm for prompt-based zero-shot learning via dataset generation, which is complementary to current prompt searching and multi-task pre-training methods.

2 Dataset Generation with PLMs

Our work also relates to research in generating data with PLMs, which aims to generate a pseudo dataset to enhance model performance. Early efforts achieve this goal with fine-tuned generative models Anaby-Tavor et al. (2020); Puri et al. (2020); Kumar et al. (2020); Lee et al. (2021). They first fine-tune the generative models using human annotations, the generated data samples are then combined with human annotations to train the models in a semi-supervised fashion. Supervised data generation methods are also studied for building auxiliary tasks Vu et al. (2021) and dataset creation based on human and machine collaboration Liu et al. (2022). To reduce the human efforts on data annotation, another line of works explore data generation methods without the need for human annotations. He et al. (2021) uses unsupervised-trained unconditional generative models to synthesize unlabeled data for semi-supervised learning. Without any model training, Wang et al. (2021) propose to directly use unlabeled in-domain examples as prompts to synthesize high-quality training data. Schick and Schütze (2021) explore dataset generation method from scratch for semantic textual similarity task. One concurrent work Meng et al. (2022) studies dataset generation for text classification and natural language inference tasks. In comparison, we take the dataset generation framework to the extreme, i.e., consider extremely tiny edge models (e.g., LSTM), explore boarder NLP tasks including question answering, and conduct extensive analysis such as decoding strategies and quality evaluation.

Conclusion and Future Directions

In this paper, we study an extreme instance of dataset generation via PLMs for zero-shot learning. Without any human annotations, we show that an small LSTM can surpass the zero-shot performance of its PLM counterparts (e.g., GPT2-XL), and even outperform the same model trained with human annotations. Despite the demonstrated effectiveness, we discuss several issues we observed when developing ZeroGen and reveal a substantial room of improvement in future research.

Despite positive results on TC tasks, we find the stability regarding prompt choice of ZeroGen is still far from satisfactory on NLI tasks. Future work could include multi-task prompt-based pre-training methods Sanh et al. (2021); Wei et al. (2021).

Furthermore, we observe noisy examples in synthetic dataset on difficult tasks such as NLI and QA, this situation progressively deteriorates when incorporating more diverse decoding strategy (e.g., Nucleus Sampling). Better decoding strategies are needed to ensure the label correctness while preserving the dataset diversity Massarelli et al. (2020). Besides, methods that learn from noisy labels can be integrated into the training of the tiny task model Song et al. (2020).

We hope this paper can provide contributions for further exploiting dataset-generation-based zero-shot learning with large pre-trained language models.

Limitations

Although ZeroGen achieves promising performance under zero-shot learning setting, this choice does come with certain limitations. We find the stability regarding the prompt choice of ZeroGen is still far from satisfactory. ZeroGen underperforms Prompting in some certain selected prompts, and prompt engineering is tough as it’s shown a different preference on prompts across various tasks. Future work may include multi-task prompt-based pre-training methods (Sanh et al., 2021; Wei et al., 2021) to improve prompt robustness.

We also observe noisy examples in the synthetic dataset on difficult tasks such as NLI and QA, this situation progressively deteriorates when incorporating a more diverse decoding strategy (e.g., Nucleus Sampling). Better decoding strategies are needed to ensure the label’s correctness while preserving the dataset diversity Massarelli et al. (2020). Reciprocally, methods that learn from noisy labels can be integrated into the training of the tiny task model Song et al. (2020).

Acknowledgement

We thank the anonymous reviewers whose suggestions helped clarify this work. This work is partially supported by the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100), and the joint research scheme of the National Natural Science Foundation of China (NSFC) and the Research Grants Council (RGC) under grant number N_HKU714/21.

References

Appendix A Experimental Setup

For dataset generation, we use Nucleus Sampling Holtzman et al. (2020) with $p=0.9$ by default as it is considered to be able to generate both fluent and diverse textsHoltzman et al. (2020). The scale of synthetic dataset is 200k in the main results, and 100k in other analysis experiments. Regarding prompt selection, we manually design a series of prompts for each task, and report results on the best prompt for Prompting and ZeroGen framework. For NLI tasks, we adopt self-debiasing mechanism with a decay constant of 200 Schick et al. (2021) to ensure that each generated text pair is not only a good fit for a given label, but also not a good fit for other labels Schick and Schütze (2021). We removing overly short/long sentences, and sentences without an ending quotation mark.

We implement a LSTM-based model and a DistilBERT model as TAM. For LSTM-based model, we use Adam optimizer Kingma and Ba (2015), a learning rate of 1e-4, an embedding dim of 100, and a hidden size of 300. For single sentence classification(TC), we use 1-layer BiLTSM to encode the sentence and use a linear classifier. For sentence-pair classification(NLI), we use 2-layer BiLTSM to encode the sentences to $v_{1},v_{2}$ and pass the concatenated [ $v_{1};v_{2};|v_{1}-v_{2}|;v_{1}*v_{2}$ ] to a classifier. For QA tasks, we use the 1-layer BiDAF model. To ensure that TAMs are truly trained from scratch using the synthetic corpus, we random initialize TAMs’ embedding without using any pre-trained word embeddings (e.g., GloVe Pennington et al. (2014). For DistilBERT, we fine-tune on each dataset with Adam optimizer, with a learning rate of 2e-5, a weight decay of 0.01, and other default hyper-parameters as suggested by HuggingFace Transformers library Wolf et al. (2019). We run experiments on a single NVIDIA A100 80G GPU, and generating 200k examples cost 12h on average.

Appendix B Additional Results on Prompt Design

For Question Answering tasks, the natural language style prompt is also the most suitable for both Prompting and ZeroGen settings, achieving the highest scores. However, for Natural Language Inference tasks, the most suitable prompts for QNLI and RTE are different. For RTE, the natural language style prompt is best, while the control code prompts perform significantly better than natural language style prompts in QNLI.

Appendix C Additional Related Work on Efficient Inference of PLMs

There is a line of works dedicated to improving the inference efficiency of PLMs, including pruning Wang et al. (2020); Gordon et al. (2020), low-rank factorization Ma et al. (2019); Noach and Goldberg (2020); Lan et al. (2020), quantization Zafrir et al. (2019); Shen et al. (2020); Kim et al. (2021), knowledge distillation Jiao et al. (2020); Sanh et al. (2019); Sun et al. (2020) and parallel decoding Gu et al. (2018); Ghazvininejad et al. (2019); Ye et al. (2021). We refer the readers to Xu et al. (2021) for a detailed survey. Concerning privacy, copyright or confidentiality, data-free knowledge distillation (DFKD) Liu et al. (2021) has attracted appealing attention in computer vision field as it deals with distilling valuable knowledge from well-trained models without requiring to access to the training data. However, similar approaches for NLP are difficult to work due to discrete character of words. Rashid et al. (2021) relax the data-free condition and use out-of-distribution labeled data to train a generator. By contrast, our method generates data with the PLMs (i.e., the teacher), without requiring any pre-defined labeled data. In the literature of knowledge distillation, ZeroGen framework could produce a student model that achieves superior zero-shot performance the teacher model.

Appendix D ZeroGen as Knowledge Distillation

ZeroGen can be seen as a dataset-based knowledge distillation framework. We compare vanilla knowledge distillation baselines with ZeroGen in Table 7. The soft and hard labels are generated by GPT2-XL on the unlabeled training set. The generated labels are used to train a tiny task model for comparison. The superior results on ZeroGen show that the paradigm can better utilize large PLM by distilling more knowledge into a large amount of input-output pairs, while vanilla knowledge distillation purely distills knowledge into outputs.

Appendix E ZeroGen for Data Augmentation

We report the results using the synthetic data as augmentation data in Tabel 8. The results show that the zero-shot synthetic data is a good supplement to human-annotated data (gold data) and can improve the model performance.

Appendix F ZeroGen for Self-improving

We have shown that a tiny task model can outperform a large PLM after training on the synthetic dataset. A natural question is "Can PLM improve its own performance after tuning on the dataset generated by itself?". We experiment using PLM as TAM and report the results in Table 9. To summarize, we find 1) A larger TAM can further boosts the performance; 2) PLMs can improve itself by fine-tuning on the dataset generated by its own.

Appendix G Generated Examples

We present some qualitative examples for different tasks in Appendix Table 11. Text classification task (SST-2) is relatively simple and concise, the generated samples generally fit the prompts and sentiment polarity well by using descriptive tokens about the given movie name and positive/negative sentiment. Take the first case in SST-2 as an example, the generated tokens “action-adventure” and “attractive” are the natural continuations for movie name “The Spiderwick Chronicles (Movie)” and “positive” sentiment in prompt. Although natural language inference tasks are complex, the generated questions (QNLI) and inferences (RTE) could respond to different types of prompts and relate to the given contexts (e.g., the generated question drifts topic for prompt “Information:…Question (answer not in above information)” in QNLI). While the context of the question answering task (SQuAD) is long and contains a lot of information, ZeroGen could successfully generate question “Who is the one and only true God ?” which is used to response to the pre-set answer “Jehovah”. Overall, these generation examples show that ZeroGen can generate useful and arbitrary number of training samples that could be used to train TAMs.