Neural Data Augmentation via Example Extrapolation

Kenton Lee, Kelvin Guu, Luheng He, Tim Dozat, Hyung Won Chung

Introduction

Data collection is a noisy process, and there are often significant mismatches between training and test distributions, leading to certain slices of data being underrepresented in the training set. For example, developers of a dialog agent may regularly add new “intents” to their system’s set of capabilities, but data collection for each new intent often lags behind Bapna et al. (2017); Gaddy et al. (2020). More generally, this issue can be a chronic problem for tasks with constantly expanding output spaces, such as relation extraction Han et al. (2018) and entity linking Logeswaran et al. (2019), or particularly long-tail output spaces, such as fine-grained image classification Akata et al. (2015). In such situations, existing systems can severely underperform on underrepresented slices of the data due to the incorrect prior probability of predicting them.

Data augmentation is a popular solution for biased or imbalanced data, either by duplicating examples or using heuristics to synthesize new examples Perez and Wang (2017). However these heuristics may not scale well and are poor approximations of the complexity of real examples.

In this paper, we propose an approach for learned data augmentation that uses a neural Example Extrapolator (Ex2) to synthesize new examples (illustrated in Figure 1). Ex2 takes as input a handful of examples (“exemplars”) drawn from an underrepresented slice of data and learns to synthesize new examples that fall within the same slice and distribution as the exemplars (Step C in Figure 1). Ex2 learns to extrapolate to new examples by simulating this procedure using random subsets of the training data that already have a large number of examples (Step B in Figure 1).

Our approach has strong connections to several recent works on using language models for data augmentation Kumar et al. (2019) and zero-shot learning Brown et al. (2020), as well as methods for few-shot learning via nearest-neighbor models Snell et al. (2017); Vinyals et al. (2016). We discuss these connections at length in Section 5.

We apply Ex2 to several language understanding tasks that contain few-shot slices of data, including relation extraction (FewRel) and intent classification/slot-filling tasks (CLINC150 and SNIPS). By correcting for the underrepresentation of those slices with Ex2 data augmentation, we significantly improve state-of-the-art methods.

Approach

Throughout this paper, we focus on applying Ex2 to standard supervised learning tasks. Our approach consists of the following high-level steps:

Organize training data into multiple slices.

Train an example extrapolator using data from those slices.

Use the example extrapolator to generate new synthetic data for underrepresented slices of the dataset.

Train a model on the union of the synthetic data and real data.

The core of the approach is the example extrapolator, which is a generative model that aims to recover the full distribution of examples given only a few samples from that distribution. During inference (Step C in Figure 1), the extrapolator takes as input the concatenation of $K$ gold examples that come from an underrepresented slice of the dataset and generates new examples that belong to the same slice. To train this extrapolator (Step B in Figure 1), we simulate this procedure by randomly selecting $K+1$ gold examples from a data-rich slice and optimize the log-likelihood of one of the examples given the other $K$ examples.

The synthetic data sampled by performing inference on the underrepresented slices can then be combined with existing data, which is applicable to any supervised learning setup.

The rest of this section motivates and formalizes this approach.

2 Formal Definitions

We denote a training example as $e=(x,y)$ , where $x$ is the input and $y$ the output. In a text classification task, for example, $x$ would be a snippet of text (e.g., “Play a song”), and $y$ the class (e.g., PlayMusic).

In many tasks, there is a natural way to slice data into different subsets of interest. For example, a slice could be the set of all examples sharing a given label, or all examples in a particular language, or with a particular syntactic construction. Ex2 makes no assumptions about how data is sliced — for any given application, it is up to the practitioner to slice the data in a way that exposes important but underrepresented slices, which Ex2 can then target for data augmentation.

To formalize the notion of slicing, we assume that the practitioner defines a list of $S$ slicing functions, $\mathtt{slice_{s}}$ for $s=1,\ldots,S$ , where each function $\mathtt{slice_{s}}(e)$ is a Boolean function indicating whether example $e$ belongs in slice $s$ (potentially overlapping with other slice functions). For example, a text classification slicing function that groups all examples with the same label $c$ would be $\mathtt{slice}((x,y))\overset{{\scriptscriptstyle\text{def}}}{=}\delta(y=c)$ .

Given a dataset $D$ , we define the $s^{th}$ slice of that dataset to be $D_{s}\overset{{\scriptscriptstyle\text{def}}}{=}\{e\in D\mid\mathtt{slice_{s}}(e)=\mathtt{true}\}$ . For a set of slices $S$ , we also define $D_{S}\overset{{\scriptscriptstyle\text{def}}}{=}\displaystyle\bigcup_{s\in S}D_{s}$ .

Few-shot versus many-shot.

We will assume that underrepresented slices have only a few examples each, so we refer to these as few-shot slices (denoted as $F$ ): we will perform data augmentation for these slices. We call the remaining slices many-shot slices (denote as $M$ ): these have enough data and will not receive data augmentation. The example extrapolator is trained with $M$ only and used to infer new examples in $F$ despite never having seen any examples in $F$ during training.

It is important to note that we refer to “few-shot” to mean that there are slices of the data within the task that have very few examples. The other notion of few-shot learning, where there are overall few examples for the entire task, is outside of the scope of our experiments.

3 Example extrapolation (Ex2)

With a formal notion of slices, we can now define the example extrapolation task. First, let $p(e)$ denote the true underlying distribution over examples. And for a given slice $s$ , let $p(e\mid s)\overset{{\scriptscriptstyle\text{def}}}{=}p(e\mid\mathtt{slice_{s}}(e)=\mathtt{true})$ be the distribution of examples restricted to that slice.

In order to generalize to new, unseen slices, we featurize $s$ with a random sample of $K$ examples from slice $s$ , denoted as $e_{1:K}$ . The example extrapolation task is to model the full distribution of slice $s$ given only those exemplars:

When deciding how many exemplars $K$ to condition on, it is important to ensure that they are enough to illustrate the intra-slice variance; we expect that conditioning on a single exemplar will generally be insufficient. Section 4 explores varying the size of $K$ .

Training procedure.

Optimization of the example extrapolator is straightforward once we define the inputs and outputs.

Given a training set $D$ , let $D_{1},\ldots,D_{s}$ denote its $S$ different slices. Let $e_{1:K}\overset{{\scriptscriptstyle\text{wr}}}{\sim}D_{s}$ denote a sample of $K$ examples from $D_{s}$ , drawn uniformly without replacement. Then the training objective is:

where the term $p(s)$ is a user-defined prior probability of each slice, which we estimate empirically from the training data in our experiments.

To optimize this objective, we iterate over all training slices ( $s\in M$ ), and every example ( $e^{*}$ ) in each slice. For each example, we sample $K$ other examples ( $e_{1:K}$ ) from the same slice, excluding $e^{*}$ itself. We then optimize the log-likelihood of $e^{*}$ as output given $e_{1:K}$ as input.

Model implementation.

We implement our example extrapolator as a neural sequence-to-sequence model. In particular, we use T5 Raffel et al. (2020), a text-to-text Transformer model Vaswani et al. (2017) that was pre-trained on a large text corpus. This provides the network with a large amount of world knowledge, which is crucial for the model’s ability to extrapolate beyond the given examples. For example, the last example in Table 1 requires extrapolating ‘‘New Beaver’’ and ‘‘Keeneland’’ from the input exemplars to ‘‘Steven’s Pass’’ in the output, which requires some world knowledge that pre-trained models are known to contain Petroni et al. (2019); Roberts et al. (2020). We show that this pre-training is crucial for an effective Ex2 model in Section 4.

Exemplar (de)serialization

Since T5 operates over plain text inputs and outputs, we must represent the input exemplars $e_{1:K}$ and the output $e^{*}$ as text. For any given task, we assume the user provides a function $\mathtt{to\_text}$ that maps a single example to a string, and a function $\mathtt{from\_text}$ that maps a string back to an example.

An important subtlety in the $\mathtt{to\_text}$ function is whether the extrapolator is allowed to “cheat” when determining the boundaries of the slice. Suppose we are using Ex2 for text classification, with our data sliced by label, and suppose we specify the $\mathtt{to\_text}$ function to prepend the label name to the input sentence (e.g. ( $x=$ “play a song”, $y=$ PlayMusic) is mapped to “PlayMusic: play a song”). On the one hand, the model may be able to take advantage of the semantics of the label name, gleaned from pre-training. On the other hand, it will be easier for the extrapolator to determine the properties of the slice by memorizing the label and ignoring everything else. This challenge is analogous to the task memorization associated with meta-learning algorithms Yin et al. (2020), where leaking task-level information to the meta-learner results in poor generalization.

We hypothesize that the benefits of anonymization outweigh the losses, so we ensure that $\mathtt{to\_text}$ anonymizes any slice information, and that $\mathtt{from\_text}$ can project the anonymized generation back to a fully realized example. Examples of the anonymization strategy for each task are shown in Table 1. We explore this hypothesis empirically in Section 4.

4 Using Ex2 for data augmentation

Our example extrapolator enables us to take $K$ examples from a slice and generate additional examples from the same slice. Concretely, given a slice $D_{s}$ , we sample $K$ exemplars without replacement, $e_{1:K}\overset{{\scriptscriptstyle\text{wr}}}{\sim}D_{s}$ , feed them into the extrapolator, then randomly sample from the extrapolator:

By repeatedly sampling in this fashion, we can produce an arbitrary number of new labeled examples, discarding any invalid ones that cannot be parsed by $\mathtt{from\_text}$ .

The amount of data generated for each slice is up to the user, but would ideally correct for the under-representation and reflect the true underlying distribution of the slices.

For ease of discussion, we may also refer to the example extrapolator as the “teacher”, and the downstream model as the “student”. This terminology is deliberately reminiscent of model distillation Tarvainen and Valpola (2017), where a “teacher” is used to label a large number of unlabeled inputs ( $x$ ’s) to be consumed by a “student”. The Ex2 approach is similar, except that the teacher does not label pre-existing $x$ ’s and instead synthesizes completely new $(x,y)$ pairs.

Experiments

To validate the generality of the Ex2 recipe, we evaluate our approach on a range of different language understanding tasks: text classification (a simple setup that resembles our running example), intent classification + slot-filling (a more complex task with a structured output space), and relation extraction (a highly multi-class problem with strong prior work in the few-shot setting).

Across all three tasks, our results consistently show that a model trained with Ex2 data augmentation outperforms our baselines. In the cases of SNIPS and especially relation extraction, where strong published baselines are available, we achieve a new state of the art.

In our experiments, we explicitly designate certain slices of the dataset as few-shot and the others as many-shot. Furthermore, we define the few-shot split of a dataset $D_{F}$ to be the set of all examples belonging to a few-shot slice, and the many-shot split $D_{M}$ to be all other examples. Table 2 gives the shorthand notation we use for these splits which are further sub-divided into Train, Development and Test.

For relation extraction, prior work had already designated certain slices as few-shot — we consider the same ones for direct comparison. For intent classification/slot-filling, we cross-validate by running one experiment for each slice in the dataset, where that slice is designated the few-shot one and its training set is artificially truncated to $K$ examples. In all cases, the Train/Dev/Test axis of our splitting follows the original benchmarks.

Evaluation.

When reporting downstream student model performance, we consider both Overall performance (averaging across $D_{M}\cup D_{F}$ ) and Few-shot performance (averaging only over $D_{F}$ ). Tables in this section report the overall and few-shot test performance.

Baselines.

The output of Ex2 is simply additional synthetic data, which must then be consumed by the downstream student model. To measure the contribution of this additional data, we always compare between the same student configuration. We use the overall accuracy of $D_{M,\text{dev}}\cup D_{F,\text{dev}}$ for early stopping for FewRel, and overall macro F1 for the other tasks. The only difference between the following setups is the data that the student is trained on:

Baseline: The student only trains on the original data without any augmentation ( $D_{M,\text{train}}~{}\cup~{}D_{F,\text{train}}$ ).

Upsampled: The student trains on original data ( $D_{M,\text{train}}~{}\cup~{}D_{F,\text{train}}$ ), but the examples from the few-shot slices $D_{F,\text{train}}$ are up-sampled to match the median frequency of the many-shot slices.

All other aspects of the model are held fixed across these setups. When previously published results for a task are available, we also compare against other model types.

Model architectures.

For simplicity, we use T5 Raffel et al. (2020) as our student models here, since they achieve state-of-the-art performance even without any data augmentation. Table 3 shows how each task is cast in the seq2seq framework. We present results where both the teacher and student models are finetuned from T5-XLWe use the T5.1.1 version that is only pretrained on unlabeled data Roberts et al. (2020). The teacher models are finetuned for 3 epochs for FewRel and 10 epochs for CLINC150/SNIPS. The student models are finetuned for 10 $k$ steps for FewRel and 20 $k$ for the others. All models use batch size of 128. All other hyper-parameters are set to T5’s default. unless otherwise noted. We also evaluate the impact of T5 model sizes in Section 4.

1 Text Classification

Our first task illustrates one of the simplest applications of Ex2. Given a short text snippet such as “play a song”, a text classifier must select the correct label (e.g., PlayMusic). For this task, we evaluate on the CLINC150 dataset Larson et al. (2019). The original dataset contains 10 domains with 15 class labels per domain and 100 training examples per class label (a total of 15,000 examples).We did not use the out-of-scope portion of the dataset. We use the cross-validation setup and report results averaged over 10 runs, where each run chooses a different domain to contain few-shot slices.

For Ex2, we slice the dataset by class label, and set the number of exemplars to be $K=10$ . For the T5 student model, the input text to T5 is simply the plain text snippet, and the output is the string representation of the label (See Table 1 for Ex2 input-output pairs).

Table 4 shows the accuracy and macro F1 results on both the overall and the few-shot splits. Ex2 significantly improves over the upsampled baseline on the few-shot slices (+15.9 ppt in terms of macro F1), while maintaining the same performance on the overall accuracy.Some previous works on few-shot intent classification of CLINC150 Zhang et al. (2020) use the setup where all intents are few-shot, therefore our results are not directly comparable.

2 Intent Classification and Slot Filling

Intent classification is the task of mapping a user utterance to an intent label, as above. Slot filling is the task of identifying argument spans of the intent within the utterance. We use the SNIPS Coucke et al. (2018) dataset,We use the preprocessed version from Goo et al. (2018) at https://github.com/MiuLab/SlotGated-SLU. which contains 7 intents (domains) with a total of 39 different slot types.

For Ex2, we slice the data by intent label and set the number of exemplars to be $K=10$ . When truncating $D_{F,\text{train}}$ , we use a greedy algorithmThe algorithm is inspired by Yang and Katiyar (2020) to ensure that all slot types are present in the smaller set. First, we identify the slot type present in the slice but least well-attested in the current set $F_{\text{train}}$ (with ties broken in favor of the more infrequent type). We then randomly select an exemplar containing that slot type from the domain. For this purpose, exemplars with no slots are assumed to have a single null slot. This ensures that the teacher and student both have access to a maximally complete and diverse set of inputs. to select source exemplars such that each one is guaranteed to share a slot type with the target.

For the T5 student model, the input to T5 is the plain text utterance, and the output is the same plain text utterance, except prefixed with the predicted intent, and with special tokens inserted to mark the beginning and end of slot values (cf. Table 3).

Kumar et al. (2019) evaluate a data augmentation technique for few-shot intent classification on the SNIPS and TOP datasets. Their approach involves permuting sentence embeddings $D_{F,\text{train}}$ set (across a variety of different permutation functions), and training the system on the permuted embeddings in addition to the original embeddings. The approach is restricted to sentence classification, however.

Hou et al. (2020) and Krone et al. (2020) both involve explicitly aligning token- or span-vectors from an incoming query to prototype vectors derived from $F_{\text{train}}$ and computing the similarity between them directly.

Kumar et al. (2019) and Hou et al. (2020) use BERT (Devlin et al., 2019) to encode queries, whereas Krone et al. (2020) found ELMo (Peters et al., 2018) to work better for this task in their experiments.

Results.

Table 5 shows how our system compares to the simple T5 baseline with and without upsampling. It can be observed that upsampling the few-shot classes improves intent accuracy over the baseline, but its impact on slot-filling is considerably more modest. Ex2, however, drastically improves intent accuracy while also increasing slot F1 (by 20 ppt. and 5 ppt. respectively) on the few-shot slices. These improvements in the few-shot domain appear to carry over into the overall scores, as evidenced by a 2.5 ppt. increase in overall intent accuracy and a 0.5 ppt. increase in overall slot F1.

We also include previous published results on SNIPS, but they only serve as a rough reference to demonstrate that T5 is a competitive baseline, since there are slight differences in the experimental setup. The numbers from Kumar et al. (2019), Hou et al. (2020) and Krone et al. (2020) are not strictly comparable to ours, because they use a different data truncation strategy, and a different train/development setupHou et al. (2020) truncate the few-shot domain to have close to $5$ instances of each slot type rather than $10$ instances of each intent type. They also use one domain for development in cross-validation, whereas Kumar et al. (2019) did not include $D_{F,dev}$ their development set..

3 Relation Extraction

In relation extraction, a model is given a passage of text featuring two entity mentions, and must predict the relation between the pair of entities.

We evaluate on the well-studied few-shot relation extraction benchmark, FewRel dataset Han et al. (2018), where some relations are designated for few-shot learning. Previous results have reported super-human performance on FewRel Baldini Soares et al. (2019). However, the original task only requires the model to select the correct relation from a pruned set of possible options, rather than the full catalogue of relations.

We therefore use a more challenging variant of FewRel (FewRel-Open), where the model must choose from all relations (and in the case of nearest neighbor models choose from all training neighbors). This setup is much closer to real-world applications of relation extraction and explicitly evaluates the models ability to predict under-represented relations while being overwhelmed by a highly-unbalanced prior in the training data.

The 64 Wikipedia training relations with 70k sentences are used for teacher and student training. In addition to in-domain Wikipedia evaluation, we also evaluate on out-of-domain generalization with the NYT, SemEval, and PubMed evaluation sets from FewRel 2.0 Gao et al. (2019) and report the macro average over all domains.

For Ex2, we slice the dataset by relation label, and treat the few-shot relations defined in the original FewRel dataset as our underrepresented slices. We set the number of exemplars to be $K=5$ . For the student model, the input text and entity mentions are formatted into a plain text by marking the start and end of each entity mention using special tokens. The text output from T5 is the string name of the relation (see Table 3).

In addition to the data augmentation baselines described earlier, we compare to the state-of-the-art Matching the Blanks (MTB) model Baldini Soares et al. (2019), which is a nearest-neighbor approach based on BERT. MTB was trained with an unsupervised objective that aims to improve the modeling of entity relations.

Results.

The first notable result is that while MTB exceeds human performance on the original FewRel task, the accuracy of MTB drops dramatically in the more challenging and realistic FewRel-Open task. It achieves an average few-shot accuracy of 69% in the overall evaluation and 50.5% when evaluating only on examples with the few-shot labels. We hypothesize that teasing apart gold and random distractor neighbors is easy, but avoiding distractors from an entire training set worth of potential neighbors is much more challenging.

Interestingly, we found that our no-data-augmentation T5 baseline already improves over MTB, even though it does not employ a custom architecture specifically designed to improve few-shot learning. This could simply be attributed to the larger size of T5-XL compared to MTB, which is based on BERT-large. Since we aim to compare to the best-performing baseline, we mainly compare to the T5 baseline.

When we perform data augmentation with Ex2, we observe another significant improvement in accuracy, setting a new state of the art for both few-shot relations (7.2 ppt increase) and the overall accuracy (2.2 ppt increase).

Analysis

Ex2 relies on three intuitions that we aim to justify empirically in this section:

It is critical to have a broad range of source exemplars in order to show the model the boundaries of the data slice under consideration.

The identity of the slice should be obfuscated in order to encourage the model to infer the slice distribution using the source exemplars.

The model needs access to world knowledge that is not present in the training data in order to generate accurate and diverse outputs.

We present ablations that test these three claims. The experimental setups for these analyses are identical to those presented in the main experiments, except we present results on the validation sets.

We use CLINC150 to demonstrate the importance of jointly reasoning across different exemplars by varying the number of exemplars $K$ . We choose this intent classification task because the special case where $K=1$ reduces to a paraphrasing data-augmentation approach. Since a paraphraser only observes one exemplar, it cannot reason about the different axes of variance in a slice, and only has enough information to generate a generically similar example.

As expected, Figure 2 shows that the paraphrasing special case does no better than the baselines. Using just $K=2$ exemplars already improves the few-shot accuracy above the baseline, and we observe substantial improvement with even more exemplars. Note that in all of these settings, the teacher performs inference on the same amount of few-shot data, and $K$ only controls the number of exemplars that the teacher encodes at the same time. Therefore, these results demonstrate the importance of cross-exemplar reasoning in Ex2.

Anonymization strategy

In this experiment, we compare our original Ex2 model with ones that lack slice anonymization; we use the SNIPS dataset for this experiment because it includes both classification and slot-filling subtasks, meaning there are two ways to anonymize the data. Table 7 compares Ex2 and baselines to two non-anonymized models: one that includes slot label names and another that also prepends the intent name to the source sequence.

The hypothesis appears to be borne out to some extent: the anonymized Ex2 models outperform the non-anonymized ones in terms of few-shot intent accuracy. Surprisingly, argument F1 is lower than in the non-anonymized models,This pattern held even after a second trial of this experiment. In Ex2-L models, anonymization improves intent accuracy dramatically and is uncorrelated with argument F1. indicating that providing slot and/or intent names improves argument synthesis. It’s likely that label strings (such as artist or AddToPlaylist) provide some semantic signal that extra-large networks can take advantage of, and that it’s easier to connect the semantics of the label to the semantics of possible fillers than to whole queries. This points to a tradeoff between providing the model with information it can use to generalize and withholding information that it may memorize.

Pre-training

We train an Ex2 model from scratch and compare it to one that has been fine-tuned from a T5 model. We evaluate this on FewRel, which requires synthesizing the longest and most complex examples out of the three tasks in this paper. Results in Table 8 demonstrate that a randomly initialized Ex2 is completely ineffective, with the generated examples introducing substantial noise into the system with little tangible gains. Furthermore, we observe a correlation between model size and performance; a sufficiently large pre-trained model (at least T5-XL) is necessary for Ex2 to be effective for FewRel. As stipulated in Section 2, this suggests the world knowledge from pre-training is critical to the ability of Ex2 to extrapolate to new examples containing of new concepts rather than simply recombining or paraphrasing existing parts from the input exemplars.

2 Qualitative analysis of Ex2 outputs

We posit that Ex2 is able to effectively use the source exemplars to estimate the boundaries of the intended slice when synthesizing a new example. In Table 9 we demonstrate this qualitatively. The first column shows sets of five exemplars passed to an Ex2 model trained on CLINC150 (with “auto” as the held-out domain), and the second shows three different outputs synthesized from each setWe generate synthetic outputs by batches of 3, and show the selected batches here..

When comparing examples (1) and (2) — which differ only in the specificity of the slice, with (1) representing queries about help learning languages and (2) representing queries about help learning academic subjects more broadly — the generated examples stay confined to the regions specified by the source exemplars while not repeating any of the source queries.

Examples (3) and (4) show that not only can Ex2 learn the boundaries of clusters, it can pass a variation of the “wug test”, using context to infer the semantic and morpho-syntactic category of nonce words with previously unseen meanings. We see that Ex2 can compose new syntactic forms based on variations in the exemplars. When observing a word such as updates or cleaning that fills the same semantic role as wug in other source exemplars but with different morphology, Ex2 is more likely to generate an example using the word wug that bears the same form. This demonstrates an extreme case of out-of-domain generalization, where Ex2 can be used to quickly adapt to new or even conflicting information.

Related Work

There is a large body of research on data augmentation (Jia and Liang, 2016; Andreas, 2020; Akyürek et al., 2021, inter alia). Within this literature, our approach is most related to recent work on data augmentation for NLP using pre-trained language models (LMs): Kumar et al. (2019); Anaby-Tavor et al. (2020) perform data augmentation for text classification by fine-tuning an LM to synthesize new inputs $x$ for a given label $y$ — modeling $p(x|y)$ . Like these approaches, Ex2 uses LM pre-training to acquire world knowledge, and then fine-tunes the LM to perform data generation. But our generation task is notably different: prior work conditioned the data generator on an output label $y$ , whereas Ex2 conditions on a collection of exemplars $[(x_{1},y_{1}),\ldots,(x_{K},y_{K})]$ .

This yields several advantages. First, it enables us to generate examples for new slices that were never seen at training time, since the extrapolator can reason by analogy instead of memorizing the identity of labels. Second, it allows us to perform data augmentation along dimensions other than the output label — exemplars can be used to express any desired quality (e.g., a particular sentence length or syntactic structure), not just a desired label. This makes Ex2 applicable to tasks beyond classification. Finally, note that Ex2 synthesizes entirely new labeled examples ( $(x,y)$ pairs), rather than just the $x$ . This allows Ex2 to naturally cover variation in the output space, which is essential for tasks with large and compositional output spaces such as parsing.

2 Few-shot learning with language models

Beyond data augmentation, large language models have been used in various other ways to address few-shot learning Schick and Schütze (2020); Brown et al. (2020). Our approach is most related to the in-context learning approach of GPT-3 Brown et al. (2020). Similar to Ex2, GPT-3 also conditions on a collection of exemplars.

However, the two models solve different tasks. GPT-3 maps an input $x$ to an output $y$ , whereas Ex2 generates a new $(x,y)$ pair. In other words, Ex2 uses a large LM to generate data, whereas GPT-3 uses a large LM as the model itself. Using large LMs for data generation rather than direct inference has practical benefits: data can be inspected and cleaned by humans, easily persisted, and finally used to train much smaller models that are cheaper to deploy than a large LM.A model like GPT-3 could also be used for data generation, by using it to label a large number of unlabeled $x$ ’s — as done in distillation. But in many NLP tasks (e.g., natural language inference), coming up with a valid $x$ is non-trivial, and often even harder than predicting the label.

The purpose of exemplars is also different: for GPT-3, exemplars are used to describe the overall task (and hence drawn uniformly from the training set), while for Ex2, exemplars are used to describe a particular slice of the task. This distinction is important for tasks with many slices. For example, consider a few-shot document classification problem with 1000 possible labels (where each label is a slice), and we have 5 examples for each label. Using Ex2, we would condition on $K=5$ exemplars at a time to generate new examples. In contrast, GPT-3 requires one set of exemplars to describe the entire task, so it must condition on at least $K=1000$ exemplars to ensure that every label is included at least once in the set. This becomes computationally intractable.

On the other hand, it is attractive that GPT-3 generalizes over many tasks, whereas Ex2 only targets a single task. In future work, one could imagine using Ex2 to generalize across tasks by grouping multiple tasks together, and learning over the union of all their slices.

Lastly, Ex2 is fine-tuned to perform few-shot data augmentation, whereas GPT-3 is not fine-tuned. Therefore, GPT-3 users must be careful to format examples in a way that resembles “natural” text encountered during pre-training – such “format engineering” can greatly affect performance Shin et al. (2020); Schick and Schütze (2020). In contrast, fine-tuning allows Ex2 to introduce arbitrary formats and annotations that deviate from natural language, which is necessary for slice anonymization and modeling more structured tasks.

3 Nearest neighbor methods

Among methods for few-shot learning, nearest-neighbor and other instance-based models constitute another prominent category that conditions on a collection of examples Vinyals et al. (2016); Snell et al. (2017); Sun et al. (2019); Yang and Katiyar (2020); Hou et al. (2020); Ziyadi et al. (2020).

It is worth noting that instance-based models require modest specialization, since inputs must be encoded into feature vectors, whereas Ex2 is model-agnostic. In fact, they are mutually compatible approaches that aim to improve few-shot learning in complementary ways.

Discussion

We address several potential concerns about the use of synthetic data generated from a highly expressive neural model.

Ex2 is likely to generate text that is factually incorrect. While this initially sounds undesirable, we argue that for most tasks, the role of the downstream model is to understand language, not evaluate world knowledge. Therefore, an ideal model should be constrained to behave well on these hallucinated data points. For example, consider using Ex2 for a new relation indicating that entity 0 is the direction in which entity 1 sets. A robust relation extractor should predict that this relation exists in all of the examples below, regardless of world knowledge:

Ensuring that models make decisions via language understanding rather than memorizing facts or entities has been argued for named entity recognition Agarwal et al. (2020) and coreference resolution Agarwal et al. (2019).

Transparency

Ex2 can also be considered a method for increasing the transparency of using large pre-trained LMs. The typical use of pre-trained LMs involves simply fine-tuning on the data and hoping that the model generalizes to new inputs. With Ex2, however, we would explicitly generate data that better cover the input space. While the new examples may contain mistakes (in the same way that a purely discriminative model would make mistakes), it would more transparently expose the regions where they happen.

Human curation

While we argue that hallucination is not necessarily a problem, there are certainly cases where it is undesirable. Ex2 should not be used in production-level models without making the most of Ex2’s transparency by vetting the generated examples with human supervision. The most effective combination uses Ex2 to thoroughly cover possible variations (that may be tedious or difficult for humans) and uses human supervision to curate high-precision data.

Conclusion

We propose an approach for data augmentation by learning a neural example extrapolator (Ex2) that generates new labeled examples from a small sets of existing examples coming from the same “slice” of the dataset. Ex2 learns from slices of data with many data points, and uses that knowledge to synthesize new examples for slices of the data with few data points. We show that this is an effective approach for few-shot text classification, intent classification + slot filling, and relation extraction.

For future work, we hope to expand this approach to broader notions of slices, including slicing by languages for multilingual applications, slicing by tasks, or working with tasks that contain orders of magnitude more slices (e.g. entity linking). We also plan to explore whether Ex2 can be generalized to other modalities, such as images or speech, where we would need to explore architectures other than pre-trained seq2seq models. Finally, we believe that investigating the best way in which human supervision should be injected into applications of Ex2 is an important direction.

Acknowledgements

We thank Ice Pasupat, Yuan Zhang, Emily Pitler, Kristina Toutanova, Arun Chaganty, Zhuyun Dai, Terry Koo, Sebastian Ruder, Siamak Shakeri, Iulia Turc, and the Google Research Language team for their helpful feedback and discussions.

Introduction

Approach

2 Formal Definitions

Few-shot versus many-shot.

3 Example extrapolation (Ex2)

Training procedure.

Model implementation.

Exemplar (de)serialization

4 Using Ex2 for data augmentation

Experiments

Evaluation.

Baselines.

Model architectures.

1 Text Classification

2 Intent Classification and Slot Filling

Results.

3 Relation Extraction

Results.

Analysis

Anonymization strategy

Pre-training

2 Qualitative analysis of Ex2 outputs

Related Work

2 Few-shot learning with language models

3 Nearest neighbor methods

Discussion

Transparency

Human curation

Conclusion

Acknowledgements

References