Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network

Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu, Ting Liu

Introduction

Slot tagging Tur and De Mori (2011), a key module in the task-oriented dialogue system Young et al. (2013), is usually formulated as a sequence labeling problem Sarikaya et al. (2016). Slot tagging faces the rapid changing of domains, and the labeled data is usually scarce for new domains with only a few samples. Few-shot learning technique Miller et al. (2000); Fei-Fei et al. (2006); Lake et al. (2015); Vinyals et al. (2016) is appealing in this scenario since it learns the model that borrows the prior experience from old domains and adapts to new domains quickly with only very few examples (usually one or two examples for each class).

Previous few-shot learning studies mainly focused on classification problems, which have been widely explored with similarity-based methods (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Yan et al., 2018; Yu et al., 2018). The basic idea of these methods is classifying an (query) item in a new domain according to its similarity with the representation of each class. The similarity function is usually learned in prior rich-resource domains and per class representation is obtained from few labeled samples (support set). It is straight-forward to decompose the few-shot sequence labeling into a series of independent few-shot classifications and apply the similarity-based methods. However, sequence labeling benefits from taking the dependencies between labels into account Huang et al. (2015); Ma and Hovy (2016). To consider both the item similarity and label dependency, we propose to leverage the conditional random fields (Lafferty et al., 2001, CRFs) in few-shot sequence labeling (see Figure 1). In this paper, we translate the emission score of CRF into the output of the similarity-based method and calculate the transition score with a specially designed transfer mechanism.

The few-shot scenario poses unique challenges in learning the emission and transition scores of CRF. It is infeasible to learn the transition on the few labeled data, and prior label dependency in source domain cannot be directly transferred due to discrepancy in label set. To tackle the label discrepancy problem, we introduce the collapsed dependency transfer mechanism. It transfers label dependency information from source domains to target domains by abstracting domain-specific labels into abstract domain-independent labels and modeling the label dependencies between these abstract labels.

It is also challenging to compute the emission scores (word-label similarity in our case). Popular few-shot models, such as Prototypical Network Snell et al. (2017), average the embeddings of each label’s support examples as label representations, which often distribute closely in the embedding space and thus cause misclassification. To remedy this, Yoon et al. (2019) propose TapNet that learns to project embedding to a space where words of different labels are well-separated. We introduce this idea to slot tagging and further propose to improve label representation by leveraging the semantics of label names. We argue that label names are often semantically related to slot words and can help word-label similarity modeling. For example in Figure 1, word rain and label name weather are highly related. To use label name semantic and achieve good-separating in label representation, we propose Label-enhanced TapNet (L-TapNet) that constructs an embedding projection space using label name semantics, where label representations are well-separated and aligned with embeddings of both label name and slot words. Then we calculate similarities in the projected embedding space. Also, we introduce a pair-wise embedding mechanism to representation words with domain-specific context.

One-shot and five-shot experiments on slot tagging and named entity recognition show that our model achieves significant improvement over the strong few-shot learning baselines. Ablation tests demonstrate improvements coming from both L-TapNet and collapsed dependency transfer. Further analysis for label dependencies shows it captures non-trivial information and outperforms transition based on rules.

Our contributions are summarized as follows: (1) We propose a few-shot CRF framework for slot tagging that computes emission score as word-label similarity and estimate transition score by transferring previously learned label dependencies. (2) We introduce the collapsed dependency transfer mechanism to transfer label dependencies across domains with different label sets. (3) We propose the L-TapNet to leverage semantics of label names to enhance label representations, which help to model the word-label similarity.

Problem Definition

As shown in Figure 2, few-shot models are usually first trained on a set of source domains $\left\{\mathcal{D}_{1},\mathcal{D}_{2},\ldots\right\}$ , then directly work on another set of unseen target domains $\left\{\mathcal{D}_{1}^{\prime},\mathcal{D}_{2}^{\prime},\ldots\right\}$ without fine-tuning. A target domain $\mathcal{D}_{j}^{\prime}$ only contains few labeled samples, which is called support set $\mathcal{S}=\left\{(\bm{x}^{(i)},\bm{y}^{(i)})\right\}_{i=1}^{N_{\mathcal{S}}}$ . $\mathcal{S}$ usually includes $k$ examples (K-shot) for each of $N$ labels (N-way).

The K-shot sequence labeling task is defined as follows: given a K-shot support set $\mathcal{S}$ and an input query sequence $\bm{x}=(x_{1},x_{2},\ldots,x_{n})$ , find $\bm{x}$ ’s best label sequence $\bm{y}^{*}$ :

Model

In this section, we first show the overview of the proposed CRF framework (§3.1). Then we discuss how to compute label transition score with collapsed dependency transfer (§3.2) and compute emission score with L-TapNet (§3.3).

Conditional Random Field (CRF) considers both the transition score and the emission score to find the global optimal label sequence for each input. Following the same idea, we build our few-shot slot tagging framework with two components: Transition Scorer and Emission Scorer.

We apply the linear-CRF to the few-shot setting by modeling the label probability of label $\bm{y}$ given query sentence $\bm{x}$ and a K-shot support set $\mathcal{S}$ :

where $Z=\displaystyle\sum_{\bm{y}^{\prime}\in\bm{Y}}\exp(\text{TRANS}(\bm{y}^{\prime})+\lambda\cdot\text{EMIT}(\bm{y}^{\prime},\bm{x},\bm{S}))$ , $\text{TRANS}(\bm{y})=\sum_{i=1}^{n}f_{T}(y_{i-1},y_{i})$ is the Transition Scorer output and $\text{EMIT}(\bm{y},\bm{x},\bm{S})=\sum_{i=0}^{n}f_{E}(y_{i},\bm{x},\mathcal{S})$ is the Emission Scorer output. $\lambda$ is a scaling parameter which balances weights of the two scores.

We take $L_{\text{CRF}}=-\log(p(\bm{y}\mid\bm{x},\mathcal{S}))$ as loss function and minimize it on data from source domains. After the model is trained, we employ Viterbi algorithm (Forney, 1973) to find the best label sequence for each input.

2 Transition Scorer

The transition scorer component captures the dependencies between labels.Here, we ignore $Start$ and $End$ labels for simplicity. In practice, $Start$ and $End$ are included as two additional abstract labels. We model the label dependency as the transition probability between two labels:

Conventionally, such probabilities are learned from training data and stored in a transition matrix $\bm{T}^{N\times N}$ , where $N$ is the number of labels. For example, $\bm{T}_{\text{B-loc},\text{B-team}}$ corresponds to $p(\text{B-loc}\mid\text{B-team})$ . But in the few-shot setting, a model faces different label sets in the source domains (train) and the target domains (test). This mismatch on labels blocks the trained transition scorer directly working on a target domain.

3 Emission Scorer

As shown in Figure 4, the emission scorer independently assigns each word an emission score with regard to each label:

In few-shot setting, a word’s emission score is calculated according to its similarity to representations of each label. To compute such emission, we propose the L-TapNet by improving TapNet Yoon et al. (2019) with label semantics and prototypes.

where $\mathbf{M}$ is a projecting function, $E$ is an embedder and Sim is a similarity function. TapNet shares the references $\mathbf{\Phi}$ across different domains and constructs $\mathbf{M}$ for each specific domain by randomly associating the references to the specific labels.

To achieve these, TapNet first computes the alignment bias between $\mathbf{c}_{j}$ and $\bm{\phi}_{j}$ in original embedding space, then it finds a projection $\mathbf{M}$ that eliminates this alignment bias and effectively separates different labels at the same time. Specifically, TapNet takes the matrix solution of a linear error nulling process as the embedding projector $\mathbf{M}$ . For the detail process, refer to the original paper.

3.2 Label-enhanced TapNet

As mentioned in the introduction, we argue that label names often semantically relate to slot words and can help word-label similarity modeling. To enhance TapNet with such information, we use label semantics in both label representation and construction of projection space.

where $\alpha$ is a balance factor. Label semantics $\mathbf{s}_{j}$ makes $\mathbf{M}$ specific for each domain. And reference $\bm{\phi}_{j}$ provides cross domain generalization.

Then we construct an $\mathbf{M}$ by linear error nulling of alignment error between label enhanced reference $\bm{\psi}_{j}$ and $\mathbf{c}_{j}$ following the same steps of TapNet.

For emission score calculation, compared to TapNet that only uses domain-agnostic reference $\bm{\phi}$ as label representation, we also consider the label semantics and use the label-enhanced reference $\bm{\psi}_{j}$ in label representation.

Besides, we further incorporate the idea of Prototypical Network and represent a label using a prototype reference $\mathbf{c}_{j}$ as $\bm{\Omega}_{j}=(1-\beta)\cdot\mathbf{c}_{j}+\beta\bm{\psi}_{j}$ . Finally, the emission score of $x$ is calculated as its similarity to label representation $\bm{\Omega}$ :

where Sim is the dot product similarity function and $E$ is a word embedding function which will be introduced in the next section.

3.3 Embeddings for Word and Label Name

For the word embedding function $E$ , we proposed a pair-wise embedding mechanism. As shown in Figure 5, a word tends to mean differently when concatenated to a different context. To tackle the representation challenges for similarity computation, we consider the special query-support setting in few-shot learning and embed query and support words pair-wisely. Such pair-wise embedding can make use of domain-related context in support sentences and provide domain adaptive embeddings for the query words. This will further help to model the query words’ similarity to domain-specific labels. To achieve this, we represent each word with self-attention over both query and support words. We first copy query sentence $\bm{x}$ for $N_{\mathcal{S}}=|\mathcal{S}|$ times, and pair them with all support sentences. Then the $N_{\mathcal{S}}$ pairs are passed to a BERT (Devlin et al., 2019) to get $N_{\mathcal{S}}$ embeddings for each query word. We represent each word as the average of $N_{\mathcal{S}}$ embeddings. Now, representations of query words are conditioned on domain-specific context. We use BERT as it can naturally capture the relation between sentence pairs.

To get label representation $\mathbf{s}$ , we first concatenate abstract label name (e.g., begin and inner) and label name (e.g., weather). Then, we insert a [CLS] token at the first position, and input them into a BERT. Finally, the representation of [CLS] is used as the label semantic embedding.

Experiment

We evaluate the proposed method on slot tagging and test its generalization ability on a similar sequence labeling task: name entity recognition (NER). Due to space limitation, we only present the detailed results for 1-shot/5-shot slot tagging, which transfers the learned knowledge from source domains (training) to an unseen target domain (testing) containing only a 1-shot/5-shot support set. The results of NER are consistent and we present them in the supplementary Appendix B.

For slot tagging, we exploit the snips dataset (Coucke et al., 2018), because it contains 7 domains with different label sets and is easy to simulate the few-shot situation. The domains are Weather (We), Music (Mu), PlayList (Pl), Book (Bo), Search Screen (Se), Restaurant (Re) and Creative Work (Cr). Information about original datasets is shown in Appendix A.

To simulate the few-shot situation, we construct the few-shot datasets from original datasets, where each sample is the combination of a query data $(\bm{x^{q}},\bm{y^{q}})$ and corresponding K-shot support set $\mathcal{S}$ . Table 1 shows the overview of the experiment data.

Different from the simple classification of single words, slot tagging is a structural prediction problem over the entire sentence. So we construct support sets with sentences rather than single words under each tag.

As a result, the normal N-way K-shot few-shot definition is inapplicable for few-shot slot tagging. We cannot guarantee that each label appears $K$ times while sampling the support sentences, because different slot labels randomly co-occur in one sentence. For example in Figure 1, in the 1-shot support set, label [B-weather] occurs twice to ensure all labels appear at least once. So we approximately construct K-shot support set $\mathcal{S}$ following two criteria: (1) All labels within the domain appear at least $K$ times in $\mathcal{S}$ . (2) At least one label will appear less than $K$ times in $\mathcal{S}$ if any $(\bm{x},\bm{y})$ pair is removed from it. Algorithm 1 shows the detail process. Due to the removing step, Algorithm 1 has a preference for sentences with more slots. So in practice, we randomly skip removing by the chance of 20%.

Here, we take the 1-shot slot tagging as an example to illustrate the data construction procedure. For each domain, we sample 100 different 1-shot support sets. Then, for each support set, we sample 20 unincluded utterances as queries (query set). Each support-query-set pair forms one few-shot episode. Eventually, we get $100$ episodes and $100\times 20$ samples (1 query utterance with a support set) for each domain.

To test the robustness of our framework, we cross-validate the models on different domains. Each time, we pick one target domain for testing, one domain for development, and use the rest domains as source domains for training. So for slot tagging, all models are trained on 10,000 samples, and validated as well as tested on 2,000 samples respectively.

When testing model on a target domain, we evaluate F1 scores within each few-shot episode. For each episode, we calculate the F1 score on query samples with conlleval script: https://www.clips.uantwerpen.be/conll2000/chunking/conlleval.txt Then we average 100 F1 scores from all 100 episodes as the final result to counter the randomness from support-sets. All models are evaluated on same support-query-set pairs for fairness.

To control the nondeterministic of neural network training (Reimers and Gurevych, 2017), we report the average score of 10 random seeds.

We use the uncased BERT-Base (Devlin et al., 2019) to calculate contextual embeddings for all models. We use ADAM (Kingma and Ba, 2015) to train the models with batch size 4 and a learning rate of 1e-5. For the CRF framework, we learn the scaling parameter $\lambda$ during training, which is important to get stable results. For L-TapNet, we set $\alpha$ as 0.5 and $\beta$ as 0.7. We fine-tune BERT with Gradual Unfreezing trick Howard and Ruder (2018). For both proposed and baseline models, we take early stop in training and fine-tuning when there is no loss decay withing a fixed number of steps.

2 Baselines

is a bidirectional LSTM Schuster and Paliwal (1997) with GloVe (Pennington et al., 2014) embedding for slot tagging. It is trained on the support set and tested on the query samples.

is a model that predicts labels according to cosine similarity of word embedding of non-fine-tuned BERT. For each word $x_{j}$ , SimBERT finds its most similar word $x_{k}^{\prime}$ in support set, and the label of $x_{j}$ is predicted to be the label of $x_{k}^{\prime}$ .

is a domain transfer model with the NER setting of BERT (Devlin et al., 2019). We pretrain the it on source domains and select the best model on the same dev set of our model. We deal with label mismatch by only transferring bottleneck feature. Before testing, we fine-tune it on target domain support set. Learning rate is set as 1e-5 in training and fine-tuning.

(Fritzler et al., 2019) is a few-shot sequence labeling model that regards sequence labeling as classification of every single word. It pre-trains a prototypical network (Snell et al., 2017) on source domains, and utilize it to do word-level classification on target domains without training. Fritzler et al. (2019) use randomly initialized word embeddings. To eliminate the influence of different embedding methods, we further implement WPZ with the pre-trained embedding of GloVe (Pennington et al., 2014) and BERT.

is similar to WPZ. The only difference is that we employ the matching network Vinyals et al. (2016) with BERT embedding for classification.

3 Main Results

Table 2 shows the 1-shot slot tagging results. Each column respectively shows the F1 scores of taking a certain domain as target domain (test) and use others as source domain (train & dev). As shown in the tables, our L-TapNet+CDT achieves the best performance. It outperforms the strongest few-shot learning baseline WPZ+BERT by average F1 scores of 14.64.

Our model significantly outperforms Bi-LSTM and TransferBERT, indicating that the number of labeled data under the few-shot setting is too scarce for both conventional machine learning and transfer learning models. Moreover, the performance of SimBERT demonstrates the superiority of metric-based methods over conventional machine learning models in the few-shot setting.

The original WarmProtoZero (WPZ) model suffers from the weak representation ability of its word embeddings. When we enhance it with GloVe and BERT word embeddings, its performance improves significantly. This shows the importance of embedding in the few-shot setting. Matching Network (MN) performs poorly in both settings. This is largely due to the fact that MN pays attention to all support word equally, which makes it vulnerable to the unbalanced amount of O-labels.

More specifically, those models that are fine-tuned on support set, such as Bi-LSTM and TransferBERT, tend to predict tags randomly. Those systems can only handle the cases that are easy to generalize from support examples, such as tags for proper noun tokens (e.g. city name and time). This shows that fine-tuning on extremely limited examples leads to poor generalization ability and undertrained classifier. And for those metric based methods, such as WPZ and MN, label prediction is much more reasonable. However, these models are easy to be confused by similar labels, such as current_location and geographic_poi. It indicates the necessity of well-separated label representations. Also illegal label transitions are very common, which can be well tackled by the proposed collapsed dependency transfer.

To eliminate unfair comparisons caused by additional information in label names, we propose the L-WPZ+CDT by enhancing the WarmProtoZero (WPZ) model with label name representation same to L-TapNet and incorporating it into the proposed CRF framework. It combines label name embedding and prototype as each label representation. Its improvements over WPZ mainly come from label semantics, collapsed dependency transfer and pair-wise embedding. L-TapNet+CDT outperforms L-WPZ+CDT by 4.79 F1 scores demonstrating the effectiveness of embedding projection. When compared with TapNet+CDT, L-TapNet+CDT achieves an improvement of 4.54 F-score on average, which shows that considering label semantics and prototype helps improve emission score calculation.

Table 3 shows the results of 5-shots experiments, which verify the proposed model’s generalization ability in more shots situations. The results are consistent with 1-shot setting in general trending.

4 Analysis

To get further an understanding of each component in our method (L-TapNet+CDT), we conduct ablation analysis on both 1-shot and 5-shots setting in Table 4. Each component of our method is removed respectively, including: collapsed dependency transfer, pair-wise embedding, label semantic, and prototype reference.

When collapsed dependency transfer is removed, we directly predict labels with emission score and huge F1 score drops are witnessed in all settings. This ablation demonstrates a great necessity for considering label dependency.

For our method without pair-wise embedding, we represent query and support sentences independently. We address the drop to the fact that support sentences can provide domain-related context, and pair-wise embedding can leverage such context and provide domain-adaptive representation for words in query sentences. This helps a lot when computing a word’s similarity to domain-specific labels.

When we remove the label-semantic from L-TapNet, the model degenerates into TapNet+CDT enhanced with prototype in emission score. The drops in results show that considering label name can provide better label representation and help to model word-label similarity. Further, we also tried to remove the inner and beginning words in label representation and observe a 0.97 F1-score drop on 1-shot SNIPS. It shows that distinguishing B-I labels in label semantics can help tagging.

And if we calculate emission score without the prototype reference, the model loses more performance in 5-shots setting. This meets the intuition that prototype allows model to benefit more from the increase of support shots, as prototypes are directly derived from the support set.

While collapsed dependency transfer (CDT) brings significant improvements, two natural questions arise: whether CDT just learns simple transition rules and why it works.

To answer the first question, we replace CDT with transition rules in Table 5, Transition Rule: We greedily predict the label for each word and block the result that conflicts with previous label. which shows CDT can bring more improvements than transition rules.

To have a deeper insight into the effectiveness of CDT, we conduct an accuracy analysis of it. We assess the label predicting accuracy of different types of label bi-grams. The result is shown in Table 6. We further summarize the bi-grams into 2 categories: Border includes the bi-grams across the border of a slot span; Inner is the bi-grams within a slot span. We argue that improvements of Inner show successful reduction of illegal label transition from CDT. Interestingly, we observe that CDT also brings improvements by correctly predict the first and last token of a slot span. The results of Border verified our observation that CDT may helps to decide the boundaries of slot spans more accurately, which is hard to achieve by adding transition rules.

Related Works

Traditional few-shot learning methods depend highly on hand-crafted features Fei-Fei (2006); Fink (2005). Classical methods primarily focus on metric learning Snell et al. (2017); Vinyals et al. (2016), which classifies an item according to its similarity to each class’s representation. Recent efforts Lu et al. (2018); Schwartz et al. (2019) propose to leverage the semantics of class name to enhance class representation. However, different from us, these methods focus on image classification where effects of name semantic are implicit and label dependency is not required.

Few-shot learning in natural language processing has been explored for classification tasks, including text classification Sun et al. (2019); Geng et al. (2019); Yan et al. (2018); Yu et al. (2018), entity relation classification Lv et al. (2019); Gao et al. (2019); Ye and Ling (2019), and dialog act prediction Vlasov et al. (2018). However, few-shot learning for slot tagging is less investigated. Luo et al. (2018) investigated few-shot slot tagging using additional regular expressions, which is not comparable to our model due to the usage of regular expressions. Fritzler et al. (2019) explored few-shot named entity recognition with the Prototypical Network, which has a similar setting to us. Compared to it, our model achieves better performance by considering both label dependency transferring and label name semantics. Zero-shot slot tagging methods Bapna et al. (2017); Lee and Jha (2019); Shah et al. (2019) share a similar idea to us in using label name semantics, but has a different setting as few-shot methods are additionally supported by a few labeled sentences. Chen et al. (2016) investigate using label name in intent detection. In addition to learning directly from limited example, another research line of solving data scarcity problem in NLP is data augmentation Fader et al. (2013); Zhang et al. (2015); Liu et al. (2017). For data augmentation of slot tagging, sentence generation based methods are explored to create additional labeled samples Hou et al. (2018); Shin et al. (2019); Yoo et al. (2019).

Conclusion

In this paper, we propose a few-shot CRF model for slot tagging of task-oriented dialogue. To compute transition score under few-shot setting, we propose the collapsed dependency transfer mechanism, which transfers the prior knowledge of the label dependencies across domains with different label sets. And we propose L-TapNet to calculate emission score, which improves label representation with label name semantics. Experiment results validate that both the collapsed dependency transfer and L-TapNet can improve the tagging accuracy.

Acknowledgments

We sincerely thank Ning Wang and Jiafeng Mao for the help on both paper and experiments. We are grateful for the helpful comments and suggestions from the anonymous reviewers. This work was supported by the National Natural Science Foundation of China (NSFC) via grant 61976072, 61632011 and 61772153.

References

Appendix

Appendix A Detail of Dataset

Table 7 shows the statistics of the original dataset used to construct few-shot experiment data.

Appendix B Few-shot experiments for Name entity recognition

Name entity recognition (NER) that identify pre-defined name entities, such as the person names, organizations and locations, can be modeled as a slot tagging task. Also, the data scarcity problem for a new domain exists in the NER task. For the above reasons, we conduct few-shot NER experiments to test our model’s generation ability.

For named entity recognition, we utilize 4 different datasets: CoNLL-2003 (Sang and Meulder, 2003), GUM (Zeldes, 2017), WNUT-2017 (Derczynski et al., 2017) and Ontonotes (Pradhan et al., 2013), each of which contains data from only 1 domain. The 4 domains are News, Wiki, Social and Mixed. Detail of the original data set is showed in Table 7 and statistic of constructed few-shot data is showed in Table 8.

Table 9 and Table 10 respectively show the 1-shot and 5-shots name entity recognition results. Our best model outperforms all baseline in both settings.

The trend of results is consistent with slot-tagging results. But the overall score is much lower than slot-tagging results. this is because NER domains are from different datasets and the domain gap is much larger.

Our improvements on 5-shots is narrowed in margin. This is because NER domains have different genres and vocabulary. So compared to SNIPS, it is harder to transfer knowledge but benefits more to rely on domain-specific support examples. This trend is even more pronounced with more shots. In 5-shots setting, the strongest baseline WPZ benefits more from the increased shots because it only uses support set for prediction. But the benefit of more shots is weaker for our model because it uses more prior knowledge.

We investigate effectiveness of collapsed dependency transfer and label semantic on the NER task. We perform ablations on two proposed components and observe performance drops on both 1-shot and 5-shots settings, which demonstrate the generalization ability of proposed two mechanism.

Appendix C Analysis of Projection Space Dimensionality

Fig 6 shows the performance on 1-shot Snips when using different projected-space dimensions in L-TapNet. As shown in the trend in the figure, the performance of the model becomes better as the dimension of the mapping space increases and gradually stabilizes. This shows the possibility of reducing the dimension without losing too much performance Yoon et al. (2019).

Appendix D Slot Tagging Result with Standard Deviations

Table 12 and 13 show the complete results with standard deviations for slot tagging task.