Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference

Jian-Guo Zhang, Kazuma Hashimoto, Wenhao Liu, Chien-Sheng Wu, Yao Wan, Philip S. Yu, Richard Socher, Caiming Xiong

Introduction

Intent detection is one of the core components when building goal-oriented dialog systems. The goal is to achieve high intent classification accuracy, and another important skill is to accurately detect unconstrained user intents that are out-of-scope (OOS) in a system (Larson et al., 2019). A practical challenge is data scarcity because different systems define different sets of intents, and thus few-shot learning is attracting much attention. However, previous work has mainly focused on the few-shot intent classification without OOS (Luo et al., 2018; Casanueva et al., 2020).

OOS detection can be considered as out-of-distribution detection (Hendrycks and Gimpel, 2017; DeVries and Taylor, 2018). Recent work has shown that large-scale pre-trained models like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) still struggle with out-of-distribution detection, despite their strong in-domain performance (Hendrycks et al., 2020). Figure 1 (a) shows how unseen input text is mapped into a feature space, by a RoBERTa-based model for 15-way 5-shot intent classification. The separation between OOS and some in-domain intents is not clear, which presumably hinders the model’s OOS detection ability. This observation calls for investigation into more sample-efficient approaches to handling the in-domain and OOS examples accurately.

In this paper, we tackle the task from a different angle, and propose a discriminative nearest neighbor classification (DNNC) model. Instead of expecting the text encoders to be generalized enough to discriminate both the in-domain and OOS examples, we make full use of the limited training examples both in training and inference time as a nearest neighbor classification schema. We leverage the BERT-style paired text encoding with deep self-attention to directly model relations between pairs of user utterances. We then train a matching model as a pairwise binary classifier to estimate whether an input utterance belongs to the same class of a paired example. We expect this to free the model from having the OOS separation issue in Figure 1 (a) by avoiding explicit modeling of the intent classes. Unlike an embedding-based matching function as in relation networks (Sung et al., 2018) (Figure 1 (b)), the deep pairwise matching function produces clear separation between the in-domain and OOS examples (Figure 1 (c)). We further propose to seamlessly transfer a natural language inference (NLI) model to enhance this clear separation (Figure 1 (d)).

We verify our hypothesis by conducting extensive experiments on a large-scale multi-domain intent detection task with OOS (Larson et al., 2019) in various few-shot learning settings. Our experimental results show that, compared with RoBERTa classifiers and embedding nearest neighbor approaches, our DNNC attains more stable and accurate performance both in in-domain and OOS accuracy. Moreover, our 10-shot model can perform competitively with a 50-shot or even full-shot classifier, with the performance boost by the NLI transfer. We also show how to speedup our DNNC’s inference time without sacrificing accuracy.

Background

Given a user utterance $u$ at every turn in a goal-oriented dialog system, an intent detection model $I(u)$ aims at predicting the speaker’s intent:

where $c$ is one of pre-defined $N$ intent classes $\mathbf{C}=\{C_{1},C_{2},\ldots,C_{N}\}$ , or is categorized as OOS. The OOS category corresponds to user utterances whose requests are not covered by the system. In other words, any utterance can be OOS as long as it does not fall into any of the $N$ intent classes, so the definition of OOS is different depending on $\mathbf{C}$ .

In a few-shot learning scenario, we have a limited number of training examples for each class, and we assume that we have $K$ examples for each of the $N$ classes in our training data. In other words, we have $N\cdot K$ training examples in total. We denote the $i$ -th training example from the $j$ -th class $C_{j}$ as $e_{j,i}\in E$ , where $E$ is the set of the examples. $K$ is typically 5 or 10.

2 Multi-Class Classification

The goal is to achieve high accuracy both for the intent classification and OOS detection. One common approach to this task is using a multi-class classification model. Specifically, to get a strong baseline for the few-shot learning use case, one can leverage a pre-trained model as transfer learning, which has been shown to achieve state-of-the-art results on numerous natural language processing tasks. We use BERT (Devlin et al., 2019; Liu et al., 2019) as a text encoder:

where $h$ is a $d$ -dimensional output vector corresponding to the special token [CLS] as in the follwing input format: [[CLS], $u$ , [SEP]].The format of these special tokens is different in RoBERTa, but we use the original BERT’s notations.

To handle the intent classification and the OOS detection, we apply the threshold-based strategy in Larson et al. (2019), to the softmax output of the $N$ -class classification model (Hendrycks and Gimpel, 2017):

3 Nearest Neighbor Classification

As the fundamental building block of our proposed method, we also review nearest neighbor classification (i.e., $k$ -nearest neighbors ( $k$ NN) classification with $k=1$ ), a simple and well-established concept for classification (Simard et al., 1993; Cunningham and Delany, 2007). The basic idea is to classify an input into the same class of the most relevant training example based on a certain metric.

In our task, we formulate a nearest neighbor classification model as the following:

Proposed Method

This section first describes how to directly model inter-utterance relations in our nearest neighbor classification scenario. We then introduce a binary classification strategy by synthesizing pairwise examples, and propose a seamless transfer of NLI. Finally, we describe how to speedup our method’s inference process.

The objective of $S(u,e_{j,i})$ in Equation (4) is to find the best matched utterance from the training set $E$ , given the input utterance $u$ . The typical methodology is to embed each data example into a vector space and (1) use an off-the-shelf distance metric to perform a similarity search (Cunningham and Delany, 2007) or (2) learn a distant metric between the embeddings (Sung et al., 2018). However, as shown in Figure 1, the text embedding methods do not discriminate the OOS examples well enough.

To model fine-grained relations of utterance pairs to distinguish in-domain and OOS intents, we propose to formulate $S(u,e_{j,i})$ as follows:

2 Discriminative Training

We train the matching model $S(u,e_{j,i})$ as a binary classifier, such that $S(u,e_{j,i})$ is closed to 1.0 if $u$ belongs to the same class of $e_{j,i}$ , and otherwise closed to 0.0. The model is trained by a binary cross-entropy loss function.

Negative examples

3 Seamless Transfer from NLI

A key characteristic of our method is that we seek to model the relations between the utterance pairs, instead of explicitly modeling the intent classes. To mitigate the data scarcity setting in few-shot learning, we consider transferring another inter-sentence-relation task.

This work focuses on NLI; the task is to identify whether a hypothesis sentence can be entailed by a premise sentence (Bowman and Zhu, 2019). We treat the NLI task as a binary classification task: entailment (positive) or non-entailment (negative).A widely-used format is a three-way classification task with entailment, neutral, and contradiction, but we merge the latter two classes into a single non-entailment class. We first pre-train our model with the NLI task, where the premise sentence corresponds to the $u$ -position, and the hypothesis sentence corresponds to the $e_{j,i}$ -position in Equation (5). Note that it is not necessary to modify the model architecture since the task format is consistent, and we can train the NLI model solely based on existing NLI datasets. Once the NLI model pre-training is completed, we fine-tune the NLI model with the intent classification training examples described in Section 3.2. This allows us to transfer the NLI model to any intent detection datasets seamlessly.

The NLI task has been actively studied, especially since the emergence of large scale datasets (Bowman et al., 2015; Williams et al., 2018), and we can directly leverage the progress. Moreover, recent work is investigating cross-lingual NLI (Eriguchi et al., 2018; Conneau et al., 2018), and this is encouraging to consider multilinguality in future work. On the other hand, while we can find examples relevant to the intent detection task, as shown in Table 1 ((c), (d), and (e)), we still need the few-shot fine-tuning. This is because a domain mismatch still exists in general, and perhaps more importantly, our intent detection approach is not exactly modeling NLI.

Why not other tasks?

There are other tasks modeling relationships between sentences. Paraphrase (Wieting and Gimpel, 2018) and semantic relatedness (Marelli et al., 2014) tasks are such examples. It is possible to automatically create large-scale paraphrase datasets by machine translation (Ganitkevitch et al., 2013). However, our task is not a paraphrasing task, and creating negative examples is crucial and non-trivial (Chambers and Jurafsky, 2010). In contrast, as described above, the NLI setting comes with negative examples by nature. The semantic relatedness (or textual similarity) task is considered as a coarse-grained task compared to NLI, as discussed in the previous work (Hashimoto et al., 2017), in that the task measures semantic or topical relatedness. This is not ideal for the intent detection task, because we need to discriminate between topically similar utterances of different intents. In summary, the NLI task well matches our objective, with access to the large datasets.

4 A Joint Approach with Fast Retrieval

The number of model parameters of the multi-class classification model in Section 2.2 and our model in Section 3 is almost the same when we use the same pre-trained models. However, our example-based method has an inference-time bottleneck in Equation (5), where we need to compute the BERT encoding for all $N\times K$ $(u,e_{j,i})$ pairs.

We follow common practice in document retrieval to reduce the inference-time bottleneck (Nie et al., 2019; Asai et al., 2020), by introducing a fast text retrieval model to select a set of top- $k$ examples $E_{k}$ from the training set $E$ , based on its retrieval scores. We then replace $E$ in Equation (4) with the shrunk set $E_{k}$ . The cost of the paired BERT encoding is now constant, regardless the size of $E$ . Either TF-IDF (Chen et al., 2017) or embedding-based retrieval (Johnson et al., 2017; Seo et al., 2019; Lee et al., 2019) can be used for the first step. We use the following fast $k$ NN.

As a baseline and a way to instantiate our joint approach, we use Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) to separately encode $u$ and $e_{j,i}$ ( $x\in\{u,e_{j,i}\}$ ) as follows:

where the input text format is identical to that of BERT in Equation (2). SBERT is a BERT-based text embedding model, fine-tuned by siamese networks with NLI datasets. Thus both our method and SBERT transfer the NLI task in different ways.

Cosine similarity between $v(u)$ and $v(e_{j,i})$ then replaces $S(u,e_{j,i})$ in Equation (6). To get a fair comparison, instead of using the encoding vectors produced by the original SBERT, we fine-tune SBERT with our intent training examples described in Section 3.2. The cosine similarity is symmetric, so we have half the training examples. We use the pairwise cosine-based loss function in Reimers and Gurevych (2019). After the model training, we pre-compute $v(e_{j,i})$ for fast retrieval.

Experimental Settings

We use a recently-released dataset, CLINC150,https://github.com/clinc/oos-eval. for multi-domain intent detection (Larson et al., 2019). The CLINC150 dataset defines 150 types of intents in total (i.e., $N=150$ ), where there are 10 different domains and 15 intents for each of them. Table 2 shows the dataset statistics.

The dataset also provides OOS examples whose intents do not belong to any of the 150 intents. From the viewpoint of out-of-distribution detection (Hendrycks and Gimpel, 2017; Hendrycks et al., 2020), we do not use the OOS examples during the training stage; we only use the evaluation splits as in Table 2.

Single-domain experiments

The task in the CLINC150 dataset is like handling many different services in a single system; that is, topically different intents are mixed (e.g., “alarm” in the “Utility” domain, and “pay bill” in the “Banking” domain). In contrast, it is also a reasonable setting to handle each domain (or service) separately as in Rastogi et al. (2019). In addition to the all-domain experiment, we conduct single-domain experiments, where we only focus on a specific domain with its 15 intents (i.e., $N=15$ ). More specifically, we use four domains, “Banking,” “Credit cards,” “Work,” and “Travel,” among the ten domains. Note that the same OOS evaluation sets are used.

2 Evaluation Metrics

We also report OOS precision and OOS F1 for more comprehensive evaluation:

3 Model Training and Configurations

We use RoBERTa (the base configuration with $d=768$ ) as a BERT encoder for all the BERT/SBERT-based models in our experiments,We use https://github.com/huggingface/transformers and https://github.com/UKPLab/sentence-transformers. because RoBERTa performed significantly better and more stably than the original BERT in our few-shot experiments. We combine three NLI datasets, SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), and WNLI (Levesque et al., 2011) from the GLUE benchmark (Wang et al., 2018) to pre-train our proposed model.

We apply label smoothing (Szegedy et al., 2016) to all the cross-entropy loss functions, which has been shown to improve the reliability of the model confidence (Müller et al., 2019). Experiments were conducted on single NVIDIA Tesla V100 GPU with 16GB memory.

We conduct our experiments with $K=5,10$ following the task definition in Section 2.1. We randomly sample $K$ examples from the entire training sets in Table 2, for each in-domain intent class 10 times unless otherwise stated. We train a model with a consistent hyper-parameter setting across the 10 different runs and follow the threshold selection process based on a mean score for each threshold. We also report a standard deviation for each result.

We would not always have access to a large enough development set in the few-shot learning scenario. However, we still use the development set provided by the dataset to investigate the models’ behaviors when changing hyper-parameters like the threshold.

We list the models used in our experiments:

Classifier baselines: “Classifier” is the RoBERTa-based classification model described in Section 2.2. We further seek solid baselines by data augmentation. “Classifier-EDA” is the classifier trained with data augmentation techniques in Wei and Zou (2019). “Classifier-BT” is the classifier trained with back-translation data augmentation (Yu et al., 2018; Shleifer, 2019) by using a transformer-based English $\leftrightarrow$ German translation system (Vaswani et al., 2017).

Non-BERT classifier: We also test a state-of-the-art fast embedding-based classifier, “USE+ConveRT” (Henderson et al., 2019; Casanueva et al., 2020), in the “all domains” setting. Casanueva et al. (2020) showed that the “USE+ConveRT” outperformed a BERT classifier on the CLINC150 dataset, while it was not evaluated along with the OOS detection task. We modified their original codehttps://github.com/connorbrinton/polyai-models/releases/tag/v1.0. to apply the uncertainty-based OOS detection.

$k$ NN baselines:We tried weighted voting in Cunningham and Delany (2007), but $k=1$ performed better in general. “Emb- $k$ NN” is the $k$ NN method with S(Ro)BERT(a) described in Section 3.4, and “Emb- $k$ NN-vanilla” is without using our intent training examples for fine-tuning. “TF-IDF- $k$ NN” is another $k$ NN baseline using TF-IDF vectors, which tells us how well string matching performs on our task. We also implement a relation network (Sung et al., 2018), “RN- $k$ NN,” to learn a similarity metric between the SRoBERTa embeddings, instead of using the cosine similarity.

Proposed method:Our code will be available at https://github.com/salesforce/DNNC-few-shot-intent. “DNNC” is our proposed method, and “DNNC-scratch” is without the NLI pre-training in Section 3.3. “DNNC-joint” is our joint approach on top of top- $k$ retrieval by Emb- $k$ NN (Section 3.4).

More details about the model training and the data augmentation configurations are described in Appendix A and Appendix B, respectively.

This section shows our experimental results. Appendix C shows some additional figures.

We first show test set results of 5-shot and 10-shot in-domain classification and OOS detection accuracy in Table 4.3 for the four selected domains. In the 5-shot setting, the proposed DNNC method consistently attains the best results across all the four domains. The comparison between DNNC-scratch and DNNC shows that our NLI task transfer is effective. In the 10-shot setting, all the approaches generally experience an accuracy improvement due to the additional training data, and the dominant performance of DNNC weakens, although it remains highly competitive. We can see that our DNNC is comparable with or even surpasses some of the 50-shot classifier’s scores, and the data augmentation techniques are not always helpful when we use the strong pre-trained model.

Entire CLINC150 dataset

Next, Table 4.3 shows results to compare our method with the classifier and USE+ConveRT baselines, on the entire CLINC150 dataset with the 150 intents. USE+ConveRT performs worse than the RoBERTa-based classifier on the OOD detection task. The advantage of DNNC for in-domain intent detection is clear, with its 10-shot in-domain accuracy close to the upper-bound accuracy for the classifier baseline. One observation is that our DNNC method tends to be more confident about its prediction, with the increasing number of the training examples; as a result, the OOS recall becomes lower in the 10-shot setting, while the OOS precision is much higher than the other baselines. Better controlling the confidence output of the model is an interesting direction for future work.

When the USE+ConveRT baseline is evaluated along with the OOS detection task, its overall accuracy is not as good as the other RoBERTa-based models, despite its potential in the purely in-domain classification. This indicates that the fine-tuned (Ro)BERT(a) models are more robust to out-of-distribution examples than shallower models like USE+ConveRT, also suggested in Hendrycks et al. (2020).

2 Robustness of DNNC

As described in Section 4.2, we select the threshold to determine OOS by making a trade-off between in-domain classification and OOS detection accuracy. It is therefore desirable to have a model with candidate thresholds that provide high in-domain accuracy as well as OOS precision and recall.

We observe in Figure 4 that in the 5-shot setting, DNNC is the most robust to the threshold selection. The contrast between the classification model and DNNC-scratch suggests that nearest neighbor approaches (in this case DNNC) make for stronger discriminators; the advantage of DNNC over DNNC-scratch further demonstrates the power of the NLI transfer and, perhaps more importantly, the effectiveness of the pairwise discriminative pre-training. This result is consistent with the intuition we gained from Figure 1, and the overall observation is also consistent across different settings.

To further understand the differences in behaviors between the classification model and DNNC method, we examine the output from the final softmax/sigmoid function (model confidence score) in Figure 4. At 5-shot, the classifier method still struggles to fully distinguish the in-domain examples from the OOS examples in its confidence scoring, while DNNC already attains a clear distinction between the two. Again, we can clearly see the effectiveness of the NLI transfer.

With the model architectures for BERT-based classifier and DNNC being the same (RoBERTa is used for both methods) except for the final layer (multi-class-softmax vs. binary sigmoid), this result suggests that the pairwise NLI-like training is more sample-efficient, making it an excellent candidate for the few-shot use case.

3 DNNC-joint for Faster Inference

Despite its effectiveness in few-shot intent and OOS settings, the proposed DNNC method might not scale in high-traffic use cases, especially when the number of classes, $N$ , is large, due to the inference-time bottleneck (Section 3.4). With this in mind, we proposed the DNNC-joint approach, wherein a faster model is used to filter candidates for the fine-tuned DNNC model.

We compare the accuracy and inference latency metrics for various methods in Table 5. Note that Emb- $k$ NN and RN- $k$ NN exhibit excellent latency performance, but they fall considerably short in both the in-domain intent and OOS detection accuracy, compared to DNNC and the DNNC-joint methods. On the other hand, the DNNC-joint model shows competitiveness in both inference latency and accuracy. These results indicate that the current text embedding approaches like SBERT are not enough to fully capture fine-grained semantics.

Intuitively, there is a trade-off between latency and inference accuracy: with aggressive filtering, the DNNC inference step needs to handle a smaller number of training examples, but might miss informative examples; with less aggressive filtering, the NLI model sees more training examples during inference, but will take longer to process single user input. This is illustrated in Figure 4, where the in-domain intent and OOS accuracy metrics (on the development set of the banking domain in the 5-shot setting) improve with the increase of $k$ , while the latency increases at the same time. Empirically, $k=20$ appears to strike the balance between latency and accuracy, with the accuracy metrics similar to those of the DNNC method, while being much faster than DNNC (dashed lines are the corresponding DNNC references).

Interpretability is an important line of research recently (Jiang et al., 2019; Sydorova et al., 2019; Asai et al., 2020). The nearest neighbor approach (Simard et al., 1993) is appealing in that we can explicitly know which training example triggers each prediction. Table 11 in Appendix C shows some examples.

Call for better embeddings

Emb- $k$ NN and RN- $k$ NN are not as competitive as DNNC. This encourages future work on the task-oriented evaluation of text embeddings in $k$ NN.

Training time

Our DNNC method needs longer training time than that of the classifier (e.g., 90 vs. 40 seconds to train a single-domain model), because we synthesize the pairwise examples. As a first step, we used all the training examples to investigate the effectiveness, but it is an interesting direction to seek more efficient pairwise training.

Distilled model

Another way to speedup our model is to use distilled pre-trained models (Sanh et al., 2019). We replaced the RoBERTa model with a distilled RoBERTa model, and observed large variances with significantly lower OOS accuracy. Hendrycks et al. (2020) also suggested that the distilled models would not be robust to out-of-distribution examples.

Few-shot text classification

Few-shot classification (Fei-Fei et al., 2006; Vinyals et al., 2016b) has been applied to text classification tasks (Deng et al., 2019; Geng et al., 2019; Xu et al., 2019), and few-shot intent detection is also studied but without OOS (Luo et al., 2018; Xia et al., 2020; Casanueva et al., 2020). There are two common scenarios: 1) learning with plenty of examples and then generalizing to unseen classes with a few examples, and 2) learning with a few examples for all seen classes. Meta-learning (Finn et al., 2017; Geng et al., 2019) is widely studied in the first scenario. In our paper, we have focused on the second scenario, assuming that there are only a limited number of training examples for each class. Our work is related to metric-based approaches such as matching networks Vinyals et al. (2016a), prototypical networks Snell et al. (2017) and relation networks Sung et al. (2018), as they model nearest neighbours in an example-embedding or a class-embedding space. We showed that a relation network with the RoBERTa embeddings does not perform comparably to our method. We also considered several ideas from prototypical networks (Sun et al., 2019), but those did not outperform our Emb- $k$ NN baseline. These results indicate that deep self-attention is the key to the nearest neighbor approach with OOS detection.

In this paper, we have presented a simple yet efficient nearest-neighbor classification model to detect user intents and OOS intents. It includes paired encoding and discriminative training to model relations between the input and example utterances. Moreover, a seamless transfer from NLI and a joint approach with fast retrieval are designed to improve the performance in terms of the accuracy and inference speed. Experimental results show superior performance of our method on a large-scale multi-domain intent detection dataset with OOS. Future work includes its cross-lingual transfer and cross-dataset (or cross-task) generalization.

This work is supported in part by NSF under grants III-1763325, III-1909323, and SaTC-1930941. We thank Huan Wang, Wenpeng Yin for their insightful discussions, and the anonymous reviewers for their helpful and thoughtful comments. We also thank Jin Qu, Tian Xie, Xinyi Yang, and Yingbo Zhou for their support in the deployment of DNNC into the internal system.

Appendix

Appendix A Training Details

To use the CLINC150 dataset (Larson et al., 2019)https://github.com/clinc/oos-eval. in our ways, especially for the single-domain experiments, we provide preprocessing scrips accompanied with our code.

General training

This section describes the details about the model training in Section 4.3. For each component related to RoBERTa and SRoBERTa, we solely follow the two libraries, transformers and sentence-transformers, for the sake of easy reproduction of our experiments.https://github.com/huggingface/transformers and https://github.com/UKPLab/sentence-transformers. The example code to train the NLI-style models is also available.https://github.com/huggingface/transformers/tree/master/examples/text-classification. We use the roberta-base configurationhttps://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json. for all the RoBERTa/SRoBERTa-based models in our experiments. All the model parameters including the RoBERTa parameters are updated during all the fine-tuning processes, where we use the AdamW (Loshchilov and Hutter, 2017) optimizer with a weight decay coefficient of 0.01 for all the non-bias parameters. We use a gradient clipping technique (Pascanu et al., 2013) with a clipping value of 1.0, and also use a linear warmup learning-rate scheduling with a proportion of 0.1 with respect to the maximum number of training epochs.

Pre-training on NLI tasks

For the pre-training on NLI tasks, we fine-tune a roberta-base model on three publicly available datasets, i.e., SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), and WNLI (Levesque et al., 2011) from the GLUE benchmark (Wang et al., 2018). The optimizer and gradient clipping follow the above configurations. The number of training epochs is set to $4$ ; the batch size is set to $32$ ; the learning rate is set to $2e-5$ . We use a linear warmup learning-rate scheduling with a proportion of $0.06$ by following Liu et al. (2019). The evaluation results on the development sets are shown in Table 6, where the low accuracy of WNLI is mainly caused by the data size imbalance. We note that these NLI scores are not comparable with existing NLI scores, because we converted the task to the binary classification task for our model transfer purpose.

Text pre-processing

For all the RoBERTa-based models, we used the RoBERTa roberta-base’s tokenizer provided in the transformers library.https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_roberta.py. We did not perform any additional pre-processing in our experiments.

Hyper-parameter settings

Appendix B Data Augmentation

We describe the details about the classifier baselines with the data augmentation techniques in Section 4.3.

Classifier-EDA uses the following four data augmentation techniques in Wei and Zou (2019): synonym replacement, random insertion, random swap, and random deletion. We follow the publicly available code.https://github.com/jasonwei20/eda_nlp. For every training example, we empirically set one augmentation based on every technique. We apply each technique separately to the original sentence and therefore every training example will have four augmentations. The probability of a word in an utterance being edited is set to 0.1 for all the techniques.

BT

For classifier-BT, we use the English-German corpus in Negri et al. (2018), which is widely used in an annual competition for automatic post-editing research on IT-domain text (Chatterjee et al., 2019). The corpus contains about 7.5 million translation pairs, and we follow the base configuration to train a transformer model (Vaswani et al., 2017) for each direction. Based on the initial trial in our preliminary experiments to generate diverse examples, we decided to use a temperature sampling technique instead of a greedy or beam-search strategy. More specifically, logit vectors during the machine translation process are multiplied by $\tau$ to distort the output distributions, where we set $\tau=5.0$ . For each training example in the intent detection dataset, we first translate it into German and then translate it back to English. We repeat this process to generate up to five unique examples, and use them to train the classifier model. Table 10 shows such examples, and we will release all the augmented examples for future research.

Appendix C Additional Results

Figure 5 shows the same curves in Figure 4 along with the corresponding 10-shot results. We can see that the 10-shot results also exhibit the same trend. Figure 6 shows more visualization results with respect to Figure 1. Again, the 10-shot visualization shows the same trend.

Figure 7 and Figure 8 show 5-shot and 10-shot confidence levels on the test sets of the banking domain and all domains, respectively. Both Classifier and Emb-kNN cannot perform well to distinguish the in-domain examples from the OOS examples, while DNNC has a clearer distinction between the two.

Faster inference

Figure 9 shows the same curves in Figure 4 also for the 10-shot setting. We can see the same trend with the 10-shot results.

Case studies

Table 11 shows four DNNC prediction examples from the development set of the banking domain. For the first example, the input utterance is correctly predicted with a high confidence score, and it has a similarly matched utterance to the input utterance; for the second example, the input utterance is predicted incorrectly with a high confidence score, where the matched utterance is related to money but it has a slightly different meaning with the input utterance. For the third example, the model gives a very low confidence score to predict an OOS user utterance as an in-domain intent; the last example is an incorrect case where the input utterance and the matched utterance have a topically similar meaning, resulting in a high confidence score for the wrong label, “bill due.” Based on these observations, it is an important direction to improve the model’s robustness (even with the large-scale pre-trained models) towards such confusing cases.