Leveraging Slot Descriptions for Zero-Shot Cross-Domain Dialogue State Tracking

Zhaojiang Lin, Bing Liu, Seungwhan Moon, Paul Crook, Zhenpeng Zhou, Zhiguang Wang, Zhou Yu, Andrea Madotto, Eunjoon Cho, Rajen Subba

cs.CL

Introduction

Task-oriented dialogue systems are designed to assist users in performing daily activities, such as restaurant booking, travel planning, and online shopping. These virtual assistants provide natural language interfaces to services and online APIs Rastogi et al. (2020). Based on users’ needs, these systems frequently require support for new domains. However, the current state-of-the-art systems require a substantial amount of in-domain data to properly model a new domain. The data-collection process is both expensive and time-consuming, and thus it is very important to study methods that can build robust and scalable dialogue systems using little to no in-domain data.

The dialogue state tracking (DST) is an essential component of task-oriented dialogue systems that tracks users’ requirements over multi-turn conversations. A popular formulation of the dialogue state is in the form of a list of slot-value pairs. In DST, tracking unseen slots in a new domain, a.k.a. zero-shot domain adaptation, is a significant challenge, since the model has never seen in-domain training samples. There are two main lines of work to tackle this problem. The first proposes domain transferable models using copy mechanisms or ontology graph information Wu et al. (2019); Zhou and Small (2019). A limitation of such models is that they may not fully leverage pre-trained language models due to the specialized model architecture. The second line of work uses slot-descriptions as input to the model to facilitate the slot understanding Rastogi et al. (2020). However, the provided slot descriptions are collected by crowd sourced human annotators and might be inconsistent among different domains. In general, the optimal approach for constructing slot descriptions in zero-shot settings remains unexplored.

In this work, we tackle the challenge of zero-shot cross-domain DST via leveraging large scale pre-trained sequence-to-sequence (seq2seq) models and with effective encoding of slot descriptions. We first introduce a generative DST model called T5DST, which models the relation of a slot and its dialogue context with a self-attentive encoder, and generates the slot value with a decoder in an autoregressive manner. This simple design allows us to effectively incorporate a pre-trained seq2seq model (e.g., T5 Raffel et al. (2020)) without any task-specific modification. To further enhance the model’s cross-domain transferability, we propose Slot Type Informed Descriptions that capture the shared information of different slots. Experimental results on the MultiWOZ benchmark Budzianowski et al. (2018) suggest that 1) our model achieves significantly higher joint goal accuracy compared to existing results in zero-shot cross domain DST; 2) models using the proposed slot description formulation substantially outperform those using other slot description variants. Our contributions are summarized as the following:

We propose a simple yet novel generative DST model based on T5 that significantly improves existing zero-shot cross-domain DST results;

We investigate the effectiveness of different slot description formulations. To the best of our knowledge, this is the first work that comprehensively studies the effectiveness of slot descriptions in zero-shot cross-domain DST.

Related Work

has been of broad interest to the dialogue research community Williams and Young (2007); Williams et al. (2014); Heck et al. (2020); Liu et al. (2020); Wu et al. (2020); Madotto et al. (2020). Current state-of-the-art models Chen et al. (2020); Lin et al. (2020); Heck et al. (2020); Hosseini-Asl et al. (2020); Ye et al. (2021); Li et al. (2020) trained with extensive annotated data have been shown promising performance in complex multi-domain conversations Budzianowski et al. (2018). However, collecting large amounts of data for every domain is costly and inefficient. To address this issue, several methods Wu et al. (2019); Zhou and Small (2019) have proposed for transferring prior knowledge of existing domains to new ones. On the other hand, Campagna et al. (2020) proposed an abstract dialogue model that leverages the ontology and in-domain templates to generate a large amount of synthesized data for domain adaptation. Different from their method, in this paper, we utilize a pre-trained seq2seq model and slot descriptions for cross-domain DST without any in-domain data.

has been shown to be a promising technique in cross domain semantic parsing Bapna et al. (2017); Shah et al. (2019); Namazifar et al. (2020). To encourage this line of research in DST as well, MultiWOZ2.1 Eric et al. (2019) provides a further annotation for slot descriptions. Rastogi et al. (2020) incorporated slot descriptions for facilitating cross domain DST, while Gao et al. (2019, 2020) formulated DST as a question answering problem by casting a slot name into questions. However, these works did not show the effectiveness of slot descriptions, by comparing the performance of models with and without them. There is no study on how to construct slot descriptions. In this paper, we aim to fill this research gap by providing an empirical study on the different slot description formulations.

Methodology

The design of our model follows the basis of generative question answering models. As illustrated in Figure 1, given a dialogue history which consists of an alternating set of utterances from two speakers, denoted as $\mathcal{C}_{t}=\{U_{1},R_{1},\dots,R_{t-1},U_{t}\}$ , we add the "user:" and "system:" prefixes to the user and system utterance respectively. Then all the utterances and slot names $s_{i}$ are concatenated into a single sequence, i.e., user: $U_{1}$ $\dots$ system: $R_{t-1}$ user: $U_{t}$ [sep] $s_{i}$ . The sequence is used as the input to the encoder, and the decoder generates the corresponding slot value $v_{i}$ :

The learning objective of this generation process is minimizing the negative log-likelihood of $v_{i}$ given $\mathcal{C}_{t}$ and $s_{i}$ , that is,

where $n$ is the number of slots to be tracked.

We initialize the model parameters with T5 Raffel et al. (2020), an encoder-decoder Transformer with relative position embeddings Shaw et al. (2018) pre-trained on a massive amount of English text. We denote our model as T5DST. To incorporate slot descriptions into T5DST, we replace the slot name with its corresponding slot description as the model input.

2 Slot Type Informed Descriptions

Although different slots may have distinguishing names, they can share the same slot type. As shown in Table 1, the slot type of hotel-stars and restaurant-book people are both number slots, while hotel-internet and hotel-parking are both boolean slots. In light of these observations, we hypothesize that adding slot type information to the slot description facilitates the knowledge transfer among different slots. We construct a template for each slot type that follows "[slot type] of [slot] of the [domain]". We denote such a slot description as Slot Type. More details are available in Appendix A.1.

Experiments

We evaluate the proposed method on the MultiWOZ 2.0 dataset Budzianowski et al. (2018), which has 7 domains. We use the pre-processing and evaluation setup from Wu et al. (2019), where restaurant, train, attraction, hotel, and taxi domains are used for training, as the test set only contains these 5 domains.

In the zero-shot cross-domain experiments, the models are first trained with four domains and then evaluated on the test-set of the unseen domain. Joint goal accuracy is used to evaluate the performance of the models. The generated dialogue states are considered to be correct if and only if all of the predicted values exactly match the oracle values.

2 Implementation

We implement T5DSTSource code is available in https://github.com/facebookresearch/Zero-Shot-DST based on the T5-small (60M parameters) model which has 6 encoder-decoder layers and the hidden size $d_{model}=512$ . All models are trained using an AdamW Loshchilov and Hutter (2018) optimizer with the initial learning rate of $0.0001$ . In all cross-domain zero-shot experiments, we train the models with batch size 128 for 5 epochs. For the few-shot experiments, the models are first trained on 4 domains for 5 epochs then fine-tuned with 1%, 5% and 10% of target domain data for 10 epochs. For full shot training, we train our model for at most 10 epochs with batch size 64 and early stop according to the loss in the validation set. Other hyper-prameters are same as zero-shot cross-domain setting. We use 8 NVIDIA V100 GPUs for all of our experiments. We use greedy decoding in test time.

3 Baselines

Transferable dialogue state generator Wu et al. (2019) which utilizes copy mechanism to facilitate domain knowledge transfer.

Slot-utterance matching belief tracker Lee et al. (2019) based on the language model BERT Devlin et al. (2018).

Dialogue state tracking via question answeringWe are aware of STARC Gao et al. (2020). However, we are not able to compare to our results with their results because they use different training data. over ontology graph Zhou and Small (2019).

SimpleTOD Hosseini-Asl et al. (2020) uses a single causal language model GPT2 Radford et al. (2019) to generate the dialogue states. To adapt this model to a zero-shot cross-domain setting, we also provide the slot name as the model input. We denote this model as SimpleTOD++.

3.2 Slot Description Variants

Human annotated slot descriptions collected in MultiWOZ2.1 Eric et al. (2019) and used in MultiWOZ2.2 Zang et al. (2020).

Simple transformation of the slot name from "domain-slot" to "[slot] of the [domain]".

Following recent works Zhang et al. (2019); Rastogi et al. (2020), slots are divided into categorical and non-categorical slots. For categorical slots, we incorporate the candidate values into the slot description, i.e., "[slot] of the [domain] is [value-1] or [value-2]?". The order of values is random. For non-categorical slots, their descriptions are the same as aforementioned Naive.

Similar to Gao et al. (2019, 2020), we reformulate the slot into a natural language question, i.e., "What is the [slot] of the [domain] that is the user interested in?".

4 Results & Discussion

The results of the zero-shot cross domain experiments are shown in Table 2. Overall, T5DST achieves significantly higher performance in terms of averaged joint goal accuracy compared to the three baseline models TRADE, SUMBT, and SimpleTOD++. These results demonstrate that our model can effectively capture the slot-context relation, and thus generalize better in unseen domains.

Replacing slot-names with human annotated slot descriptions does not bring improvement to the zero-shot performance. This might because of the diverse and inconsistent human descriptions among different domains. For example, the human descriptions of attraction-area and restaurant-area are "area to search for attractions" and "area or place of the restaurant" respectively. Such inconsistent descriptions increase the challenge on slot understanding in the zero-shot learning setting. the model using naive slot descriptions gives similar performance to the one that uses original slot names. The two approaches lead to similar semantic representation of the slots. In contrast, incorporating slot values hurts the learning, leading to a lower joint goal accuracy in the restaurant domain. We observe that even though adding value candidates improve some of the categorical slots (e.g., restaurant-area 68.35% $\rightarrow$ 82.25% slot accuracy), it hurts the unseen non-categorical slots (e.g., restaurant-food 40.63% $\rightarrow$ 26.10% slot accuracy). These non-categorical slots are usually the bottlenecks of joint goal accuracy. Finally, models trained with question style descriptions improves the performance in some domains, but fails in the others.

Our proposed slot type informed descriptions consistently improves the zero-shot performance of T5DST in all the domains. It produced an average of 2% joint goal accuracy improvement compared to human labeled and naive description formulations. This result indicates that slot type information may better capture the shared property (e.g., time, location) among different slots, thus facilitating the domain knowledge transferring for DST.

Figure 3 and 4 show the slot accuracy of models using Naive and Slot Type description. Compared to naive description, we obverse significant gain of time slots (e.g., arrive by and leave at), location slots (e.g., departure and destination), and number slots (e.g., book stay and book people) by adding slot type information. We conjecture that explicit information about the target value (i.e., slot type) is important in the low resource condition when the model does not have enough data to capture the semantic meaning of a new slot.

4.2 Few-Shot Cross-Domain

We further conduct experiments in few-shot cross-domain settings, as in Wu et al. (2019); Zhou and Small (2019), where the models are first trained on 4 domains then fine-tuned with 1%, 5% and 10% of target domain data. As shown in Table 3, our model outperforms the DSTQA model in 4 out of 5 domains. Moreover, our approach is more practical in a real-world learning scenario as it does not require the supervision of a full ontology graph. We also conduct the full shot experiments and compare our model with previous methods. The reults are reported in Appendix A.2.

Conclusion

In this paper, we propose leveraging large scale pre-trained models with an effective slot description formulation to tackle the zero-shot cross-domain DST challenge. Specifically, we propose T5DST, a novel generative DST model based on the T5 language model, and incorporate Slot Type Informed Descriptions to facilitate cross-domain knowledge transfer. In the evaluation on the MultiWOZ dataset, our approach substantially improves existing results in both the zero-shot and few-shot settings.

References

Appendix A Appendices

As shown in Table 4, each slot type has one prefix for appending to the beginning of the description. We used three different templates to construct the slot description. For all the booking slots (e.g., book people), we use "[prefix] [slot] for the [domain] booking". For boolean slots, we use "[prefix] [slot] in the [domain]". And for all the others, we use "[prefix] [slot] of the [domain]". When a slot name (e.g., train-day) overlap with the slot type (e.g., day) or a slot does not fall into any slot type category (others), we simply set the prefix as an empty string.

A.2 Full Shot Results

To understand the full shot performance of our T5DST model and whether slot description is still helpful when there is enough training data, we also conduct the experiments in a full data setting. As shown in Table 5, using slot description only improves the joint goal accuracy by 0.56% in MultiWoz 2.0 and 0.30% in MultiWoz 2.1, which indicates that the description is less effective when there is a large amount of data for training.

Compared to prior models with zero-shot capability, T5DST shows promising performance. Compared to other state-of-the-art models that optimized for full shot training, our model achieve competitive results in MultiWoz 2.0, but inferior results on MultiWoz 2.1. We notice that there are many training strategies (e.g., token masking Kim et al. (2019); Heck et al. (2020)), additional supervision (e.g., full ontology Chen et al. (2020)), and label cleaning strategies Heck et al. (2020)) that may impact final full-shot result. We also expect higher performance with a larger T5 model, such as T5-base or T5-large. However, achieving SOTA in full-scale training is out of the scope of this work.