BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue Modeling

Zhaojiang Lin, Andrea Madotto, Genta Indra Winata, Peng Xu, Feijun Jiang, Yuxiang Hu, Chen Shi, Pascale Fung

Introduction

Task-oriented dialogue (ToD) systems are designed to assist humans in performing daily activities, such as ticket booking, travel planning, and online shopping. These systems are the core modules of virtual assistants (e.g., Apple Siri and Amazon Alexa), and they provide natural language interfaces for online services . Recently, there has been growing interest in developing deep learning-based end-to-end ToD systems because they can handle complex dialogue patterns with minimal hand-crafted rules. To advance the existing state-of-the-art, large-scale datasets have been proposed for training and evaluating such data-driven systems.

However, existing datasets for end-to-end ToD modelling are limited to a single language, such as English , or Chinese . The absence of bilingual or multilingual datasets not only limits the research on cross-lingual transfer learning but also hinders the development of robust end-to-end ToD systems for multilingual countries and regions.

To tackle the challenge mentioned above, we introduce 𝔹𝕚𝕋𝕠𝔻, a bilingual multi-domain dataset for task-oriented dialogue modelling. 𝔹𝕚𝕋𝕠𝔻 has 7,232 bilingual dialogues (in English and Chinese), spanning seven services within five domains, where each dialogue is annotated with dialogue states, speech-acts, and service API calls. Therefore, 𝔹𝕚𝕋𝕠𝔻 can be used for building both end-to-end ToD systems and dialogue sub-modules (e.g., Dialogue State Tracking). We propose three evaluation settings: 1) monolingual, in which the models are trained and tested on either English or Chinese data, 2) bilingual, where the models are trained with bilingual data and tested with English and Chinese dialogues simultaneously, and 3) cross-lingual, where the models are first trained with the source language and then tested in a few-shot setting in the target language.

The contribution of this work is three-fold. 1) We propose the first bilingual dataset (𝔹𝕚𝕋𝕠𝔻) with a total of 7, 232 dialogues for end-to-end ToD modeling. 𝔹𝕚𝕋𝕠𝔻 serves as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning approaches. 2) We provide novel baselines under the three evaluation settings, i.e., monolingual, bilingual, and cross-lingual. 3) We show the effectiveness of training a bilingual ToD system compared to two independent monolingual ToD systems as well as the potential of leveraging a bilingual knowledge base and cross-lingual transfer learning to improve the system performance under low resource condition.

The paper is organized as follows: We next describe the 𝔹𝕚𝕋𝕠𝔻 data collection methods in Section 2. We then describe our proposed tasks in section 3. Section 4 introducew our baselines, and we finally present and discuss results in Section 5.

𝔹𝕚𝕋𝕠𝔻 Dataset

𝔹𝕚𝕋𝕠𝔻 is designed to develop virtual assistants in multilingual cities, regions, or countries (e.g., Singapore, Hong Kong, India, Switzerland, etc.). For the 𝔹𝕚𝕋𝕠𝔻 data collection, we chose Hong Kong since it is home to plenty of attractions, restaurants and more, and is one of the most visited cities globally, especially by English and Chinese speakers. This section describes the knowledge base construction and provides detailed descriptions of the dialogue collection.

We collect publicly available Hong Kong tourism information from the Web, to create a knowledge base that includes 98 metro stations, 305 attractions, 699 hotels, and 1,218 restaurants. For the weather domain, we synthetically generate the weather information on different dates. Then, we implement seven service APIs (Restaurant_Searching, Restaurant_Booking, Hotel_Searching, Hotel_Booking, Attraction_Searching, MTR_info, Weather_info) to query our knowledge base. The knowledge base statistics are shown in Table 1. Although we aim to collect a fully parallel knowledge base, we observe that some items do not include bilingual information. For example, several traditional Cantonese restaurants do not have English names, and similarly, some restaurants do not provide addresses in Chinese. This lack of parallel information reflects the real-world challenges that databases are often incomplete and noisy.

2 Dialogue Data Collection

The dialogues are collected through a four-phase pipeline, as shown in Figure 2. We first design a schema, as a flowchart, for each service API, to specify the possible API queries and expected system actions after the API call. Then, user goals are sampled from the knowledge base according to the pre-defined schemas. Based on the user goals, the dialogue simulator interacts with the APIs to generate dialogue outlines. Finally, the dialogue outlines are converted into natural conversations through crowdsourcing. Our data collection methodology extends the Machine-to-Machine (M2M) approaches to bilingual settings to minimize the annotation overhead (time and cost).

The dialogue schema shown as a flowchart (Restaurant_Searching) in Figure 2.a specifies the input and output options of the API and the desired system behaviours. To elaborate, the user searches a restaurant by name, location, cuisine, etc. Then the system calls the API and informs the user of the restaurant name and other requested information. If the user is not satisfied with the search results, the system continues searching and provides other options. To ensure the provided services are realistic, we impose a few restrictions, as in . Firstly, each API has a list of required slots, and the system is not allowed to hit the API without specifying values for these slots. For example, the system needs to obtain departure and destination locations before calling the metro-info API. Secondly, the system must confirm the booking information with the user before making any reservations (e.g., restaurant booking).

A user goal consists of a list of intents and a set of constraints under each intent. Figure 2.b shows a single domain (intent) example where the user’s intent is Restaurant_Search. A constraint is defined with a triple (slot, relation, value) (e.g., (Rating, at_least, 4)). Different from previous work, which defined user constraints as slot-value pairs, we impose slot-value relations (listed in Figure 3.b) to promote more diverse user goals. To generate a user goal, we first sample a list of intents. We randomly sample a set of slot-relation-value combinations from the bilingual knowledge base for each intent, which includes non-existent combinations to create unsatisfiable user requests. In multi-domain scenarios, we set a certain probability to share the same values for some of the cross-domain slots (e.g., date and location) to make the transition among domains smooth. For example, users might want to book restaurants and hotels on the same date or take the metro from the location of the booked restaurant to their hotel. Note that the user goals for English and Chinese are sampled independently, as the real-world customer service conversations are often unparalleled.

Dialogue outlines are generated by a bilingual dialogue simulator that accepts user goals in both languages as inputs. The dialogue simulator consists of a user agent and a system agent. Both agents interact with each other using a finite set of actions specified by speech acts over a probabilistic automaton designed to capture varied dialogue trajectories . Each speech act takes a slot or slot-relation-value triple as an argument. When the conversation starts, the user agent is assigned a goal, while the system agent is initialized with a set of requests related to the services. During the conversation, the user informs constraints according to the user goal, and the system responds to the user queries while interacting with the service APIs. For some services, the system needs to request all the required slots before querying the APIs. After the API call, the system either informs the search result or searches for other options until the user intents are fulfilled. Following , we also augment the value entities during the dialogue outlines generation process, e.g., Tsim Sha Tsui can be replaced with its abbreviation TST, as shown in Figure 2.c. After the user goal is fulfilled by a series of user and system actions, we convert all the actions into natural language using templates. In this phase, we obtain the dialogue states annotations and speech acts automatically for both the user and system sides.

The dialogue outlines are converted to natural dialogues via crowdsourcing. Figure 5 and 6 show the interface for Chinese and English paraphrasing, where workers see the full dialogue and rewrite the dialogue turn by turn. Before the task, workers are asked to read the instructions, shown in Figure 7. In the instructions, we specify that the paraphrased dialogue should retain the same meaning as the dialogue outline but sound like a real conversation between a user and a professional assistant. The user utterances are expected to be creative and diverse, while the system utterances are expected to be formal and correct. To ensure all the essential information is presented in the new dialogue, we highlight all the entities with bold text. In the user utterances, the highlighted entities are allowed to be paraphrased without losing their original meaning; e.g., “The restaurant should provide Vegan Options" is allowed to be rewritten as “I would like to find a vegan-friendly restaurant". In contrast, all the entities in the system utterances are required to be unchanged.

After the dialogue paraphrasing, workers are asked to read through the new dialogue and answer the following questions, as in : 1) Does it seem like a conversation between a user that sounds like you and an assistant that sounds formal? 2) Does it have the same meaning as the original conversation, while still making sense on its own? The first question is for examining whether the new conversation is realistic, and the second question is for verifying whether the dialogue outline and the paraphrased dialogue are valid. Given the two answer options: 1) Yes, 2) No, but I cannot make it better, 97.56% of annotators chose the first option for the first question and 98.89% of them chose the first option for the second question.

3 Dataset Statistics

We collected 7,232 dialogues with 144,798 utterances, in which 3,689 dialogues are in English, and 3,543 dialogues are in Chinese. We split the data into 80% training, 8% validation, and 12% testing, resulting in 5,787 training dialogues, 542 validation dialogues, and 902 testing dialogues. In Figure 3 we show the main data statistics of the BiToD corpus. As shown in Figure 3.a, the lengths of the dialogues vary from 10 turns to more than 50 turns. Multi-domain dialogues, in both English and Chinese, have many more turns compared to single-domains. The most used relation in user goals is equal_to (Figure 3.b), and the most common speech-acts (Figure 3.c) for users and systems are inform and offer, respectively. Finally, in Table 4 in the Appendix, we list all the informable and requestable slots per domain.

4 Dataset Features

Table 2 shows the comparison of the 𝔹𝕚𝕋𝕠𝔻 training set to previous ToD datasets. Prior work for end-to-end ToD modelling only focuses on a single language. Our 𝔹𝕚𝕋𝕠𝔻 is the first bilingual ToD corpus with comparable data size. In addition to its bilingualism, 𝔹𝕚𝕋𝕠𝔻 also provides the following unique features:

Given an API query for recommendation services (e.g., restaurant searching and hotel searching), there is typically more than one matched item. Previous works have randomly sampled one or two items as API results and returned them to users. However, in real-world applications, the system should recommend items according to certain criteria (e.g., user rating). Moreover, the randomness of the API also increases the difficulty of evaluating the models. Indeed, the evaluation metrics in rely on delexicalized response templates, which are not compatible with knowledge-grounded generation approaches . To address these issues, we implement deterministic APIs by ranking the matched items according to user ratings.

To simulate more diverse user goals, we impose different relations for slot-value pairs. For example, in the restaurant searching scenarios, a user might want to eat Chinese food (cuisine, equal_to, Chinese), or do not want Chinese food (cuisine, not, Chinese). Figure 3.b shows the distribution of different relations in user goals.

Our corpus contains code-switching utterances as some of the items in the knowledge base have mixed-language information. In the example in Figure 1.b, the system first recommends a restaurant called ChocoDuck Bistro and the user asks for other options. Then the system searches other restaurants with an additional constraint (restaurant_name, not, ChocoDuck Bistro). In this example, both restaurants only have English names, which is a common phenomenon in multilingual regions like Hong Kong. Thus, ToD systems need to handle the mixed-language context to make correct API calls.

Our corpus includes scenarios where the value of a slot is not presented in the conversation, and the system needs to carry over values from previous API results. In the example in Figure 1.a, the user first finds and books a restaurant without specifying the location; then she () wants an attraction nearby the restaurant. In this case, the system needs to infer the attraction location (Wan Chai) from the restaurant search result.

Tasks & Evaluations

Dialogue state tracking (DST), an essential task for ToD modelling, tracks the users’ requirements over multi-turn conversations. DST labels provide sufficient information for a ToD system to issue APIs and carry out dialogue policies. In this work, we formulate a dialogue state as a set of slot-relation-value triples. We use Joint Goal Accuracy (JGA) to evaluate the performance of the DST. The model outputs are correct when all of the predicted slot-relation-value triples exactly match the oracle triples.

2 End-to-End Task Completion

A user’s requests are fulfilled when the dialogue system makes correct API calls and correctly displays the requested information. We use the following automatic metrics to evaluate the performance of end-to-end task completion: 1) Task Success Rate (TSR): whether the system provides the correct entity and answers all the requested information of a given task, 2) Dialogue Success Rate (DSR): whether the system completes all the tasks in the dialogue, 3) API Call Accuracy ( $\textbf{API}_{Acc}$ ): whether the system generates a correct API call, and 4) BLEU : measuring the fluency of the generated response.

3 Evaluation Settings

Under the monolingual setting, models are trained and tested on either English or Chinese dialogues.

Under the bilingual setting, models are trained on bilingual dialogues (full training set), and in the testing phase, the trained models are expected to handle dialogues in both languages simultaneously without any language identifiers.

This setting simulates the condition of lacking data in a certain language, and we study how to transfer the knowledge from a high resource language to a low resource language. Models have full access to the source language in this setting but limited access to the target language (10%).

Proposed Baselines

Our proposed baselines are based on the recent state-of-the-art end-to-end ToD modeling approach MinTL and cross-lingual transfer approach MTL . We report the hyper-parameters and training details in the Appendix A.5.

We define a dialogue $\mathcal{D}=\{U_{1},S_{1},\dots,U_{T},S_{T}\}$ as an alternating set of utterances from user and systems. At turn $t$ , we denote a dialogue history as $\mathcal{H}_{t}=\{U_{t-w},S_{t-w},\dots,S_{t-1},U_{t}\}$ , where $w$ is the context window size. We denote the dialogue state and knowledge state at turn $t$ as $B_{t}$ and $K_{t}$ , respectively.

1 ToD Modeling

The $Lev_{t}$ is a text span that contains the information for updating the dialogue state from $B_{t-1}$ to $B_{t}$ . The updated dialogue state $B_{t}$ and a response generation prompt, $P_{R}="Response:"$ , are used as input. Then, the model will either generate an API name (2a) when an API call is needed at the current turn, or a plain text response directly returned to the user (2b). If the model generates an API name, it is

the system will query the API with the constraints in the dialogue state and update the knowledge state $K_{t-1}\rightarrow K_{t}$ . The updated knowledge state and API name are incorporated into the model to generate the next turn response generation.

2 Cross-lingual Transfer

Based on the modelling strategy mentioned above, we propose three baselines for the cross-lingual setting.

Directly finetune the pre-trained mSeq2seq models like mBART and mT5 on the 10% dialogue data in the target language.

First, pre-train the mBART and mT5 models on the source language, then finetune the models on the 10% target language data.

To leverage the fact that our knowledge base contains the bilingual parallel information for most of the entities, we replace the entities in the source language data (both input sequence and output sequence) with their target language counterpart in our parallel knowledge base to generate the mixed-language training data. We first pre-train the mSeq2seq models with the generated mixed-language data, then finetune the models on the 10% target language data.

Results & Discussion

The main results for DST and end-to-end task completion are reported in Table 3. Note that the $API_{Acc}$ is highly correlated with the JGA because the dialogue states contain constraints for issuing APIs. And, the DSR is a more challenging metric compared to TSR because the dialogue might contain 2-5 tasks.

Comparing the models that are trained under monolingual and bilingual setting, the latter can leverage more training data and handle tasks in both languages simultaneously without a language identifier. We observe that mT5 achieves better results in the bilingual settings, while mBART performs better with monolingual training. The underlying reason might be the different pre-training strategies of the two mSeq2seq models. mBART is pre-trained with language tokens in both the encoder and decoder, but in our bilingual setting, we do not provide any language information. Such a discrepancy does not exist in the mT5 model, as it is pre-trained without language tokens.

We observe that it is difficult for the baseline models to converge with minimal training data (10%) due to the complex ontology and diverse user goals. Interestingly, pre-training the mSeq2seq models on the source language improves both DST and task completion performance. Such results indicate the excellent cross-lingual transferability of multilingual language models. Furthermore, the mixed-language training strategy further improves the cross-lingual few shot performance, especially the JGA, which suggests that the bilingual knowledge base can facilitate the cross-lingual knowledge transfer in the low resource scenario.

The main limitation of this work is the low number of languages in the corpus due to the difficulty of collecting the knowledge base in languages other than English and Chinese in Hong Kong. in future work, we plan to extend the dataset to more languages including low resource languages in dialogue research (e.g., Indonesian), to better examine the cross-lingual transferability of end-to-end ToD systems. Another limitation is that the M2M data collection might not cover rare and unexpected user behaviours (e.g., non-collaborative dialogues), as dialogue simulators generate the dialogue outlines. However, we see 𝔹𝕚𝕋𝕠𝔻 as a necessary step for building robust multilingual ToD systems before tackling even more complex scenarios.

Related Work

Many datasets have been proposed in the past to support various assistant scenarios. In English, Wen et al. collected a single domain dataset with a Wizard-of-Oz (Woz) setup, which was latter extended to multi-domain by many follow-up works . Despite its effectiveness, Woz data collection method is expensive since two annotators need to be synchronized to conduct a conversation, and the other set of annotators need to annotate speech-act and dialogue states further. To reduce the annotation overhead (time and cost), Byrne et al. proposed a Machines Talking To Machines (M2M) self-chat annotations schema. Similarly, Rastongi et al. applied M2M to collect a large-scale schema-guided ToD dataset, and Kottur et al. extended it to multimodal setting. In languages other than English, only a handful of datasets have been proposed. In Chinese, Zhu et al. , and Quan et al. proposed WoZ style datasets, and in German, the WMT 2020 Chat translated the dataset from Byrne et al. . To the best of our knowledge, all the above-mentioned datasets are monolingual, thus making our BiToD dataset unique since it includes a bilingual setting and all the annotations needed for training an end-to-end task-oriented dialogue system. In the chit-chat setting, XPersona has a translation corpus in seven languages, but it is limited in the chit-chat domain. Finally, Razumovskaia et al. made an excellent summarization of the existing corpus for task-oriented dialogue systems, highlighting the need for multilingual benchmarks, like 𝔹𝕚𝕋𝕠𝔻, for evaluating the cross-lingual transferability of end-to-end systems.

Conclusion

We present 𝔹𝕚𝕋𝕠𝔻, the first bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling. 𝔹𝕚𝕋𝕠𝔻 contains over 7k multi-domain dialogues (144k utterances) with a large and realistic knowledge base. It serves as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning approaches. We provide state-of-the-art baselines under three evaluation settings (monolingual, bilingual and cross-lingual). The analysis of our baselines in different settings highlights 1) the effectiveness of training a bilingual ToD system compared to two independent monolingual ToD systems, and 2) the potential of leveraging a bilingual knowledge base and cross-lingual transfer learning to improve the system performance under low resource conditions.

References

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes]

Did you discuss any potential negative societal impacts of your work? [Yes] Section A.1 for more information.

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [N/A]

Did you include complete proofs of all theoretical results? [N/A]

If you ran experiments (e.g. for benchmarks)…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No]

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Section A.5 for more information.

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [Yes]

Did you mention the license of the assets? [Yes]

Did you include any new assets either in the supplemental material or as a URL? [Yes]

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes]

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes]

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes]

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [Yes]

Appendix A Appendix

In this paper, we propose a new bilingual dataset for end-to-end task-oriented dialogue systems training and evaluation. Our dataset neither introduces any social/ethical, since we generate data with dialogue simulator and humanly paraphrase the utterances nor amplifies any bias. We do not foresee any direct social consequences or ethical issues. Furthermore, our proposed dataset encourages research in the cross-lingual few shot setting, where fewer data and resources are needed, rendering it energy-efficient models.

A.2 Dataset documentation and intended uses

We follow datasheets for datasets guideline to document the following:

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled?

BiToD is created to benchmark the multilingual ability of end-to-end task oriented dialogue systems. Existing end-to-end benchmarks are limited to a single language (e.g., English or Chinese), thus BiToD fills the need of having a dataset for training and evaluating end-to-end task-oriented dialogue systems in the multilingual and cross-lingual settings.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

HKUST CAiRE team and Alibaba team work together to create this dataset.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

Alibaba team funded the creation of the dataset.

A.2.2 Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

BiToD is made of conversations (text) between two speakers (user and assistant) and the textual knowledge in return from the API-call (tuple in a DB). BiToD also includes speech-acts for both user and systems, and dialogue state annotations.

How many instances are there in total (of each type, if appropriate)?

BiToD has 7,232 dialogues with 144,798 utterances, in which 3,689 dialogues are in English and 3,543 dialogues are in Chinese.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).

BiToD has been designed from scratch and thus contains all possible instances.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.

Each sample has raw text of conversations, speech-acts for both user and systems, dialogue state annotations, query, and knowledge bases return.

Is there a label or target associated with each instance? If so, please provide a description.

Each response is annotated with its speech-acts and the response it-self is target label.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

No, we included all the information we had.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

Yes, we split the data into 80% training, 8% validation, and 12% testing, resulting in 5,787 training dialogues, 542 validation dialogues, and 902 testing dialogues.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

In 2.44% of the dialogues, the annotators reported that the conversation did not sound formal enough, and in 1.11% of the dialogues, the annotators reported that the dialogues are not valid – did not sound coherent.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions] (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctorpatient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual. orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

A.2.3 Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated? If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

Each utterance in the dialogue is paraphrased by Amazon Mechanical Turk for the English instances and AI-Speech http://www.aispeech.com/ for the Chinese instances.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

Crowdworkers. We paid them roughly $10-12 per hour, calculated by the average time to write the paraphrase which is approximately 8 minutes.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

The data was collected during February 2021 to May 2021.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

We have conducted an internal ethical review process by the HKUST ethical team.

Does the dataset relate to people? If not, you may skip the remainder of the questions in this section.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

Yes, the workers knew the data collection procedure. Screenshots are shown in Figure 5, Figure 6, Figure 7 and Figure 8 in the Appendix.

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

AMT has its own data policy (https://www.mturk.com/acceptable-use-policy) and AI-Speech (http://www.aispeech.com/).

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

https://www.mturk.com/acceptable-use-policy and http://www.aispeech.com/.

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

A.2.4 Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the. remainder of the questions in this section.

No data cleaning or preprocessing is done for the released dataset since the dialogue data were generated by a simulator and only paraphrased by the workers.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.

A.2.5 Uses

Has the dataset been used for any tasks already? If so, please provide a description.

BiToD is a new dataset we collected for end-to-end task-oriented modeling and dialogue state tracking tasks. In this work, we build baseline models on BiToD for the mentioned tasks as a benchmark for future research.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

Yes, we release our dataset, code, and baseline models at https://github.com/HLTCHKUST/BiToD.

What (other) tasks could the dataset be used for?

BiToD could be used for training dialogue policy by using the speech-act annotation, natural language generation modules, and user simulators.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

Are there tasks for which the dataset should not be used? If so, please provide a description.

A.2.6 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

It is released on Github at https://github.com/HLTCHKUST/BiToD. No DOI.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

Apache License 2.0. https://github.com/HLTCHKUST/BiToD/blob/main/LICENSE

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

A.2.7 Maintenance

Who is supporting/hosting/maintaining the dataset?

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

Create an open issue on our Github repository or contact the authors (check author list email).

Is there an erratum? If so, please provide a link or other access point.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?

No. If we plan to update in the future, we will indicate the information on our Github repository.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users.

Yes. If we plan to update the data, we will keep the original version available and then release the follow-up version, for example, BiToD-2.0

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.

Yes, they can submit a Github pull request or contact us privately.

A.3 Accessibility

Links to access the dataset and its metadata. https://github.com/HLTCHKUST/BiToD

The data is saved in a json format, where an example is shown in the README.md file.

HKUST CAiRE team will maintain this dataset on the official company Github account.

Apache License 2.0. https://github.com/HLTCHKUST/BiToD/blob/main/LICENSE

A.4 Data Usage

The authors bear all responsibility in case of violation of rights.

A.5 Training Details

We implement our baselines based on the huggingface Transformers . In all of our experiments, we set the dialogue context window size $w=2$ and we use the pre-trained model mT5-small and mBART-large. They are trained with batch size 128 using an AdamW optimizer with the initial learning rate of $0.0005$ and $0.0001$ respectively. In monolingual and bilingual settings, all the models are trained for 8 epochs, while in cross-lingual setting, the models are first trained on source language dialogues for 8 epochs and then fine tune the model on target language for 10 epochs. We use 2 NVIDIA V100 GPUs for mBART training and 2 1080Ti for mT5 training. All the trainings take less than 10 hours. We use greedy decoding in test time. More training information is available in https://github.com/HLTCHKUST/BiToD.