UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models
Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, Tao Yu
Introduction
Structured knowledge (e.g., web tables, knowledge graphs, and databases) stores large amounts of data in organized structures, forming a basis for a wide range of applications, e.g., medical diagnosis, personal assistants, and customer relations management. Accessing and searching data in structured knowledge typically requires mastering query languages through professional training. To promote the efficiency of data access, structured knowledge grounding (SKG) systems ground user requests in structured knowledge and produce various outputs, including computer programs (e.g., SQL and SPARQL), table cell values, and natural language responses (Figure 1). For example, semantic parsing Zelle and Mooney (1996); Zettlemoyer and Collins (2005) converts natural language questions into formal programs; knowledge-base question answering Berant et al. (2013) derives answers from tables or knowledge graphs.
SKG has attracted significant interest and has been studied through different tasks defined by different communities. Recent developments in tasks, models, and datasets for SKG have led to task-specific modeling advances, making each task’s progress seemingly unique and incompatible. A main reason is that SKG tasks are heterogeneous. Different types of structured knowledge, such as databases or knowledge graphs, lead to highly specialized encoders (Lin et al., 2019; Herzig et al., 2020; Wang et al., 2020; Yasunaga et al., 2021). Some SKG tasks, e.g., semantic parsing, use customized decoders to generate programs Yin and Neubig (2018); Ren et al. (2021). Therefore, instead of solving common challenges in SKG research, improvements in SKG have been prone to be exclusive to a single task, domain, or dataset.
In this paper, we propose the UnifiedSKG framework to advocate for a unifying view of 21 SKG tasks across six task families and multiple data domains (Table 1). UnifiedSKG standardizes datasets, models, code, experiments, and evaluation metrics into a single framework. By casting user requests, structured knowledge, and outputs into the text-to-text format Raffel et al. (2020), it promotes model advances where new tasks can be framed with our standardized abstraction, and new models can be easily applied to diverse SKG tasks. While previous works also cast SKG tasks into the text-to-text format Hosseini-Asl et al. (2020); Shaw et al. (2021); Liu et al. (2021), their independent choices of pretrained language models (PLMs), input-output formats, and frameworks make our unification non-trivial. UnifiedSKG is easily extensible to more SKG tasks, and it is open-sourced to promote community-wide progress.
Using UnifiedSKG as a benchmark, we show that finetuning T5 (with constrained decoding or reranking when necessary) on individual tasks achieves state-of-the-art (sota) results on almost all of the 21 tasks, establishing a powerful and reproducible starting point for SKG research. T5 performance also increases with size on most tasks.
UnifiedSKG facilitates multi-task learning on SKG, enabling knowledge sharing and cross-task generalization. Although simple multi-task learning has mixed results, we show that multi-task learning with prefix-tuning (Li and Liang, 2021) benefits most tasks and largely improves the overall performance, on both T5-base and T5-large.
UnifiedSKG is a challenging testbed for few-shot Brown et al. (2020); Ye et al. (2021a) and zero-shot learning Zhong et al. (2021); Wei et al. (2021); Sanh et al. (2021) with PLMs. Our experiments show that models like T0 Sanh et al. (2021) struggle in zero-shot learning on SKG tasks, and GPT-3 Brown et al. (2020) and Codex Chen et al. (2021a) struggle in few-shot learning on SKG tasks.
UnifiedSKG enables a series of controlled experiments on structured knowledge encoding. We find that T5 is sensitive to encoding variations, and the sensitivity varies across tasks. UnifiedSKG aims to facilitate more general and robust structured knowledge encoding methods. Finally, we conduct a comprehensive error analysis across SKG tasks. Although the errors made by PLMs decrease with the model size, T5-3B may still generate invalid outputs.
In summary, we 1) unify and benchmark 21 SKG tasks under the UnifiedSKG framework to evaluate diverse grounding goals and structured knowledge sources, 2) demonstrate (near) sota performance of T5 on all the unified SKG tasks, using a single, general-purpose approach, 3) show the benefit of knowledge sharing across SKG tasks via multi-task prefix-tuning, and 4) analyze recent modeling contributions (zero-shot, few-shot, and structured knowledge encoding) on these tasks. We hope UnifiedSKG enables the design of new models and learning algorithms that generalize to diverse SKG tasks and to identify their challenges.
Related Work
SKG with PLMs PLMs have been applied to several SKG tasks. To encode structured knowledge, prior work linearized the structured knowledge and concatenated it with the text Hwang et al. (2019); Liu et al. (2020); Hosseini-Asl et al. (2020); Liu et al. (2021), which has been augmented by positional encoding (e.g., row/column embedding) Herzig et al. (2020); Yin et al. (2020a) and template-based linearization Chen et al. (2020a, b); Oguz et al. (2021), and planning Su et al. (2021). Recently, cell-column alignment is modeled by manipulating the attention matrix of transformers Zhang et al. (2020); Eisenschlos et al. (2021). Hierarchical encoding is another way to represent the structure, e.g., Wang et al. (2021b) used tree-based transformers to represent the structure of the tables; Iida et al. (2021) used transformers to encode row and column representations; Chen et al. (2021b) used hierarchical transformers to encode KG triples. SKG’s outputs include, but are not limited to, structured meaning representations (e.g., logic forms, SQL), dialogue states, natural language, answer sets, and Boolean values. Among them, structured meaning representation is challenging for PLMs because they are originally trained on natural language. To bridge this gap, Shin et al. (2021) adopted the insights from Berant and Liang (2014) and Marzoev et al. (2020) and proposed to convert formal language into an English-like representation, decode with GPT-3, and map back to formal language automatically. We do not focus on these techniques in this work; instead, we unify all tasks and systematically compare them.
Task format unification Recent years witnessed the trend of unifying related but different tasks into a shared format. McCann et al. (2018) unified various tasks as question answering. Yin et al. (2020b) and Wang et al. (2021a) unified few-shot learning as textual entailment. PLUR Chen et al. (2021c) unified program learning, understanding, and repair tasks into a graph-to-sequence format. In this paper, we focus on the text-to-text format Raffel et al. (2020) due to its flexibility. Different from unifying tasks that only take text as input, a core challenge in unifying SKG tasks into the text-to-text format is to linearize structured knowledge. Notably, UnifiedQA Khashabi et al. (2020) unified QA tasks, while UnifiedSKG covers a broader scope of six task families for systematic exploration.
Cross-task generalization with PLMs Multi-task learning and transfer learning go beyond task boundaries, view different tasks as related, and have been shown to outperform single-task learning Aghajanyan et al. (2021a); Vu et al. (2021). Large PLMs show potential for zero-shot and few-shot learning, e.g., GPT-2 Radford et al. (2019) and GPT-3 Brown et al. (2020), which can be improved by multi-task learning Zhong et al. (2021), e.g., FLAN Wei et al. (2021), T0 Sanh et al. (2021), and CrossFit Ye et al. (2021a). ExT5 Aribandi et al. (2021) shows that scaling up multi-task learning helps improve pretraining efficiency and downstream performances. UnifiedSKG facilitates the investigation of multi-task, zero-shot, and few-shot learning on SKG tasks.
The UnifiedSKG Framework
The guiding principle of UnifiedSKG’s task selection is diversity. We unify 21 SKG tasks across six task families and multiple domains (Table 1). Our task families include:
Semantic parsing converts questions to logical forms Zelle and Mooney (1996); Zettlemoyer and Collins (2005).
Question answering derives answers to natural language questions based on structured data Berant et al. (2013).
Data-to-text generation describes structured data in natural language Novikova et al. (2017).
Fact verification checks if a statement is true based on the structured data Chen et al. (2020b).
Conversational tasks require understanding of not only the user’s last request but also the full interaction history between users and machines Budzianowski et al. (2018); Eric et al. (2019); Yu et al. (2019a).
Formal language to text translation describes formal language in natural language Chen et al. (2020d).
2 Modeling
The simplest usage of UnifiedSKG is to train on individual tasks. In this case, we minimize the negative log-likelihood loss averaged over tokens in each batch. For decoding, we use beam search by default. UnifiedSKG also facilitates exploration of multi-task learning, few-shot, and zero-shot learning with PLMs, and details are presented in the corresponding parts in Section 4.
Experiments and Analysis
We apply T5 models Raffel et al. (2020) on each individual task in UnifiedSKG. For model training, we set the maximum number of epochs as 50–200, depending on the dataset size. We use early stopping and model selection on the development set. More details are shown in Appendix D.1. For each task, we report one commonly used metric in Table 3. See Appendix B for all metrics.
Comparison with previous sota Table 3 shows that vanilla T5-3B outperforms most previous sota models not trained on extra unsupervised in-domain data. Some semantic parsing sota models, denoted as + in Table 3, are also T5 with constrained decoding Scholak et al. (2021) or reranking Ye et al. (2021b). This shows that a generalist architecture like T5, when scaled up to a certain size, can be as good as task-specific architectures for SKG, suggesting the potential of larger PLMs.
Model scalability In general, T5 performance increases with the model size, but this trend varies across task families. Semantic parsing, QA, and fact verification tasks get large benefits from increased sizes, while text generation does not. See Section 4.5 for a human evaluation for text generation tasks. Also, the gap between T5-base (220M) and T5-large (770M) is larger than the gap between T5-large (770M) and T5-3B (3B).
Effect of pretraining on structured knowledge Some smaller models pretrained on structured knowledge Liu et al. (2021) show competitive performance as T5-3B, suggesting that pretraining with structured data is beneficial for SKG. This result calls for structured knowledge pretraining that generalizes to different SKG tasks across domains, which can be systematically explored using UnifiedSKG.
Effect of pretraining on non-SKG tasks T0-3B Sanh et al. (2021) is initialized from T5-3B and pretrained on multiple tasks that (in most cases) do not use structured knowledge as input (non-SKG tasks). Exploring the performance of T0-3B on SKG tasks helps us understand the relationship between SKG tasks and non-SKG tasks. Table 3 shows that T0-3B under-performs T5-3B on semantic parsing and outperforms T5-3B on dialogue state tracking and fact verification. We note that T0-3B is pretrained on dialogue QA, dialogue summarization, and NLI tasks; therefore, pretraining on non-SKG tasks might not be useful for SKG unless we add similar SKG tasks to pretraining.
2 Multi-Task Learning
UnifiedSKG facilitates the exploration of multi-task learning. In this part, we systematically study multi-task learning on all 21 unified tasks. We find that SKG benefits from multi-task prefix-tuning on both T5-base and T5-large, showing that the benefits from multi-task learning is scalable in terms of the model size. The baselines we use include:
Single-task finetuning (ST-F), which is finetuning on individual tasks, same as Section 4.1.
Single-task prefix-tuning (ST-P; Li and Liang, 2021), which learns lightweight task-specific parameters while keeping the PLM fixed. We set the prefix length as 10. Clive et al. (2021) also used prefix-tuning on T5 for data-to-text generation.
Multi-task finetuning (MT-F), which combines the training data of all tasks with temperature mixing (Raffel et al., 2020; after hyperparameter tuning with a few steps, we set the temperature as 2). We select model weights based on the average metric on all tasks’ development set.
Table 4 shows that ST-P is comparable to ST-F on nearly all tasks. However, we find that it takes about 5–10 times as many training steps (See Appendix E), which is similarly observed for prompt-tuning Lester et al. (2021). We also observe that MT-F leads to mixed results. For many tasks, MT-F is even worse than ST-F.
Multi-task prefix-tuning (MT-P) Our explanation for the mixed results of MT-F is that the inputs of SKG tasks contain different structured knowledge from diverse domains, making it difficult to learn shared parameters effectively. To address this challenge, we first pretrain a prefix on all tasks, freezing T5 and using the same temperature mixing as MT-F. In the second step, we initialize each task’s prefix with this pretrained prefix and optimize the prefix while freezing T5. This initialization step is similar to the prompt transfer explored in Vu et al. (2021). Following ST-P, we set the prefix length as 10.
Table 4 shows that multi-task prefix-tuning outperforms single-task finetuning and single-task prefix-tuning on most tasks, and it largely outperforms the naive multi-task learning baseline. It demonstrates that SKG tasks can be studied together to share data and knowledge.
Exploring task knowledge transfer UnifiedSKG facilitates studying knowledge transfer between SKG tasks. Given two tasks, task A and task B, we first train the model on task A and then continue training on task B. Table 5 shows that tasks benefit from other tasks with the same data source (e.g., tasks that all use Wikipedia tables as structured knowledge). We do not observe positive transfer between parallel tasks (e.g., semantic parsing tasks with different structured knowledge and different output) and subtask (e.g., question answering can be viewed as the execution semantic parses) when data sources are different. Compared to the positive results in Table 4, results in this part indicate that manually selecting source and target tasks may not be efficient for multi-task learning.
3 Zero-Shot and Few-Shot Learning
The text-to-text unification of UnifiedSKG enables us to investigate zero/few-shot learning on SKG with large PLMs.
Zero-shot learning setting Zero-shot learning enables models to solve tasks with natural language descriptions without training samples. We follow T0 Sanh et al. (2021) to create similar natural language instructions for the unseen tasks. Our instructions are provided in Appendix D.3.
Few-shot learning settings Brown et al. (2020) showed that large PLMs could be few-shot learners by encoding a few training samples as “context” to learn without gradient updates. We use GPT-3 Brown et al. (2020) and Codex Chen et al. (2021a) to explore such few-shot learning for SKG. To stay within our budget, for GPT-3, we report the performance on 100 random dev. set samples. We explore two settings for few-shot learning.
In the first setting, we randomly sample few-shot examples from the training set; these examples are shared by all dev. set samples, denoted as random in Table 6. For sequences that are too long for Codex (4096) and GPT-3 (2048), we use as many examples as possible and make sure that there is at least one example (truncated if needed).
In the second setting, we follow Gao et al. (2021) to select few-shot examples from the training set. We call this setting few-shot with example selection, denoted as select in Table 6. We use the pretrained SBERT Reimers and Gurevych (2020) for sentence embeddings of the user request input (for tasks that only have structured input, we embed the linearized structured input) and sample five most similar examples measured by cosine similarity. Further details (e.g., prompts and task instructions) are provided in Appendix D.4.
SKG is challenging for zero/few-shot learning. Table 6 shows that zero-shot performance is very poor on most tasks (Spider and MultiWoZ are even 0). It also shows a large gap between few-shot learning and finetuning for Spider, WikiTQ, MWoZ, and TabFact, while the gap is smaller for generation tasks. For few-shot learning, example selection based on similarity outperforms random selection, but the gap is usually smaller than 10 points out of 100. It is also interesting to compare the results between synthesis tasks (Spider), which requires predicting programs, and induction tasks (WikiTQ and TabFact), where a model directly outputs answers (Devlin et al., 2017). We find that PLMs generally struggle more when adapting to induction tasks (e.g., close to random-guess on the binary classification task TabFact), reminiscent of recent attempts in program synthesis and induction using PLMs (Austin et al., 2021). For GPT-3 and Codex, better zero-shot performances can be expected by better prompt design.
4 Structured Knowledge Encoding
Structured knowledge encoding has been widely explored (Bogin et al., 2019; Lin et al., 2019; Agarwal et al., 2020; Saxena et al., 2020; Yasunaga and Liang, 2020; Yasunaga et al., 2022; and others detailed in Section 2). We hope that UnifiedSKG can promote systematic study of general structured knowledge encoding. To this end, this part focuses on the linearization of structured knowledge.
Does the order of user input, structured knowledge, and context matter? To explore the effect of the order of user input, structured knowledge, and context, we rerun the single-task experiments while switching the order of these components in both the training and development set. Table 7 shows that placing the text before structured knowledge (rs) is better than the opposite (sr), which is consistent across SKG tasks. Our explanation is that the position of the text is relatively fixed in rs, helping the decoder to learn stable attention over the text. Also, placing the context in between the text and structured knowledge yields better results.
Is T5 sensitive to structured knowledge ordering? Order-insensitivity is common for most structured knowledge, e.g., permutation of columns in a table preserves the meaning. To study this insensitivity, we evaluate T5-large on a manipulated development set where the order of schema (for database), column (for table), or slots and values (for ontology) is reversed. Table 8 shows that tasks with cross-domain tables and databases are less order-sensitive, while models are very sensitive to the order of ontology. Other types of robustness (e.g., robustness to cell values irrelevant to the answer) remain an open question in UnifiedSKG.
Is it beneficial to represent structured knowledge as natural language? SKG data is not typically used to pretrain PLMs. Given ample training data, PLMs adapt well to SKG tasks, as shown in Table 3. However, under the low-resource setting, converting structured data to natural language might be helpful. For Spider, we use a shared template to convert structured data to natural language. For TabFact and WikiSQL, we randomly selected 236 tables shared by both datasets and manually labeled templates to convert each row into a sentence. Examples of the templates are shown in Appendix I. These templates produce about 1000 samples for each task, divided into training and test sets. We find that, in WikiSQL, the conversion to natural language stabilizes and accelerates the training process. Table 9 shows that conversion to natural language improves the performance on WikiSQL, has no significant influence on TabFact, and slightly degrades the performance on Spider.
5 Human Evaluation for Generation Tasks
For each generation task, we randomly sample 100 development set samples and ask human annotators to judge the correctness of each output, using a 0-1 score. Details are provided in Appendix D.5. Table 10 shows that automatic metrics do not always reflect human evaluation, calling for better automatic metrics to truly reflect the model’s ability on generation tasks. Larger models are not always better, and detailed error analysis is provided below.
6 Error Analysis
Error analysis based on output validity Unconstrained decoding from PLMs may generate invalid outputs. For semantic parsing, we divide wrong outputs into invalid outputs (i.e., not executable when the output is SQL, and not parse-able when the output is s-expression or TOP-representation) and valid but wrong answers. Figure 3 shows that, for SQL semantic parsing, a large number of errors are caused by invalid outputs, and the number of invalid outputs gradually decreases with the increase of model size. This phenomenon is also observed by Scholak et al. (2021), who used constrained decoding to improve the validity, largely improving the parsing performance. For s-expression semantic parsing, invalid outputs take up 30–50% of all wrong outputs, and increasing the model size does not reduce invalidity significantly. For fact verification tasks, valid outputs are “entailed” and “refuted”. We observe that T5 always generates valid outputs. For question answering, we do not include the validity analysis since the validity check for an answer is non-trivial and could be imprecise.
Error analysis for text generation tasks For generation tasks, we consider four types of errors: missing information (required information is not shown in the output), contradiction (the output is contradictory to the input), 3) hallucination (the output contains information that cannot be verified by the input), and 4) ungrammatical. Figure 3 shows that the proportion of ungrammatical outputs is generally less than 5%. Missing information and contradiction are common errors made by T5, and performance gains generally come from reducing contradiction. Hallucination is not a common error made by T5 except for the highlighted-table-to-text task (ToTTo), where T5 tends to output information of non-highlighted cell values.
Case study We summarize some interesting observations about the model output (more in Appendix H). Compared with T5-base and T5-large, T5-3B’s outputs for text generation tasks tend to be more diverse and creative as shown in Appendix H.2 and H.7. Also, T5-3B sometimes leverages domain knowledge to summarize facts in some tasks such as DART (e.g., describing rating 5 out of 5 as low), while the other two copy the original expressions in the input, as shown in Appendix H.5 and H.6. However, this ability puts T5-3B in the risk of manipulating information and meaning of user request as shown in Appendix H.3.2 and H.4.
Conclusions
In this paper, we propose the UnifiedSKG framework to promote systematic research on structured knowledge grounding by unifying 21 SKG tasks. Using UnifiedSKG as a benchmark, we demonstrate that finetuning T5 on individual tasks achieves state-of-the-art results on almost all 21 tasks. We show that multi-task prefix-tuning benefits most SKG tasks, largely improving the overall performance. For structured knowledge encoding, we find that the effectiveness of encoding variations varies across tasks. Moreover, UnifiedSKG is a challenging testbed for zero-shot and few-shot learning, shown by the poor results of large PLMs.
Limitations
UnifiedSKG establishes a powerful and reproducible starting point for SKG research. New models can be easily applied to diverse SKG tasks, and new tasks can be easily framed based on our standardized abstraction. UnifiedSKG promotes a systematic study on more general and robust advances in structured knowledge encoding, multi-task learning, zero-shot learning, and few-shot learning for SKG tasks. It also would be interesting to explore general pretraining methods within UnifiedSKG, which potentially benefit all the unified tasks. When the structured knowledge is too large for GPU memory, we truncate them based on heuristic rules, calling for future study on 1) incorporating retrieval component in SKG, 2) designing sparse attention in T5 for structured knowledge or other means to improve model efficiency.
UnifiedSKG currently provides the correct type of structured knowledge for each task. However, how a system searches for the correct structured knowledge resources, takes appropriate action, and integrates information and results from multiple structured sources given a user request is still under-explored, which are a prerequisite for building a unified multi-purpose SKG system.
Since we select popular tasks from each task family, we risk disproportionality in terms of the data language, domain and population, and we actively welcome diverse, multi-lingual tasks to be added into UnifiedSKG. Also, the error analysis of SKG can more fine-grained, and we hope our findings can promote future work on systematically studying and decomposing the behavior of PLMs on SKG tasks. Furthermore, training and evaluation data should reflect the intents and linguistic phenomena in the real world de Vries et al. (2020), suggesting more realistic tasks to be added into UnifiedSKG.
References
Appendix A Contributions
Code implementation Tianbao Xie and Chen Henry Wu implemented the code base of the UnifiedSKG framework and experiment pipeline. The code of PICARD and advice from Torsten Scholak sped up the implementation.
Task unification Tianbao Xie, Peng Shi, Michihiro Yasunaga, Chen Henry Wu, and Ming Zhong implemented the 21 tasks into the text-to-text format, adapted the metrics, and verified the performances.
Paper writing Chen Henry Wu and Tianbao Xie finished most part of the paper. Michihiro Yasunaga, Peng Shi, and Chengzu Li added results and analysis for their corresponding parts. Peng Shi drafted related work on SKG with PLMs. Torsten Scholak, Pengcheng Yin, Rui Zhang, Ruiqi Zhong, Victor Zhong, Michihiro Yasunaga, Connor Boyle, Chien-Sheng Wu, Sida Wang, Bailin Wang, Ansong Ni, Ziyu Yao, Lingpeng Kong, Caiming Xiong, Dragomir Radev, Noah A. Smith, and Luke Zettlemoyer carefully reviewed the paper and gave feedback for multiple rounds.
Experiments Chen Henry Wu, Tianbao Xie, and Chien-Sheng Wu conducted experiments on individual tasks and multi-task learning. Tianbao conducted the zero-shot learning experiments. Chengzu Li and Tianbao Xie conducted the few-shot learning experiments. Tianbao Xie conducted experiments on the ordering of sequence inputs and order-sensitivity. Chengzu Li, Connor Boyle, and Peng Shi conducted the experiments on converting structured knowledge into natural language.
Human evaluation Chen Henry Wu organized the human evaluation. Torsten Scholak, Rui Zhang, Chengzu Li, Connor Boyle, Tianbao Xie, Peng Shi, Tao Yu, and Chen Henry Wu were the human participants.
Error analysis and case study Tianbao Xie, Chen Henry Wu, and Michihiro Yasunaga designed and conducted the error analysis for semantic parsing and generation tasks. Authors who participated in the human annotation selected the cases for case study.
Discussion We had three separate weekly meetings, and everyone in the project attended one of them. Torsten Scholak, Ruiqi Zhong, Pengcheng Yin, Victor Zhong, Peng Shi, Rui Zhang, Sida Wang, and Lingpeng Kong actively provided advice. Torsten Scholak provided signals that prefix-tuning would be comparable to fine-tuning. Ruiqi Zhong gave advice on analyzing the effect of model size, Pengcheng Yin and Peng Shi gave advice on analysis on converting structured knowledge into natural language. Pengcheng Yin helped interpret experimental results. Ziyu Yao suggested that we report both sota (w/ extra) and sota (w/o extra) for a fair comparison. Victor Zhong and Bailin Wang gave valuable suggestions on multi-task learning and task transfer analysis. Luke Zettlemoyer, Noah A. Smith, Caiming Xiong, and Dragomir Radev gave valuable comments on research questions and experimental design.
Computing resources We thank Salesforce Research, an Amazon Research Award, ServiceNow Research, and Yale NLP for providing computing resources generously.
Acknowledgments
We thank Yifei Min and Libo Qin for their early-stage discussion. We thank Panupong Pasupat and William W. Cohen for their valuable feedback on our initial draft. We thank Qian Liu for his TaPeX code and advice on question answering tasks. We thank wandb for free logging and OpenAI for free Codex usage.
Appendix B Results with Full Metrics
For the KVRET dataset, instead of the version used in our main tables, we re-run another more widely used pre-processed version Madotto et al. (2018); Wu et al. (2019); Qin et al. (2020) on T5-base, T5-large and T5-3b. Results are shown in Table 13.
Appendix C Input and Output Length Analysis
Linearization of large structured knowledge input (e.g., large tables and KGs) can be arbitrarily long, which needs to be truncated to fit in GPUs with a limited size. The input and output are tokenized by T5Tokenizer in Huggingface’s Transformers.https://huggingface.co/t5-base/tree/main We visualize the length distribution in Figure 5, and details are presented in Table 14. Among the datasets with very long inputs, we choose WikiTableQuestion to study the impact of input length. We visualize the table length distribution and performances with different input truncation lengths in Figure 6. We observe that the accuracy increases as the input becomes longer, motivating future work to study how to effectively encode large structured input, e.g., leveraging sparse attention Zaheer et al. (2020).
Appendix D Experimental Setup
We use T5 Raffel et al. (2020) as our backbone language model. Each experiment For T5-3B experiments, we use Deepspeedhttps://github.com/microsoft/DeepSpeed to save memory. We use batch size 32 as default, except WikiTQ, WikiSQL, and TabFact, for which for use batch size 128 because we found it to work significantly better. We use the Adafactor optimizer for T5-base and T5-large, and AdamW for T5-3b. We evaluate on the development set for each 500 steps and use the average development set metric for best checkpoint selection. For all tasks, we set learning rate to 5e-5 and used linear learning rate decay. All experiments are done on NVIDIA Tesla V100 and NVIDIA Tesla A100.
D.2 Metric Details
For most semantic parsing tasks, we report the exact match accuracy of logical forms, and for task has test suite Zhong et al. (2020), we add test suite metric to represent model’s performance; an exception is WebQSP, for which we follow previous work to execute the parses and report the F1 score. For QA, we report the exact match accuracy of answer sets. For data-to-text generation, we report sacre-BLEU Post (2018).Signature: BLEU + case.lc + numrefs.1 + smooth.exp + tok.13a + version.1.4.0 We use each task’s representative metric used by previous works. For fact verification, we report the accuracy. For high-fidelity NLG, we report BLEC Shu et al. (2021), which is the exact match between keywords in the formal language and the natural language. Unless specified, we use T5-large and report the development set performance.
D.3 T0 Zero-shot Experimental Details
For each task in UnifiedSKG we search Sanh et al. (2021) for the most similar instructions(if there is no one for use, we create one follow their writing style), make our input in that format and directly test on T0 3B. The specific instructions are shown below.
D.4 GPT3 and Codex Details
For GPT3 and Codex, we set the decoding temperature to 0 (i.e., greedy decoding without sampling) for Spider, WikiTQ, MultiWoZ and TabFact. We observe a drop of 10% in the exact match metric when set the temperature to 1 by default in OpenAI. For Codex, we tune the temperature from 0 to 1 in a step of 0.1 for DART, SQL2Text, and no significant difference is observed. For GPT3, we do not tune on that to stay within our budget.
We set max output length to 256 for Spider, WikiTQ, MultiWoZ and SQL2Text, while 4 for TabFact to contain more length in the input side(the concept of max length in GPT3 and Codex is the sum of input tokens length and output tokens length). We set “\n” as the stop token.
D.4.2 Prompts
We use simple prompt words for each task to concatenate the request, linearized structured knowledge, and context together. For example, for each example in WikiTQ, we format it as “examples\n\n[linearized table] || Write a answer for [request] \nThe answer is:”, and make GPT3 and Codex make the completion as prediction. We do experiments on Spider with different format of forming structured knowledge (e.g., linearization, description), but get a similar result. Better usage of GPT3 and Codex under the UnifiedSKG framework is an interesting direction.
D.5 Human Evaluation
Participants of our human evaluation are eight of the authors of this paper. They are familiar with the tasks being evaluated. The human evaluation guideline is shown below.
D.6 Hyperparameters
Shown in Table 17. For semantic parsing tasks, the decoding was done under the greedy search, where we set the beam size to 1 specially. For tasks with a long linearized sequence, we used 1024 as input length to hold the maximum of input; reasons are explained in App. C.
Appendix E Training Details
Here we show comparisons of finetuning and prefix-tuning on aspect of training. For prefix-tuning, we use random initialization as done by Li and Liang (2021). In general, prefix-tuning needs more steps than finetuning but has the ability to reach comparable results with continued training.
Appendix F Task Unification
A highlighted table contains a table, table metadata (such as the title), and a set of highlighted cells which entails the text description Parikh et al. (2020).
Relation triples are a set of subject-predicate-object triples to capture rich relationships in the data. Many data-to-text tasks such as DART Nan et al. (2021b) take these relation triples as inputs and generate natural language from them.
A knowledge graph is a multi-relational graph composed of entities (nodes) and relations (different types of edges). Each edge is represented as a triple of the form (head entity, relation, tail entity), also called a fact, indicating that two entities are connected by a specific relation Wang et al. (2017).
A dialogue state at any turn in a dialogue comprises the summary of the dialogue history until turn , such that contains all sufficient information for the system to choose the next action. Williams et al. (2016) Specifically, it captures the user goals in the conversation in the form of (slot, value) pairs. The set of possible slots is predefined in the ontology , typically domain-dependent, while the values assumed by each slots are provided by the user as a dialogue goal.
F.2 Linearization
Tables. Following Liu et al. (2021), we linearize the table into a sequence. By inserting several special tokens to indicate the table boundaries, a linearized table can be represented as “col: , …, row 1 : row 2 : … ”, and are the number of columns and rows.
Highlighted tables. Following Parikh et al. (2020), we represent each highlighted cell by concatenating its value, column headers, and row headers. The table is represented as the concatenation of the page title, section title, and representations of all highlighted cells.
Relation-triples and knowledge graphs. Following Nan et al. (2021b), each relation-triple is linearized as “sub : rela : obj”, and different triples are joined by “ | ”. The subgraph retrieved from the knowledge graph is treated as a list of relation-triples and we use the same formulation.
Ontology. Following Hosseini-Asl et al. (2020) and Lin et al. (2021), for each slot in ontology, each slot along with its all possible values is formatted as “slot : value1, … value ”, different slot-values are joined by “ | ”
F.3 Output Format
When the output is natural language or formal language we do not modify it because it is already in sequence format; a set of answers, we use a comma followed by a space to join the answers; a Boolean value, we map True to “entailed” and False to “refuted”; a dialogue state, we follow Hosseini-Asl et al. (2020) to place its slot-value pairs sequentially.