jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models

Yada Pruksachatkun, Phil Yeres, Haokun Liu, Jason Phang, Phu Mon Htut, Alex Wang, Ian Tenney, Samuel R. Bowman

Introduction

This paper introduces jiant,The name jiant stands for “jiant is an NLP toolkit”. an open source toolkit that allows researchers to quickly experiment on a wide array of NLU tasks, using state-of-the-art NLP models, and conduct experiments on probing, transfer learning, and multitask training. jiant supports many state-of-the-art Transformer-based models implemented by Huggingface’s Transformers package, as well as non-Transformer models such as BiLSTMs.

Packages and libraries like HuggingFace’s Transformers (Wolf et al., 2019) and AllenNLP (Gardner et al., 2017) have accelerated the process of experimenting and iterating on NLP models by both abstracting out implementation details, and simplifying the model training pipeline. jiant extends the capabilities of both toolkits by presenting a wrapper that implements a variety of complex experimental pipelines in a scalable and easily controllable setting. jiant contains a task bank of over 50 tasks, including all the tasks presented in GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019b), the edge-probing suite Tenney et al. (2019b), and the SentEval probing suite (Conneau and Kiela, 2018), as well as other individual tasks including CCG supertagging (Hockenmaier and Steedman, 2007), SocialIQA (Sap et al., 2019), and CommonsenseQA (Talmor et al., 2019). jiant is also the official baseline codebase for the SuperGLUE benchmark.

Ease of use: jiant should allow users to run a variety of experiments using state-of-the-art models via an easy to use configuration-driven interface.

Reproducibility: jiant should provide features that support correct and reproducible experiments, including logging and saving and restoring model state.

Availability of NLU tasks: jiant should maintain and continue to grow a collection of tasks useful for NLU research, especially popular evaluation tasks and tasks commonly used in pretraining and transfer learning.

Availability of cutting-edge models: jiant should make implementations of state-of-the-art models available for experimentation.

Open source: jiant should be free to use, and easy to contribute to.

Early versions of jiant have already been used in multiple works, including probing analyses (Tenney et al., 2019b, a; Warstadt et al., 2019; Lin et al., 2019; Hewitt and Manning, 2019; Jawahar et al., 2019), transfer learning experiments (Wang et al., 2019a; Phang et al., 2018), and dataset and benchmark construction (Wang et al., 2019b, 2018; Kim et al., 2019).

Background

Transfer learning is an area of research that uses knowledge from pretrained models to transfer to new tasks. In recent years, Transformer-based models like BERT (Devlin et al., 2019) and T5 (Raffel et al., 2019) have yielded state-of-the-art results on the lion’s share of benchmark tasks for language understanding through pretraining and transfer, often paired with some form of multitask learning.

jiant enables a variety of complex training pipelines through simple configuration changes, including multi-task training (Caruana, 1993; Liu et al., 2019a) and pretraining, as well as the sequential fine-tuning approach from STILTs (Phang et al., 2018). In STILTs, intermediate task training takes a pretrained model like ELMo or BERT, and applies supplementary training on a set of intermediate tasks, before finally performing single-task training on additional downstream tasks.

jiant System Overview

jiant can be cloned and installed from GitHub: https://github.com/nyu-mll/jiant. jiant v1.3.0 requires Python 3.5 or later, and jiant’s core dependencies are PyTorch (Paszke et al., 2019), AllenNLP (Gardner et al., 2017), and HuggingFace’s Transformers (Wolf et al., 2019). jiant is released under the MIT License (Open Source Initiative, 2020). jiant runs on consumer-grade hardware or in cluster environments with or without CUDA GPUs. The jiant repository also contains documentation and configuration files demonstrating how to deploy jiant in Kubernetes clusters on Google Kubernetes Engine.

2 jiant Components

Tasks: Tasks have references to task data, methods for processing data, references to classifier heads, and methods for calculating performance metrics, and making predictions.

Sentence Encoder: Sentence encoders map from the indexed examples to a sentence-level representation. Sentence encoders can include an input module (e.g., Transformer models, ELMo, or word embeddings), followed by an optional second layer of encoding (usually a BiLSTM). Examples of possible sentence encoder configurations include BERT, ELMo followed by a BiLSTM, BERT with a variety of pooling and aggregation methods, or a bag of words model.

Task-Specific Output Heads: Task-specific output modules map representations from sentence encoders to outputs specific to a task, e.g. entailment/neutral/contradiction for NLI tasks, or tags for part-of-speech tagging. They also include logic for computing the corresponding loss for training (e.g. cross-entropy).

Trainer: Trainers manage the control flow for the training and validation loop for experiments. They sample batches from one or more tasks, perform forward and backward passes, calculate training metrics, evaluate on a validation set, and save checkpoints. Users can specify experiment-specific parameters such as learning rate, batch size, and more.

Config: Config files or flags are defined in HOCONHuman-Optimized Config Object Notation (lightbend, 2011). jiant uses HOCON’s logic to consolidate multiple config files and command-line overrides into a single run config. format. Configs specify parameters for jiant experiments including choices of tasks, sentence encoder, and training routine.jiant configs support multi-phase training routines as described in section 3.3 and illustrated in Figure 2.

Configs are jiant’s primary user interface. Tasks and modeling components are designed to be modular, while jiant’s pipeline is a monolithic, configuration-driven design intended to facilitate a number of common workflows outlined in 3.3.

3 jiant Pipeline Overview

jiant’s core pipeline consists of the five stages described below and illustrated in Figure 2:

A config or multiple configs defining an experiment are interpreted. Users can choose and configure models, tasks, and stages of training and evaluation.

The tasks and sentence encoder are prepared:

The task data is loaded, tokenized, and indexed, and the preprocessed task objects are serialized and cached. In this process, AllenNLP is used to create the vocabulary and index the tokenized data.

The sentence encoder is constructed and (optionally) pretrained weights are loaded. The sentence encoder’s weights can optionally be left frozen, or be included in the training procedure.

The task-specific output heads are created for each task, and task heads are attached to a common sentence encoder. Optionally, different tasks can share the same output head, as in Liu et al. (2019a).

Optionally, in the intermediate phase the trainer samples batches randomly from one or more tasks, Tasks can be sampled using a variety of sample weighting methods, e.g., uniform or proportional to the tasks’ number of training batches or examples. and trains the shared model.

Optionally, in the target training phase, a copy of the model is configured and trained or fine-tuned for each target task separately.

Optionally, the model is evaluated on the validation and/or test sets of the target tasks.

4 Task and Model resources in jiant

jiant supports over 50 tasks. Task types include classification, regression, sequence generation, tagging, masked language modeling, and span prediction. jiant focuses on NLU tasks like MNLI (Williams et al., 2018), CommonsenseQA (Talmor et al., 2019), the Winograd Schema Challenge (Levesque et al., 2012), and SQuAD Rajpurkar et al. (2016). A full inventory of tasks and task variants is available in the jiant/tasks module.

jiant provides support for cutting-edge sentence encoder models, including support for Huggingface’s Transformers. Supported models include: ELMo (Peters et al., 2018), GPT (Radford, 2018), BERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), GPT-2 (Radford et al., 2019), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019b), and ALBERT (Lan et al., 2019). jiant also supports the from-scratch training of (bidirectional) LSTMs (Hochreiter and Schmidhuber, 1997) and deep bag of words models (Iyyer et al., 2015), as well as syntax-aware models such as PRPN Shen et al. (2018) and ON-LSTM Shen et al. (2019). jiant also supports word embeddings such as GloVe (Pennington et al., 2014).

5 User Interface

jiant experiments can be run with a simple CLI:

jiant provides default config files that allow running many experiments without modifying source code.

jiant also provides baseline config files that can serve as a starting point for model development and evaluation against GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019b) benchmarks.

More advanced configurations can be developed by composing multiple configurations files and overrides. Figure 3 shows a config file that overrides a default config, defining an experiment that uses BERT as the sentence encoder. This config includes an example of a task-specific configuration, which can be overridden in another config file or via a command line override.

Because jiant implements the option to provide command line overrides with a flag, it is easy to write scripts that launch jiant experiments over a range of parameters, for example while performing grid search across hyperparameters. jiant users have successfully run large-scale experiments launching hundreds of runs on both Kubernetes and Slurm.

6 Example jiant Use Cases and Options

Here we highlight some example use cases and key corresponding jiant config options required in these experiments:

Fine-tune BERT on SWAG (Zellers et al., 2018) and SQUAD Rajpurkar et al. (2016), then fine-tune on HellaSwag (Zellers et al., 2019):

Train a probing classifier over a frozen BERT model, as in Tenney et al. (2019a):

Compare performance of GloVe (Pennington et al., 2014) embeddings using a BiLSTM:

Evaluate ALBERT (Lan et al., 2019) on the MNLI (Williams et al., 2018) task:

7 Optimizations and Other Features

jiant implements features that improve run stability and efficiency:

jiant implements checkpointing options designed to offer efficient early stopping and to show consistent behavior when restarting after an interruption.

jiant caches preprocessed task data to speed up reuse across experiments which share common data resources and artifacts.

jiant implements gradient accumulation and multi-GPU, which enables training on larger batches than can fit in memory for a single GPU.

jiant supports outputting predictions in a format ready for GLUE and SuperGLUE benchmark submission.

jiant generates custom log files that capture experimental configurations, training and evaluation metrics, and relevant run-time information.

jiant generates TensorBoard event files (Abadi et al., 2015) for training and evaluation metric tracking. TensorBoard event files can be visualized using the TensorBoard Scalars Dashboard.

8 Extensibility

jiant’s design offers conveniences that reduce the need to modify code when making changes:

jiant’s task registry makes it easy to define a new version of an existing task using different data. Once the new task is defined in the task registry, the task is available as an option in jiant’s config.

jiant’s sentence encoder and task output head abstractions allow for easy support of new sentence encoders.

In use cases requiring the introduction of a new task, users can use class inheritance to build on a number of available parent task types including classification, tagging, span prediction, span classification, sequence generation, regression, ranking, and multiple choice task classes. For these task types, corresponding task-specific output heads are already implemented.

More than 30 researchers and developers from more than 5 institutions have contributed code to the jiant project.https://github.com/nyu-mll/jiant/graphs/contributors jiant’s maintainers welcome pull requests that introduce new tasks or sentence encoder components, and pull request are actively reviewed. The jiant repository’s continuous integration system requires that all pull requests pass unit and integration tests and meet Blackhttps://github.com/psf/black code formatting requirements.

9 Limitations and Development Roadmap

While jiant is quite flexible in the pipelines that can be specified through configs, and some components are highly modular (e.g., tasks, sentence encoders, and output heads), modification of the pipeline code can be difficult. For example, training in more than two phases would require modifying the trainer code.While not supported by config options, training in more than two phases is possible by using jiant’s checkpointing features to reload models for additional rounds of training. Making multi-stage training configurations more flexible is on jiant’s development roadmap.

jiant’s development roadmap prioritizes adding support for new Transformer models, and adding tasks that are commonly used for pretraining and evaluation in NLU. Additionally, there are plans to make jiant’s training phase configuration options more flexible to allow training in more than two phases, and to continue to refactor jiant’s code to keep jiant flexible to track developments in NLU research.

Benchmark Experiments

To benchmark jiant, we perform a set of experiments that reproduce external results for single fine-tuning and transfer learning experiments. jiant has been benchmarked extensively in both published and ongoing work on a majority of the implemented tasks.

Next, we benchmark jiant’s transfer learning regime. We perform transfer experiments from MNLI to BoolQ with BERT-large. In this configuration Clark et al. (2019) demonstrated an accuracy improvement of 78.1 to 82.2 on the dev set, and jiant achieves an improvement of 78.1 to 80.3.

Conclusion

jiant provides a configuration-driven interface for defining transfer learning and representation learning experiments using a bank of over 50 NLU tasks, cutting-edge sentence encoder models, and multi-task and multi-stage training procedures. Further, jiant is shown to be able to replicate published performance on various NLU tasks.

jiant’s modular design of task and sentence encoder components make it possible for users to quickly and easily experiment with a large number of tasks, models, and parameter configurations, without editing source code. jiant’s design also makes it easy to add new tasks, and jiant’s architecture makes it convenient to extend jiant to support new sentence encoders.

jiant code is open source, and jiant invites contributors to open issues or submit pull request to the jiant project repository: https://github.com/nyu-mll/jiant.

Acknowledgments

Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Edouard Grave, Najoung Kim, Thibault Févry, Berlin Chen, Nikita Nangia, Anhad Mohananey, Katharina Kann, Shikha Bordia, Nicolas Patry, David Benton, and Ellie Pavlick have contributed substantial engineering assistance to the project.

The early development of jiant took at the 2018 Frederick Jelinek Memorial Summer Workshop on Speech and Language Technologies, and was supported by Johns Hopkins University with unrestricted gifts from Amazon, Facebook, Google, Microsoft and Mitsubishi Electric Research Laboratories.

Subsequent development was possible in part by a donation to NYU from Eric and Wendy Schmidt made by recommendation of the Schmidt Futures program, by support from Intuit Inc., and by support from Samsung Research under the project Improving Deep Learning using Latent Structure. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU used at NYU in this work. Alex Wang’s work on the project is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1342536. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Yada Pruksachatkun’s work on the project is supported in part by the Moore-Sloan Data Science Environment as part of the NYU Data Science Services initiative. Sam Bowman’s work on jiant during Summer 2019 took place in his capacity as a visiting researcher at Google.