fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli

Introduction

Neural sequence-to-sequence models have been successful on a variety of text generation tasks, including machine translation, abstractive document summarization, and language modeling. Accordingly, both researchers and industry professionals can benefit from a fast and easily extensible sequence modeling toolkit.

There are several toolkits with similar basic functionality, but they differ in focus area and intended audiences. For example, OpenNMT Klein et al. (2017) is a community-built toolkit written in multiple languages with an emphasis on extensibility. MarianNMT Junczys-Dowmunt et al. (2018) focuses on performance and the backend is written in C++ for fast automatic differentiation. OpenSeq2Seq Kuchaiev et al. (2018) provides reference implementations for fast distributed and mixed precision training. Tensor2tensor Vaswani et al. (2018) and Sockeye Hieber et al. (2018) focus on production-readiness.

In this paper, we present fairseq, a sequence modeling toolkit written in PyTorch that is fast, extensible, and useful for both research and production. fairseq features: (i) a common interface across models and tasks that can be extended with user-supplied plug-ins (§2); (ii) efficient distributed and mixed precision training, enabling training over datasets with hundreds of millions of sentences on current hardware (§3); (iii) state-of-the-art implementations and pretrained models for machine translation, summarization, and language modeling (§4); and (iv) optimized inference with multiple supported search algorithms, including beam search, diverse beam search Vijayakumar et al. (2016), and top-k sampling. fairseq is distributed with a BSD license and is available on GitHub at https://github.com/pytorch/fairseq.

Design

fairseq can be extended through five types of user-supplied plug-ins, which enable experimenting with new ideas while reusing existing components as much as possible.

Models

define the neural network architecture and encapsulate all learnable parameters. Models extend the BaseFairseqModel class, which in turn extends torch.nn.Module. Thus any fairseq model can be used as a stand-alone module in other PyTorch code. Models can additionally predefine named architectures with common network configurations (e.g., embedding dimension, number of layers, etc.). We also abstracted the methods through which the model interacts with the generation algorithm, e.g., beam search, through step-wise prediction. This isolates model implementation from the generation algorithm.

Criterions

compute the loss given the model and a batch of data, roughly: loss = criterion(model, batch). This formulation makes criterions very expressive, since they have complete access to the model. For example, a criterion may perform on-the-fly generation to support sequence-level training Edunov et al. (2018b) or online backtranslation Edunov et al. (2018a); Lample et al. (2018). Alternatively, in a mixture-of-experts model, a criterion may implement EM-style training and backpropagate only through the expert that produces the lowest loss Shen et al. (2019).

Tasks

store dictionaries, provide helpers for loading and batching data and define the training loop. They are intended to be immutable and primarily interface between the various components. We provide tasks for translation, language modeling, and classification.

Optimizers

update the model parameters based on the gradients. We provide wrappers around most PyTorch optimizers and an implementation of Adafactor Shazeer and Stern (2018), which is a memory-efficient variant of Adam.

Learning Rate Schedulers

update the learning rate over the course of training. We provide several popular schedulers, e.g., the inverse square-root scheduler from Vaswani et al. (2017) and cyclical schedulers based on warm restarts Loshchilov and Hutter (2016).

Reproducibility and forward compatibility.

fairseq includes features designed to improve reproducibility and forward compatibility. For example, checkpoints contain the full state of the model, optimizer and dataloader, so that results are reproducible if training is interrupted and resumed. fairseq also provides forward compatibility, i.e., models trained using old versions of the toolkit will continue to run on the latest version through automatic checkpoint upgrading.

Implementation

fairseq is implemented in PyTorch and it provides efficient batching, mixed precision training, multi-GPU as well as multi-machine training.

There are multiple strategies to batch input and output sequence pairs (Morishita et al., 2017). fairseq minimizes padding within a mini-batch by grouping source and target sequences of similar length. The content of each mini-batch stays the same throughout training, however mini-batches themselves are shuffled randomly every epoch. When training on more than one GPU or machine, then the mini-batches for each worker are likely to differ in the average sentence length which results in more representative updates.

Multi-GPU training.

fairseq uses the NCCL2 library and torch.distributed for inter-GPU communication. Models are trained in a synchronous optimization setup where each GPU has a copy of the model to process a sub-batch of data after which gradients are synchronized between GPUs; all sub-batches constitute a mini-batch. Even though sub-batches contain a similar number of tokens, we still observe a high variance in processing times. In multi-GPU or multi-machine setups, this results in idle time for most GPUs while slower workers are finishing their work (Figure 1 (a)). fairseq mitigates the effect of stragglers by overlapping gradient synchronization between workers with the backward pass and by accumulating gradients over multiple mini-batches for each GPU (Ott et al., 2018b).

Overlapping gradient synchronization starts to synchronize gradients of parts of the network when they are computed. In particular, when the gradient computation for a layer finishes, fairseq adds the result to a buffer. When the size of the buffer reaches a predefined threshold, the gradients are synchronized in a background thread while back-propagation continues as usual (Figure 1 (b)). Next, we accumulate gradients for multiple sub-batches on each GPU which reduces the variance in processing time between workers since there is no need to wait for stragglers after each sub-batch (Figure 1 (c)). This also increases the effective batch size but we found that models can still be trained effectively (Ott et al., 2018b).

Mixed precision.

Recent GPUs enable efficient half precision floating point (FP16) computation. fairseq provides support for both full precision (FP32) and FP16 at training and inference. We perform all forward-backward computations as well as the all-reduce for gradient synchronization between workers in FP16. However, the parameter updates remain in FP32 to preserve accuracy. fairseq implements dynamic loss scaling (Micikevicius et al., 2018) in order to avoid underflows for activations and gradients because of the limited precision offered by FP16. This scales the loss right after the forward pass to fit into the FP16 range while the backward pass is left unchanged. After the FP16 gradients are synchronized between workers, we convert them to FP32, restore the original scale, and update the weights.

Inference.

fairseq provides fast inference for non-recurrent models (Gehring et al., 2017; Vaswani et al., 2017; Fan et al., 2018b; Wu et al., 2019) through incremental decoding, where the model states of previously generated tokens are cached in each active beam and re-used. This can speed up a naïve implementation without caching by up to an order of magnitude, since only new states are computed for each token. For some models, this requires a component-specific caching implementation, e.g., multi-head attention in the Transformer architecture.

During inference we build batches with a variable number of examples up to a user-specified number of tokens, similar to training. fairseq also supports inference in FP16 which increases decoding speed by 54% compared to FP32 with no loss in accuracy (Table 1).

Applications

fairseq has been used in many applications, such as machine translation (Gehring et al., 2017; Edunov et al., 2018b, a; Chen et al., 2018; Ott et al., 2018a; Song et al., 2018; Wu et al., 2019), language modeling (Dauphin et al., 2017; Baevski and Auli, 2019), abstractive document summarization (Fan et al., 2018a; Liu et al., 2018; Narayan et al., 2018), story generation (Fan et al., 2018b, 2019), error correction (Chollampatt and Ng, 2018), multilingual sentence embeddings Artetxe and Schwenk (2018), and dialogue (Miller et al., 2017; Dinan et al., 2019).

We provide reference implementations of several popular sequence-to-sequence models which can be used for machine translation, including LSTM Luong et al. (2015), convolutional models Gehring et al. (2017); Wu et al. (2019) and Transformer Vaswani et al. (2017).

We evaluate a “big” Transformer encoder-decoder model on two language pairs, WMT English to German (En–De) and WMT English to French (En–Fr). For En–De we replicate the setup of Vaswani et al. (2017) which relies on WMT’16 for training with 4.5M sentence pairs, we validate on newstest13 and test on newstest14. The 32K vocabulary is based on a joint source and target byte pair encoding (BPE; Sennrich et al. 2016). For En–Fr, we train on WMT’14 and borrow the setup of Gehring et al. (2017) with 36M training sentence pairs. We use newstest12+13 for validation and newstest14 for test. The 40K vocabulary is based on a joint source and target BPE.

We measure case-sensitive tokenized BLEU with multi-bleu (Hoang et al., 2006) and de-tokenized BLEU with SacreBLEUSacreBLEU hash: BLEU+case.mixed+lang.en-{de,fr}+ numrefs.1+smooth.exp+test.wmt14/full+tok.13a+version.1.2.9 (Post, 2018). All results use beam search with a beam width of 4 and length penalty of 0.6, following Vaswani et al. 2017. fairseq results are summarized in Table 2. We reported improved BLEU scores over Vaswani et al. (2017) by training with a bigger batch size and an increased learning rate Ott et al. (2018b).

2 Language modeling

fairseq supports language modeling with gated convolutional models (Dauphin et al., 2017) and Transformer models (Vaswani et al., 2017). Models can be trained using a variety of input and output representations, such as standard token embeddings, convolutional character embeddings (Kim et al., 2016), adaptive softmax (Grave et al., 2017), and adaptive inputs (Baevski and Auli, 2019). We also provide tutorials and pre-trained models that replicate the results of Dauphin et al. (2017) and Baevski and Auli (2019) on WikiText-103 and the One Billion Word datasets.

We evaluate two Transformer language models, which use only a decoder network and adaptive input embeddings, following Baevski and Auli (2019). The first model has 16 blocks, inner dimension 4K and embedding dimension 1K; results on WikiText-103 are in Table 3. The second model has 24 blocks, inner dimension 8K and embedding dimension 1.5K; results on the One Billion Word benchmark are in Table 4.

3 Abstractive document summarization

Next, we experiment with abstractive document summarization where we use a base Transformer to encode the input document and then generate a summary with a decoder network. We use the CNN-Dailymail dataset (Hermann et al., 2015; Nallapati et al., 2016) of news articles paired with multi-sentence summaries. We evaluate on the full-text version with no entity anonymization See et al. (2017); we truncate articles to 400 tokens (See et al., 2017). We use BPE with 30K operations to form our vocabulary following Fan et al. (2018a). To evaluate, we use the standard rouge metric Lin (2004) and report rouge-1, rouge-2, and rouge-l. To generate summaries, we follow standard practice in tuning the minimum output length and disallow repeating the same trigram Paulus et al. (2017). Table 5 shows results of fairseq. We also consider a configuration where we input pre-trained language model representations to the encoder network and this language model was trained on newscrawl and CNN-Dailymail, totalling 193M sentences.

Conclusion

We presented fairseq, a fast, extensible toolkit for sequence modeling that is scalable and suitable for many applications. In the future, we will continue the development of the toolkit to enable further research advances.

Acknowledgements

We thank Jonas Gehring for writing the original Lua/Torch version of fairseq.