Q8BERT: Quantized 8Bit BERT

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat

Introduction

Pre-trained transformer language models (GPT , XLNet , XLM , BERT ) have demonstrated State-of-the-Art (SOTA) results for a variety of NLP tasks such as sentence classification, sequence tagging and question answering, by extracting contextual word representations or by fine-tuning the whole model on a target task. The models are pre-trained on extremely large corpora and result in a large number of parameters. For example, Devlin et al. introduced two pre-trained models: BERT-Base, which has 110M parameters in 32bit Floating Point (FP32) representation, and BERT-Large, which has 334M parameters in FP32 representation. Both BERT models have a high memory footprint and require heavy compute and wide bandwidth during inference. In addition, real time NLP applications that integrate BERT have to meet low latency requirements to achieve a high quality customer experience, therefore, the computational characteristics of BERT pose a challenge to deployment in production environments. These models will have a major impact on the way business organizations consume computing resources, since computing resources will have to handle loading of large models and heavy feed-forward calculations, shifting workload focus from lower level training to more application-specific fine-tuning and inference. Therefore, it is crucial to develop energy-efficient and minimum-cost methods to run these models in production .

Model compression algorithms are used to reduce compute and memory resources required for running inference. For example, Han et al. used a pipeline of pruning, quantization and Huffman encoding in order to achieve a compression ratio of $49\times$ of VGG-16 . As a result, the compressed VGG-16 can be fitted into an on-chip SRAM cache which allows faster access times with less power in comparison to off-chip DRAM memory. In another example, Jacob et al. introduced a method of training linear quantized Convolutional Neural Networks (CNN) that uses Integer Arithmetic instead of Floating Point Arithmetic which can be up to $4\times$ faster using only $25\%$ of the memory footprint .

In this work, we present a method for achieving best-in-class compression-accuracy ratio for BERT. To do this, we apply quantization-aware training during the fine-tuning process of BERT. We quantize all GEMM (General Matrix Multiply) operations in BERT Fully Connected (FC) and Embedding layers. We simulate 8bit quantized inference while maintaining $99\%$ accuracy in comparison to the FP32 version of BERT for eight different NLP tasks. Moreover, since we quantize all the FC and Embedding layers’ weights - which comprise over $99\%$ of the model’s weights - to 8bit, we achieve a memory footprint $4\times$ smaller than the original BERT. In addition, it is possible to use our method to implement efficient inference with hardware that supports 8bit arithmetic and an optimized library for 8bit GEMM. We have released our work as part of our open source model library NLP Architecthttps://github.com/NervanaSystems/nlp-architect.

The method presented in this paper is not exclusive to BERT model and can be integrated into other large pre-trained Transformer based models.

Method

In this section, we describe the quantization scheme, linear quantization, and quantization-aware training method we used. We chose to use this quantization scheme because, in addition to reducing the model size by approximately $4\times$ , it is also possible to accelerate inference time by using Integer arithmetic to calculate GEMM using specialized hardware for Integer and Fixed Point calculations. For example, Bhandare et al. stated that using Intel® Xeon® Cascade Lake’s Vectorized Neural Network Instructions (VNNI) to perform Int8 matrix multiplication provides a speed-up of $3.7\times$ over FP32 matrix multiplication. Moreover, by using symmetric linear quantization we simplify the quantization process and zero out terms related to the offset part of the quantized values. Our method is based on the method proposed by Jacob et al. .

We use symmetric linear quantization as our quantization scheme for quantizing both weights and activations to 8bit Integers (Int8):

where $S^{x}$ is the quantization scaling-factor for input $x$ and $M$ is the highest quantized value when quantizing to $b$ number of bits:

E.g. when quantizing to 8 bits, $M=127$ . The scaling-factor can be determined either dynamically during inference, or calculated using statistics collected during training, or calculated using statistics collected, post-training, during inference on a calibration set. In our work the weights’ scaling-factor is calculated according to:

and the activations’ scaling-factor is calculated based on values seen during training using an Exponential Moving Average (EMA):

2 Quantized-Aware Training

Quantization-aware training is a method of training Neural Networks (NN) to be quantized at the inference stage, as opposed to post-training quantization where the training is executed without any adaptation to the quantization process. In our work, we use fake quantization to introduce the quantization error to the model during the training phase in order for the model to learn to bridge the quantization error gap. Fake quantization is an operation that simulates the rounding effect in Floating Point values as presented by Jacob et al. . Since the rounding operation is not derivable, we use the Straight-Through Estimator (STE) to estimate the gradient of fake quantization:

where $x^{q}$ denotes the result of fake quantizing $x$ . Using the combination of fake quantization and STE we are able to perform quantized inference during training while back-propagating at full precision which allows the FP32 weights to overcome the quantization error.

Implementation

Our goal is to quantize all the Embedding and FC layers in BERT to Int8 using the method described in Section 2. For this purpose we implemented quantized versions of Embedding and FC layers. During training, the Embedding layer returns fake quantized embedding vectors, and the quantized FC performs GEMM between the fake quantized input and the fake quantized weight, and then accumulates the products to the bias which is untouched since the bias will be later quantized to Int32. During inference, the quantized Embedding layer returns Int8 embedding vectors, and the quantized FC performs GEMM between Int8 inputs accumulated to the Int32 bias which is quantized using the weights’ and activations’ scaling-factors as described in . Although the bias vectors are quantized to Int32 values, they only make up for a fraction of the amount of parameters in the model.

Our implementation of Quantized BERT is based on the BERT implementation provided by the PyTorch-Transformershttps://github.com/huggingface/pytorch-transformers library. To implement quantized BERT we replaced all the Embedding and FC layers in BERT to the quantized Embedding and FC layers we had implemented. Operations that require higher precision, such as Softmax, Layer Normalization and GELU, are kept in FP32.

Evaluation

To test our approach we evaluated our model on the GLUE (General Language Understanding Evaluation) benchmark , which is a collection of resources for training, evaluating, and analyzing natural language understanding systems in a wide array of NLP tasks. The ultimate goal of GLUE is to drive research in the development of general and robust natural language understanding systems. In addition, we evaluated our model on the question and answering task SQuADv1.1 . The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.

We summarized our results for quantized BERT in the QAT column in Table 1. We ran each experiment five times and reported the average result and standard deviation. In addition, we calculated the relative error induced by the quantization process and summarized the results in Table 2. In all experiments we used BERT-Base as the base model unless indicated otherwise. In all experiments we fine-tuned the pre-trained models offered by Tensorflow-Hubhttps://www.tensorflow.org/hub. In our internal testing, we found that the relative error induced by quantization is less than $1\%$ (excluding RTE task) while the space capacity of the model is reduced by 4x.

In order to measure the necessity of quantization-aware training we compared our results to post-training quantized models. We quantized our baseline models using Dynamic Quantization (DQ). The weights and activations are quantized as described in Section 2.1 with a small difference in the way we calculate the quantization scaling-factor of the activations. Instead of using Equation 4 we compute the scale the same way we compute the weights’ scaling-factor using Equation 3. This calculation is done during inference for each incoming activation tensor. The results for this quantization method are summarized in the DQ column in Table 1 and the relative error induced by quantization is also summarized in Table 2. We observe that the DQ method produces significantly worse results over all tasks.

Related Work

Compressing Transformer-based models for efficient inference is an active field of research. Junczys-Dowmunt et al. applied knowledge distillation and 8bit post-training quantization to speed up Transformer models for neural machine translation (Transformer-LT) , however, the quantized model suffered a loss of 1 BLEU score in comparison to the baseline model. Bhandare et al. also applied 8bit post-training quantization to Transformer-LT models and demonstrated how to utilize Intel® specialized 8bit hardware to accelerate the inference process. Habana Labshttps://habana.ai/habana-labs-goya-delivers-inferencing-on-bert/ published Quantized BERT performance measurements on its in-house accelerator for NN inference, however, Habana quantized BERT to 16bit Integer which offers a much wider quantization range and only $2\times$ compression. NVIDIAhttps://devblogs.nvidia.com/nlu-with-tensorrt-bert/ also measured BERT performance on its in-house accelerator using 16bit Floating Point arithmetic. Furthermore, NVIDIA implemented a number of optimized kernels for BERT’s operations in order to save memory bandwidth during inference. Sucik fine-tuned BERT on a custom dataset and performed 8bit Integer post-training quantization.

Conclusions and Future Work

We have shown a method for quantizing BERT GEMM operations to 8bit for a variety of NLP tasks with minimum loss in accuracy, and hope that the software developers community can use our quantization method to compress BERT and implement efficient BERT inference with 8bit GEMM operations. Efficient inference will enable low-latency NLP applications on a variety of hardware platforms from edge devices to data centers. In the future we intend to apply other model compression methods in order to compress BERT. Decreasing BERT’s memory footprint will accelerate BERT inference time and reduce power consumption, both of which are critical for deploying BERT in production environments having low memory and power resources. Furthermore, we intend to integrate other compression methods with our quantized BERT model.