BinaryBERT: Pushing the Limit of BERT Quantization

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, Irwin King

Introduction

Recent pre-trained language models have achieved remarkable performance improvement in various natural language tasks (Vaswani et al., 2017; Devlin et al., 2019). However, the improvement generally comes at the cost of increasing model size and computation, which limits the deployment of these huge pre-trained language models to edge devices. Various methods have been recently proposed to compress these models, such as knowledge distillation (Sanh et al., 2019; Sun et al., 2019; Jiao et al., 2020), pruning (Michel et al., 2019; Fan et al., 2019), low-rank approximation (Ma et al., 2019; Lan et al., 2020), weight-sharing (Dehghani et al., 2019; Lan et al., 2020; Huang et al., 2021), dynamic networks with adaptive depth and/or width (Hou et al., 2020; Xin et al., 2020; Zhou et al., 2020), and quantization (Zafrir et al., 2019; Shen et al., 2020; Fan et al., 2020; Zhang et al., 2020).

Among all these model compression approaches, quantization is a popular solution as it does not require designing a smaller model architecture. Instead, it compresses the model by replacing each 32-bit floating-point parameter with a low-bit fixed-point representation. Existing attempts try to quantize pre-trained models (Zafrir et al., 2019; Shen et al., 2020; Fan et al., 2020) to even as low as ternary values (2-bit) with minor performance drop (Zhang et al., 2020). However, none of them achieves the binarization (1-bit). As the limit of quantization, weight binarization could bring at most 32×32\times reduction in model size and replace most floating-point multiplications with additions. Moreover, quantizing activations to 8-bit or 4-bit further replaces the floating-point addition with int8 and int4 addition, decreasing the energy burden and the area usage on chips Courbariaux et al. (2015).

In this paper, we explore to binarize BERT parameters with quantized activations, pushing BERT quantization to the limit. We find that directly training a binary network is rather challenging. According to Figure 1, there is a sharp performance drop when reducing weight bit-width from 2-bit to 1-bit, compared to other bit configurations. To explore the challenges of binarization, we analyze the loss landscapes of models under different precisions both qualitatively and quantitatively. It is found that while the full-precision and ternary (2-bit) models enjoy relatively flat and smooth loss surfaces, the binary model suffers from a rather steep and complex landscape, which poses great challenges to the optimization.

Motivated by the above empirical observations, we propose ternary weight splitting, which takes the ternary model as a proxy to bridge the gap between the binary and full-precision models. Specifically, ternary weight splitting equivalently converts both the quantized and latent full-precision weights in a well-trained ternary model to initialize BinaryBERT. Therefore, BinaryBERT retains the good performance of the ternary model, and can be further refined on the new architecture. While neuron splitting is previously studied (Chen et al., 2016; Wu et al., 2019) for full-precision network, our ternary weight splitting is much more complex due to the additional equivalence requirement of quantized weights. Furthermore, the proposed BinaryBERT also supports adaptive splitting. It can adaptively perform splitting on the most important ternary modules while leaving the rest as binary, based on efficiency constraints such as model size or floating-point operations (FLOPs). Therefore, our approach allows flexible sizes of binary models for various edge devices’ demands.

Empirical results show that BinaryBERT split from a half-width ternary network is much better than a directly-trained binary model with the original width. On the GLUE and SQuAD benchmarks, our BinaryBERT has only a slight performance drop compared to the full-precision BERT-base model, while being 24×\mathbf{24\times} smaller. Moreover, BinaryBERT with the proposed importance-based adaptive splitting also outperforms other splitting criteria across a variety of model sizes.

Difficulty in Training Binary BERT

In this section, we show that it is challenging to train a binary BERT with conventional binarization approaches directly. Before diving into details, we first review the necessary backgrounds.

where sign()\textrm{sign}(\cdot) is the sign function, Δ=0.7nwt1\Delta=\frac{0.7}{n}\|\mathbf{w}^{t}\|_{1} and α ⁣= ⁣1IiIwit\alpha\!=\!\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}|w_{i}^{t}| with I={i  w^it0}\mathcal{I}=\{i\ |\ \hat{w}^{t}_{i}\neq 0\}.

Binarization is first proposed in (Courbariaux et al., 2015) and has been extensively studied in the academia (Rastegari et al., 2016; Hubara et al., 2016; Liu et al., 2018). As a representative work, Binary-Weight-Network (BWN) (Hubara et al., 2016) binarizes wb\mathbf{w}^{b} element-wisely with a scaling parameter α\alpha as follows:

Despite the appealing properties of network binarization, we show that it is non-trivial to obtain a binary BERT with these binarization approaches.

1 Sharp Performance Drop with Weight Binarization

To study the performance drop of BERT quantization, we train the BERT model with full-precision, {8,4,3,2,1}-bit weight quantization and 8-bit activations on MRPC and MNLI-m from the GLUE benchmark (Wang et al., 2018) We conduct more experiments on other GLUE datasets and with different settings in Appendix C.1, and find similar empirical results to MRPC and MNLI-m here.. We use loss-aware weight quantization (LAQ) (Hou and Kwok, 2018) for 8/4/3-bit weight quantization, TWN (Li et al., 2016) for weight ternarization and BWN (Hubara et al., 2016) for weight binarization. Meanwhile, we adopt 8-bit uniform quantization for activations. We follow the default experimental settings detailed in Section 4.1 and Appendix C.1.

From Figure 1, the performance drops mildly from 32-bit to as low as 2-bit, i.e., around 0.6%0.6\%\downarrow on MRPC and 0.2%0.2\%\downarrow on MNLI-m. However, when reducing the bit-width to one, the performance drops sharply, i.e, 3.8%\sim 3.8\%\downarrow and 0.9%\sim 0.9\%\downarrow on the two tasks, respectively. Therefore, weight binarization may severely harm the performance, which may explain why most current approaches stop at 2-bit weight quantization (Shen et al., 2020; Zadeh and Moshovos, 2020; Zhang et al., 2020). To further push weight quantization to the limit, a first step is to study the potential reasons behind the sharp drop from ternarization to binarization.

2 Exploring the Quantized Loss Landscape

To learn about the challenges behind the binarization, we first visually compare the loss landscapes of full-precision, ternary, and binary BERT models. Following (Nahshan et al., 2019), we extract parameters wx,wy\mathbf{w}_{x},\mathbf{w}_{y} from the value layersWe also extract parameters from other parts of the Transformer in Appendix C.2, and the observations are similar. of multi-head attention in the first two Transformer layers, and assign the following perturbations on parameters:

where x{±0.2wˉx,±0.4wˉx,...,±1.0wˉx}x\in\{\pm 0.2\bar{w}_{x},\pm 0.4\bar{w}_{x},...,\pm 1.0\bar{w}_{x}\} are perturbation magnitudes based the absolute mean value wˉx\bar{w}_{x} of wx\mathbf{w}_{x}, and similar rules hold for yy. 1x\boldsymbol{1}_{x} and 1y\boldsymbol{1}_{y} are vectors with all elements being 1. For each pair of (x,y)(x,y), we evaluate the corresponding training loss and plot the surface in Figure 2.

As can be seen, the full-precision model (Figure 2(a)) has the lowest overall training loss, and its loss landscape is flat and robust to the perturbation. For the ternary model (Figure 2(b)), despite the surface tilts up with larger perturbations, it looks locally convex and is thus easy to optimize. This may also explain why the BERT model can be ternarized without severe accuracy drop (Zhang et al., 2020). However, the loss landscape of the binary model (Figure 2(c)) turns out to be both higher and more complex. By stacking the three landscapes together (Figure 2(d)), the loss surface of the binary BERT stands on the top with a clear margin with the other two. The steep curvature of loss surface reflects a higher sensitivity to binarization, which attributes to the training difficulty.

To quantitatively measure the steepness of loss landscape, we start from a local minima w\mathbf{w} and apply the second order approximation to the curvature. According to the Taylor’s expansion, the loss increase induced by quantizing w\mathbf{w} can be approximately upper bounded by

From Figure 3, the top-1 eigenvalues of the binary model are higher both on expectation and standard deviation compared to the full-precision baseline and the ternary model. For instance, the top-1 eigenvalues of MHA-O in the binary model are 15×\sim 15\times larger than the full-precision counterpart. Therefore, the quantization loss increases of full-precision and ternary model are tighter bounded than the binary model in Equation (4). The highly complex and irregular landscape by binarization thus poses more challenges to the optimization.

Proposed Method

Given the challenging loss landscape of binary BERT, we propose ternary weight splitting (TWS) that exploits the flatness of ternary loss landscape as the optimization proxy of the binary model. As is shown in Figure 4, we first train the half-sized ternary BERT to convergence, and then split both the latent full-precision weight wt\mathbf{w}^{t} and quantized w^t\hat{\mathbf{w}}^{t} to their binary counterparts w1b,w2b\mathbf{w}_{1}^{b},\mathbf{w}_{2}^{b} and w^1b,w^2b\hat{\mathbf{w}}_{1}^{b},\hat{\mathbf{w}}_{2}^{b} via the TWS operator. To inherit the performance of the ternary model after splitting, the TWS operator requires the splitting equivalency (i.e., the same output given the same input):

While solution to Equation (5) is not unique, we constrain the latent full-precision weights after splitting w1b,w2b\mathbf{w}^{b}_{1},\mathbf{w}^{b}_{2} to satisfy wt=w1b+w2b\mathbf{w}^{t}=\mathbf{w}^{b}_{1}+\mathbf{w}^{b}_{2} as

where aa and bb are the variables to solve. By Equations (9) and (13) with w^t=w^1b+w^2b\hat{\mathbf{w}}^{t}=\hat{\mathbf{w}}^{b}_{1}+\hat{\mathbf{w}}^{b}_{2}, we get

where we denote I={i  w^it0}\mathcal{I}=\{i\ |\ \hat{w}^{t}_{i}\neq 0\}, J={j  w^jt=0 and wjt>0}\mathcal{J}=\{j\ |\ \hat{w}^{t}_{j}=0\ \textrm{and}\ w^{t}_{j}>0\} and K={k  w^kt=0 and wkt<0}\mathcal{K}=\{k\ |\ \hat{w}^{t}_{k}=0\ \textrm{and}\ w^{t}_{k}<0\}. |\cdot| denotes the cardinality of the set. Detailed derivation of Equation (3.1) is in Appendix A.

Following (Zhang et al., 2020), for each weight matrix in the Transformer layers, we use layer-wise ternarization (i.e., one scaling parameter for all elements in the weight matrix). For word embedding, we use row-wise ternarization (i.e., one scaling parameter for each row in the embedding). After splitting, each of the two split matrices has its own scaling factor.

Aside from weight binarization, we simultaneously quantize activations before all matrix multiplications, which could accelerate inference on specialized hardwares (Shen et al., 2020; Zafrir et al., 2019). Following (Zafrir et al., 2019; Zhang et al., 2020), we skip the quantization for all layer-normalization (LN) layers, skip connections, and bias as their calculations are negligible compared to matrix multiplication. The last classification layer is also not quantized to avoid a large accuracy drop.

We then conduct prediction-layer distillation by minimizing the soft cross-entropy (SCE) between quantized student logits y^\hat{\mathbf{y}} and teacher logits y\mathbf{y}, i.e.,

After splitting from the half-sized ternary model, the binary model inherits its performance on a new architecture with full width. However, the original minimum of the ternary model may not hold in this new loss landscape after splitting. Thus we further fine-tune with prediction-layer distillation to look for a better solution. We dub the resulting model as BinaryBERT.

2 Adaptive Splitting

Our proposed approach also supports adaptive splitting that can flexibly adjust the width of BinaryBERT, based on the parameter sensitivity to binarization and resource constraints of edge devices.

Specifically, given the resource constraints C\mathcal{C} (e.g., model size and computational FLOPs), we first train a mixed-precision model adaptively (with sensitive parts being ternary and the rest being binary), and then split ternary weights into binary ones. Therefore, adaptive splitting finally enjoys consistent arithmetic precision (1-bit) for all weight matrices, which is usually easier to deploy than the mixed-precision counterpart.

where C0\mathcal{C}_{0} is the baseline efficiency of the half-sized binary network. Dynamic programming can be applied to solve Equation (3.2) to avoid NP-hardness.

Experiments

In this section, we empirically verify our proposed approach on the GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016, 2018) benchmarks. We first introduce the experimental setup in Section 4.1, and then present the main experimental results on both benchmarks in Section 4.2. We compare with other state-of-the-arts in Section 4.3, and finally provide more discussions on the proposed methods in Section 4.4. Code is available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/BinaryBERT.

The GLUE benchmark contains multiple natural language understanding tasks. We follow Devlin et al. (2019) to evaluate the performance on these tasks: Matthews correlation for CoLA, Spearman correlation for STS-B and accuracy for the rest tasks: RTE, MRPC, SST-2, QQP, MNLI-m (matched) and MNLI-mm (mismatched). For machine reading comprehension on SQuAD, we report the EM (exact match) and F1 score.

Aside from the task performance, we also report the model size (MB) and computational FLOPs at inference. For quantized operations, we follow Zhou et al. (2016); Liu et al. (2018); Li et al. (2020a) to count the bit-wise operations, i.e., the multiplication between an mm-bit number and an nn-bit number approximately takes mn/64mn/64 FLOPs for a CPU with the instruction size of 64 bits.

We take DynaBERT Hou et al. (2020) sub-networks as backbones as they offer both half-sized and full-sized models for easy comparison. We start from training a ternary model of width 0.5×0.5\times with the two-stage knowledge distillation introduced in Section 3.1. Then we split it into a binary model with width 1.0×1.0\times, and perform further fine-tuning with prediction-layer distillation. Each training stage takes the same number of training epochs. Following Jiao et al. (2020); Hou et al. (2020); Zhang et al. (2020), we adopt data augmentation with one training epoch in each stage on all GLUE tasks except for MNLI and QQP. Aside from this default setting, we also remove data augmentation and perform vanilla training with 6 epochs on these tasks. On MNLI and QQP, we train 3 epochs for each stage.

We verify our ternary weight splitting (TWS) against vanilla binary training (BWN), the latter of which doubles training epochs to match the overall training time in TWS for fair comparison. More training details are provided in Appendix B.

While BinaryBERT focuses on weight binarization, we also explore activation quantization in our implementation, which is beneficial for reducing the computation burden on specialized hardwares Hubara et al. (2016); Zhou et al. (2016); Zhang et al. (2020). Aside from 8-bit uniform quantization (Zhang et al., 2020; Shen et al., 2020) in past efforts, we further pioneer to study 4-bit activation quantization. We find that uniform quantization can hardly deal with outliers in the activation. Thus we use Learned Step-size Quantization (LSQ) (Esser et al., 2019) to directly learn the quantized values, which empirically achieves better quantization performance.

2 Experimental Results

The main results on the development set are shown in Table 1. For results without data augmentation (row #2-5), our ternary weight splitting method outperforms BWN with a clear margin Note that DynaBERT only squeezes width in the Transformer layers but not the word embedding layer, thus the split binary model has a slightly larger size than BWN.. For instance, on CoLA, ternary weight splitting achieves 6.7%6.7\%\uparrow and 9.6%9.6\%\uparrow with 8-bit and 4-bit activation quantization, respectively. While data augmentation (row 6-9) mostly improves each entry, our approach still overtakes BWN consistently. Furthermore, 4-bit activation quantization empirically benefits more from ternary weight splitting (row 4-5 and 8-9) compared with 8-bit activations (row 2-3 and 6-7), demonstrating the potential of our approach in extremely low bit quantized models.

In Table 2, we also provide the results on the test set of GLUE benchmark. Similar to the observation in Table 1, our approach achieves consistent improvement on both 8-bit and 4-bit activation quantization compared with BWN.

2.2 Results on SQuAD Benchmark

The results on the development set of SQuAD v1.1 and v2.0 are shown in Table 3. Our proposed ternary weight splitting again outperforms BWN w.r.t both EM and F1 scores on both datasets. Similar to previous observations, 4-bit activation enjoys a larger gain in performance from the splitting approach. For instance, our approach improves the EM score of 4-bit activation by 1.8%1.8\% and 0.6%0.6\% on SQuAD v1.1 and v2.0, respectively, both of which are higher than those of 8-bit activation.

2.3 Adaptive Splitting

The adaptive splitting in Section 3.2 supports the conversion of mixed ternary and binary precisions for more-fine-grained configurations. To verify its advantages, we name our approach as Maximal Gain according to Equation (3.2), and compare it with two baseline strategies i) Random Gain that randomly selects weight matrices to split; and ii) Minimal Gain that splits the least important modules according to sensitivity. We report the average score over six tasks (QNLI, SST-2, CoLA, STS-B, MRPC and RTE) in Figure 5. The end-points of 9.8MB and 16.5MB are the half-sized and full-sized BinaryBERT, respectively. As can be seen, adaptive splitting generally outperforms the other two baselines under varying model size, indicating the effectiveness of maximizing the gain in adaptive splitting. In Appendix C.4, we provide detailed performance on the six tasks, together with the architecture visualization of adaptive splitting.

3 Comparison with State-of-the-arts

Now we compare our proposed approach with a variety of state-of-the-art counterparts, including Q-BERT (Shen et al., 2020), GOBO Zadeh and Moshovos (2020), Quant-Noise (Fan et al., 2020) and TernaryBERT (Zhang et al., 2020). Aside from quantization, we also compare with other general compression approaches such as DistillBERT (Sanh et al., 2019), LayerDrop (Fan et al., 2019), TinyBERT (Jiao et al., 2020), and ALBERT (Lan et al., 2020). The results are taken from the original papers, respectively. From Table 4, our proposed BinaryBERT has the smallest model size with the best performance among all quantization approaches. Compared with the full-precision model, our BinaryBERT retains competitive performance with a significant reduction of model size and computation. For example, we achieve more than 24×\mathbf{24\times} compression ratio compared with BERT-base, with only 0.4%0.4\%\downarrow and 0.0%/0.2%0.0\%/0.2\%\downarrow drop on MNLI-m on SQuAD v1.1, respectively.

4 Discussion

We now demonstrate the performance gain by refining the binary model on the new architecture. We evaluate the performance gain after splitting from a half-width ternary model (TWN0.5×) to the full-sized model (TWN1.0×) on the development set of SQuAD v1.1, MNLI-m, QNLI and MRPC. The results are shown in Table 5. As can be seen, further fine-tuning brings consistent improvement on both 8-bit and 4-bit activation.

Furthermore, we plot the training loss curves of BWN, TWN and our TWS on MRPC with data augmentation in Figures 6(a) and 6(b). Since TWS cannot inherit the previous optimizer due to the architecture change, we reset the optimizer and learning rate scheduler of BWN, TWN and TWS for a fair comparison, despite the slight increase of loss after splitting. We find that our TWS attains much lower training loss than BWN, and also surpasses TWN, verifying the advantages of fine-tuning on the wider architecture.

We also follow Li et al. (2018); Hao et al. (2019) to visualize the optimization trajectory after splitting in Figures 6(c) and 6(d). We calculate the first two principal components of parameters in the final BinaryBERT, which are the basis for the 2-D plane. The loss contour is thus obtained by evaluating each grid point in the plane. It is found that the binary models are heading towards the optimal solution for both 8/4-bit activation quantization on the loss contour.

4.2 Exploring More Binarization Methods

We now study if there are any improved binarization variants that can directly bring better performance. Aside from BWN, we compare with LAB Hou et al. (2017) and BiReal Liu et al. (2018). Meanwhile, we compare with gradual quantization, i.e., BWN training based on a ternary model, denoted as BWN\dagger. Furthermore, we also try the same scaling factor of BWN with TWN to make the precision change smooth, dubbed as BWN\ddagger. From Table 6, we find that our TWS still outperforms various binarization approaches in most cases, suggesting the superiority of splitting in finding better minima than direct binary training.

Related Work

Network quantization has been a popular topic with vast literature in efficient deep learning. Below we give a brief overview for three research strands: network binarization, mixed-precision quantization and neuron splitting, all of which are related to our proposed approach.

Network binarization achieves remarkable size reduction and is widely explored in computer vision. Existing binarization approaches can be categorized into quantization error minimization (Rastegari et al., 2016; Hou et al., 2017; Zhang et al., 2018), improving training objectives (Martinez et al., 2020; Bai et al., 2020) and reduction of gradient mismatch (Bai et al., 2018; Liu et al., 2018, 2020). Despite the empirical success of these approaches in computer vision, there is little exploration of binarization in natural language processing tasks. Previous works on BERT quantization (Zafrir et al., 2019; Shen et al., 2020; Zhang et al., 2020) push down the bit-width to as low as two, but none of them achieves binarization. On the other hand, our work serves as the first attempt to binarize the pre-trained language models.

2 Mixed-precision Quantization

Given the observation that neural network layers exhibit different sensitivity to quantization (Dong et al., 2019; Wang et al., 2019), mixed-precision quantization re-allocate layer-wise quantization bit-width for higher compression ratio. Inspired by neural architecture search (Liu et al., 2019; Wang et al., 2020), common approaches of mixed-precision quantization are primarily based on differentiable search (Wu et al., 2018a; Li et al., 2020b), reinforcement learning (Wu et al., 2018b; Wang et al., 2019), or simply loss curvatures Dong et al. (2019); Shen et al. (2020). While mixed-precision quantized models usually demonstrate better performance than traditional methods under the same compression ratio, they are also harder to deploy (Habi et al., 2020). On the contrary, BinaryBERT with adaptive splitting enjoy both the good performance from the mixed precision of ternary and binary values, and the easy deployment given the consistent arithmetic precision.

There are also works on binary neural architecture search (Kim et al., 2020; Bulat et al., 2020) which have a similar purpose to mixed-precision quantization. Nonetheless, such methods are usually time-consuming to train and are prohibitive for large pre-trained language models.

3 Neuron Splitting

Neuron splitting is originally proposed to accelerate the network training, by progressively increasing the width of a network (Chen et al., 2016; Wu et al., 2019). The split network equivalently inherits the knowledge from the antecessors and is trained for further improvement. Recently, neuron splitting is also studied in quantization (Zhao et al., 2019; Kim et al., 2019). By splitting neurons with large magnitudes, the full-precision outliers are removed and thus the quantization error can be effectively reduced Zhao et al. (2019). Kim et al. (2019) apply neuron splitting to decompose ternary activation into two binary activations based on bias shifting of the batch normalization layer. However, such a method cannot be applied in BERT as there is no batch normalization layer. Besides, weight splitting is much more complex due to the equivalence constraint on both the quantized and latent full-precision weights.

Conclusion

In this paper, we propose BinaryBERT, pushing BERT quantization to the limit. As a result of the steep and complex loss landscape, we find directly training a BinaryBERT is hard with a large performance drop. We thus propose a ternary weight splitting that splits a trained ternary BERT to initialize BinaryBERT, followed by fine-tuning for further refinement. Our approach also supports adaptive splitting that can tailor the size of BinaryBERT based on the edge device constraints. Empirical results show that our approach significantly outperforms vanilla binary training, achieving state-of-the-art performance on BERT compression.

Acknowledgement

This work was partially supported by the National Key Research and Development Program of China (No. 2018AAA0100204), and Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14210717 of the General Research Fund). We sincerely thank all anonymous reviewers for their insightful suggestions.

References

Appendix A Derivation of Equation (3.1)

In this section, we show the derivations to obtain aa and bb. Recall the BWN quantizer introduced in Section 2, we have

According to w^t=w^1b+w^2b\hat{\mathbf{w}}^{t}=\hat{\mathbf{w}}^{b}_{1}+\hat{\mathbf{w}}^{b}_{2}, for those w^it=w^1,ib+w^2,ib=0\hat{w}_{i}^{t}=\hat{w}_{1,i}^{b}+\hat{w}_{2,i}^{b}=0, we have

By assuming 0<a<10<a<1 and b>0b>0, this can be further simplified to

We empirically find the solution satisifies 0<a<10<a<1. For w^it0\hat{w}_{i}^{t}\neq 0, from w^it=w^1,ib+w^2,ib\hat{w}_{i}^{t}=\hat{w}_{1,i}^{b}+\hat{w}_{2,i}^{b}, we have

Appendix B Implementation Details

As mentioned in Section 3.2, the adaptive splitting requires to first estimate the quantization sensitivity vector u\mathbf{u}. We study the sensitivity in two aspects: the Transformer parts, and the Transformer layers. For Transformer parts, we follow the weight categorization in Section 2.2: MHA-Q/K, MHA-V, MHA-O, FFN-Mid and FFN-Out. For each of them, we compare the performance gap between quantizing and not quantizing that part (e.g., MHA-V), while leavging the rest parts all quantized (e.g., MHA-Q/K, MHA-O, FFN-Mid and FFN-Out). Similarly, for each Transformer layer, we quantize all layers but leave the layer under investigation un-quantized, and calculate the performance gain compared with the fully qauntized baseline. The performance gain of both Transformer parts and layers are shown in Figure 7. As can be seen, for Transformer parts, the FFN-Mid and MHA-Q/K rank in the first and second place. In terms of Transformer layers, shallower layers are more sensitive to quantization than the deeper ones.

However, the absolute performance gain may not reflect the quantization sensitivity directly, since Transformer parts have different number of parameters. Therefore, we divide the performance gain by the number of parameters in that part or layer to obtain the parameter-wise performance gain. We are thus able to measure the quantization sensitivity of the iith Transformer part in the jjth Transformer layer by summing their parameter-wise performance gain together. We also apply the same procedure to word embedding and pooler layer to otain their sensitivity scores.

We are now able to solve Equation (3.2) by dynamic programming. The combinatorial optimization can be viewed as a knapsack problem, where the constraint CC0\mathcal{C}-\mathcal{C}_{0} is the volume of the knapsack, and the sensitivity scores u\mathbf{u} are the item values.

B.2 Hyper-parameter Settings

We first perform the two-stage knowledge distillation, i.e., intermediate-layer distillation (Int. Dstil.) and prediction-layer distillation (Pred. Dstil.) on the ternary model, and then perform ternary weight splitting followed by fine-tuning (Split Ft.) with only prediction-layer distillation after the splitting. The initial learning rate is set as 5×1055\times 10^{-5} for the intermediate-layer distillation, and 2×1052\times 10^{-5} for the prediction-layer distillation, both of which linearly decay to 0 at the end of training. We conduct experiments on GLUE tasks both without and with data augmentation (DA) except for MNLI and QQP due to their limited performance gain. The running epochs for MNLI and QQP are set to 3, and 6 for the rest tasks if without DA and 1 otherwise. For the rest hyper-parameters, we follow the default setting in Devlin et al. (2019). The detailed hyper-parameters are summarized in Table 7.

Appendix C More Empirical Results

Here we provide more empirical results on the sharp drop in performance as a result of binarization. We run multi-bit quantization on the BERT model over representative tasks of the GLUE benchmark, and activations are quantized in both 8-bit and 4-bit. We run 10 independent experiments for each task except for MNLI with 3 runs. We follow the same procedure in Section 2.1, and the default experimental setup in Appendix B.2 without data augmentation and splitting. The results are shown in Figures 8 and 9 respectively. It can be found that while the performance drops slowly from full-precision to ternarization, there is a consistent sharp drop by binarization in each tasks and on both 8-bit and 4-bit activation quantization. This is similar to the findings in Figure 1.

C.2 More Visualizations of Loss Landscape

To comprehensively compare the loss curvature among the full-precision, ternary and binary models, we provide more landscape visualizations aside from the value layer in Figure 2. We extract parameters from MHA-K, MHA-O, FFN-Mid and FFN-out in the first two Transformer layers, and the corresponding landscape are shown in Figure 10, Figure 11, Figure 12, Figure 13 respectively. We omit MHA-Q due to page limitation, and also it is symmetric to MHA-K with similar landscape observation. It can be found that binary model have steep and irregular loss landscape in general w.r.t different parameters of the model, and is thus hard to optimize directly.

C.3 Ablation of Knowledge Distillation

While knowledge distillation on BERT has been thoroughly investigated in Jiao et al. (2020); Hou et al. (2020); Zhang et al. (2020), here we further conduct ablation study of knowledge distillation on the proposed ternary weight splitting. We compare with no distillation (“N/A”), prediction distillation (“Pred”) and our default setting (“Int.+Pred”). For “N/A” or “Pred”, fine-tuning after splitting follows the same setting to their ternary training. “Int.+Pred” follows our default setting in Table . We do not adopt data-augmentation, and results are shown in Table 10. It can be found that “Int.+Pred.” outperforms both “N/A” and “Pred.” with a clear margin, which is consistent to the findings in Zhang et al. (2020) that knowledge distillation helps BERT quantization.

C.4 Detailed Results of Adaptive Splitting

The detailed comparison of our adaptive splitting strategy against the random strategy (Rand.) and minimal gain strategy (Min.) under different model size are shown in Table 8 and Table 9. It can be found that for both 8-bit and 4-bit activation quantization, our strategy that splits the most sensitive modules mostly performs the best on average under various model sizes.

C.5 Architecture Visualization

We further visualize the architectures after adaptive splitting on MRPC in Figure 14. For clear presentation, we merge all splittable parameters in each Transformer layer. As the baseline, 9.8MB refers to no splitting, while 16.5MB refers to splitting all splittable parameters in the model. According to Figure 14, with the increasing model size, shallower layers are more preferred for splitting than deeper layers, which is consistent to the findings in Figure 7.