Accelerating BERT Inference for Sequence Labeling via Early-Exit

Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, Xuanjing Huang

Introduction

Sequence labeling plays an important role in natural language processing (NLP). Many NLP tasks can be converted to sequence labeling tasks, such as named entity recognition, part-of-speech tagging, Chinese word segmentation and Semantic Role Labeling. These tasks are usually fundamental and highly time-demanding, therefore, apart from performance, their inference efficiency is also very important.

The past few years have witnessed the prevailing of pre-trained models (PTMs) (Qiu et al., 2020) on various sequence labeling tasks (Nguyen et al., 2020; Ke et al., 2020; Tian et al., 2020; Mengge et al., 2020). Despite their significant improvements on sequence labeling, they are notorious for enormous computational cost and slow inference speed, which hinders their utility in real-time scenarios or mobile-device scenarios.

Recently, early-exit mechanism (Liu et al., 2020; Xin et al., 2020; Schwartz et al., 2020; Zhou et al., 2020) has been introduced to accelerate inference for large-scale PTMs. In their methods, each layer of the PTM is coupled with a classifier to predict the label for a given instance. At inference stage, if the prediction is confident In this paper, confident prediction indicates that the uncertainty of it is low. enough at an earlier time, it is allowed to exit without passing through the entire model. Figure 1(a) gives an illustration of early-exit mechanism for text classification. However, most existing early-exit methods are targeted at sequence-level prediction, such as text classification, in which the prediction and its confidence score are calculated over a sequence. Therefore, these methods cannot be directly applied to sequence labeling tasks, where the prediction is token-level and the confidence score is required for each token.

In this paper, we aim to extend the early-exit mechanism to sequence labeling tasks. First, we proposed the SENTence-level Early-Exit (SENTEE), which is a simple extension of existing early-exit methods. SENTEE allows a sequence of tokens to exit together once the maximum uncertainty of the tokens is below a threshold. Despite its effectiveness, we find it redundant for most tokens to update the representation at each layer. Thus, we proposed a TOKen-level Early-Exit (TOKEE) that allows part of tokens that get confident predictions to exit earlier. Figure 1(b) and 1(c) illustrate our proposed SENTEE and TOKEE. Considering the local dependency inherent in sequence labeling tasks, we decide whether a token could exit based on the uncertainty of a window of its context instead of itself. For tokens that are already exited, we do not update their representation but just copy it to the upper layers. However, this will introduce a train-inference discrepancy. To tackle this problem, we introduce an additional fine-tuning stage that samples the token’s halting layer based on its uncertainty and copies its representation to upper layers during training. We conduct extensive experiments on three sequence labeling tasks: NER, POS tagging, and CWS. Experimental results show that our approach can save up to 66% $\sim$ 75% inference cost with minimal performance degradation. Compared with competitive compressed models such as DistilBERT, our approach can achieve better performance under speed-up ratio of 2 $\times$ , 3 $\times$ , and 4 $\times$ .

BERT for Sequence Labeling

Recently, PTMs (Qiu et al., 2020) have become the mainstream backbone model for various sequence labeling tasks. The typical framework consists of a backbone encoder and a task-specific decoder.

In this paper, we use BERT (Devlin et al., 2019) as our backbone encoder . The architecture of BERT consists of multiple stacked Transformer layers (Vaswani et al., 2017).

Given a sequence of tokens $x_{1},\cdots,x_{N}$ , the hidden state of $l$ -th transformer layer is denoted by $\mathbf{H}^{(l)}=[\mathbf{h}_{1}^{(l)},\cdots,\mathbf{h}_{N}^{(l)}]$ , and $\mathbf{H}^{(0)}$ is the BERT input embedding.

Usually, we can predict the label for each token according to the hidden state of the top layer. The probability of labels is predicted by

where $N$ is the sequence length, $C$ is the number of labels, $L$ is the number of BERT layers, $\mathbf{W}$ is a learnable matrix, and $f(\cdot)$ is a simple softmax classifier or conditional random field (CRF) (Lafferty et al., 2001). Since we focus on inference acceleration and PTM performs well enough on sequence labeling without CRF (Devlin et al., 2019), we do not consider using such a recurrent structure.

Early-Exit for Sequence Labeling

The inference speed and computational costs of PTMs are crucial bottlenecks to hinder their application in many real-world scenarios. In many tasks, the representations at an earlier layer of PTMs are usually adequate to make a correct prediction. Therefore, early-exit mechanisms (Liu et al., 2020; Xin et al., 2020; Schwartz et al., 2020; Zhou et al., 2020) are proposed to dynamically stop inference on the backbone model and make prediction with intermediate representation.

However, these existing early-exit mechanisms are built on sentence-level prediction and unsuitable for token-level prediction in sequence labeling tasks. In this section, we propose two early-exist mechanisms to accelerate the inference for sequence labeling tasks.

To extend early-exit to sequence labeling, we couple each layer of the PTM with token-level s that can be simply implemented as a linear classifier. Once the off-ramps are trained with the golden labels, the instance has a chance to be predicted and exit at an earlier time instead of passing through the entire model.

Given a sequence of tokens $X=x_{1},\cdots,x_{N}$ , we can make predictions by the injected off-ramps at each layer. For an off-ramp at $l$ -th layer, the label distribution of all tokens is predicted by

With the prediction for each token at hand, we can calculate the uncertainty for each token as follows,

where $\mathbf{p}^{(l)}_{n}$ is the label probability distribution for the $n$ -th token.

2 Early-Exit Strategies

In the following sections, we will introduce two early-exit mechanisms for sequence labeling, at sentence-level and token-level.

Sentence-Level Early-Exit (SENTEE) is a simple extension for sequential labeling tasks based on existing early-exit approaches. SENTEE allows a sequence of tokens to exit together if their uncertainty is low enough. Therefore, SENTEE is to aggregate the uncertainty for each token to obtain an overall uncertainty for the whole sequence. Here we perform a straight-forward but effective method, i.e., conduct max-poolingWe also tried average-pooling, but it brings drastic performance drop. We find that the average uncertainty over the sequence is often overwhelmed by lots of easy tokens and this causes many wrong exits of difficult tokens. over uncertainties of all the tokens,

where $u^{(l)}$ represents the uncertainty for the whole sentence. If $u^{(l)}<\delta$ where $\delta$ is a pre-defined threshold, we let the sentence exit at layer $l$ . The intuition is that only when the model is confident of its prediction for the most difficult token, the whole sequence could exit.

2.2 TOKEE: Token-Level Early-Exit

Despite the effectiveness of SENTEE (see Table 1), we find it redundant for most simple tokens to be fed into the deep layers. The simple tokens that have been correctly predicted in the shallow layer can not exit (under SENTEE) because the uncertainty of a small number of difficult tokens is still above the threshold. Thus, to further accelerate the inference for sequence labeling tasks, we propose a token-level early-exit (TOKEE) method that allows simple tokens with confident predictions to exit early.

Note that a prevalent problem in sequence labeling tasks is the local dependency (or label dependency). That is, the label of a token heavily depends on the tokens around it. To that end, the calculation of the uncertainty for a given token should not only be based on itself but also its context. Motivated by this, we proposed a window-based uncertainty criterion to decide for a token whether or not to exit at the current layer. In particular, the uncertainty for the token $x_{n}$ at $l$ -th layer is defined as

where $k$ is a pre-defined window size. Then we use $u_{n}^{\prime(l)}$ to decide whether the $n$ th token can exit at layer $l$ , instead of $u_{n}^{(l)}$ . Note that window-based uncertainty is equivalent to sentence-level uncertainty when $k$ equals to the sentence length.

For tokens that have exited, their representation would not be updated in the upper layers, i.e., the hidden states of exited tokens are directly copied to the upper layers.For English sequence labeling, we use the first-pooling to get the representation of the word. If a word exits, we will halt-and-copy its all wordpieces. Such a halt-and-copy mechanism is rather intuitive in two-fold:

Halt. If the uncertainty of a token is very small, there are also few chances that its prediction will be changed in the following layers. So it is redundant to keep updating its representation.

Copy. If the representation of a token can be classified into a label with a high degree of confidence, then its representation already contains the label information. So we can directly copy its representation into the upper layers to help predict the labels of other tokens.

These exited tokens will not attend to other tokens at upper layers but can still be attended by other tokens thus part of the layer-specific query projections in upper layers can be omitted. By this, the computational complexity in self-attention is reduced from $\mathcal{O}(N^{2}d)$ to $\mathcal{O}(NMd)$ , where $M\ll N$ is the number of tokens that have not exited. Besides, the computational complexity of the point-wise FFN can also be reduced from $\mathcal{O}(Nd^{2})$ to $\mathcal{O}(Md^{2})$ .

The halt-and-copy mechanism is also similar to multi-pass sequence labeling paradigm, in which the tokens are labeled their in order of difficulty (easiest first). However, the copy mechanism results in a train-inference discrepancy. That is, a layer never processed the representation from its non-adjacent previous layers during training. To alleviate the discrepancy, we further proposed an additional fine-tuning stage, which will be discussed in Section 3.3.2.

3 Model Training

In this section, we describe the training process of our proposed early-exit mechanisms.

For sentence-level early-exit, we follow prior early-exit work for text classification to jointly train the added off-ramps. For each off-ramp, the loss function is as follows,

where $H$ is the cross-entropy loss function, $N$ is the sequence length. The total loss function for each sample is a weighted sum of the losses for all the off-ramps,

where $w_{l}$ is the weight for the $l$ -th off-ramp and $L$ is the number of backbone layers. Following (Zhou et al., 2020), we simply set $w_{l}=l$ . In this way, The deeper an off-ramp is, the weight of its loss is bigger, thus each off-ramp can be trained jointly in a relatively balanced way.

3.2 Fine-Tuning for TOKEE

Since we equip halt-and-copy in TOKEE, the common joint training off-ramps are not enough. Because the model never conducts halt-and-copy in training but does in inference. In this stage, we aim to train the model to use the hidden state from different previous layers but not only the previous adjacent layer, just like in inference.

A direct way is to uniformly sample halting layers of tokens. However, halting layers at the inference are not random but depends on the difficulty of each token in the sequence. So random sampling halting layers also causes the gap between training and inference.

Instead, we use the fine-tuned model itself to sample the halting layers. For every sample in each training epoch, we will randomly sample a window size and threshold for it, and then we can conduct TOKEE on the trained model, under the window size and threshold, without halt-and-copy. Thus we get the exiting layer of each token, and we use it to re-forward the sample, by halting and copying each token in the corresponding layer. In this way, the exiting layer of a token can correspond to its difficulty. The deeper a token’s exiting layer is, the more difficult it is. Because we sample the exiting layer using the model itself, we think the gap between training and inference can be further shrunk. To avoid over-fitting during further training, we prevent the training loss from further reducing, similar with the flooding mechanism used by Ishida et al. (2020). We also employ the sandwich rule to stabilize this training stage Yu and Huang (2019). We compare self-sampling with random sampling in Section 4.4.4.

Experiment

We use average floating-point operations (FLOPs) as the measure of computational cost, which denotes how many floating-point operations the model performs for a single sample. The FLOPs is universal enough since it is not involved with the model running environment (CPU, GPU or TPU) and it can measure the theoretical running time of the model. In general, the lower the model’s FLOPs is, the faster the model’s inference is.

2 Experimental Setup

To verify the effectiveness of our methods, We conduct experiments on ten English and Chinese datasets of sequence labeling, covering NER: CoNLL2003 (Tjong Kim Sang and De Meulder, 2003), Twitter NER (Zhang et al., 2018), Ontonotes 4.0 (Chinese) (Weischedel et al., 2011), Weibo (Peng and Dredze, 2015; He and Sun, 2017) and CLUE NER (Xu et al., 2020), POS: ARK Twitter (Gimpel et al., 2011; Owoputi et al., 2013), CTB5 POS (Xue et al., 2005) and UD POS (Nivre et al., 2016), CWS: CTB5 Seg (Xue et al., 2005) and UD Seg (Nivre et al., 2016). Besides the standard benchmark dataset like CoNLL2003 and Ontonotes 4.0, we also choose some datasets closer to real-world application to verify the actual utility of our methods, such as Twitter NER and Weibo in social media domain. We use the same dataset preprocessing and split as in previous work (Huang et al., 2015; Mengge et al., 2020; Jia et al., 2020; Tian et al., 2020; Nguyen et al., 2020).

2.2 Baseline

We compare our methods with three baselines:

BiLSTM-CRF (Huang et al., 2015; Ma and Hovy, 2016) The most widely used model in sequence labeling tasks before the pre-trained language model prevails in NLP.

BERT The powerful stacked Transformer encoder model, pre-trained on large-scale corpus, which we use as the backbone of our methods.

DistilBERT The most well-known distillation method of BERT. Huggingface released 6 layers DistilBERT for English (Sanh et al., 2019). For comparison, we distill {3, 4} and {3, 4, 6} layers DistilBERT for English and Chinese using the same method.

2.3 Hyper-Parameters

For all datasets, We use batch size=10. We perform grid search over learning rate in {5e-6,1e-5,2e-5}. We choose learning rate and the model based on the development set. We use the AdamW optimizer (Loshchilov and Hutter, 2019). The warmup step, weight decay is set to 0.05, 0.01, respectively.

3 Main Results

For English Datasets, we use the ‘BERT-base-cased’ released by Google (Devlin et al., 2019) as backbone. For Chinese Datasets, we use ‘BERT-wwm’ released by (Cui et al., 2019). The DistilBERT is distilled from the backbone BERT.

To fairly compare our methods with baselines, we turn the speedup ratio of our methods to be consistent with the corresponding static baseline. We report the average performance over 5 times under different random seeds. The overall results are shown in Table 1, where the speedup is based on the backbone. We can see both SENTEE and TOKEE brings little performance drop and outperforms DistilBERT in speedup ratio of 2, which has achieved similar effect like existing early-exit for text classification. Under higher speedup, 3 $\times$ and 4 $\times$ , SENTEE shows its weakness but TOKEE can still keep a certain performance. And under 2 $\sim$ 4 $\times$ speedup ratio, TOKEE has a lower performance drop than DistilBERT. What’s more, for datasets where BERT can show its power than LSTM-CRF, e.g., Chinese NER, TOKEE (4 $\times$ ) on BERT can still outperform LSTM-CRF significantly. This indicates the potential utility of it in complicated real-world scenario.

To explore the fine-grained performance change under different speedup ratio, We visualize the speedup-performance trade-off curve on 6 datasets, in Figure2. We observe that,

Before the speedup ratio rises to a certain turning point, there is almost no drop on performance. After that, the performance will drop gradually. This shows our methods keep the superiority of existing early-exit methods (Xin et al., 2020).

As the speedup rises, TOKEE will encounter the speedup turning point later than SENTEE. After both methods reach the turning point, SENTEE’s performance degradation is more drastic than TOKEE. These both indicate the higher speedup ceiling of TOKEE.

On some datasets, such as CoNLL2003, we observe a little performance improvement under low speedup ratio, we attribute this to the potential regularization brought by early-exit, such as alleviating overthinking (Kaya et al., 2019).

To verify the versatility of our method over different PTMs, we also conduct experiments on two well-known BERT variants, RoBERTa (Liu et al., 2019)https://github.com/ymcui/Chinese-BERT-wwm. and ALBERT (Lan et al., 2020)https://github.com/brightmart/albert_zh., as shown in Table 2. We can see that SENTEE and TOKEE also significantly outperform static backbone internal layer on three Representative datasets of corresponding tasks. For RoBERTa and ALBERT, we also observe the TOKEE can have a better performance than SENTEE under high speedup ratio.

4 Analysis

In this section, we conduct a set of detailed analysis on our methods.

We show the performance change under different $k$ in Figure 3, keeping the speedup ratio consistent. We observe that: (1) when $k$ is 0, in other words, not using window-based uncertainty but token-independent uncertainty, the performance is the almost lowest across different speedup ratio, because it does not consider local dependency at all. This shows the necessity of the window-based uncertainty. (2) When $k$ is relatively large, it will bring significant performance drop under high speedup ratio (3 $\times$ and 4 $\times$ ), like SENTEE. (3) It is necessary to choose an appropriate $k$ under high speedup ratio, where the effect of different $k$ has a high variance.

4.2 Accuracy V.S. Uncertainty

Liu et al. (2020) verified ‘the lower the uncertainty, the higher the accuracy’ on text classification. Here, we’d like to verify our window-based uncertainty on sequence labeling. In detail, we verify the entire window-based uncertainty and its specific hyper-parameter, $k$ , on CoNLL2003, shown in Figure 4. For the uncertainty, we intercept the 4 th and 8 th off-ramps and calculate their accuracy in each uncertainty interval, when $k$ =2. The result shown in Figure 4(a) indicates that ‘the lower the window-based uncertainty, the higher the accuracy’, similar as in text classification. For $k$ , we set a certain threshold = 0.3, and calculate accuracy of tokens whose window-based uncertainty is small than the threshold under different $k$ , shown in Figure 4(b). The result shows that, as $k$ increases: (1) The accuracy of screened tokens is higher. This shows that the wider of a token’s low-uncertainty neighborhood, the more accurate the token’s prediction is. This also verifies the validity of window-based uncertainty strategy. (2) The accuracy improvement slows down. This shows the low relevance of distant tokens’ uncertainty and explains why large $k$ performs not well under high speedup ratio: it does not help improving more accurate exiting but slowing down exiting.

4.3 Influence of Sequence Length

Transformer-based PTMs, e.g. BERT, face a challenge in processing long text, due to the $O(N^{2}d)$ computational complexity brought by self-attention. Since the TOKEE reduces the layer-wise computational complexity from $O(N^{2}d+Nd^{2})$ to $O(NMd+Md^{2})$ and SENTEE does not, we’d like to explore their effect over different sentence length. We compare the highest speedup ratio of TOKEE and SENTEE when performance drop $<1$ on Ontonotes 4.0, shown in Figure 5. We observe that TOKEE has a stable computational cost saving as the sentence length increases, but SENTEE’s speedup ratio will gradually reduce. For this, we give an intuitive explanation. In general, a longer sentence has more tokens, it is more difficult for the model to give them all confident prediction at the same layer. This comparison reveals the potential of TOKEE on accelerating long text inference.

4.4 Effects of Self-Sampling Fine-Tuning

To verify the effect of self-sampling fine-tuning in Section 3.3.2, we compare it with random sampling and no extra fine-tuning on CoNLL2003. The performance-speedup trade-off curve of TOKEE is shown in Figure 6, which shows self-sampling is always better than random sampling for TOKEE. As speedup ratio rises, this trend is more significant. This shows the self-sampling can help more in reducing the gap of training and inference. As for no extra fine-tuning, it will deteriorate drastically at high speedup ratio. But it can roughly keep a certain capability at low speedup ratio, which we attribute to the residual-connection of PTM and similar results were reported by Veit et al. (2016).

4.5 Layer Distribution of Early-Exit

In TOKEE, by halt-and-copy mechanism, each token goes through a different number of PTM layers according to the difficulty. We show the average distribution of a sentence’s tokens exiting layers under different speedup ratio on CoNLL2003, in Figure 7. We also draw the average exiting layer number of SENTEE under the same speedup ratio. We observe that as speedup ratio rises, more tokens will exit at the earlier layer but a bit of tokens can still go through the deeper layer even when 4 $\times$ , meanwhile, the SENTEE’s average exiting layer number reduces to 2.5, where the PTM’s encoding power is severely cut down. This gives an intuitive explanation of why TOKEE is more effective than SENTEE under high speedup ratio: although both SENTEE and TOKEE can dynamically adjust computational cost on the sample-level, TOKEE can adjust do it in a more fine-grained way.

Related Work

PTMs are powerful but have high computational cost. To accelerate them, many attempts have been made. A kind of methods is to reduce its size, such as distillation Sanh et al. (2019); Jiao et al. (2020), structural pruning Michel et al. (2019); Fan et al. (2020) and quantization Shen et al. (2020).

Another kind of methods is early-exit, which dynamically adjusts the encoding layer number of different samples (Liu et al., 2020; Xin et al., 2020; Schwartz et al., 2020; Zhou et al., 2020; Li et al., 2020). While they introduced early-exit mechanism in simple classification tasks, our methods are proposed for the more complicated scenario: sequence labeling, where it has not only one prediction probability and it’s necessary to consider the dependency of token exitings. Elbayad et al. (2020) proposed Depth-Adaptive Transformer to accelerate machine translation. However, their early-exit mechanism is designed for auto-regressive sequence generation, in which the exit of tokens must be in left-to-right order. Therefore, it is unsuitable for language understanding tasks. Different from their method, our early-exit mechanism can consider the exit of all tokens simultaneously.

Conclusion and Future Work

In this work, we propose two early-exit mechanisms for sequence labeling: SENTEE and TOKEE. The former is a simple extension of sequence-level early-exit while the latter is specially designed for sequence labeling, which can conduct more fine-grained computational cost allocation. We equip TOKEE with window-based uncertainty and self-sampling finetuning to make it more robust and faster. The detailed analysis verifies their effectiveness. SENTEE and TOKEE can achieve 2 $\times$ and 3 $\sim$ 4 $\times$ speedup with minimal performance drop.

For future work, we wish to explore: (1) leveraging the exited token’s label information to help the exiting of remained tokens; (2) introducing CRF or other global decoding methods into early-exit for sequence labeling.

Acknowledgments

We thank anonymous reviewers for their detailed reviews and great suggestions. This work was supported by the National Key Research and Development Program of China (No. 2020AAA0106700), National Natural Science Foundation of China (No. 62022027) and Major Scientific Research Project of Zhejiang Lab (No. 2019KD0AD01).