WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, Nan Duan

Introduction

Pre-trained language models (PLMs) Devlin et al. (2019); Liu et al. (2019) perform well on learning sentence semantics when fine-tuned with supervised data Reimers and Gurevych (2019); Thakur et al. (2020). However, in practice, especially when a large amount of supervised data is unavailable, an approach that provides sentence embeddings in an unsupervised way is of great value in scenarios like sentence matching and retrieval. While there are attempts on unsupervised sentence embeddings Arora et al. (2017); Zhang et al. (2020), to the best of our knowledge, there is no comprehensive study on various PLMs with regard to multiple factors. Meanwhile, we aim to provide an easy-to-use toolkit that can be used to produce sentence embeddings upon various PLMs.

In this paper, we investigate PLMs-based unsupervised sentence embeddings from three aspects. First, a standard way of obtaining sentence embedding is to pick the vector of $[CLS]$ token. We explore whether using the hidden vectors of other tokens is beneficial. Second, some works suggest producing sentence embedding from the last layer or the combination of the last two layers Reimers and Gurevych (2019); Li et al. (2020). We seek to figure out whether there exists a better way of layer combination. Third, recent attempts transform sentence embeddings to a different distribution with sophisticated networks Li et al. (2020) to address the problem of non-smooth anisotropic distribution. Instead, we aim to explore whether a simple linear transformation is sufficient.

To answer these questions, we conduct thorough experiments upon 4 different PLMs and evaluate on 7 datasets regarding semantic textual similarity. We find that, first, to average the token representations consistently yields better sentence representations than using the representation of the $[CLS]$ token. Second, combining the embeddings of the bottom layer and the top layer performs than using top two layers. Third, normalizing sentence embeddings with whitening, an easy linear matrix transformation algorithm with less than 10 lines of code, consistently brings improvements.

Transformer-based PLMs

Multi-layer Transformer architecture Vaswani et al. (2017) has been widely used in pre-trained language models (e.g. Devlin et al., 2019; Liu et al., 2019) to encode sentences. Given an input sequence $S=\{s_{1},s_{2},\dots,s_{n}\}$ , a transformer-based PLM produces a set of hidden representations $H^{(0)},H^{(1)},\dots,H^{(L)}$ , where $H^{(l)}=[\mathbf{h}^{(l)}_{1},\mathbf{h}^{(l)}_{2},\dots,\mathbf{h}^{(l)}_{n}]$ are the per-token embeddings of $S$ in the $l$ -th encoder layer and $H^{(0)}$ corresponds to the non-contextual word(piece) embeddings.

In this paper, we use four transformer-based PLMs to derive sentence embeddings, i.e. BERT-base Devlin et al. (2019), RoBERTa-base Liu et al. (2019), DistilBERT Sanh et al. (2019), and LaBSE Feng et al. (2020). They vary in the model architecture and pre-training objectives. Specifically, BERT-base, RoBERTa-base, and LaBSE follow an architecture of twelve layers of transformers but DistilBERT only contains six layers. Additionally, LaBSE is pre-trained with a unique translation ranking task which forces the sentence embeddings of a parallel sentence pair to be closer, while the other three PLMs do not include such a pre-training task for sentence embeddings.

WhiteningBERT

In this section, we introduce how to derive sentence embeddings $\mathbf{s}$ from PLMs following the three strategies below.

Taking the last layer of token representations as an example, we compare the following two methods to obtain sentence embeddings: (1) using the vector of $[CLS]$ token which is the first token of the sentence, i.e., $\mathbf{s}=\mathbf{s}^{L}=\mathbf{h}_{1}^{L}$ ; (2) averaging the vectors of all tokens in the sentence, including the $[CLS]$ token, i.e., $\mathbf{s}=\mathbf{s}^{L}=\frac{1}{n}\sum_{i=1}^{N}{\mathbf{h}^{L}_{i}}$ .

2 Layer Combination

Most works only take the last layer to derive sentence embeddings, while rarely explore which layer of semantic representations can help to derive a better sentence embedding. Here we explore how to best combine layers of embeddings to obtain sentence embeddings. Specifically, we can first compute the vector representation of each layer following Section 3.1. Then we perform layer combinations as $\mathbf{s}=\sum_{l}{\mathbf{s}^{l}}$ to acquire sentence embedding. For example, for the combination of L1+L12 with two layers, we obtain sentence embeddings by adding up the vector representation of layer one and layer twelve, i.e., $\mathbf{s}=\frac{1}{2}(\mathbf{s}^{1}+\mathbf{s}^{12})$ .

3 Whitening

Whitening is a linear transformation that transforms a vector of random variables with a known covariance matrix into a new vector whose covariance is an identity matrix, and has been verified effective to improve the text representations in bilingual word embedding mapping Artetxe et al. (2018) and image retrieval Jégou and Chum (2012).

Experiment

We evaluate sentence embeddings on the task of unsupervised semantic textual similarity. We show experimental results and report the best way to derive unsupervised sentence embedding from PLMs.

The task of unsupervised semantic textual similarity (STS) aims to predict the similarity of two sentences without direct supervision. We experiment on seven STS datasets, namely the STS-Benchmark (STS-B) Cer et al. (2017), the SICK-Relatedness Marelli et al. (2014), and the STS tasks 2012-2016 Agirre et al. (2012, 2013, 2014, 2015, 2016). These datasets consist of sentence pairs with labeled semantic similarity scores ranging from 0 to 5.

Evaluation Procedure

Following the procedures in previous works like SBERT Reimers and Gurevych (2019), we first derive sentence embeddings for each sentence pair and compute the cosine similarity score of the embeddings as the predicted similarity. Then we calculate the Spearman’s rank correlation coefficient between the predicted similarity and gold standard similarity scores as the evaluation metric. We average the Spearman’s coefficients among the seven datasets as the final correlation score.

Baseline Methods

We compare our methods with five representative unsupervised sentence embedding models, including average GloVe embedding Pennington et al. (2014), SIF Arora et al. (2017) , IS-BERT Zhang et al. (2020) and BERT-flow Li et al. (2020), SBERT-WK with BERT Wang and Kuo (2020).

2 Overall Results

Table 1 shows the overall performance of sentence embeddings with different models and settings. We can observe that:

(1) Averaging the token representations of the last layer to derive sentence embeddings performs better than only using $[CLS]$ token in the last layer by a large margin, no matter which PLM we use, which indicates that single $[CLS]$ token embedding does not convey enough semantic information as a sentence representation, despite it has been proved effective in a number of supervised classification tasks. This finding is also consistent with the results in Reimers and Gurevych (2019). Therefore, we suggest inducing sentence embeddings by averaging token representations.

(2) Adding up the token representations in layer one and the last layer to form the sentence embeddings performs better than separately using only one layer, regardless of the selection of the PLM. Since PLMs capture a rich hierarchy of linguistic information in different layers Tenney et al. (2019); Jawahar et al. (2019), layer combination is capable of fusing the semantic information in different layers and thus yields better performance. Therefore, we suggest summing up the last layer and layer one to perform layer combination and induce better sentence embeddings.

(3) Introducing the whitening strategy produce consistent improvement of sentence embeddings on STS tasks. This result indicates the effectiveness of the whitening strategy in deriving sentence embeddings. Among the four PLMs, LaBSE achieves the best STS performance while obtains the least performance enhancement after incorporating whitening strategy. We attribute it to the good intrinsic representation ability because LaBSE is pre-trained by a translation ranking task which improves the sentence embedding quality.

3 Analysis of Layer Combination

To further investigate the effects of layer combination, we add up the token representations of different layers to induce sentence embeddings.

First, we explore whether adding up layer one and the last layer is consistently better than other combinations of two layers. Figure 1 shows the performance of all two-layer combinations. We find that adding up the last layer and layer one do not necessarily achieve the best performance among all PLMs, but could be a satisfying choice for simplicity.

Second, we explore the effects of the number of layers to induce sentence embeddings. We evaluate on BERT-base and figure 2 shows the maximum correlation score of each group of layer combinations. By increasing the number of layers, the maximum correlation score increases first but then drops. The best performance appears when the number of layers is three (L1+L2+L12). This indicates that combining three layers is sufficient to yield good sentence representations and we do not need to incorporating more layers which is not only complex but also poorly performed.

Related works

Unsupervised sentence embeddings are mainly composed with pre-trained (contextual) word embeddings Pennington et al. (2014); Devlin et al. (2019). Recent attempts can be divided into two categories, according to whether the pre-trained embeddings are further trained or not. For the former, some works leverage unlabelled natural language inference datasets to train a sentence encoder without direct supervision Li et al. (2020); Zhang et al. (2020); Mu and Viswanath (2018). For the latter, some works propose weighted average word embeddings based on word features Arora et al. (2017); Ethayarajh (2018); Yang et al. (2019); Wang and Kuo (2020). However, these approaches need further training or additional features, which limits the direct applications of sentence embeddings in real-world scenarios. Finally, we note that concurrent to this work, Su et al. (2021) also explored whitening sentence embedding, released to arXiv one week before our paper.

Conclusion

In this paper, we explore different ways and find a simple and effective way to produce sentence embedding upon various PLMs. Through exhaustive experiments, we make three empirical conclusions here. First, averaging all token representations consistently induces better sentence representations than using the $[CLS]$ token embedding. Second, combining the embeddings of the bottom layer and the top layer outperforms that using the top two layers. Third, normalizing sentence embeddings with a whitening algorithm consistently boosts the performance.

References

Appendix A Appendix

To further illustrate the effectiveness of the whitening algorithm in induce sentence embeddings for STS tasks, we experiment with more PLMs and report their performance with and without incorporating the whitening algorithm. From the results exhibited in Table 2, we find that no matter which PLM we use, the average performance on 7 STS tasks improves after incorporating the whitening strategy. This result again verifies the effectiveness of whitening in producing sentence embeddings.

A.2 Comparison with GPT-3

GPT-3 Brown et al. (2020) is a powerful language model that is capable of sophisticated natural language understanding of tasks like classification in a zero-shot fashion. Here we report the results of whiteningBERT (PLM=BERT) on RTE dev set Wang et al. (2019). Specifically, we first compute the cosine similarity of the two sentence embeddings and then manually set a threshold of 0.5 to predict the label of each sentence pairs. The results are shown in Table 3.

A.3 Code for Whitening

Figure 3 displays the source code for whitening algorithm in PyTorch Paszke et al. (2019).