SGPT: GPT Sentence Embeddings for Semantic Search

Niklas Muennighoff

Introduction

Semantic search consists of two parts: Search refers to finding the top $k$ answers from a document corpus given a query. Semantic refers to understanding the documents and queries beyond keywords. Transformers are the dominant semantic architecture competing with non-semantic models like BM25 . Search applications like Google or Bing rely on transformers to provide semantically relevant results. However, they have been limited to BERT-like encoder-only transformers .

Meanwhile, GPT-like decoder-only transformers have been the focus of recent scaling efforts of up to 540 billion parameters . Increasing language model parameters has been repeatedly shown to improve downstream zero-shot and fine-tuning performance on a variety of language tasks . For example, increasing scale has allowed decoder-only transformers to outperform all encoder-only and catch-up with encoder-decoder transformers on the SuperGLUE benchmark .

However, the related fields of semantic search and language embeddings have not been part of the proliferation of decoders. They are dominated by comparatively small encoders , as it remains unclear how to extract semantically meaningful embeddings from decoders and use them for semantic search. Methods to do so are desirable for two reasons:

Taking advantage of the available scale of decoders has the potential to produce new state-of-the-art results in search. Available encoders are orders of magnitude smaller . Google search, for example, processes an estimated 4 billion searches daily , thus better search models could have wide-reaching impacts.

Large-scale pre-trained decoders have been reused for different tasks via prompting or fine-tuning . A well-performing method to extract embeddings from billion parameter decoders may prevent the need to train and maintain separate encoder and decoder models. Training just one large decoder and reusing it for search prevents additional cost to the environment .

In this work, we propose SGPT to apply decoder-only transformers to semantic search and extract meaningful sentence embeddings from them. We distinguish four settings: Cross-Encoder vs Bi-Encoder, Symmetric vs Asymmetric. See Figure 1 and §2.

In the Bi-Encoder setting, we propose SGPT-BE using position-weighted mean pooling and contrastive fine-tuning of only bias tensors (BitFit ). We show that BitFit is competitive with full fine-tuning performance for both encoders (SBERT) and decoders (SGPT) despite changing <0.1% of pre-trained parameters. When controlling for size, our decoders closely trail the performance of encoders. When scaling up, SGPT-BE-5.8B sets state-of-the-art results on BEIR and USEB for asymmetric and symmetric search.

In the Cross-Encoder setting, we propose SGPT-CE using log probability extraction of pre-trained GPT models. The method is applicable to symmetric or asymmetric search by changing the prompt. At scale, the model sets an unsupervised state-of-the-art on BEIR.

In summary, our contributions are three-fold:

For SGPT-BE in §4, we develop a new pooling method and show the usefulness of bias-only fine-tuning for embeddings. At 5.8B parameters, it produces the best natural language embeddings available by a margin of 7% for the example of semantic search.

For SGPT-CE in §3, we show how to use GPT for search via log probabilities without fine-tuning. At 6.1B parameters, it has the best unsupervised performance on BEIR by a margin of 8%.

We provide free, more performant alternatives to commonly used endpoints, such as OpenAI’s Search and Similarity Embeddings and the OpenAI Search endpoint available at https://github.com/Muennighoff/sgpt.

Related Work

In this section, we explain two dimensions fundamental to our work: Cross-Encoders vs Bi-Encoders and Symmetric vs Asymmetric Search. We highlight work in those areas relevant to ours.

Cross-Encoders encode query and document at the same time. BERT is used as a Cross-Encoder by separating the query from the document with a $[SEP]$ token. They are then passed through the transformer network together. Each new query requires $k$ forward passes given a corpus of $k$ documents. There is no existing literature on using GPT models as Cross-Encoders, however, we suspect the OpenAI Search API uses a GPT-based Cross-Encoder. We include results based on querying their API, as well as a BERT-based state-of-the-art Cross-Encoder in our benchmarks.

Bi-Encoders encode query and document separately. SBERT extends BERT to the Bi-Encoder setting via supervised fine-tuning and a pooling operation across the sequence output. The resulting document vectors can be cached. A new query requires only one forward pass through the transformer to produce the query vector. The query vector can then be scored against the cached document vectors with a similarity function. Embeddings from Bi-Encoders can be used for non-search tasks such as clustering or as input features of machine learning models . While non-semantic models like keyword-based BM25 remain extensively used, the field increasingly focuses on neural models using transformers . Contriever shows the utility of unsupervised contrastive training for pre-trained encoders. Their best model that we compare with also adds supervised fine-tuning as a third training stage. GTR explores the effect of scaling up encoders on semantic search tasks also in a three-stage training setup. They find that despite keeping the embedding size fixed, more parameters increase the performance of encoders. Usage of GPT models as Bi-Encoders remains limited. Rather, there has been interest in using them as generators to produce search training data for encoders . Previous work has studied the differences in embeddings produced by various language models including BERT and GPT . They have found GPT-2 embeddings to underperform on various word embedding benchmarks . Concurrently to our work, the first trained GPT-based Bi-Encoder, cpt-text, was proposed . They use a pre-trained decoder and employ two additional training stages, contrastive unsupervised pre-training and supervised fine-tuning. Their models are used for the OpenAI Similarity and Search Embeddings API . Our Bi-Encoders differ from theirs in that we simplify the training process to only fine-tuning, use a novel pooling method and only train bias parameters. cpt-text is most similar to our Bi-Encoders, hence we include their results in our benchmarks.

Cross-Encoders tend to outperform Bi-Encoders , but are slower as vectors cannot be cached and reused. To balance the trade-offs, multi-stage architectures have been proposed . In a two-stage re-ranking setup, the first model processes the entire corpus and the second model is only used on the top $k$ documents returned by the first. In §3, we use Bi-Encoder (BM25 ) + Cross-Encoder re-ranking.

Asymmetric Search means queries and documents are not interchangeable. Finding answers given a question is an asymmetric search problem. Commonly, documents are much longer than queries . We evaluate asymmetric search experiments on BEIR , a recently proposed benchmark consisting of 19 asymmetric search datasets.

Symmetric Search means queries and documents are interchangeable. Finding duplicate questions, where both queries and documents are questions, is a symmetric search problem. We evaluate symmetric search experiments on USEB , Quora from BEIR and STS-B . In Quora, queries are question titles and documents are question texts. They are often the same with average word lengths of 9.53 and 11.44, respectively . Hence, we consider it more of a symmetric search task. We include Quora in both symmetric and asymmetric experiments.

SGPT Cross-Encoder

Given a query $q$ , and a document corpus $D$ , we are interested in the most likely document $d^{*}$ . Using Bayes’ Theorem this can be expressed as:

Note that $P(q)$ is irrelevant as it is always the same when taking the $\arg\max$ over $D$ . Due to variable document lengths and contents it is easier to compare $P(q|d)$ than $P(d|q)$ . We hence compute the joint probability of the query tokens $q_{i,..,n}$ given the document tokens embedded in a prompt $P$ as $p(q_{i},...,q_{n}|p_{1},...,p_{i-1})$ ignoring $P(d)$ . As long as $P(d)$ does not vary excessively across the corpus $D$ , this simplification should produce reasonable scores.

In practice, we use log probabilities , computed via the log of the softmax of the model output. To have a constant query length $n+1-i$ and avoid abrupt text changes, documents are truncated from the left until the input fits the model’s maximum sequence length. We apply these methods to re-rank top $k$ documents returned by BM25 . While re-ranking with BM25 bottlenecks performance, it speeds up experiments. It is not a necessary part of the architecture and therefore not depicted in Figure 1.

We experiment with publicly available pre-trained decoder transformers with 125M, 1.3B, 2.7B and 6.1B parameters .

1.2 Results

We perform a search over 12 prompts using the MSMARCO dataset as provided in BEIR . The prompts and results are in Appendix §B.1. We select the prompt with the best score, $P_{G}$ .

In Table 1, we benchmark the resulting SGPT-CE (SGPT-Cross-Encoder). We compare with OpenAI’s Search endpoint, which is to be distinguished from their Embeddings endpoint. Please refer to Table 6 in the Bi-Encoder section for a benchmark with the OpenAI Embeddings endpoint. We provide parameter estimates for the OpenAI model names in Table 2. We also compare with the current state-of-the-art on BEIR , a BERT-based Cross-Encoder. BM25+CE consists of a pre-trained BERT model that is further fine-tuned on MS-MARCO in a supervised fashion . SGPT-CE consists solely of the pre-trained GPT model. However, SGPT-CE-6.1B has almost 15x more parameters than BM25+CE significantly increasing latency. In the Re-rank Top 100 setting, the top 100 documents as returned by BM25 are re-ranked by the respective model. While SGPT-CE-6.1B wins on more datasets than the encoder-based state-of-the-art, its average score is worse. This can be alleviated by not using the same prompt $P_{G}$ for all datasets. We show in §3.2 that SGPT-CE-6.1B can beat BM25+CE on Quora by changing the prompt.

In Figure 2, we investigate how performance scales with model size. As we are in a re-ranking setup, the Cross-Encoder performance is bounded by the documents returned by BM25. We provide the BM25 bounds and additional model results in Appendix §A. In a Re-rank Top 10 setting, the model is significantly bottlenecked by BM25. SGPT-CE-6.1B reaches around 80% of the maximum possible performance. We hence observe high jumps in performance for datasets like HotpotQA or TREC-COVID as we move to top 100. In fact, the 0.791 nDCG@10 on TREC-COVID in Table 1 is not possible in a Re-rank Top 10 setting as the bound is at 0.750. From the results, we infer that performance scales both as we re-rank more documents or increase model size.

2 Symmetric Search

We use the same methods outlined in §3.1.1, but adapt the prompt for symmetric search. We show this on the example of Quora in Table 3. In §2, we have explained why Quora is closer to symmetric search than asymmetric search. We search over several prompts on the smaller 125M parameter model and use the best one on the large model. By doing so, SGPT-CE-6.1B improves by 6% outperforming all Quora results in Table 1. We hypothesize that further customizing the prompt for each dataset could significantly improve performance. However, we highlight that searching prompts for all possible input types may not be feasible in practice and is not considered true few-shot learning . Hence, the prompt we find for Quora may not generalize well to other symmetric search datasets, a key limitation of this method.

SGPT Bi-Encoder

Like in §3.1.1, we first experiment with decoder transformers that have only gone through unsupervised pre-training. In the Bi-Encoder setting, a pooling operation is commonly applied to the model’s hidden states to reduce them to a vector whose size is irrespective of sequence length. SBERT showed that a MEAN pooling mechanism outperforms [CLS] and MAX strategies for a BERT encoder. Due to the causal attention mask in an auto-regressive decoder transformer, tokens do not attend to future tokens like in an encoder transformer. Hence, only the last token has attended to all tokens in a sequence. To account for this information mismatch, we propose to give later tokens a higher weight using a position-weighted mean pooling method:

where $S$ is the sequence length, $h_{i}$ the $i$ th hidden state and $v$ the query or document embedding. We compare weighted mean pooling with last token pooling, where the hidden state of the final token is the embedding, and regular mean pooling.

We follow recent work and perform supervised contrastive learning with in-batch negatives. Given matching query-doc pairs $\{q^{(i)},d^{(i)}\}_{i=1}^{M}$ , we optimize the cost function:

where $f_{\theta}$ is the SGPT model outputting a fixed-size vector, $\sigma$ cosine similarity and $\tau$ a temperature parameter set to $20$ in our experiments. We use GradCache to train with large batch sizes in a limited memory setting. We train on SNLI and MNLI . We limit the model sequence length to 75 tokens during both training and inference.

We fine-tune only bias parameters and freeze the rest of the model. This has been recently proposed as BitFit for BERT encoders. It has been shown to be competitive with full fine-tuning in various scenarios . Table 4 shows the number of parameters trained for BitFit models. Due to fewer gradient updates, BitFit significantly reduces GPU memory and time required per step. Further, adding a BitFit checkpoint to an instance with an existing full model will only require storing the different biases. An instance already serving a 22.5GB fp32 GPT-J-6B model requires an additional 22MB of storage to serve an SGPT-5.8B-bitfit model.

1.2 Results

Figure 3 shows average precisions on USEB across different methods and layers. Similar to previous work , we find that in the unsupervised setting, decoder transformers (GPT) strongly underperform encoders (BERT). However, after fine-tuning on the same dataset with the same hyperparameters, decoders (SGPT) with 125M parameters closely trail the 110M parameter encoder (SBERT) for the 12th layer. Weighted mean pooling outperforms mean and last token pooling for SGPT 125M. When increasing SGPT size ten-fold, the last layer performance (24th layer) increases beyond that of SBERT models. The performance difference of weighted mean pooling compared to mean pooling further widens for SGPT 1.3B.

Table 5 provides performance on the individual USEB datasets, Quora and STS-B. STS-B scores should not be the focus of comparison due to the drawbacks highlighted in . Despite training on less than 0.1% of parameters BitFit models are within +2 to -2% of fully fine-tuned ones. BitFit degrades performance more for decoders than encoders. This could be due to the missing bias parameters, see Table 4. highlights the importance of the query bias vector for BERT, which is not present for SGPT models. SGPT-5.8B-weightedmean-nli-bitfit sets an out-of-domain state-of-the-art on USEB, but is outperformed by models trained in-domain in . We observed performance gains by increasing the training batch size. SGPT-5.8B-weightedmean-nli-bitfit is trained with a batch size of 1024. In Appendix §A, we provide results using a lower batch size and additional ablations. Results for Ada and Curie were obtained by querying the OpenAI Similarity Embeddings API in March 2022. They correspond to the cpt-text similarity models from and we provide their parameters in Table 2.

2 Asymmetric Search

If not otherwise specified, we follow the same setup as in §4.1.1. For asymmetric search, we train on MS-MARCO . We limit the model sequence length to 300 tokens during both training and inference. We follow concurrent work and add enclosing brackets to help the model distinguish between query and document. We embed the tokens of query $q$ in two brackets as $[q_{0-n}]$ . For documents, we use curly brackets: $\{d_{0-n}\}$ . We add the token ids of the brackets to the already tokenized text to avoid the tokens intermingling. We refer to these special brackets as $specb$ .

2.2 Results

Table 6 benchmarks SGPT-BE-5.8B (SGPT-5.8B-weightedmean-msmarco-specb-bitfit) on BEIR with: (a) BM25 , a non-semantic fast baseline (b) SGPT-CE-6.1B from §3 (c) BM25+CE , the current overall state-of-the-art on BEIR (d) TAS-B , the original Bi-Encoder state-of-the-art on BEIR (e) Contriever , a similar training scheme as but using an encoder transformer (f) GTR-XXL , the current Bi-Encoder state-of-the-art on BEIR with 4.8 billion parameters using the BERT-like encoder transformer of T5 (g) cpt-text, a GPT-like decoder transformer architecture concurrently proposed in . Corresponding parameter estimates are in Table 2.

SGPT-5.8B achieves the best average nDCG@10 both on the BEIR subset selected in and on the full BEIR benchmark. It outperforms the roughly same-sized cpt-text-L and the 30x larger cpt-text-XL by 8.1% and 4.2%, respectively. Yet, cpt-text models have gone through an additional unsupervised training stage and are fully trained. SGPT-BE-5.8B fine-tunes just 700K parameters, 0.0004% of the parameters fine-tuned for cpt-text-XL . See Table 2 for sizes. We suspect much of the difference to come from the cpt-text model’s inferior last token pooling as shown in Figure 3. Further, we suspect that the benefits of the additional unsupervised contrastive pre-training stage diminish when followed by supervised contrastive fine-tuning. SGPT-BE-5.8B improves on the overall state-of-the-art, a Cross-Encoder, by 3%. It improves on the previously best sentence embeddings (Bi-Encoder) on BEIR, GTR-XXL, by 7%. However, these improvements come at a significant cost. GTR-XXL has 20% fewer parameters and its embeddings have 768 dimensions. SGPT-BE-5.8B produces embeddings with 4096 dimensions, hence requiring about 5x more storage. It took the model six days on one Nvidia A100 GPU to encode the entire BioASQ corpus with 15M documents and an average 200 words each . Its comparatively low performance on BioASQ may be improved by increasing the sequence length limit beyond 300, however, requiring additional compute. For SGPT-CE-6.1B, the sequence length limit was 2048 for the combined prompt on all datasets. The high performance on TREC-COVID for SGPT models could be due to the different pre-training datasets. The SGPT pre-training dataset, The Pile , contains data until mid-2020. This may give the models an information advantage on Covid-19. Lastly, we highlight that on Quora SGPT-BE-5.8B-msmarco is outperformed by SGPT-BE-5.8B-nli from Table 5. Given our classification of Quora as a symmetric search task in §2, this supports our overall distinction between asymmetric and symmetric search. We advise users of our models to classify their tasks as symmetric or asymmetric and use the appropriate model. For non-classifiable embedding tasks, both may work, but we recommend experimenting with embeddings from the symmetric models in §4.1 first.

Conclusion and Future Work

This work presented SGPT. Building on SBERT, we proposed modifications to GPT models to use them as Cross- or Bi-Encoders for semantic search.

SGPT-BE uses position-weighted mean pooling and fine-tuning of only bias tensors. At scale, it produces new state-of-the-art sentence embeddings. The model can be used for semantic search or other embedding tasks. We recommend using SGPT-BE-5.8B when compute and storage are of high availability and maximum performance is desired.

SGPT-CE extracts log probabilities of pre-trained GPT models to produce unsupervised state-of-the-art search results. The setup presented can only be used for semantic search. Storage can be limited, but compute should be of high availability for SGPT-CE-6.1B. The prompt and max re-rank parameter can be adjusted depending on performance and latency requirements.

Future research could fine-tune a GPT Cross-Encoder on MSMARCO similar to the BM25+CE model. We suspect that this should outperform the presented non-fine-tuned SGPT-CE model as well as SGPT-BE if enough documents are re-ranked. Further, the combination of SGPT with GPT for generative search results could be interesting. Possibly, SGPT embeddings could be injected into GPT models to generate answers. Lastly, a detailed study of the disadvantages of the missing biases in large GPT models could be helpful to consider their inclusion in the training of future large language models.

Acknowledgments and Disclosure of Funding

We thank Constantin Eichenberg and Samuel Weinbach for insightful discussions and valuable feedback throughout the project. We thank Robert Baldock, Marco Bellagente and Koen Oostermeijer for reading drafts of this paper. This work has been supported by OpenAI under the academic access program.

References

Appendix A Additional results

Appendix B Task and Experimental Details

B.2 Licenses

Datasets from the BEIR benchmark are licensed under various licenses available in Appendix E of their paper . USEB datasets are licensed under an Apache 2.0 license.https://github.com/UKPLab/useb. To the best of our knowledge, these datasets do not contain private, personally identifiable information but may contain offensive content. OpenAI models are licensed by OpenAI API to customers via a non-exclusive, non-sublicensable, non-transferable, non-assignable, revocable license.https://openai.com/api/policies/terms/

B.3 Computational Cost

We use the OpenAI API to evaluate Search and Embeddings endpoints. We used tokens equivalent to around 5,000 USD. For SGPT experiments, we use one node with 8x NVIDIA A100 Tensor Core GPU with 40GB memory. For SGPT-CE the evaluation on the entire BEIR suite took around two weeks for the 5.8B model. For SGPT-BE symmetric search training took 21 hours, while asymmetric training took 60 hours for the 5.8B model. Our cluster was provided by Oracle.