SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant

Introduction

The release of large pre-trained language models like BERT (Devlin et al., 2018) has shaken-up Natural Language Processing and Information Retrieval. These models have shown a strong ability to adapt to various tasks by simple fine-tuning. At the beginning of 2019, Nogueira and Cho (Nogueira and Cho, 2019) achieved state-of-the-art results – by a large margin – on the MS MARCO passage re-ranking task, paving the way for LM-based neural ranking models. Because of strict efficiency requirements, these models have initially been used as re-rankers in a two-stage ranking pipeline, where first-stage retrieval – or candidate generation – is conducted with bag-of-words models (e.g. BM25) that rely on inverted indexes. While BOW models remain strong baselines (Yang et al., 2019), they suffer from the long standing vocabulary mismatch problem, where relevant documents might not contain terms that appear in the query. Thus, there have been attempts to substitute standard BOW approaches by learned (neural) rankers. Designing such models poses several challenges regarding efficiency and scalability: therefore there is a need for methods where most of the computation can be done offline and online inference is fast. Dense retrieval with approximate nearest neighbors search has shown impressive results (Xiong et al., 2021; Qu et al., 2021; Lin et al., 2021; Hofstätter et al., 2021), but can still benefit from BOW models (e.g. by combining both types of signals), due to the absence of explicit term matching. Hence, there has recently been a growing interest in learning sparse representations for queries and documents (Zamani et al., 2018; Dai and Callan, 2019; Nogueira et al., 2019; Zhao et al., 2020; Bai et al., 2020; Gao et al., 2021; Mallia et al., 2021; Formal et al., 2021). By doing so, models can inherit from the desirable properties of BOW models like exact-match of (possibly latent) terms, efficiency of inverted indexes and interpretability. Additionally, by modeling implicit or explicit (latent, contextualized) expansion mechanisms – similarly to standard expansion models in IR – these models can reduce the vocabulary mismatch.

In this paper, we build on the SPLADE model (Formal et al., 2021) and propose several improvements/modifications that bring gains in terms of effectiveness or efficiency: (1) by simply modifying SPLADE pooling mechanism, we are able to increase effectiveness by a large margin; (2) in the meantime, we propose an extension of the model without query expansion. Such model is inherently more efficient, as everything can be pre-computed and indexed offline, while providing results that are still competitive; (3) finally, we use distillation techniques (Hofstätter et al., 2020) to boost SPLADE performance, leading to close to state-of-the-art results on the MS MARCO passage ranking task as well as the BEIR zero-shot evaluation benchmark (Thakur et al., 2021).

Related Works

Dense retrieval based on BERT Siamese models (Reimers and Gurevych, 2019) has become the standard approach for candidate generation in Question Answering and IR (Guu et al., 2020; Karpukhin et al., 2020; Xiong et al., 2021; Qu et al., 2021; Lin et al., 2021; Hofstätter et al., 2021). While the backbone of these models remains the same, recent works highlight the critical aspects of the training strategy to obtain state-of-the-art results, ranging from improved negative sampling (Lin et al., 2021; Hofstätter et al., 2021) to distillation (Hofstätter et al., 2020; Lin et al., 2021). ColBERT (Khattab and Zaharia, 2020) pushes things further: the postponed token-level interactions allow to efficiently apply the model for first-stage retrieval, benefiting of the effectiveness of modeling fine-grained interactions, at the cost of storing embeddings for each (sub)term – raising concerns about the actual scalability of the approach for large collections. To the best of our knowledge, very few studies have discussed the impact of using approximate nearest neighbors (ANN) search on IR metrics (Boytsov, 2018; Tu et al., 2020). Due to the moderate size of the MS MARCO collection, results are usually reported with an exact, brute-force search, therefore giving no indication on the effective computing cost.

Motivated by the success of BERT, there have been attempts to transfer the knowledge from pre-trained LM to sparse approaches. DeepCT (Dai and Callan, 2019, 2020a, 2020b) focused on learning contextualized term weights in the full vocabulary space – akin to BOW term weights. However, as the vocabulary associated with a document remains the same, this type of approach does not solve the vocabulary mismatch, as acknowledged by the use of query expansion for retrieval (Dai and Callan, 2019). A first solution to this problem consists in expanding documents using generative approaches such as doc2query (Nogueira et al., 2019) and doc2query-T5 (Nogueira and Lin, 2019) to predict expansion words for documents. The document expansion adds new terms to documents – hence fighting the vocabulary mismatch – as well as repeats existing terms, implicitly performing re-weighting by boosting important terms.

Recently, DeepImpact (Mallia et al., 2021) combined the expansion from doc2query-T5 with the re-weighting from DeepCT to learn term impacts. These expansion techniques are however limited by the way they are trained (predicting queries), which is indirect in nature and limit their progress. A second solution to this problem, that has been chosen by recent works (Bai et al., 2020; MacAvaney et al., 2020; Zhao et al., 2020; Formal et al., 2021), is to estimate the importance of each term of the vocabulary implied by each term of the document (or query), i.e. to compute an interaction matrix between the document or query tokens and all the tokens from the vocabulary. This is followed by an aggregation mechanism (roughly sum for SparTerm (Bai et al., 2020) and SPLADE (Formal et al., 2021), max for EPIC (MacAvaney et al., 2020) and SPARTA (Zhao et al., 2020)), that allows to compute an importance weight for each term of the vocabulary, for the full document or query.

However, EPIC and SPARTA (document) representations are not sparse enough by construction – unless resorting on top- $k$ pooling – contrary to SparTerm, for which fast retrieval is thus possible. Furthermore, the latter does not include (like SNRM) an explicit sparsity regularization, which hinders its performance. SPLADE however relies on such regularization, as well as other key changes, that boost both the efficiency and the effectiveness of this type of approaches, providing a model that both learns expansion and compression in an end-to-end manner. Furthermore, COIL (Gao et al., 2021) proposed to revisit exact-match mechanisms by learning dense representations per term to perform contextualized term matching, at the cost of increased index size.

Sparse Lexical representations for first-stage ranking

In this section, we first describe in details the SPLADE model recently introduced in (Formal et al., 2021).

SPLADE predicts term importance – in BERT WordPiece vocabulary ( $|V|=30522$ ) – based on the logits of the Masked Language Model (MLM) layer. More precisely, let us consider an input query or document sequence (after WordPiece tokenization) $t=(t_{1},t_{2},...,t_{N})$ , and its corresponding BERT embeddings $(h_{1},h_{2},...,h_{N})$ . We consider the importance $w_{ij}$ of the token $j$ (vocabulary) for a token $i$ (of the input sequence):

where $E_{j}$ denotes the BERT input embedding for token $j$ , $b_{j}$ is a token-level bias, and transform $(.)$ is a linear layer with GeLU activation and LayerNorm. Note that Eq. (1) is equivalent to the MLM prediction, thus it can also be initialized from a pre-trained MLM model. The final representation is then obtained by summing importance predictors over the input sequence tokens, after applying a log-saturation effect (Formal et al., 2021):

Let $s(q,d)$ denote the ranking score obtained via dot product between $q$ and $d$ representations from Eq. (2). Given a query $q_{i}$ in a batch, a positive document $d_{i}^{+}$ , a (hard) negative document $d_{i}^{-}$ (e.g. coming from BM25 sampling), and a set of negative documents in the batch (positive documents from other queries) $\{d_{i,j}^{-}\}_{j}$ , we consider a constrastive loss, which can be interpreted as the maximization of the probability of the document $d_{i}^{+}$ being relevant among the documents $d_{i}^{+},d_{i}^{-}$ and $\{d_{i,j}^{-}\}$ :

The in-batch negatives (IBN) sampling strategy is widely used for training image retrieval models, and has shown to be effective in learning first-stage rankers (Karpukhin et al., 2020; Qu et al., 2021; Lin et al., 2021).

Learning sparse representations

Overall loss

By jointly optimizing the model in Eq. (2) with ranking and regularization losses, SPLADE combines the best of both worlds for end-to-end training of sparse, expansion-aware representations of documents and queries:

where $\mathcal{L}_{\texttt{reg}}$ is the sparse FLOPS regularization from Eq. 4. We use two distinct regularization weights ( $\lambda_{d}$ and $\lambda_{q}$ ) for queries and documents – allowing to put more pressure on the sparsity for queries, which is critical for fast retrieval.

2. Pooling strategy

We propose to change the sum in Eq. (2) by a $\max$ pooling operation:

The model now bears more similarities with SPARTA and EPIC, and to some extent ColBERT. As shown in the experiments section, it considerably improves SPLADE performance. In the following, max pooling is the default configuration for SPLADE, and the corresponding model is referred to as SPLADE- ${\max}$ .

3. SPLADE document encoder

In addition to the max pooling operation, we consider a document-only version of SPLADE. In this case, there are no query expansion nor query term weighting, and the ranking score is simply given by:

Such extension offers an interesting efficiency boost: because the ranking score solely depends on the document term weights, everything can be pre-computed offline, and inference cost is consequently reduced, while still offering competitive results as shown in the experiments. We refer to this model as SPLADE-doc.

4. Distillation and hard negatives

We also incorporate distillation to our training procedure, following the improvements shown in (Hofstätter et al., 2020). The distillation training is done in two steps: (1) we first train both a SPLADE first-stage retriever as well as a cross-encoder reranker Using https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2 as pre-trained checkpoint. using the triplets generated by (Hofstätter et al., 2020); (2) in the second step, we generate triplets using SPLADE trained with distillation (thus providing harder negatives than BM25), and use the aforementioned reranker to generate the scores needed for the Margin-MSE loss. We then train a SPLADE model from scratch using these triplets and scores. The result of the second step is what we call DistilSPLADE- ${\max}$ .

Experimental setting and results

We trained and evaluated our models on the MS MARCO passage ranking datasethttps://github.com/microsoft/MSMARCO-Passage-Ranking in the full ranking setting. The dataset contains approximately $8.8$ M passages, and hundreds of thousands training queries with shallow annotation ( $\approx 1.1$ relevant passages per query in average). The development set contains $6980$ queries with similar labels, while the TREC DL 2019 evaluation set provides fine-grained annotations from human assessors for a set of $43$ queries (Craswell et al., 2020).

We initialized the models with the DistilBERT-base checkpoint. Models are trained with the ADAM optimizer, using a learning rate of $2e^{-5}$ with linear scheduling and a warmup of $6000$ steps, and a batch size of $124$ . We keep the best checkpoint using MRR@10 on a validation set of $500$ queries, after training for $150$ k iterations, using an approximate retrieval validation set similar to (Hofstätter et al., 2021). For the SPLADE-doc approach, we simply train for $50$ k steps and select the last checkpoint. We consider a maximum length of $256$ for input sequences. In order to mitigate the contribution of the regularizer at the early stages of training, we follow (Paria et al., 2020) and use a scheduler for $\lambda$ , quadratically increasing $\lambda$ at each training iteration, until a given step ( $50$ k in our case), from which it remains constant. Typical values for $\lambda$ fall between $1e^{-1}$ and $1e^{-4}$ . For storing the index, we use a custom implementation based on Python arrays, and we rely on Numba (Lam et al., 2015) to parallelize retrieval. ModelsWe made the code public at https://github.com/naver/splade are trained using PyTorch (Paszke et al., 2019) and HuggingFace transformers (Wolf et al., 2020), on $4$ Tesla $V100$ GPUs with $32$ GB memory.

Evaluation

We report Recall@1000 for both datasets, as well as the official metrics MRR@10 and NDCG@10 for MS MARCO dev set and TREC DL 2019 respectively. Since we are essentially interested in the first retrieval step, we do not consider re-rankers based on BERT, and we compare our approach to first stage rankers only – results reported on the MS MARCO leaderboard are thus not comparable to the results presented here. We compare to the following sparse approaches (1) BM25, (2) DeepCT (Dai and Callan, 2019), (3) doc2query-T5 (Nogueira and Lin, 2019), (4) SparTerm (Bai et al., 2020), (5) COIL-tok (Gao et al., 2021) and (6) DeepImpact (Mallia et al., 2021) , as well as state-of-the-art dense approaches ANCE (Xiong et al., 2020), TCT-ColBERT (Lin et al., 2021) and TAS-B (Hofstätter et al., 2021), reporting results from corresponding papers.

BEIR

Finally, we verify the zero-shot performance of SPLADE using a subset of datasets from the BEIR (Thakur et al., 2021) benchmark that encompasses various IR datasets for zero-shot evaluation. We solely use a subset due to the fact that some of the datasets (namely CQADupstack, BioASQ, Signal-1M, TREC-NEWS, Robust04) are not readily available. Results are displayed in Table 2 (NDCG@10). We compare against the best performing models from the original benchmark paper (Thakur et al., 2021) (ColBERT (Khattab and Zaharia, 2020)) and the two best performing from the rolling benchmarkhttps://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns (tuned BM25 and TAS-B (Hofstätter et al., 2021)). We also report the SPLADE evaluation against these baselines.

1. Impact of max pooling

First, on MS MARCO and TREC, max pooling brings almost $2$ points in MRR@10 and NDCG@10 compared to the SPLADE baseline. It becomes competitive with COIL and DeepImpact. In addition, Figure 1 shows that SPLADE- $\max$ is consistently better than SPLADE, in terms of effectiveness-efficiency trade-off. SPLADE- $\max$ has also improved performance on the BEIR benchmark (cf Table 2).

2. Document expansion

Our document encoder model with max pooling is able to reach the same performance as the previous SPLADE model, outperforming doc2query-T5 on MS MARCO. As this model has no query encoder, it is more efficient in terms of e.g. latency. Figure 2 illustrates how we can balance efficiency (in terms of the average size of document representations) with effectiveness. For relatively sparse representations, we are able to obtain performance on par with approaches like doc2query-T5 (e.g. MRR@10= $29.6$ for a model with an average of $19$ non-zero weights per document). In addition, it is straightforward to train and apply to a new document collection: a single forward is required as opposed to multiple inferences with beam search for doc2query-T5.

3. Distillation

By training with distillation, we are able to considerably improve the performance of SPLADE, as seen in Table 1. From Figure 1, we observe that distilled models bring huge improvements for higher values of FLOPS ( $0.368$ MRR@10 for $\approx$ $4$ FLOPS), but are still very efficient in low regime ( $0.35$ MRR for $\approx$ 0.3 FLOPS). Furthermore, DistilSPLADE- $\max$ is able to outperform all other methods in most datasets of the BEIR benchmark (cf Table 2).

Conclusion

In this paper, we have built on the SPLADE model by reconsidering its pooling mechanism, and by using standard training techniques such as distillation for neural IR models. Our experiments have shown that the max pooling technique indeed provides a substantial improvement. Secondly, the document encoder is an interesting model for faster retrieval conditions. Finally, the distilled SPLADE model leads to close to state-of-the-art models on MS MARCO and TREC DL 2019, while clearly outperforming recent dense models on zero-shot evaluation.