Learning Groupwise Multivariate Scoring Functions Using Deep Neural Networks
Qingyao Ai, Xuanhui Wang, Sebastian Bruch, Nadav Golbandi, Michael Bendersky, Marc Najork
Introduction
Unlike in classification or regression, the main goal of a ranking problem is not to assign a label or a value to individual items, but, given a list of items, to produce an ordering of the items in that list in such a way that the utility of the entire list is maximized. In other words, in ranking we are more concerned with the relative ordering of the relevance of items—for some notion of relevance—than their absolute magnitudes.
Modeling relativity in ranking has been extensively studied in the past, especially in the context of learning-to-rank (Liu, 2009). Learning-to-rank aims to learn a scoring function that maps feature vectors to real-valued scores in a supervised setting. Scores computed by such a function induce an ordering of items in the list. The majority of existing learning-to-rank algorithms learn a parameterized function by optimizing a loss that acts on pairs of items (pairwise) or a list of items (listwise) (Burges et al., 2005; Burges et al., 2006; Cao et al., 2007; Xu and Li, 2007). The idea is that such loss functions guide the learning algorithm to optimize preferences between pairs of items or to maximize a ranking metric such as NDCG (Joachims, 2002; Taylor et al., 2008; Burges, 2010), thereby indirectly modeling relative relevance.
Though effective, most existing learning-to-rank frameworks are restricted to the paradigm of univariate scoring functions: the relevance of an item is computed independently of other items in the list. This setting could prove sub-optimal for ranking problems for two main reasons. First, univariate functions have limited power to model cross-item comparison. Consider an ad hoc document retrieval scenario where a user is searching for the name of an artist. If all the results returned by the query (e.g., “calvin harris”) are recent, the user may be interested in the latest news or tour information. If, on the other hand, most of the query results are older (e.g., “frank sinatra”), it is more likely that the user seeks information on artist discography or biography. Thus, the relevance of each document depends on the distribution of the whole list. Second, user interaction with search results shows a strong tendency to compare items. Prior research suggests that preference judgments by comparing a pair of documents are faster to obtain, and are more consistent than absolute ratings (Ye and Doermann, 2013). Moreover, it has been shown that better predictive capability is achieved when user actions are modeled in a relative fashion (e.g., SkipAbove) (Joachims et al., 2005; Borisov et al., 2016). These studies indicate that users compare a document with its surrounding documents prior to a click, and that a ranking model that uses the direct comparison mechanism can be more effective, as it mimics user behavior more closely.
In particular, our contributions can be summarized as follows:
We motivate and formulate multivariate scoring functions for learning-to-rank;
We present Groupwise Scoring Function (GSF) as an instance of the class of multivariate functions that is parameterized by deep neural networks;
We explore the conditions under which a GSF reduces to existing learning-to-rank models;
We demonstrate, through empirical evaluation on proprietary and public datasets, the improvements obtained by GSFs and discuss their potential for learning-to-rank tasks;
To encourage research in this space and to allow for reproducibility of the reported results, we open source our implementation within the TF Ranking library (Pasumarthi et al., 2018).
Related Work
Learning-to-rank refers to algorithms that model the ranking problem with machine learning techniques. In general, ranking is formulated as a score-and-sort problem with the objective of constructing a scoring function where scores computed by such a function induce an ordering of items in a list. Existing learning-to-rank algorithms (Joachims, 2006; Friedman, 2001; Burges, 2010; Burges et al., 2005; Cao et al., 2007; Xia et al., 2008) mainly differ by two factors: (a) the parameterization of the scoring function (e.g., linear functions (Joachims, 2006), boosted weak learners (Xu and Li, 2007), gradient-boosted trees (Friedman, 2001; Burges, 2010), support vector machines (Joachims, 2006; Joachims et al., 2017), and neural networks (Burges et al., 2005)); and (b) the loss function (e.g., pointwise (Gey, 1994), pairwise (Burges, 2010; Burges et al., 2005; Joachims, 2006) and listwise (Cao et al., 2007; Xia et al., 2008)). Virtually all of the existing algorithms, however, yield a univariate scoring function in the end where the score of an item is computed in isolation and independently of other items in the list. To the best of our knowledge, there are only a few exceptions.
First, the score regularization technique (Diaz, 2007) and the CRF-based model (Qin et al., 2008) use document similarities to smooth the initial ranking scores or enrich query-document pair feature vectors. When computing relevance scores, however, both methods take only one document at a time.
The second exception is a bivariate scoring function (Dehghani et al., 2017) that takes a pair of documents as input and predicts the preference of one over the other. It is easy to show that the bivariate scoring function is a special case of our proposed framework.
Third is a group of neural learning to rank algorithms (Ai et al., 2018; Bello et al., 2018) and click model (Borisov et al., 2016) that builds an recurrent neural network over document lists. They, however, either focus on a re-ranking problem or use a pointwise loss to optimize user clicks. In contrast, our method can be applied to arbitrary number of documents with any types of ranking loss functions.
Search result diversification is another area of related work. Diversification algorithms maximize objectives that take subsets of documents into account. These include maximal marginal relevance (Carbonell and Goldstein, 1998) and subtopic relevance (Agrawal et al., 2009). Recently, several deep learning algorithms were proposed with losses corresponding to those objectives (Jiang et al., 2017; Xia et al., 2016). In contrast, our work focuses on improving relevance, not diversity, by way of cross-document comparisons.
Another area of related research is the work on pseudo-relevance feedback (Manning et al., 2008) where queries are expanded based on the top retrieved documents in a first round. The idea is that expanded queries lead to improvements in a second-stage retrieval. In this paper, we consider document relationships in the learning-to-rank setting, not retrieval, and do not require two rounds of retrieval. We also do not assume a pre-existing initial ordering of the document list.
Finally, note that our work is orthogonal and complementary to the recently proposed neural IR techniques (Pang et al., 2017; Guo et al., 2016; Mitra et al., 2017; Dehghani et al., 2017). These techniques focus on advanced representations of document and query text but employ standard loss and scoring functions. On the other hand, our work concerns the nature of the scoring functions while employing a relatively simple query-doc representation.
Problem Formulation
where denotes the dimension of .
A score obtained from depends only on its argument. In other words, fixing and changing any or all other documents in the list to (for ) does not affect the output of .
Learning and evaluating a multivariate scoring function in practice is, however, nontrivial. In the discussion above, we made a simplifying assumption that , the number of documents in a list, is constant across all training samples. As is common in learning-to-rank settings, however, that is often not the case and in fact the length of is arbitrary and varies across training or evaluation samples. It is therefore not immediately clear how one may construct a generic multivariate function. In the following section, we address these challenges and introduce an instance of multivariate scoring functions that is suitable for the task of learning-to-rank and is further trained and evaluated in an efficient manner.
Groupwise Scoring Functions
As noted earlier, we parameterize our functions using deep neural networks. Feed-forward neural networks have widely been applied to learning-to-rank problems (Edizel et al., 2017; Huang et al., 2013; Zamani et al., 2017). The reasons we believe a deep neural network fits well into our framework are two-fold. First, compared to tree models, neural networks scale well to high-dimensional inputs. This is important because a GSF takes documents as input where each document is a vector of an arbitrary and potentially large number of features. Second, neural networks arguably handle sparse features such as text more naturally whereas other models require extensive feature engineering. As such we believe a deep neural network is the right candidate for the task of learning a GSF.
To begin the construction, we need to define an input layer. Conceptually, a document can be represented as a concatenation of two subsets of features: the embedding features for sparse textual features (e.g., for document titles) and the dense features (e.g., document static scores or various match scores (Guo et al., 2016)). For simplicity, we construct the input layer by concatenating all documents. Specifically, let
Note that in practice the input layer can be extended to include document-independent ”context” features (such as query embeddings) and need not be limited to document-derived features.
Given the above input layer, we build a multi-layer feed-forward network with 3 hidden layers as follows:
where and denote the weight matrix and the bias vector in the -th layer, is an activation function, which in this work is the ReLU function: .
Our groupwise scoring function is thus defined as:
where and are the weight vector and the bias in the output layer. The output layer of the network consists of units, each producing a score for each of the documents.
We note that in this work we wish to keep the design of our input layer and network architecture simple as these details, while important and consequential, are not germane to the topic of this work. We leave the exploration of more sophisticated representation of groups of input documents and advanced layers as future work.
2. Extension to Arbitrarily Long Lists
The domain of the function presented in the previous section is with fixed. As noted earlier, in learning-to-rank, it is often the case that the list size (i.e., number of documents retrieved for a query) varies between queries. That important detail poses a challenge when designing and training a GSF.
Addressing this challenge by brute-force, one may set to be the corpus size and subsequently zero-pad input lists during training and inference. To state the obvious, the resulting network clearly does not scale to real-world corpora. Moreover, given the enormity of the parameter space, training such a network becomes prohibitive and any resulting model is unlikely to be effective.
Let denote a set of all possible permutations of size of the documents in , and let be an element of that set. A permutation can be understood as a group of documents. In our proposed method, we compute on for all . The vector of values contains the scores of all documents relative to other documents in that group. Group scores are subsequently used to compute a final score for all documents. To explain that, it helps to define the following function:
where we use to denote the position of in . The final score is then calculated by the following equation:
Figure 1 illustrates one such in a simplified setting where is bivariate and is a list of 3 documents.
3. Efficient Training and Inference
One caveat of the extended GSF is the factorial growth of the space of permutations . For large values of , the set grows so intractably large that computing on the resulting groups and aggregating group scores by Equation 6 quickly become prohibitive: assuming the computational complexity of is such a scoring paradigm has a complexity of .
To reduce the complexity of GSFs, we propose to substitute the summation in Equation 6 with an expectation as follows:
The expectation in Equation (7) can be approximated effectively using Monte Carlo methods (Robert and Casella, 2005). In our implementation, we use the following sampling recipe: From each training sample with document list , we form groups by taking sub-sequences of a randomly shuffled version of .
Such down-sampling substantially reduces the time complexity to . It is easy to show that, because each appears in exactly groups, each document is equally likely to be compared with other documents in the list. Moreover, a document’s position in the group is also uniformly distributed. Given enough training data, a GSF trained using this sampling strategy asymptotically approaches a GSF trained with all permutations and is further invariant to document order in the input list.
4. Loss Function
where is a normalizing factor, and ’s are the projection of scores onto the probability simplex using Softmax:
An important property of this loss function is that it can be incorporated into an unbiased learning-to-rank framework. Specifically, it is easy to extend this loss to factor in Inverse Propensity Weights (Joachims et al., 2017; Wang et al., 2018) to counter position bias in click logs. The IPW-enabled variant of the loss in Equation (8) is as follows:
where is the Inverse Propensity Weight of the result, and where it is assumed that for click logs and that only one document is clicked (i.e., ).
We use the above loss in the experiments reported in this work and leave the exploration of more advanced loss functions as future work. We will also defer a theoretical analysis of the cross-entropy loss or its extensions in the context of GSFs to a future study.
5. Relationship with Existing Models
Let us go through this exercise by presenting a configuration that transforms our GSF model to ListNet (Cao et al., 2007). Given the univariate nature of in the new configuration, define as
ListNet optimizes the cross-entropy loss between two (”top-one” probability) distributions: One obtained from relevance labels and another defined over scores . The following expression defines the ListNet loss:
Using the above loss in Equation (1) completes the transformation of a GSF to the ListNet model.
Note that the ListNet loss is almost identical to the loss used in our GSF model as shown in Equation (8). ListNet, however, projects labels to the probability simplex using the Softmax function whereas in GSF labels are simply normalized. When , we calculate zero loss in the GSF setup while this is not the case in the standard ListNet loss; the ListNet loss is always non-zero. This difference becomes important if one wishes to train a model in an unbiased learning-to-rank framework (Wang et al., 2016; Joachims et al., 2017) where propensity weights cannot be computed for non-clicked documents (Wang et al., 2018). As such, having a non-zero loss for non-clicked documents proves to be a significant limitation of the ListNet loss in the context of unbiased learning.
Experimental Setup
GSFs have theoretically interesting properties but their effectiveness in practice remains to be verified empirically. In the remainder of this paper, we set out to do just that by evaluating our proposed method on two datasets. To conduct experiments, we have implemented the GSF model in Tensorflow, a standard deep learning platform. In order to facilitate reproducibility of the reported results, we open source our code within the the TF Ranking library (Pasumarthi et al., 2018). Moreover, in this section, we give a detailed description of our experimental design, setup, and model hyper-parameters.
We compare our method with a number of existing learning-to-rank algorithms that fall into two categories: DNN models and tree-based models. Table 1 summarizes a list of DNN models we use as baselines in our experiments. In this table, PointDNN and RankNet represent the existing DNN models with a univariate scoring function in the learning-to-rank literature. BiDNN is a recently proposed model that takes a pair of documents and jointly computes preference scores (Dehghani et al., 2017). As for tree-based models, we primarily use the state-of-the-art MART and LambdaMART (Burges, 2010) algorithms as a baseline to compare with. In general, tree-based models cannot efficiently handle high-dimensional sparse features such as document text. Therefore, where we compare DNNs and tree-based models we do so by training a model with dense features only. Furthermore, where possible, we also explore a hybrid approach in which predictions from the DNN models are used as input features for tree-based models. Such a hybrid approach enables us to incorporate sparse features into tree-based models; we compare hybrid models with both standalone DNN and tree-based models in our experiments.
For completeness, Table 2 summarizes the different GSF variants considered in the following sections.
2. Datasets
We conduct a first set of experiments on a click dataset that is obtained from search logs of one of the largest commercial email search engines. In this service, a maximum of 6 results are returned and presented to users in an overlay. The overlay disappears after a click and the clicked result is then displayed. As a result, at most one click is obtained per query session. For this dataset, we discard all sessions that do not contain a click. For sessions with a click, we keep all 6 displayed documents and their click/no-click is recorded as relevance labels. This process results in approximately 150 million sessions in total. We sampled 5 million sessions to construct a held-out test set and used the rest for training and validation with a 9 : 1 ratio. To train BiDNN and BiGSF, we sample all pairs where one document is clicked.
The features in this dataset consist of both dense and sparse features. The dense features include query-document matching features like BM25. These types of features are the primary features used in traditional learning-to-rank algorithms (Liu, 2009). Recently, sparse features were shown to be effective through embedding in an end-to-end deep neural network model (Dehghani et al., 2017). Our click dataset contains n-grams from query strings and document subjects as sparse features. The average of the embedding vectors for n-grams in a query or document subject is used as the feature representation.
The second dataset used in our experiments is the publicly available MSLR-WEB30K (Qin and Liu, 2013). This is a large-scale learning-to-rank set that contains 30,000 queries. On average there are 120 documents per query and each document has 136 numeric features. All documents are labeled with graded relevance from 0 to 4 with larger labels indicating a higher relevance. We evaluate the models on Fold 1 of this dataset. Results obtained on the other folds are similar.
3. Hyperparameters and Training
We build the DNN models using a 3-layer feed-forward network. On the Email dataset, hidden layer sizes are set as 256, 128 and 64 for , and respectively. We set the learning rate to 0.1 and training batch size to 100. For sparse features, we set the embedding dimension to 20. Larger embedding dimensions (e.g., 100), learning rates (e.g., 0.2, 0.3), and layer sizes (e.g., 512 or 1024) were tested but no significant difference was observed. For this dataset, we use unbiased learning-to-rank techniques to overcome click bias (Joachims et al., 2017; Wang et al., 2018) and to that end, we optimize the weighted variant of the cross-entropy loss as shown in Equation (10) during training.
In models trained on Web30K, hidden layers have 64, 32, and 16 units instead with batch normalization between consecutive layers. A learning rate of 0.005 is used and training batch size is set to 128. These hyper-parameters were found to be effective through fine-tuning on the validation set. We train the model for 30,000 steps and evaluate the final model. Finally, when aggregating losses from queries in a mini-batch, a query’s loss is weighted by the sum of the relevance grade of its documents.
For both datasets, we use Adagrad to optimize the objective.
4. Evaluation
For experiments on the Email dataset, we report an Inverse Propensity Weight (IPW) enabled variant of mean reciprocal rank (MRR) (Wang et al., 2016). Such a weighted metric allows us to correct for the position bias that exists in click logs. Let denote the number of test sessions and be the rank of the clicked document for the session, then weighted MRR is calculated as follows:
where is the IPW of the clicked document for the session.
For experiments on the Web30K dataset, we run 10 trials of every model configuration and report mean Normalized Discounted Cumulative Gain (Järvelin and Kekäläinen, 2002) at rank positions 1, 5, and 10 along with confidence intervals. Note that when computing NDCG, queries with no relevant documents are discarded from the evaluation set. Also, each trial may produce a different model given the same hyper-parameters due to the stochastic nature of network initialization, as well as batch-level and query-level shuffling of documents.
Experimental Results
In this section, we report the results of our experiments. We first examine the effect of group sizes on GSF models. We then compare GSFs with the state-of-the-art learning-to-rank algorithms.
As discussed in Sections 4.2 and 4.3, while a GSF can easily and efficiently extend to a variable list size , the group size must be fixed before the construction of the model. To study the effect has on resultant models, we conduct experiments with different configurations on both the Email and Web30K datasets.
On the Email dataset, we train a PointDNN model (a univariate scoring function with pointwise loss, see Table 1) as baseline. We then measure improvements over this model, as indicated by WMRR, of GSF models trained with different group sizes. We repeat these experiments for two settings: (a) all features experiments use both sparse query and document textual features as well as numerical features; and, (b) dense features experiments use only the dense numerical features. Results are illustrated in Figure 2.
From Figure 2, we can see that a GSF trained with all features reaches its peak performance when , and GSF with dense features reaches its peak when . We also observe that the ranking quality decreases slightly when the group size becomes larger. We believe this observation can be explained by the fact that feed-forward networks usually are sensitive to the input order. As group size increases, the number of permutations for a group of documents grows rapidly. When this happens, because of the particular sampling process we use to form groups from a document list (see Section 4.3 for details), the approximation of the expectation in Equation (7) becomes less accurate.
On the Web30K dataset, we measure the performance of GSF models in terms of NDCG at rank positions 1, 5, and 10 and report these metrics for various group sizes. Figure 3 illustrates the results for groups of size 1, 2, 4, 8, 16, 32, 64, and 128. We observe an upward trend as we increase the group size . The models with group size are approximately 3% better than those with group size 1. Noting that there are only dense features in the Web30K dataset, this indicates that the extension of univariate scoring functions to multivariate scoring functions could be particularly useful for learning-to-rank models with dense features. Once again, for models with very large group sizes (), we observe that NDCG plateaus or drops slightly, which can be explained by inadequate sampling and the growth of the space of permutations.
2. Comparison with Baseline DNNs
We are interested in the relative performance of GSF models when compared with other baseline DNN algorithms. On the Email dataset, we measure the gain in WMRR over the PointDNN method obtained by PairGSF (univariate GSF with pairwise loss), BiGSF (bivariate GSF), and finally GSF models with group sizes 1 and 2—we denote the last two models as GSF(1) and GSF(2) for brevity. To provide a reference point for how well GSF models perform, we also report the gain from the BiDNN model. The results of these comparisons on Email data is illustrated in Figure 4.
All five models achieve significant improvements over the PointDNN baseline. Among them, GSF(2) yields the highest WMRR for both ”all” and ”dense” feature settings. For example, using all features GSF(2) achieves a 2.5% improvement over the PointDNN baseline. This improvement is significantly better than the gain from BiDNN, an improvement of less than 1.5%.
If we consider dense features only, BiDNN, BiGSF, and GSF(2) models—all bivariate functions—lead to better results than GSF(1), a univariate function. This suggests that scoring documents jointly proves particularly effective when only dense features are available.
We also compared PairGSF, BiGSF, GSF(1) and GSF(2) in terms of NDCG@5 on the Web30k dataset. Results are shown in Table 4(a) which confirm again that the GSF is indeed more effective than the univariate and bivariate scoring functions.
3. Comparison with Tree-based Models
We next compare the proposed GSF models with tree-based models in both a standalone and a hybrid approach. In the hybrid setting—henceforth, referred to as LambdaMART+GSF—the output of the GSF model is used as a feature in LambdaMART. We use LambdaMART as reference because it has been shown to yield state-of-the-art performance in public learning-to-rank competitions (Chapelle and Chang, 2011).
Table 3 shows the results we obtained on the Email dataset. For scalability reasons, we use an internal implementation of LambdaMART on this dataset. As LambdaMART cannot natively handle raw textual features, we only report the relative improvement over LambdaMART with dense features. Based on the results in Figure 4, we use GSF(2) as the representative GSF model for this experiment.
From Table 3, we see that GSF significantly outperforms LambdaMART in (a) dense features regime (where GSF slightly outperforms LambdaMART), and (b) all features regime (where the performance gap is much more significant). This demonstrates the importance of incorporating raw textual features and the effectiveness of GSF models in leveraging them. Furthermore, the hybrid LambdaMART+GSF approach achieves an even better performance, reaching gains as large as 3.42% over LambdaMART, a statistically significant improvement over all other models in Table 3. This validates the complementary nature of our method to LambdaMART and the benefits of the hybrid approach.
Table 4 shows the results on the Web30K dataset. For reproducibility, we use several learning-to-rank models implemented in the open-source Ranklib toolkithttps://sourceforge.net/p/lemur/wiki/RankLib/ as baselines.
We observe that all GSF variants in Table 4(a), outperform RankNet and RankSVM by a very large margin. A comparison of GSFs with MART and LambdaMART is shown in Table 4(b), where we report mean NDCG at various rank positions over 10 trials along with 95% confidence intervals. From the table, it is clear that the GSF setting with group size yields statistically significant improvements over MART at all NDCG cut-offs. On the other hand, GSF(64) falls short of LambdaMART as measured by NDCG@1, is on par in terms of NDCG@5 (i.e., confidence intervals overlap), and performs significantly better than LambdaMART as indicated by NDCG@10. For completeness, we select the trial with the highest NDCG@1 on the validation set and measure its NDCG@1 on the test set. We repeat this for NDCG@5 and NDCG@10 and report the results in Table 4(c). The conclusions from Table 4(b) still hold.
The results from Table 4 are interesting. We believe the reason LambdaMART performs better than GSF at NDCG@1 is a result of the differences between loss functions: in the existing GSF setup, we use the cross-entropy loss which is not position-dependent, whereas LambdaMART’s loss is designed to take position into account.
Our observations from a comparison of GSFs with tree-based models lead us to believe that the ranking quality of GSFs is at least on par with state-of-the-art tree-based models. As the number of training examples increases by orders of magnitude and when sparse textual features are present in a feature set, GSFs yield higher quality models and prove more scalable.
Discussion and Future Work
We began this work by stating a hypothesis, that the relative relevance of an item would be more accurately estimated if relevance scores for all items were computed jointly. Experiments conducted in the last section shed light on that hypothesis and the questions raised earlier in this work. The results are encouraging.
GSFs, while not yet a fully mature deep learning framework, provide a blueprint for designing multivariate scoring functions for ranking. Analogous to the use of Recurrent Neural Networks in Natural Language Processing and Convolutional Neural Networks in Computer Vision, we believe GSFs are inspired by and are more appropriate for ranking, where relativity plays a large role. Moreover, GSFs incorporate local feature distributions by simultaneously considering multiple candidate documents, mimicking user behavior more closely. Thus, we believe that GSFs provide an opportunity for the advancement of learning-to-rank research using deep learning.
There are, for example, many components that warrant a closer look. A naïve concatenation of a list of input documents, as done in this work, may not be effective at preserving documents’ structure and may lead to a loss of signals useful for comparison of documents. A feed-forward network may not be appropriate for capturing similarities or differences between documents. Finally, while the cross-entropy loss proved effective in practice, (a) it lacks theoretical justification and (b) a metric-driven loss may lead to better overall performance. We plan to pursue this direction of research and continue to improve our understanding of multivariate scoring and ranking functions. In order to facilitate and share this research, we open source GSFs within the TF Ranking library (Pasumarthi et al., 2018).
Acknowledgments
This work would not be possible without the support provided by the TF-Ranking team.