A Neural Autoregressive Approach to Collaborative Filtering
Yin Zheng, Bangsheng Tang, Wenkui Ding, Hanning Zhou
Introduction
Collaborative filtering (CF) is a class of methods for predicting a user’s preference or rating of an item, based on his/her previous preferences or ratings and decisions made by similar users. CF lies at the core of most recommender systems and has attracted increasing attention along with the recent boom of e-commerce and social network systems. The premise of CF is that a person’s preference does not change much over time. A good CF algorithm helps a user discover products or services that suit his/her taste efficiently.
Generally speaking, there is a dichotomy of CF methods: Memory-based CF and Model-based CF. Memory-based CF usually computes the similarities between users or items directly from the rating data, which are then used for recommendation. The explainatbility of the recommended results as well as the easy-to-implement nature of memory-based CF ensured its popularity in early recommender systems (Resnick et al., 1994). However, memory-based CF has faded out due to its poor performance on real-life large-scale and sparse data.
Distinct from memory-based CF, model-based CF learns a model from historical data and then uses the model to predict preferences of users. The models are usually developed with machine learning algorithms, such as Bayesian networks, clustering models and latent semantic models. Complex preference patterns can be recognized by these models, allowing model-based CF to perform better for preference prediction tasks. Among all these models, matrix factorization is most popular and successful, c.f. (Koren et al., 2009; Salakhutdinov & Mnih, 2008; Mackey et al., 2011; Gopalan et al., 2013).
With the recent development of deep learning (Krizhevsky et al., 2012; Szegedy et al., 2014; He et al., 2015), neural network based CF, a subclass of model-based CF, has gained enormous attention. A prominent example is RBM-based CF (RBM-CF) (Salakhutdinov et al., 2007). RBM-CF is a two-layer undirected generative graph model which generalizes Restricted Boltzmann Machine (RBM) to modeling the distribution of tabular data, such as user’s ratings of movies. RBM-CF has shown its power in Netflix prize challenge. However, RBM-CF suffers from inaccuracy and impractically long training time, since training RBM-CF is intractable and one has to rely on variational approximation or MCMC sampling.
Recently, a good alternative to RBM has been proposed by Larochelle & Murray (2011). The so-called Neural Autoregressive Distribution Estimator (NADE) is a tractable distribution estimator for high dimensional binary vectors. NADE computes the conditional probabilities of each element given the other elements to its left in the binary vector, where all conditionals share the same parameters. The probability of the binary vector can then be obtained by taking the product of these conditionals. Unlike RBM, NADE does not incorporate any latent variable where expensive inference is required, in constrast it can be optimized efficiently by backpropagation. NADE together with its variants achieved competitive results on many machine learning tasks (Larochelle & Lauly, 2012; Uria et al., 2013; Zheng et al., 2014b; Uria et al., 2014; Zheng et al., 2014a, 2015).
In this paper, we propose a novel model-based CF approach named CF-NADE, inspired by RBM-CF and NADE models. Specifically, we will show how to adapt NADE to CF tasks and describe how to improve the performance of CF-NADE by encouraging the model to share parameters between different ratings. We also propose a factored version of CF-NADE to deal with large-scale dataset efficiently. As Truyen et al. (2009) observed, preference usually has the ordinal nature: if the true rating of an item by a user is 3 stars in a 5-star scale, then predicting 4 stars is preferred to predicting 5 stars. We take this ordinal nature of preferences into consideration and propose an ordinal cost to optimize CF-NADE. Moreover, we propose a deep version of CF-NADE, which can be optimized efficiently. The performance of CF-NADE is tested on real world benchmarks: MovieLens 1M, MovieLens 10M and Netflix dataset. Experimental results show that CF-NADE outperforms all previous state-of-the-art methods.
Related Work
As mentioned previously, some of the most successful model-based CF methods are based on matrix factorization (MF) techniques, where a prevalent assumption is that the partially observed matrix is of low rank. In general, MF characterizes both users and items by vectors of latent factors, where the number of factors is much smaller than the number of users or items, and the correlation between user and item factor vectors are used for recommendation tasks. Specifically, Billsus & Pazzani (1998) proposed to apply Singular Value Decomposition (SVD) to CF tasks, which is an early work on MF-based CF. Bias MF (Koren et al., 2009) is proposed to improve the performance of SVD by introducing systematic biases associated with users and items. Mnih & Salakhutdinov (2007) extended MF to a probabilistic linear model with Gaussian noise referred to as Probabilistic Matrix Factorization (PMF), and showed that PMF performed better than SVD. Salakhutdinov & Mnih (2008) proposed a Bayesian treatment of PMF, which can be trained efficiently by MCMC methods. Along this line, Lawrence & Urtasun (2009) proposed a non-linear PMF using Gaussian process latent variable models. There are other MF-based CF methods such as (Rennie & Srebro, 2005; Mackey et al., 2011). Recently, Poisson Matrix Factorization (Gopalan et al., 2014b, a, 2013) was proposed, replacing Gaussian assumption of PMF by Poisson distribution. Lee et al. (2013) extended the low-rank assumption by embedding locality into MF models and proposed Local Low-Rank Matrix Approximation (LLORMA) method, which achieved impressive performance on several public benchmarks.
Another line of model-based CF is based on neural networks. With the tremendous success of deep learning (Krizhevsky et al., 2012; Szegedy et al., 2014; He et al., 2015), neural networks have found profound applications in CF tasks. Salakhutdinov et al. (2007) proposed a variant of Restricted Boltzmann Machine (RBM) for CF tasks, which is successfully applied in Netflix prize challenge (Bennett & Lanning, 2007). Recently, Sedhain et al. (2015) proposed AutoRec, an autoencoder-based CF model, which achieved the state-of-the-art performance on some benchmarks. RBM-CF (Salakhutdinov et al., 2007) and AutoRec (Sedhain et al., 2015) are common in that both of them build different models for different users, where all these models share the parameters. Truyen et al. (2009) proposed to apply Boltzmann Machine (BM) on CF tasks, which extends RBM-CF by integrating the correlation between users and between items. Truyen et al. (2009) also extended the standard BM model so as to exploit the ordinal nature of ratings. Recently, Dziugaite & Roy (2015) proposed Neural Network Matrix Factorization (NNMF), where the inner product between the vectors of users and items in MF is replaced by a feed-forward neural network. However, NNMF does not produce convincing results on benchmarks.
Our proposed method CF-NADE, can be generally categorized as a neural network based CF method. CF-NADE bears some similarities with NADE (Larochelle & Murray, 2011) in that both model vectors with neural autoregressive architectures. The crucial difference between CF-NADE and NADE is that CF-NADE is designed to deal with vectors of variable length, while NADE can only deal with binary vectors of fixed length. Though DocNADE (Larochelle & Lauly, 2012) does take inputs with various lengths, it is designed to model unordered sets of words where each element of the input corresponds to a word, while CF-NADE models user rating vectors, where each element corresponds to the rating to a specific item.
NADE for Collaborative Filtering
This section devotes to CF-NADE, a NADE-based model for CF tasks. Specifically, we describe the basic model of CF-NADE in Section 3.1, and propose to improve CF-NADE by sharing parameters between different ratings in Section 3.2. At last, a factored version of CF-NADE is described in Section 3.3 to deal with large-scale datasets.
Suppose that there are items, users, and the ratings are integers from to (-star scale). One practical and prevalent assumption in CF literature is that a user usually rated items might vary between different users, where . To tackle sparsity, similar to RBM-CF (Salakhutdinov et al., 2007), we use a different CF-NADE model for each user and all these models share the same parameters. Specifically, all models have the same number of hidden units, but a user-specific model only has visible units if the user only rated items. Thus, each CF-NADE has only one single training case, which is a vector of ratings that a user gave to his/her viewed items, but all the weights and biases of these CF-NADE’s are tied.
In this paper, we denote the training case for user as , where is a -tuple in the set of permutations of which serves as an ordering of the rated items , and denotes the rating that the user gave to item . For simplicity, we will omit the index of , and focus on a single user-specific CF-NADE in the rest of the paper.
CF-NADE models the probability of the rating vector by the chain rule as:
where denotes the first elements of indexed by .
Similar to NADE (Larochelle & Murray, 2011), CF-NADE models the conditionals in Equation 1 with neural networks. To compute the conditionals in Equation 1, CF-NADE first computes the hidden representation of dimension given as follows:
Then the conditionals in Equation 1 could be modeled as:
where is the score indicating the preference that the user gave rating for item given the previous ratings , and is computed as,
CF-NADE is optimized for minimum negative log-likelihood of (Equation (3)),
averaged over all the training cases. As in NADE, the ordering in CF-NADE must be predefined and fixed during training for each user. Ideally, the ordering should follow the timestamps when the user gave the ratings. In practice, we find that a ordering that is randomly drawn from the set of permutations of for each user yields good results. As Uria et al. (2014) observed, we can think of the models trained with different orderings as different instantiations of CF-NADE for the same user. Section 5 will discuss how to (virtually) train a factorial number of CF-NADE’s with different orderings simultaneously, which is the key to extend CF-NADE to a deep model efficiently.
Once the model is trained, given a user’s past behavior , the user’s rating of a new item can be predicted as
where conditional are computed by Equation 3 along with the hidden representation and score computed as
2 Sharing Parameters Between Different Ratings
In Equations 2 and 4, the connection matrices , and the bias are different for different ratings ’s. In other words, CF-NADE uses different parameters for different ratings. In practice, for a specific item, some ratings can be much more often observed than others. As a result, parameters associated with a rare rating might not be sufficiently optimized. To alleviate this problem, we propose to share parameters between different ratings of the same item.
Particularly, we propose to compute the hidden representation as follows:
Note that, given an item rated by the user, depends on all the weights , . Thus, Equation 9 encourages a solution that is utilized by all the ratings , .
Similarly, the score in Equation 3 is adjusted as
where and are shared by the rating , where .
Sharing parameters between different ratings can again be understood as a kind of regularization, which encourages the model to use as many parameters as possible to explain the data. Experimental results in Section 6.2.1 confirm the advantage of this regularization.
3 Dealing with Large-Scale Datasets
Inspired by RBM-CF (Salakhutdinov et al., 2007) and FixationNADE (Zheng et al., 2014a), we propose to address this problem by factorizing and into products of two lower-rank matrices. Particularly,
Traing CF-NADE with Ordinal Cost
CF-NADE can be trained by minimizing the negative log-likelihood based on conditionals defined by Equation 3. To go one step further, following Truyen et al. (2009), we take the ordinal nature of a user’s preference into consideration. That is, if a user rated an item , the preference of the user to the ratings from to should increase monotonically and the preference to the ratings from to should decrease monotonically. The basic CF-NADE treats different ratings as separate labels, leaving the ordinal information not captured. Here we describe how to equip CF-NADE with an ordinal cost.
Formally, suppose , the ranking of preferences over all the possible ratings under the ordinal assumption can be expressed as:
where denotes the preference of rating over , Equation 13 is omitted if the true rating is ; likewise, Equation 14 is omitted if the true rating is .. Two rankings of ratings, and , can be induced by Equation 13 and 14:
Note that maximizing the conditional in Equation 3 only ensures that the probability of rating is the largest among all possible ratings. To capture the ordinal nature induced by Equations 13 and 14, we propose to compute the conditional as
where is a shorthand for the score introduced in Section 3.1, which indicates the preference to rating of item given the previous context .
Both two products in Equation 17 can be interpreted as the likelihood loss introduced in (Xia et al., 2008) in the context of Listwise Learning To Rank problem. Actually, from the perspective of learning-to-rank, CF-NADE acts as a ranking function which produces rankings of ratings based on previous ratings, where corresponds to the score function in (Xia et al., 2008) and the rankings, and , corresponds to true rankings that we would like CF-NADE to fit. Thus, the conditional computed by Equation 17 is actually the conditional distribution of the rankings and given previous ratings. Put differently, the ranking loss in Equation 17 is defined on the ratings, while other learning-to-rank based CF methods, such as (Shi et al., 2010), are on items, which is the crucial difference.
For the rest of the paper, we denote the negative log-likelihood based on the conditionals computed by Equation 17 as ordinal cost , and denote the negative log-likelihood based on Equation 3 as regular cost . The final cost to optimize the model is then defined as
where is the hyperparameter to determine the weight of . The impact of the hyperparameter on the performance of CF-NADE is discussed in Section 6.2.1.
Extending CF-NADE to a Deep Model
So far we have described CF-NADE with single hidden layer. As suggested by the recent and impressive success of deep neural networks (Krizhevsky et al., 2012; Szegedy et al., 2014; He et al., 2015), extending CF-NADE to a deep, multiple hidden layers architecture could allow us to have better performance. Recently, Uria et al. (2014) proposed an efficient deep extension to original NADE (Larochelle & Murray, 2011) for binary vector observations, which inspires other related deep model (Zheng et al., 2015). Following (Uria et al., 2014), we propose a deep variant of CF-NADE.
As mentioned in Section 3.1, a different CF-NADE model is used for each user and the ordering in is stochastically sampled from the set of permutations of . Training CF-NADE on stochastically sampled orderings corresponds, in expectation, to minimizing the cost in Equation 18 over all possible orderings for each user. As noticed by Uria et al. (2014) and Zheng et al. (2015), training over all possible orderings for CF-NADE implies that for any given context , the model performs equally well at predicting all the remaining items in , since for each item there is an ordering such that it appears at position . This is the key observation to extend CF-NADE to a deep model. Specifically, instead of sampling a complete ordering over all the items, we instead sample a context and perform an update of the conditionals using that context.
The procedure is done as follows. Given a user who has rated items, an ordering is first sampled randomly from the set of permutations of for each update and a vector is generated according to the ordering . Then a split point is randomly drawn from for each update. The split point divides into two parts: and . According to the analysis above, in the new training procedure, is considered as the input of CF-NADE and the training objective is to maximize conditionals for each element in . The cost function with this procedure is
By Equation 19, the model predicts the ratings of each items after the splitting position in the randomly drawn ordering as if it were actually at position . The factors in front of the sum come from the fact that the total number of elements in the sum is and that we are averaging over possible choices for the item at position , similar to (Uria et al., 2014) and (Zheng et al., 2015). Derivation of Equation 19 can be found in the supplementary materials.
In this procedure, a training update relies only on a single hidden representation , more hidden layers can be added into CF-NADE with the computational complexity increased moderately. Particularly, suppose is the hidden representation computed in Equation 2. Then new hidden layers can be added as in a regular deep feed-forward neural network:
Note that CF-NADE with a single hidden layer can also be trained by Equation 19. In practice, Equation 19 can be implemented efficiently on GPUs, hence we use it throughout our experiments for either one hidden layer or multiple-layers architecture.
Experiments
In this section, we test the performance of CF-NADE on 3 real-world benchmarks: MovieLens 1M, MovieLens 10M (Harper & Konstan, 2015) and Netflix dataset (Bennett & Lanning, 2007), which contain , and ratings, respectively. Following LLORMA (Lee et al., 2013) and AutoRec (Sedhain et al., 2015), of the ratings in each of these datasets are randomly selected as the test set, leaving the remaining of the ratings as the training set. Among the ratings in the training set, are used as validation set. We use a default rating of for items without training observations. Prediction error is measured by Root Mean Squared Error (RMSE),
MovieLens 1M dataset contains around million anonymous ratings of approximately movies by users, where each user rated at least items. The ratings in MovieLens 1M dataset are made on a -star scale, with -star increments. MovieLen 10M dataset contains about million ratings of movies by users. The users of MovieLens 10M dataset are randomly chosen and each user rated at least movies. Unlike MovieLens 1M dataset, the ratings in MovieLens 10M are on a -star scale with half-star increments. Thus, the number of rating scales of MovieLens 10M is actually . In our experiments, we rescale the ratings in MovieLens 10M to -star scale with -star increments. Netflix dataset comes from the Netflix challenge prizeThe test set of Netflix prize challenge dataset is not available now. Following Lee et al. (2013) and Sedhain et al. (2015), we split the available trainset of Netflix dataset into train, valid and test sets.. It is massive compared to the previous two, which contains more than million ratings of movies by users. The ratings of Netflix dataset are on -star scale, with -star increments.
2 Experiments on MovieLen 1M Dataset
In this section, we test the performance of CF-NADE on MovieLen 1M dataset. We first evaluate the performance of the ordinal cost described in Section 4 with/without sharing parameters between different ratings as described in Section 3.2. Then we compare several variants of CF-NADE with some strong baselines.
In this section, we evaluate the impact of the ordinal weight in Equation 21 on the performance of CF-NADE. As Sedhain et al. (2015) mentioned, item-based CF outperforms user-based CF, therefore we use item-based CF-NADE (I-CF-NADE) in this section. Distinct from user-based CF-NADE (U-CF-NADE), which builds a different model for each user as we described previously, I-CF-NADE model builds a different CF-NADE model for each item. In other words, the only difference between U-CF-NADE and I-CF-NADE is that the roles of users and items are switched. Comparison between U-CF-NADE and I-CF-NADE can be found in Section 6.2.2.
The configuration of the experiments is as follows. We use a single hidden layer architecture and the number of hidden units is set to , same as AutoRec (Sedhain et al., 2015) and LLORMA (Lee et al., 2013). Adam (Kingma & Ba, 2014) with default parameters (, and ) are utilized to optimize the cost function in Equation 19. The learning rate is set to , the weight decay is set to and we use the tanh activation function.
Let CF-NADE-S denote the variant of CF-NADE model where parameters are shared between different ratings, as described in Section 3.2. Figure 1 shows the superior performance of CF-NADE and CF-NADE-S w.r.t different values of . Effectiveness of parameter sharing and ordinal cost can be justified by observing that: 1) CF-NADE-S always outperforms regular CF-NADE; 2) as the ordinal weight increases, test RMSE of both CF-NADE and CF-NADE-S decrease monotonically. Based on these observations, we will use CF-NADE-S and fix throughout the rest of the experiments.
2.2 Comparing with strong baselines on MovieLens 1M
In this comparison, we compare CF-NADE with other baselines on MoiveLens 1M dataset. During the comparison, the learning rate is chosen on the validation set by cross-validation among , and the weight decay is chosen among . According to Section 6.2.1, the weight of ordinal cost is fixed to and CF-NADE-S is adopted. The model is trained with Adam optimizer and tanh as activation function.
Table 1 shows the performance of CF-NADE-S and baselines. The number of hidden units of CF-NADE is , same as AutoRec (Sedhain et al., 2015) for a fair comparison. One can observe that I-CF-NADE-S outperforms U-CF-NADE-S by a large margin. I-CF-NADE-S with a single hidden layer achieves RMSE of , which is comparable with any strong baseline. Moreover, I-CF-NADE-S with 2 hidden layers achieves RMSE of .
Figure 2 illustrates the performance of I-CF-NADE-S w.r.t the number of hidden units. Increasing the number of hidden units is beneficial, but the return is diminishing. It can also be observed from Figure 2 that deep CF-NADE models achieve better performance than the shallow ones, as expected.
3 Experiments on MovieLens 10M Dataset
Table 2 shows the comparison between CF-NADE and other baselines on MovieLens 10M dataset. U-CF-NADE-S with a single hidden layer has already outperformed the baselines, which achieves RMSE of . The performance of U-CF-NADE-S can be slightly improved by adding another hidden layer. Noticeably, the test RMSE of U-AutoRec is much worse than I-AutoRec, whereas U-CF-NADE-S outperforms I-AutoRec.
4 Experiments on Netflix Dataset
Our final set of experiments are on the massive Netflix dataset, which contains ratings. Similar to Section 6.3, we use the factored version of U-CF-NADE with . The Netflix dataset is so big that we need not add a strong regularization to avoid overfitting and therefore set the weight decay to . Other configurations are the same as in Section 6.3.
Table 3 compares the performance of U-CF-NADE with other baselines. We can see that U-CF-NADE-S with a single hidden layer achieves RMSE of , outperforming all baselines. Another observation from Table 3 is that using a deep CF-NADE architecture achieves a slight improvement over the shallow one, with a test RMSE of .
5 The Complexity and Running Time of CF-NADE
We implement CF-NADE using Theano (Bastien et al., 2012) and Blocks (van Merriënboer et al., 2015), and the code is available at https://github.com/Ian09/CF-NADE. Table 4 shows the running time of one epochExperiments are conducted on a single NVIDIA Titan X Card. as well as the number of parameters used by CF-NADE. For MovieLens 1M dataset, we used the item-based CF-NADE and did not use the factorization method introduced by Sec 3.3, hence the number of parameters for MovieLens 1M is bigger than the other two. Running times in Table 4 include overheads such as transferring data from and to GPU memory for each update. Note that there is still room for faster implementationsIn our implementation, samples are represented as binary matrices, where is the number of items and is the number of rating scales. An entry is assigned only if the user gave a -star to item . Thus, we could use the tensordot operator in Theano and feed CF-NADE with a batch of samples. In the experiments, mini-batch size is set to . One disadvantage of this implementation is that some amount of computational time is spent on unrated items, which can be enormous especially when the data is sparse. .
Conclusions
In this paper, we propose CF-NADE, a feed-forward, autoregressive architecture for collaborative filtering tasks. CF-NADE is inspired by the seminal work of RBM-CF and the recent advancements of NADE. We propose to share parameters between different ratings to improve the performance. We also describe a factored version of CF-NADE, which reduces the number of parameters by factorizing a large matrix by a product of two lower-rank matrices, for better scalability. Moreover, we take the ordinal nature of preference into consideration and propose an ordinal cost to optimize CF-NADE. Finally, following recent advancements of deep learning, we extend CF-NADE to a deep model with moderate increase of computational complexity. Experimental results on three real-world benchmark datasets show that CF-NADE outperforms the state-of-the-art methods on collaborative filtering tasks. all results of this work rely on explicit feedback, namely, ratings explicitly given by users. however, explicit feedback is not always available or as common as implicit feedback (watch, search, browse behaviors) in real-world recommender systems (Hu et al., 2008). Developing a version of CF-NADE tailored for implicit feedback is left for future work.
Acknowledgements
We thank Hugo Larochelle and the reviewers for many helpful discussions.