Collaborative filtering is a technique of predicting the preferences and interests of users from the collections of taste information given by large amount of users. For example, by collecting the ratings of movies by many users, the preferences of users for movies can be predicted from the data. Applications like recommendation system that gives suggestions to users are based on collaborative filtering.
The Netflix Prize is a going contest relevant to collaborative filtering. The destination of this contest is to improve the accuracy of predictions about how much a customer loves a movie. The movie preferences are measured by voting scores 1~5 given by customers. Given the large amount (about 100M) voting scores of movie-customer pairs, one must predict the voting scores of movie-customer pairs in a quiz set with the ground truth unknown. The performance is measured by the Root Mean Square Error (RMSE) between the predicted scores and the actual ones, which are only known by Netflix.
Two main approaches for the Netflix Prize are Singular Value Decomposition (SVD) which extracts the features of movies and customers, and K-Nearest Neighborhood (KNN) which find the customers with similar preferences. SVD is more suitable and hence become a conventional method since the data is large-scaled and sparse, making the computation time-consuming and difficult. Consider m customers and n movies, the votes collected from them can be considered as an m×n (sparse) matrix V, where the value of vij is the score that customer i votes for movie j. An SVD algorithm decomposes V into two matrices U and M with dimensions m×k and k×n, which represent the preferences of customer for some movie features, and the degrees of movies on these features.
Modified SVD Algorithm (2007/7/27)
The main difference between this algorithm and previous work are the update order of SVD. I implemented Simon's algorithm, which update one feature at a time (more precisely, the Nth feature of all movies and all customers), in contrast with updating all features in an iteration. The 1st feature of all movies and customers is update first, and its values are validated by an independent validation data. The feature is updated until the validation RMSE start to increase while the training RMSE still decrease, i.e. overfitting occurs. Then the 2nd feature is updated, then 3rd, in the same manner.
When all features are updated once, another loop is started to update the 1st, 2nd, …, kth feature again, where k is the number of features. The RMSE can be reduce to about 0.9450 after the first loop, but the second and third loop can still decrease the RMSE, and the algorithm is stopped when a whole loop cannot decrease the validation RMSE anymore.
The order of update represents the priorities of features. If all features are updated simultaneously, then they are considered to have equal importance. Otherwise, some of the features will become more significant than others. For example, in human face images, the features representing eyes, nose and mouth may be more important than other features. In Netfilx Prize, the new update algorithm gives better performance than the older one under the same objective function of SVD, with an RMSE difference of 0.02 in both a divided small data (about 0.9300 → 0.9100) and the full data (0.9470 → 0.9267). A problem is that the last several features are hard to update because it is easy to cause overfitting when the forward features are already updated. For example, if k=30, then only the first 15~20 features are updated, and the others remain the same as their starting points.
In future work, there are some possible ideas:
(1) Fusion approach. The SVD algorithm may be combined with classification methods, since it can extract the features of movies and customers. But some strong classification algorithms may be restricted, since the number of votes is enormous.
(2) More about SVD. In the experiments, the small data usually has better RMSE than the full data with a difference of about 0.02. If the full data is divided appropriately, then performing individual SVD algorithms on the divided data may be better than running only one SVD on the full data. But it needs experiments to find the meaning of “appropriately” here.
(3) Time complexity. In a single iteration, this algorithm needs to compute the prediction values of all voting pairs by dot product, since these values are used in the gradients. If k=30, an iteration will contain 3 billion multiplications and additions, but only one feature is updated. The time requirement as well as memory access is still a problem to be solved.
Best RMSE: 0.9267
SVD Algorithm (2007/7/11)
An SVD algorithm with k=10 is used, but instead of performing it on the full data (with 17770 movies and 480189 customers), I divide the voting data into 27 divisions so that each divisions has about the same number of movie and customers. The SVD algorithm is performed on each division, and the performance measure is the prediction RMSE of an independent validation data from the SVD result (matrices U and M). A variance threshold can be use to find the low-variance customer, i.e. the customer whose voting score has low variance. The voting scores of these customers are not used in SVD, and the voting predictions of these customers in test data are predicted by customer average. But this approach does not have significant performance, so the variance threshold is disabled.
The starting point of matrices U, M in SVD is random, but the range is controlled so the dot product of a random movie feature and a random customer feature falls into [1, 5]. A search for better starting point is performed by running the SVD 100 iterations for several random starting points and choosing the result which has the best validation RMSE. This search is time-consuming but has only limited effect, so it may be unnecessary. If the search is disabled, 100 iterations are also run on the random starting point to achieve better U and M values. The iterations in this step are the gradient descent method with line search.
After those iterations, the matrices U and M are continually updated by gradient descent, but the line search is disabled to prevent overfitting. After each 10~50 iterations, the matrices U, M are taken out and the validation RMSE is computed. If the validation RMSE starts to increase then overfitting occurs, and the update is stopped. The matrices U, M which give the best validation RMSE is used for this division in the final prediction of quiz set.
Given the starting points of matrices U, M and the voting scores of data, the SVD algorithm performs gradient descent to update U and M, with or without line search. The objective function is the form:
||UM – V||2proj + reg × ||U||2 + reg × ||M||2
where U and M are the SVD matrices, V is the matrix of voting score, and reg is the regularization coefficient. ||．|| means the L2 norm of matrix so ||．||2 is the sum of square of elements in matrix, and ||．||2 proj means that the sum of square is only performed on the offsets with actual votes, i.e. the offsets of nonzero values in V.
In each iteration, the update is performed by the order of movies: for each movie, the customers who vote this movie are found, and the features of this movie and these customers are updated simultaneously by the gradients of these features. If line search is used, the partial objective function about these features is used as a measure. During a single iteration, a movie feature is updated only once, but a customer feature may be updated many times.
Best RMSE: 0.9470