**Collaborative**** ****Filtering**

Collaborative filtering is a technique of predicting the
preferences and interests of users from the collections of taste information
given by large amount of users. For example, by collecting the ratings of
movies by many users, the preferences of users for movies can be predicted from
the data. Applications like recommendation
system that gives suggestions to users are based on collaborative filtering.

The Netflix Prize is a going contest
relevant to collaborative filtering. The destination of this contest is
to improve the accuracy of predictions about how much a customer loves a movie.
The movie preferences are measured by voting scores 1~5 given by customers.
Given the large amount (about 100M) voting scores of movie-customer pairs, one must
predict the voting scores of movie-customer pairs in a quiz set with the ground
truth unknown. The performance is measured by the Root Mean Square Error (RMSE)
between the predicted scores and the actual ones, which are only known by
Netflix.

Two main
approaches for the Netflix Prize are Singular Value Decomposition (SVD) which
extracts the features of movies and customers, and K-Nearest Neighborhood (KNN)
which find the customers with similar preferences. SVD is more suitable and
hence become a conventional method since the data is large-scaled and sparse,
making the computation time-consuming and difficult. Consider m customers and n
movies, the votes collected from them can be considered as an *m*×*n*
(sparse) matrix *V*, where the value of
*v _{ij}* is the score that
customer

**Progress**

**Modified
SVD Algorithm** (2007/7/27)

The
main difference between this algorithm and previous work are the update order
of SVD. I implemented Simon's
algorithm, which update one feature at a time (more precisely, the Nth
feature of all movies and all customers), in contrast with updating all
features in an iteration. The 1st feature of all movies and customers is update
first, and its values are validated by an independent validation data. The
feature is updated until the validation RMSE start to increase while the
training RMSE still decrease, i.e. overfitting occurs. Then the 2nd feature is
updated, then 3rd, in the same manner.

When
all features are updated once, another loop is started to update the 1st, 2nd,
…, *k*th feature again, where *k* is the number of features. The RMSE
can be reduce to about 0.9450 after the first loop, but the second and third
loop can still decrease the RMSE, and the algorithm is stopped when a whole
loop cannot decrease the validation RMSE anymore.

The order of update represents the priorities
of features. If all features are updated simultaneously, then they are
considered to have equal importance. Otherwise, some of the features will
become more significant than others. For example, in human face images, the
features representing eyes, nose and mouth may be more important than other
features. In Netfilx Prize, the new update algorithm gives better performance than
the older one under the same objective function of SVD, with an RMSE difference
of 0.02 in both a divided small data (about 0.9300 → 0.9100) and the full data
(0.9470 → 0.9267). A problem is that the last several features are hard to
update because it is easy to cause overfitting when the forward features are
already updated. For example, if *k*=30,
then only the first 15~20 features are updated, and the others remain the same
as their starting points.

In
future work, there are some possible ideas:

(1) Fusion approach. The SVD algorithm may be
combined with classification methods, since it can extract the features of
movies and customers. But some strong classification algorithms may be
restricted, since the number of votes is enormous.

(2) More about SVD. In the experiments, the
small data usually has better RMSE than the full data with a difference of
about 0.02. If the full data is divided appropriately, then performing
individual SVD algorithms on the divided data may be better than running only
one SVD on the full data. But it needs experiments to find the meaning of
“appropriately” here.

(3) Time complexity. In a single iteration, this
algorithm needs to compute the prediction values of all voting pairs by dot
product, since these values are used in the gradients. If *k*=30, an iteration will contain 3 billion multiplications and
additions, but only one feature is updated. The time requirement as well as
memory access is still a problem to be solved.

**Best
RMSE: 0.9267**

**SVD
Algorithm** (2007/7/11)

An SVD algorithm with *k*=10
is used, but instead of performing it on the full data (with 17770 movies and
480189 customers), I divide the voting data into 27 divisions so that each
divisions has about the same number of movie and customers. The SVD algorithm
is performed on each division, and the performance measure is the prediction
RMSE of an independent validation data from the SVD result (matrices *U* and *M*). A variance threshold can be use to find the low-variance
customer, i.e. the customer whose voting score has low variance. The voting
scores of these customers are not used in SVD, and the voting predictions of
these customers in test data are predicted by customer average. But this
approach does not have significant performance, so the variance threshold is
disabled.

The starting point of
matrices *U*, *M* in SVD is random, but the range is controlled so the dot product
of a random movie feature and a random customer feature falls into [1, 5]. A
search for better starting point is performed by running the SVD 100 iterations
for several random starting points and choosing the result which has the best
validation RMSE. This search is time-consuming but has only limited effect, so
it may be unnecessary. If the search is disabled, 100 iterations are also run
on the random starting point to achieve better *U* and *M* values. The
iterations in this step are the gradient descent method with line search.

After those iterations, the
matrices *U* and *M* are continually updated by gradient descent, but the line search
is disabled to prevent overfitting. After each 10~50 iterations, the matrices *U*, *M*
are taken out and the validation RMSE is computed. If the validation RMSE
starts to increase then overfitting occurs, and the update is stopped. The
matrices *U*, *M* which give the best validation RMSE is used for this division in
the final prediction of quiz set.

Given
the starting points of matrices *U*, *M* and the voting scores of data, the SVD
algorithm performs gradient descent to update *U* and *M*, with or without
line search. The objective function is the form:

||*UM*
– *V*||^{2}_{proj} + *reg* × ||*U*||^{2} + *reg* ×
||*M*||^{2}

where *U*
and *M* are the SVD matrices, *V* is the matrix of voting score, and *reg* is the regularization coefficient. ||．|| means the L2 norm of matrix so ||．||^{2} is the sum of square of
elements in matrix, and ||．||^{2}_{
proj} means that the sum of square is only performed on the offsets with
actual votes, i.e. the offsets of nonzero values in *V*.

In
each iteration, the update is performed by the order of movies: for each movie,
the customers who vote this movie are found, and the features of this movie and
these customers are updated simultaneously by the gradients of these features.
If line search is used, the partial objective function about these features is
used as a measure. During a single iteration, a movie feature is updated only
once, but a customer feature may be updated many times.

**Best
RMSE: 0.9470**