Company LOGO MovieMiner A collaborative filtering system for predicting Netflix user’s movie ratings [ECS289G Data Mining] Team Spelunker: Justin Becker, Philip Fisher-Ogden
The Problem Given a set of entries, predict the ratings values for unknown entries. Example: –X-Men, Philip, 5, –Spiderman 3, Philip, 4, –X-Men, Justin, 4, –Spiderman 3, Justin, ?, What rating do you predict Justin would give Spiderman 3?
Our Approach - Motivation Motivating Factors –Review current approaches taken by the Netflix prize top leaders –Leverage and extend existing libraries, to minimize the ramp-up time required to implement a working system –Utilize the UC Davis elvis cluster to alleviate any scale problems
What - Our Approach Collaborative Filtering (CF) –Weighted average of predictions from the following recommenders: Slope One recommender Item-based recommender User-based recommender
What - Our Approach Leveraging three CF recommenders –Similarities: Each uses prior preference information to predict values for unrated entries –Differences: How is the similarity between two entries computed? How are the neighbors selected? How are the interpolation weights determined? Image source:
Why - Our Approach Why Collaborative Filtering? –“Those who agreed in the past tend to agree again in the future“ –Requires no external data sources –Uses k-Nearest-Neighbor approaches to predict the class (rating) of an unknown entry –Exists a full features CF Java library- Taste –CF is one of two main approaches used by the Netflix prize top leaders (with the other being SVD).
How – Slope One Recommender Introduced by Daniel Lemire and Anna Maclachlan Simple and accurate predictor Average difference between two items Weighted average to produce better results Number of user having rated both items
Ex: Slope One Recommender Average difference between X-Men and Spiderman 3 is 1. Justin's rating for Spiderman 3 is then 4+1=5 X-Men Spiderman 3Batman BeginsNacho Libre Justin4?54 Philip5342 Dan4455 Ian3433 Michael2315
How – User-based Recommender Predicts a user u’s rating for an item i: –Find the k nearest neighbors to the user u Similarity measure = Pearson correlation Missing preferences are inferred by using the user’s average rating –Interpolate between those in-common neighbors’ ratings for item i Interpolation weights = Pearson correlation Neighbors are ignored if they did not rate i
Ex: User-based Recommender X-Men Spiderman 3Batman BeginsNacho Libreavg Justin4? Philip Dan Ian Michael centered data (user average) X-MenSpiderman 3Batman BeginsNacho Libre Eucl Norm Justin ? Philip Dan Ian Michael
Ex: User-based Recommender Similarities are calculated using the Pearson correlation coefficient (on centered data): Interpolation between nearest neighbors produces the prediction: User-user similarities Justin-Phil Justin-Dan Justin-Ian E-16 Justin-Michael Prediction using 2-nearest neighbors Philip, Dan round(prediction)4
How – Item-based Recommender Predicts a user u’s rating for an item i: –Find the k most similar items to i Similarity measure = Pearson correlation –Keep only similar items also rated by u –Interpolate between the remaining items’ ratings Interpolation weights = Pearson correlation –Note: Item-item similarities allow for more efficient computations as cnt(items) << cnt(users) and, thus, the similarity matrix can be pre- computed and leveraged as needed.
Ex: Item-based Recommender X-Men Spiderman 3Batman BeginsNacho Libre Justin4?54 Philip5342 Dan4455 Ian3433 Michael2315 avg centered data (item average) X-Men Spiderman 3Batman BeginsNacho Libre Justin0.4? Philip Dan Ian Michael Eucl Norm
Ex: Item-based Recommender Similarities are calculated using the Pearson correlation coefficient (on centered data): Interpolation between nearest neighbors produces the prediction: Item-item similarities S-Xm0 S-BB S-NL Prediction from 2-nearest neighbors BB, NL round(prediction)5
Initial Results Bottom line: correct=91934, loss=319,710 Parameters used: 40% user, 60% item, 20 nearest neighbors ~97% scored with composite recommender (user,item) ~3% scored with random recommender RMSE
Final Results Bottom line: correct=106,253, loss=236,523 Parameters used: 25% user, 5% item, 70% slope one, 20 nearest neighbors ~97% scored with composite recommender (user, item, slope one) ~3% scored with weighted average RMSE:
Questions? ?
Conclusion