G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit Topic 4: Applications Lecture 1: The Netflix Challenge Some material taken from and
Outline The challenge and its assessment Timeline of progress Recommendation methods Matrix Factorisation techniques Ensemble methods Lessons learnt Resources
The Netflix Challenge Netflix is an online video rental company One of its most relevant components is its move recommendation system – Suggest movies to users based on their past ratings In 2006 netflix made its recommendation database public Challenged the community to produce a new recommender that was 10% better than their own method Winner would get $1M
Training data Movie ratings collected from 1998 to ,480,507 ratings that 480,189 users gave to 17,770 movies. Training data divided in – Training set (99,072,112 ratings) – Probe set (1,408,395 ratings) Each rating was a quadruplet Very sparse data: the number of ratings is a very small fraction of users x movies
Test data Qualifying data were triplets Qualifying set (2,817,131 ratings) consisting of: – Test set (1,408,789 ratings), used to determine winners – Quiz set (1,408,342 ratings), used to calculate leaderboard scores Participants did not know which instances were part of the test set and which part of the quiz set Test, quiz and probe set were created to have similar statistical properties
Assessment Error on the quiz and test set was computed as Root Mean Squared Error (RMSE), rounded to 4 digits RMSE of the Cinematch system (Netflix own predictor) = – Target RMSE = Once a participant improves the target RMSE, a “last call” period of 30 days start At the end of the 30 days, the participant with lowest test RMSE is declared the winner In case of ties, the prize goes to the earliest entry
Progress in the challenge Data released on October 2 nd, 2006 On October 8 th a participant already had better RMSE than Cinematch The 2007 progress prize was awarded to BellKor with an improvement of 8.43% The 2008 progress prize was awarded to “BellKor in BigChaos” with an improvement of 9.44% In June 26 th, 2009, the team "BellKor's Pragmatic Chaos" achieved an improvement of 10.05%. The “last call” period started
Progress: Last call period On July 25, 2009 the team "The Ensemble", a merger of the teams "Grand Prize Team" and "Opera Solutions and Vandelay United", achieved a 10.09% improvementOpera Solutions After the last call period ended, two teams leaded the quiz leaderboard: – "The Ensemble" with a 10.10% improvement – "BellKor's Pragmatic Chaos" with a 10.09% improvement On the test set both teams were tied with an improvement of 10.06% BellKor's Pragmatic Chaos was declared the winner because they had submitted their entry 20 minutes before The Ensemble
Recommender systems: Content Filtering Collect background information from users and movies to generate a profile of each of them – Users: demographic information – Movies: genre, actors, box office results Produce recommendations by matching the profiles of users and movies Costly as many times it’s difficult to collect all this information or it’s simply not available
Recommender systems: collaborative filtering Generate predictions of ratings only based on the past behavior of the users No background domain knowledge required Easier to generate the models Faces difficulties to start up: when not enough ratings are available
Collaborative filtering: neighbourhood methods Compute relationship between items or users Identify which movies are similar to each other, based on receiving similar ratings from the same user Hierarchical clustering showing the similarities of 5000 movies Hierarchical clustering showing the similarities of 5000 movies
Collaborative filtering: latent factor models Automatically map users and movies into a new space of factors (same for both of them)
Matrix Factorisation methods Most successful of the latent factors methods These methods generate a vector q i f for each item and a vector p u f for each user A prediction is a linear combination of both vectors The problem of finding the vectors q and p for each movie and user is defined as the following optimisation problem Training set Actual rating Predicted rating Regularisation term (avoid overfitting)
Optimisation methods Stochastic gradient descend – Iteratively samples training examples, computes the prediction errors and adjusts the vectors of the involved user and item accordingly Alternating least squares – The original definition of the optimization problem is not convex, and hence cannot be solved to optimality – If either p or q is fixed, the problem is convex and can be solved using least squares methods – This method alternates between two states, where in each state it fixes either p or q
Bias in the models Not all movies receive the same distribution of ratings – Some are more popular Not all users give the same distribution of ratings – Some users are more strict than others Refinement of the model introducing bias terms Average overall rating Bias of item i Bias of user u
Additional input sources Implicit feedback – User has preference over certain movies, therefore it will not produce ratings for anything Demographic information – If available
Temporal dynamics Ratings change through time Users – May change tastes – May produce more/less strict ratings in different periods of time Movies – Blockbusters may fade in popularity – Cult movies may become more popular
Impact of all components of the model (BellKor)
Ensemble methods All top participants methods combined (blended) the predictions of hundreds of models of many types – Matrix Factorisation – Neighbourhood methods – Restircted Boltzmann Machines methods Many ways of combining the models – Linear combinations – Neural networks – Regression trees
Basic linear regression method Need to optimize the vector of weights associated to each method Can use e.g. least squares method for this, optimizing over the probe set How to choose the models to include in the ensemble? – Forward method: start with one, keep adding until the probe set error degrades – Backward method: start with all, keep removing while the probe set error improves
Feature-Weighted Linear Stacking Method from “The Ensemble” Not all models are suitable for all kind of movies/users Generate a set of “meta-features” for each instance that are used to calibrate the linear combination of weights specifically for each case v ij = weight associated to feature j for model I f j (x) = value of feature j for instance x g i (x) = prediction of model I for instance x
Top 10 features (out of 25)
Lessons learnt from the challenge Well defined competition (clear rules, instant feedback of progress, forums to discuss) Great collaboration between participants, sharing ideas, combining efforts Widen the awareness of statistical and machine learning in the mainstream society It has provided a big challenge to the ML community, and hence, new science was done
Resources Challenge web page Very nice article about Matrix Factorisation Article on Feature-Weighted Linear StackingFeature-Weighted Linear Stacking Progress reports of BellKor BigChaos PragmaticTheory Web page of “The Ensemble” Web page of “The Ensemble”