CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines

The Data Set ● 17,770 Movies ● 480,189 Reviewers ● More than 100 Million reviews  Rating of 1 through 5  Review Date ● Uncompressed full dataset is 2 Gigabytes

Netflix Data Properties ● Distribution of Number of Reviews per Reviewer ● X-axis:  # of reviews ● Y-axis  P (# of reviews)

Netflix Data Subsets ● You will be given two subsets of the data ● Format:  ● Subset  Contains 9,000 reviewers  Restricted to only those movies with at least 5 ratings ● 12,000 movies  ~2 Million reviews  ~50 MB

Project Requirements ● Compute each of the following  Average review score  Top 10 most highly rated movies  Distribution of all review scores ● p(rating=1),..., p(rating=5)  Number of reviews as a function of time  The reviewer whose review score distribution has the largest entropy ● Compute five other properties of the data  These properties should be relevant to your project  You should explain this relevancy

Project Options ● Classification ● Clustering ● Recommendation ● Data Cubes

Project 1: Classification ● Goal: Predict classification scores  5-class classification problem ● K-Nearest Neighbor ● Represent each reviewer by a (sparse) vector of his review scores  How can scores be predicted given a reviewer's nearest neighbors? ● Represent each movie by a vector of each reviewer's scores  How can scores be predicted given a movie's nearest neighbors? ● Experiment with different distance measures ● Experiment with various normalization schemes

Project 1: Classification ● Decision Trees and other Parametric Classifiers  Create dense features for each instance ● Reviewer's average rating ● Movie's average rating ● Movie related features  Actors in each movie (collected from IMDB) ● Time related features  Number of reviewer's previous scores  Use the WEKA machine learning package ● Evaluate performance of various algorithms in the package  Decision Tree, SVM,...

Project 1: Classification ● Evaluation of Classification Performance  Accuracy, Confusion Matrices ● Analysis: Are 1's harder to predict than 5's?  Cross-validation ● Does this make sense when these is a time-series component? ● Extensions  Learning curves ● How does accuracy change as the training set size increases  Distribution of accuracy per reviewer ● Are some reviewers harder to predict than others? ● Are some movies harder to predict? ...

Project 2: Clustering ● Goal: Cluster reviewers and movies ● K-means based methods  Download G-Means ● Supports k-means and also other variants  Cluster using both sparse and dense representations ● Sparse representation: same as used for KNN classification ● Dense representation: same as used for parametric classification

Project 2: Clustering ● Graph-based methods  Compute pairwise similarities between reviewers ● Correlation ● Your own ad-hoc method  i.e. The Kevin Bacon method ● Sim(x, y) = # of Kevin Bacon movies viewed by both x and y ● Similarity computation may be too expensive to perform on the full dataset  Software: Graclus ● Results analysis  Quantitative as well as Qualitative

Project 3: Recommendations ● Goal: Create movie recommendations for each reviewer ● K-Nearest Neighbor  Instance representation ● Sparse representation  Find the reviewer's nearest neighbors ● Recommend movies scored highly by these neighbors  Try out various distance measures

Project 3: Recommendation ● Evaluation  Propose a way of quantifying the quality of your recommendations ● i.e. A recommendation is good if a reviewer ended up rating the recommendation with score of 4 or higher  Is it harder to recommend movies to reviewers who do not watch many movies? ● Does your evaluation metric reflect this?

Project 3: Data Cubes ● Load the data into a data cube  Find interesting trends in the data ● i.e. Relation between average review score and day of week?  Slice on day, aggregate review scores across all reviewers and movies ● Find other interesting trends ● Use an open source data cube package (OLAP)  Mondrian – Java based  Must be a proficient coder