Download presentation
Presentation is loading. Please wait.
Published byCory Emma McCarthy Modified over 9 years ago
1
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines
2
The Data Set ● 17,770 Movies ● 480,189 Reviewers ● More than 100 Million reviews Rating of 1 through 5 Review Date ● Uncompressed full dataset is 2 Gigabytes
3
Netflix Data Properties ● Distribution of Number of Reviews per Reviewer ● X-axis: # of reviews ● Y-axis P (# of reviews)
4
Netflix Data Subsets ● You will be given two subsets of the data ● Format: ● Subset Contains 9,000 reviewers Restricted to only those movies with at least 5 ratings ● 12,000 movies ~2 Million reviews ~50 MB
5
Project Requirements ● Compute each of the following Average review score Top 10 most highly rated movies Distribution of all review scores ● p(rating=1),..., p(rating=5) Number of reviews as a function of time The reviewer whose review score distribution has the largest entropy ● Compute five other properties of the data These properties should be relevant to your project You should explain this relevancy
6
Project Options ● Classification ● Clustering ● Recommendation ● Data Cubes
7
Project 1: Classification ● Goal: Predict classification scores 5-class classification problem ● K-Nearest Neighbor ● Represent each reviewer by a (sparse) vector of his review scores How can scores be predicted given a reviewer's nearest neighbors? ● Represent each movie by a vector of each reviewer's scores How can scores be predicted given a movie's nearest neighbors? ● Experiment with different distance measures ● Experiment with various normalization schemes
8
Project 1: Classification ● Decision Trees and other Parametric Classifiers Create dense features for each instance ● Reviewer's average rating ● Movie's average rating ● Movie related features Actors in each movie (collected from IMDB) ● Time related features Number of reviewer's previous scores Use the WEKA machine learning package ● Evaluate performance of various algorithms in the package Decision Tree, SVM,...
9
Project 1: Classification ● Evaluation of Classification Performance Accuracy, Confusion Matrices ● Analysis: Are 1's harder to predict than 5's? Cross-validation ● Does this make sense when these is a time-series component? ● Extensions Learning curves ● How does accuracy change as the training set size increases Distribution of accuracy per reviewer ● Are some reviewers harder to predict than others? ● Are some movies harder to predict? ...
10
Project 2: Clustering ● Goal: Cluster reviewers and movies ● K-means based methods Download G-Means ● Supports k-means and also other variants Cluster using both sparse and dense representations ● Sparse representation: same as used for KNN classification ● Dense representation: same as used for parametric classification
11
Project 2: Clustering ● Graph-based methods Compute pairwise similarities between reviewers ● Correlation ● Your own ad-hoc method i.e. The Kevin Bacon method ● Sim(x, y) = # of Kevin Bacon movies viewed by both x and y ● Similarity computation may be too expensive to perform on the full dataset Software: Graclus ● Results analysis Quantitative as well as Qualitative
12
Project 3: Recommendations ● Goal: Create movie recommendations for each reviewer ● K-Nearest Neighbor Instance representation ● Sparse representation Find the reviewer's nearest neighbors ● Recommend movies scored highly by these neighbors Try out various distance measures
13
Project 3: Recommendation ● Evaluation Propose a way of quantifying the quality of your recommendations ● i.e. A recommendation is good if a reviewer ended up rating the recommendation with score of 4 or higher Is it harder to recommend movies to reviewers who do not watch many movies? ● Does your evaluation metric reflect this?
14
Project 3: Data Cubes ● Load the data into a data cube Find interesting trends in the data ● i.e. Relation between average review score and day of week? Slice on day, aggregate review scores across all reviewers and movies ● Find other interesting trends ● Use an open source data cube package (OLAP) Mondrian – Java based Must be a proficient coder
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.