CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Nonparametric Methods: Nearest Neighbors
Computational Learning An intuitive approach. Human Learning Objects in world –Learning by exploration and who knows? Language –informal training, inputs.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Data Mining Classification: Alternative Techniques
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Indian Statistical Institute Kolkata
By Fernando Seoane, April 25 th, 2006 Demo for Non-Parametric Classification Euclidean Metric Classifier with Data Clustering.
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Carla P. Gomes Module: Nearest Neighbor Models (Reading: Chapter.
Classification and risk prediction Usman Roshan. Disease risk prediction What is the best method to predict disease risk? –We looked at the maximum likelihood.
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on.
Recommendations via Collaborative Filtering. Recommendations Relevant for movies, restaurants, hotels…. Recommendation Systems is a very hot topic in.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Non Parametric Classifiers Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
Predicting Matchability - CVPR 2014 Paper -
CS Instance Based Learning1 Instance Based Learning.
STUDENTLIFE PREDICTIVE MODELING Hongyu Chen Jing Li Mubing Li CS69/169 Mobile Health March 2015.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Techniques
An Exercise in Machine Learning
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Performance of Recommender Algorithms on Top-N Recommendation Tasks
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
This week: overview on pattern recognition (related to machine learning)
Active Learning for Class Imbalance Problem
Data mining and machine learning A brief introduction.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
MUSTAFA OZAN ÖZEN PINAR SAĞLAM LEVENT ÜNVER MEHMET YILMAZ.
Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
1 Collaborative Filtering & Content-Based Recommending CS 290N. T. Yang Slides based on R. Mooney at UT Austin.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Scalable Learning of Collective Behavior Based on Sparse Social Dimensions Lei Tang, Huan Liu CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/02/01.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
The Little Big Data Showdown ECE 8110 Pattern Recognition and Machine Learning Temple University Christian Ward.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
Data Mining and Text Mining. The Standard Data Mining process.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
In Search of the Optimal Set of Indicators when Classifying Histopathological Images Catalin Stoean University of Craiova, Romania
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Artist Identification Based on Song Analysis
Machine Learning Basics
Waikato Environment for Knowledge Analysis
Machine Learning Week 1.
Collaborative Filtering Nearest Neighbor Approach
Classification Nearest Neighbor
Nearest Neighbors CSC 576: Data Mining.
CAMCOS Report Day December 9th, 2015 San Jose State University
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines

The Data Set ● 17,770 Movies ● 480,189 Reviewers ● More than 100 Million reviews  Rating of 1 through 5  Review Date ● Uncompressed full dataset is 2 Gigabytes

Netflix Data Properties ● Distribution of Number of Reviews per Reviewer ● X-axis:  # of reviews ● Y-axis  P (# of reviews)

Netflix Data Subsets ● You will be given two subsets of the data ● Format:  ● Subset  Contains 9,000 reviewers  Restricted to only those movies with at least 5 ratings ● 12,000 movies  ~2 Million reviews  ~50 MB

Project Requirements ● Compute each of the following  Average review score  Top 10 most highly rated movies  Distribution of all review scores ● p(rating=1),..., p(rating=5)  Number of reviews as a function of time  The reviewer whose review score distribution has the largest entropy ● Compute five other properties of the data  These properties should be relevant to your project  You should explain this relevancy

Project Options ● Classification ● Clustering ● Recommendation ● Data Cubes

Project 1: Classification ● Goal: Predict classification scores  5-class classification problem ● K-Nearest Neighbor ● Represent each reviewer by a (sparse) vector of his review scores  How can scores be predicted given a reviewer's nearest neighbors? ● Represent each movie by a vector of each reviewer's scores  How can scores be predicted given a movie's nearest neighbors? ● Experiment with different distance measures ● Experiment with various normalization schemes

Project 1: Classification ● Decision Trees and other Parametric Classifiers  Create dense features for each instance ● Reviewer's average rating ● Movie's average rating ● Movie related features  Actors in each movie (collected from IMDB) ● Time related features  Number of reviewer's previous scores  Use the WEKA machine learning package ● Evaluate performance of various algorithms in the package  Decision Tree, SVM,...

Project 1: Classification ● Evaluation of Classification Performance  Accuracy, Confusion Matrices ● Analysis: Are 1's harder to predict than 5's?  Cross-validation ● Does this make sense when these is a time-series component? ● Extensions  Learning curves ● How does accuracy change as the training set size increases  Distribution of accuracy per reviewer ● Are some reviewers harder to predict than others? ● Are some movies harder to predict? ...

Project 2: Clustering ● Goal: Cluster reviewers and movies ● K-means based methods  Download G-Means ● Supports k-means and also other variants  Cluster using both sparse and dense representations ● Sparse representation: same as used for KNN classification ● Dense representation: same as used for parametric classification

Project 2: Clustering ● Graph-based methods  Compute pairwise similarities between reviewers ● Correlation ● Your own ad-hoc method  i.e. The Kevin Bacon method ● Sim(x, y) = # of Kevin Bacon movies viewed by both x and y ● Similarity computation may be too expensive to perform on the full dataset  Software: Graclus ● Results analysis  Quantitative as well as Qualitative

Project 3: Recommendations ● Goal: Create movie recommendations for each reviewer ● K-Nearest Neighbor  Instance representation ● Sparse representation  Find the reviewer's nearest neighbors ● Recommend movies scored highly by these neighbors  Try out various distance measures

Project 3: Recommendation ● Evaluation  Propose a way of quantifying the quality of your recommendations ● i.e. A recommendation is good if a reviewer ended up rating the recommendation with score of 4 or higher  Is it harder to recommend movies to reviewers who do not watch many movies? ● Does your evaluation metric reflect this?

Project 3: Data Cubes ● Load the data into a data cube  Find interesting trends in the data ● i.e. Relation between average review score and day of week?  Slice on day, aggregate review scores across all reviewers and movies ● Find other interesting trends ● Use an open source data cube package (OLAP)  Mondrian – Java based  Must be a proficient coder