Statistics 202: Statistical Aspects of Data Mining

Slides:



Advertisements
Similar presentations
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 8 = Finish chapter 6 Agenda:
Advertisements

1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final.
G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit
Rubi’s Motivation for CF  Find a PhD problem  Find “real life” PhD problem  Find an interesting PhD problem  Make Money!
PSY 307 – Statistics for the Behavioral Sciences
Experimental Design, Statistical Analysis CSCI 4800/6800 University of Georgia Spring 2007 Eileen Kraemer.
Customizable Bayesian Collaborative Filtering Denver Dash Big Data Reading Group 11/19/2007.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 2 = Finish Chapter “Introduction and Data Collection”
1) Go over HW #1 solutions (Due today)
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Chapter 9 - Lecture 2 Computing the analysis of variance for simple experiments (single factor, unrelated groups experiments).
Simple Linear Regression Analysis
Mathematical Statistics Lecture Notes Chapter 8 – Sections
Chapter 12 (Section 12.4) : Recommender Systems Second edition of the book, coming soon.
Performance of Recommender Algorithms on Top-N Recommendation Tasks
Chapter 13: Inference in Regression
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 24 Statistical Inference: Conclusion.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 9 = Review for midterm exam.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 7 = Finish chapter 3 and.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Inferential Statistics Part 1 Chapter 8 P
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
PCB 3043L - General Ecology Data Analysis.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
QM Spring 2002 Business Statistics Analysis of Time Series Data: an Introduction.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Item-Based Collaborative Filtering Recommendation Algorithms
Quantitative Methods in the Behavioral Sciences PSY 302
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
Howard Community College
Unsupervised Learning
Matrix Factorization and Collaborative Filtering
Active Learning Lecture Slides
Data Analysis.
By Arijit Chatterjee Dr
Lecture Notes for Chapter 2 Introduction to Data Mining
Statistics 202: Statistical Aspects of Data Mining
CHAPTER 11 Inference for Distributions of Categorical Data
Statistics 200 Objectives:
Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 7-9
Lecture Slides Elementary Statistics Eleventh Edition
Collaborative Filtering Nearest Neighbor Approach
Q4 : How does Netflix recommend movies?
CHAPTER 14: Confidence Intervals The Basics
CHAPTER 11 Inference for Distributions of Categorical Data
Use your Chapter 1 notes to complete the following warm-up.
Screen Stage Lecturer’s desk Gallagher Theater Row A Row A Row A Row B
Psych 231: Research Methods in Psychology
15.1 The Role of Statistics in the Research Process
CHAPTER 11 Inference for Distributions of Categorical Data
Review for Exam 1 Ch 1-5 Ch 1-3 Descriptive Statistics
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Recommendation Systems
CHAPTER 11 Inference for Distributions of Categorical Data
Data Pre-processing Lecture Notes for Chapter 2
InferentIal StatIstIcs
Unsupervised Learning
Presentation transcript:

Statistics 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 6 = Collaborative Filtering Agenda: 1) Homework #2 due Monday 2) Reminder: Midterm is on Monday, July 14th 3) Collaborative Filtering 3) Simpson's Paradox 3) Review for the Midterm *

Announcement – Midterm Exam: The midterm exam will be Monday, July 14 Stanford and SCPD students should try to take it in class (4:15 PM) Remote students who can’t come to class should take it with a proctor and return it via Scoryst by July 15 at 11:59 PM. You are allowed one 8.5 x 11 inch sheet (front and back) containing notes No books or computers are allowed, but please bring a hand held calculator The exam will cover the material that we covered in class from Chapters 1,2,3 and 6 * *

The Netflix Prize • 100M ratings of movies • 18k movies and 48k users • On average ~ 5600 ratings / movie • On average ~ 208 ratings / user • Data collected over several years • Ratings are integers from 1 to 5

Objective • Reduce RMSE on new data by 10% • Current is 0.951, so reduce to 0.856. • New data may not have the same distributions as older data (Netflix is growing, more users and movies, fewer movies rated per user and per movie)

A baseline model • bui = μ + bu + bi • Where μ is the item rating • bi is the mean rating for that item • bu is the mean rating for that user • Models how “critical” a user is and how good a movie is, on average.

Collaborative Filtering • CF produces recommendations of items based on patterns of ratings or usage (e.g. purchases) without the need for exogenous information about the item or user • Relates two fundamentally different entities: items and users

Collaborative Filtering • Two main techniques – Neighborhood approach – Latent factor models • Neighborhood methods focus on relationships between items (or users), modeling the preference of a user to an item based on ratings of similar items by that user.

Neighborhood approaches • Two items are more similar if a user rated both items similarly. • Cluster items based on similarity • Or build a kNN based predictive model

Latent factor models • Transform items and users to the same latent factor space. • Explains ratings by characterizing products and users on factors inferred from user feedback. • This new space might identify factors relating to “comedy”, “romance”, or a particular actor, etc. • The model provides weights for each user and item in this space

Latent factor models • Map items and users into a latent factor space of dimensionality, f

Latent factor models • Estimate the parameters with the least squared error with some regularization • λ is a regularization parameter to bias parameters towards 0. • Estimate with gradient ascent

Latent factor models • Bonus - include information about whether a result was rated at all • Each item associated with a new factor vector y which is then used to modify our user features based on which items they rated

Simpson's Paradox

Simpson’s “Paradox” (page 384) Occurs when the relationship between a pair of variables across different groups changes when the groups are combined Baseball Example: Batting averages of David Justice and Derek Jeter in 1995 and 1996 Justice has a better batting average in 1995 and 1996, but overall for the two seasons, he has a lower average 1995 1996 Combined Derek Jeter 12/48 .250 183/582 .314 195/630 .310 David Justice 104/411 .253 45/140 .321 149/551 .270 *

Another example of Simpson’s “Paradox” Real example from a medical study comparing the effectiveness of a treatment on kidney stones Overall success rate Above table seems to suggest that Treatment B is more effective, but if we break down the data by kidney stone size, we see that the opposite may be true Treatment A Treatment B 78% (273/350) 83% (289/350) Treatment A Treatment B Small Stones 93% (81/87) 87% (234/270) Large Stones 73% (192/263) 69% (55/80) Both 78% (273/350) 83% (289/350) *

Sample Midterm Question #1: What is the definition of data mining used in your textbook? A) the process of automatically discovering useful information in large data repositories B) the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data C) an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data *

Sample Midterm Question #2: If height is measured as short, medium or tall then it is what kind of attribute? A) Nominal B) Ordinal C) Interval D) Ratio *

Sample Midterm Question #3: If my data frame in R is called “data”, which of the following will give me the third column? A) data[2,] B) data[3,] C) data[,2] D) data[,3] E) data(2,) F) data(3,) G) data(,2) H) data(,3) *

Sample Midterm Question #4: Compute the confidence for the association rule {b, d} → {a} by treating each row as a market basket. Also, state what this value means in plain English. *

Sample Midterm Question #5: Compute the standard deviation for the numbers 23, 25, 30. Show your work below. *