Statistical Learning Introduction: Data Mining Process and Modeling Examples Data Mining Process
We can see associations between customer type and fraudulent behavior. Are they legitimate? Data leakage? Our goal is to build model to predict fraud in advance
Targeting, Sales force mgmt. Business problem definition Wallet / opportunity estimation Modeling problem definition Quantile est., Latent variable est. Statistical problem definition Quantile est., Graphical model Modeling methodology design Programming, Simulation, IBM Wallets Model generation & validation OnTarget, MAP Implementation & application development Project evolution and relevance to our course Outside scope Keep in mind This is our domain!
Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements
ESL Chap1 - Introduction Identify the risk factors for prostate cancer (lpsa), based on clinical and demographic variables.
Classify a recorded phoneme, based on a log-periodogram. A restricted model (red) does much better than an unrestricted one (jumpy black)
Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements
Customize an spam detection system. X = which words appear and how much Y = Spam or not?
Identify the numbers in a handwritten zip code, from a digitized image X = color of each pixel Y = which digit is it?
Classify a tissue sample into one of several cancer classes, based on a gene expression profile. X = expression levels of genes Y = which cancer?
Classify the pixels in a LANDSAT image, according to usage: Y = {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil} X = values of pixels in several wavelength bands
October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 $50K for the annual progress price (relative to baseline) Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies Performance is evaluated on holdout movies-users pairs NETFLIX competition has attracted contestants on teams from 180 different countries valid submissions from 4336 different teams Leaderboard: current best result is 9.63% better than baseline (getting close!)Leaderboard
All movies (80K) All users (6.8 M) NETFLIX Competition Data 17K Selection unclear 480 K At least 20 Ratings by end M ratings Data Overview: NETFLIX Internet Movie Data Base Fields Title Year Actors Awards Revenue …
17K movies Training Data Movie Arrival 1998 Time 2005 User Arrival 45? 3 2 ? Qualifier Dataset 3M NETFLIX data generation process
Netflix and us We will hear talks about Netflix and also work with the data throughout our course: We will have a modeling challenge in our course which will use the Netflix data. The winners will get a grade boost! –You are all also welcome to try your hand at winning the $1M… Both yearly $50K prizes were awarded to a team from AT&T, with an Israeli participant (Yehuda Koren) –He is now back in Israel, and will give us a talk! While I was at IBM Research, our team won a related competition in 2007 (same data, more “standard” modeling tasks) –We will probably have a “case study” lecture on that as well