Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Learning Introduction: Data Mining Process and Modeling Examples Data Mining Process.

Similar presentations


Presentation on theme: "Statistical Learning Introduction: Data Mining Process and Modeling Examples Data Mining Process."— Presentation transcript:

1 Statistical Learning Introduction: Data Mining Process and Modeling Examples Data Mining Process

2 We can see associations between customer type and fraudulent behavior. Are they legitimate? Data leakage? Our goal is to build model to predict fraud in advance

3 Targeting, Sales force mgmt. Business problem definition Wallet / opportunity estimation Modeling problem definition Quantile est., Latent variable est. Statistical problem definition Quantile est., Graphical model Modeling methodology design Programming, Simulation, IBM Wallets Model generation & validation OnTarget, MAP Implementation & application development Project evolution and relevance to our course Outside scope Keep in mind This is our domain!

4 Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements

5 ESL Chap1 - Introduction Identify the risk factors for prostate cancer (lpsa), based on clinical and demographic variables.

6 Classify a recorded phoneme, based on a log-periodogram. A restricted model (red) does much better than an unrestricted one (jumpy black)

7 Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements

8 Customize an email spam detection system. X = which words appear and how much Y = Spam or not?

9 Identify the numbers in a handwritten zip code, from a digitized image X = color of each pixel Y = which digit is it?

10 Classify a tissue sample into one of several cancer classes, based on a gene expression profile. X = expression levels of genes Y = which cancer?

11 Classify the pixels in a LANDSAT image, according to usage: Y = {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil} X = values of pixels in several wavelength bands

12

13 October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 $50K for the annual progress price (relative to baseline) Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies Performance is evaluated on holdout movies-users pairs NETFLIX competition has attracted 45878 contestants on 37660 teams from 180 different countries 36868 valid submissions from 4336 different teams Leaderboard: current best result is 9.63% better than baseline (getting close!)Leaderboard

14 451 3 2 4 All movies (80K) All users (6.8 M) NETFLIX Competition Data 17K Selection unclear 480 K At least 20 Ratings by end 2005 100 M ratings Data Overview: NETFLIX Internet Movie Data Base Fields Title Year Actors Awards Revenue …

15 17K movies Training Data Movie Arrival 1998 Time 2005 User Arrival 45? 3 2 ? Qualifier Dataset 3M NETFLIX data generation process

16 Netflix and us We will hear talks about Netflix and also work with the data throughout our course: We will have a modeling challenge in our course which will use the Netflix data. The winners will get a grade boost! –You are all also welcome to try your hand at winning the $1M… Both yearly $50K prizes were awarded to a team from AT&T, with an Israeli participant (Yehuda Koren) –He is now back in Israel, and will give us a talk! While I was at IBM Research, our team won a related competition in 2007 (same data, more “standard” modeling tasks) –We will probably have a “case study” lecture on that as well


Download ppt "Statistical Learning Introduction: Data Mining Process and Modeling Examples Data Mining Process."

Similar presentations


Ads by Google