Download presentation
Presentation is loading. Please wait.
Published byErick Banks Modified over 9 years ago
1
Statistical Learning Introduction: Modeling Examples
2
We can see associations between customer type and fraudulent behavior. Are they legitimate? Data leakage? Our goal is to build model to predict fraud in advance
3
Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements
4
ESL Chap1 - Introduction Identify the risk factors for prostate cancer (lpsa), based on clinical and demographic variables.
5
Classify a recorded phoneme, based on a log-periodogram. A restricted model (red) does much better than an unrestricted one (jumpy black)
6
Customize an email spam detection system. X = which words appear and how much Y = Spam or not?
7
Identify the numbers in a handwritten zip code, from a digitized image X = color of each pixel Y = which digit is it?
8
Classify a tissue sample into one of several cancer classes, based on a gene expression profile. X = expression levels of genes Y = which cancer?
9
Classify the pixels in a LANDSAT image, according to usage: Y = {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil} X = values of pixels in several wavelength bands
11
October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 $50K for the annual progress price (relative to baseline) Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies Performance is evaluated on holdout movies-users pairs NETFLIX competition has attracted 45878 contestants on 37660 teams from 180 different countries Tens of thousands of valid submissions from thousands of teams Conclusion: in 2009, an international team attained the goal and won the prize! More later…
12
451 3 2 4 All movies (80K) All users (6.8 M) NETFLIX Competition Data 17K Selection unclear 480 K At least 20 Ratings by end 2005 100 M ratings Data Overview: NETFLIX Internet Movie Data Base Fields Title Year Actors Awards Revenue …
13
17K movies Training Data Movie Arrival 1998 Time 2005 User Arrival 45? 3 2 ? Qualifier Dataset 3M NETFLIX data generation process
14
Netflix and us We will have a modeling challenge in our course which will use the Netflix data. The winners will get a grade boost! The $1M was won in 2009 by a collaboration of several leading teams –The strongest team, which won both yearly $50K prizes, was founded at AT&T, with an Israeli participant (Yehuda Koren) –Yehuda was one of the major driving forces on the final winning team –He is now back in Israel, and may come give us a talk! While I was at IBM Research, our team won a related competition in KDD-Cup 2007 (same data, more “standard” modeling tasks) –We may have a “case study” lecture on that as well
15
Targeting, Sales force mgmt. Business problem definition Wallet / opportunity estimation Modeling problem definition Quantile est., Latent variable est. Statistical problem definition Quantile est., Graphical model Modeling methodology design Programming, Simulation, IBM Wallets Model generation & validation OnTarget, MAP Implementation & application development Project evolution and relevance to our course Outside scope Keep in mind This is our domain!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.