Statistical Learning Introduction: Modeling Examples

Slides:



Advertisements
Similar presentations
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Advertisements

Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit
Error detection and concealment for Multimedia Communications Senior Design Fall 06 and Spring 07.
Preference Elicitation [Conjoint Analysis]. Conjoint Analysis Market research: assess consumer’s preferences on homogenous class of products Approach:
Rubi’s Motivation for CF  Find a PhD problem  Find “real life” PhD problem  Find an interesting PhD problem  Make Money!
Statistical Learning Introduction: Data Mining Process and Modeling Examples Data Mining Process.
Statistical Learning Introduction: Data Mining Process and Modeling Examples Data Mining Process.
SELECTING THE RIGHT TARGET MARKET Entrp 1: Lecture 4.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Lecture 29 Electronic Business (MGT-485). Affiliate Programs.
Performance of Recommender Algorithms on Top-N Recommendation Tasks
TaskStream Training Presented by the Committee on Learning Assessment 2015.
Statistical Learning Introduction: Modeling Examples.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
SMART Agency Tipsheet Staff List This document focuses on setting up and maintaining program staff. Total Pages: 14 Staff Profile Staff Address Staff Assignment.
9- 1 Chapter Nine New-Product Development and Product Life-Cycle Strategies.
Supporting Your Success Brandon Grosvenor Director – National Sales.
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering.
By Rachsuda Jiamthapthaksin 10/09/ Edited by Christoph F. Eick.
Introduction & Step 1 Presenter:. Training Overview Introduction Participation requirements FET Tool Orientation Distribution of username & passwords.
DATABASES Southern Region CEO Wednesday 13 th October 2010.
Unit Two: Methods Psychology. How do Psychologists use the Scientific Method? Do Now: What is the Scientific Method?
Netflix Netflix is a subscription-based movie and television show rental service that offers media to subscribers: Physically by mail Over the internet.
Formulating a Simulation Project Proposal Chapter3.
MM271 Introduction to Marketing Topic 4 Identifying Market Segments & Targets.
Evaluation of Recommender Systems Joonseok Lee Georgia Institute of Technology 2011/04/12 1.
Collaborative Filtering with Temporal Dynamics Yehuda Koren Yahoo Research Israel KDD’09.
ESL Chap1 - Introduction Statistical Learning Problems Identify the risk factors for prostate cancer, based on clinical and demographic variables.
Advanced Software Engineering PROJECT November 2015.
CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Collaborative Filtering with Temporal Dynamics Yehuda Koren Yahoo! Israel KDD 2009.
T EST T OOLS U NIT VI This unit contains the overview of the test tools. Also prerequisites for applying these tools, tools selection and implementation.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Web Content And Customer Relationship Management Solution. Transforming web sites into a customer-focused, revenue generating channel with less stress.
Marketing Research.
Statistics 202: Statistical Aspects of Data Mining
This template provides guidance for the execution of task, Analyze Capabilities (EA.040). It is in a presentation format that may be used to present.
Introduction to Marketing Research
Artificial Intelligence, P.II
Statistical Learning Introduction: Modeling Examples
An Introduction to Mike Buhmann Reference Librarian.
CHAPTER OVERVIEW The Format of a Research Proposal Being Neat
9 Career Planning and Development 9-1 Career Opportunities
Check Your Assumptions
Compositional Human Pose Regression
Applications of IScore (using R)
CIS 339 Competitive Success/snaptutorial.com
Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.
GNOSIS eLearning Modules How-To Guide
The Features of a Product or System
Chapter 4 Online Consumer Behavior, Market Research, and Advertisement
Collaborative Filtering Nearest Neighbor Approach
Client Needs Analysis & Competitors
Q4 : How does Netflix recommend movies?
Unit 1: Introduction to Small Business
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensembles.
Movie Recommendation System
Education and Training Statistics Working Group – 2-3 June 2016
Data Mining Ensembles Last modified 1/9/19.
Embassy of Denmark in Washington, D.C.
Prodcom ESTP course October 2013
Systematic review of atopic dermatitis disease definition in studies using routinely-collected health data M.P. Dizon, A.M. Yu, R.K. Singh, J. Wan, M-M.
NAÏVE BAYES CLASSIFICATION
CHAPTER OVERVIEW The Format of a Research Proposal Being Neat
Machine Learning: Lecture 5
One potential implementation for the use of outlier kinase profiling and targeting for clinical management of pancreatic cancer in a precision medicine.
Presentation transcript:

Statistical Learning Introduction: Modeling Examples

Our goal is to build model to predict fraud in advance We can see associations between customer type and fraudulent behavior. Are they legitimate? Data leakage?

Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements

ESL Chap1 - Introduction Model test results for prostate cancer (lpsa), based on actual cancer volume (lcavol) and other clinical and demographic variables. ESL Chap1 - Introduction

Classify a recorded phoneme, based on a log-periodogram. A restricted model (red) does much better than an unrestricted one (jumpy black)

Customize an email spam detection system. X = which words appear and how much Y = Spam or not?

Identify the numbers in a handwritten zip code, from a digitized image X = color of each pixel Y = which digit is it?

Classify a tissue sample into one of several cancer classes, based on a gene expression profile. X = expression levels of genes Y = which cancer?

Classify the pixels in a LANDSAT image, according to usage: Y = {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil} X = values of pixels in several wavelength bands

80.000 DVD titles, 6.8 Million users 2000 online movies 1400 employers

October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 $50K for the annual progress price (relative to baseline) Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies Performance is evaluated on holdout movies-users pairs NETFLIX competition has attracted 45878 contestants on 37660 teams from 180 different countries Tens of thousands of valid submissions from thousands of teams Conclusion: in 2009, an international team attained the goal and won the prize! More later… Public in 2002 (at around $20)

Data Overview: NETFLIX Internet Movie Data Base All movies (80K) 17K Selection unclear All users (6.8 M) NETFLIX Competition Data 480 K At least 20 Ratings by end 2005 Fields Title Year Actors Awards Revenue … 100 M ratings 4 5 1 3 2

NETFLIX data generation process User Arrival Movie Arrival 17K movies Training Data 1998 Time 2005 4 5 ? 3 2 Qualifier Dataset 3M

Netflix in the class We will demonstrate many of the methods we discuss on a simplified version of the Netflix dataset The $1M was won in 2009 by a collaboration of several leading teams The strongest team, which won both yearly $50K prizes, was founded at AT&T, with an Israeli participant (Yehuda Koren) I have Yehuda’s presentation on their work, and if time allows it we will discuss it in class briefly While I was at IBM Research, our team won a related competition in KDD-Cup 2007 (same data, more “standard” modeling tasks) We may have a “case study” lecture on that as well

Project evolution and relevance to our course Business problem definition Modeling problem definition Statistical problem definition Modeling methodology design Targeting, Sales force mgmt. Wallet / opportunity estimation Quantile est., Latent variable est. Quantile est., Graphical model Outside scope Model generation & validation Implementation & application development Keep in mind Programming, Simulation, IBM Wallets OnTarget, MAP This is our domain!