to win Kaggle Data Mining Competitions

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

Ivan Ramler St. Lawrence University Canton, New York A Guitar Hero Based Project in Mathematical Statistics.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Model generalization Test error Bias, variance and complexity
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Indian Statistical Institute Kolkata
Rosa Cowan April 29, 2008 Predictive Modeling & The Bayes Classifier.
R for Research Data Analysis using R Day1: Basic R Baburao Kamble University of Nebraska-Lincoln.
Machine Learning in R and its use in the statistical offices
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
By DRSS Enterprise  In the text couple of slides we will discuss these methods, how to use them effectively, tips and tricks of the business, and.
Regression Basics For Business Analysis If you've ever wondered how two or more things relate to each other, or if you've ever had your boss ask you to.
Simulation.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
Hands-on Introduction to R. Outline R : A powerful Platform for Statistical Analysis Why bother learning R ? Data, data, data, I cannot make bricks without.
We’ve Developed Insights. Now, how do we commercialize them across the organization and retail customers with speed? Objective: Share how Georgia-Pacific.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
Feature Engineering Studio September 23, Welcome to Mucking Around Day.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.
GrowingKnowing.com © Correlation and Regression Correlation shows relationships between variables. This is important. All professionals want to.
XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.
Learning Agenda Emotions & Sales Article Sutton & Rafaeli Understanding the phenomenon Conducting an observational study –qualitative & quantitative info.
Regression Mediation Chapter 10. Mediation Refers to a situation when the relationship between a predictor variable and outcome variable can be explained.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Feature Engineering Studio March 1, Let’s start by discussing the HW.
Introduction to R Introductions What is R? RStudio Layout Summary Statistics Your First R Graph 17 September 2014 Sherubtse Training.
Customer Relationship Management (CRM) Chapter 4 Customer Portfolio Analysis Learning Objectives Why customer portfolio analysis is necessary for CRM implementation.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Recruiting, Keeping, and Building Your Business (Hide this Prep Slide) 1.
Konstantina Christakopoulou Liang Zeng Group G21
Kaggle Competition Prudential Life Insurance Assessment
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Hands-on Introduction to R. We live in oceans of data. Computers are essential to record and help analyse it. Competent scientists speak C/C++, Java,
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Correlation and Regression
Usman Roshan Dept. of Computer Science NJIT
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
SPSS: Using statistical software — a primer
Chapter 7. Classification and Prediction
Chapter 18 From Data to Knowledge
Multiple Regression Prof. Andy Field.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Discriminant Analysis
The Elements of Statistical Learning
Classification with Perceptrons Reading:
A GACP and GTMCP company
Introduction to R.
Q4 : How does Netflix recommend movies?
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Dimension reduction : PCA and Clustering
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Usman Roshan Dept. of Computer Science NJIT
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

to win Kaggle Data Mining Competitions Using R to win Kaggle Data Mining Competitions Chris Raimondi November 1, 2012

Overview of talk What I hope you get out of this talk Life before R Simple model example R programming language Background/Stats/Info How to get started Kaggle Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

Overview of talk Individual Kaggle competitions HIV Progression Chess Mapping Dark Matter Dunnhumby’s Shoppers Challenge Online Product Sales Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

What I want you to leave with Belief that you don’t need to be a statistician to use R - NOR do you need to fully understand Machine Learning in order to use it Motivation to use Kaggle competitions to learn R Knowledge on how to start Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

My life before R Lots of Excel Had tried programming in the past – got frustrated Read NY Times article in January 2009 about R & Google Installed R, but gave up after a couple minutes Months later… Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

My life before R Using Excel to run PageRank calculations that took hours and was very messy Was experimenting with Pajek – a windows based Network/Link analysis program Was looking for a similar program that did PageRank calculations Revisited R as a possibility Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

My life before R Came across “R Graph Gallery” Saw this graph… Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

Addicted to R in one line of code pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21, bg=c("red", "green3", "blue")[unclass(iris$Species)]) “pairs” = function “iris” = dataframe

What do we want to do with R? Machine learning a.k.a. – or more specifically Making models We want to TRAIN a set of data with KNOWN answers/outcomes In order to PREDICT the answer/outcome to similar data where the answer is not known Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

How to train a model R allows for the training of models using probably over 100 different machine learning methods To train a model you need to provide Name of the function – which machine learning method Name of Dataset What is your response variable and what features are you going to use

Example machine learning methods available in R Bagging Partial Least Squares Boosted Trees Principal Component Regression Elastic Net Projection Pursuit Regression Gaussian Processes Quadratic Discriminant Analysis Generalized additive model Random Forests Generalized linear model Recursive Partitioning K Nearest Neighbor Rule-Based Models Linear Regression Self-Organizing Maps Nearest Shrunken Centroids Sparse Linear Discriminant Analysis Neural Networks Support Vector Machines

Code used to train decision tree library(party) irisct <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris) Or use “.” to mean everything else - as in… irisct <- ctree(Species ~ ., data = iris)

That’s it You’ve trained your model – to make predictions with it – use the “predict” function – like so: my.prediction <- predict(irisct, iris2) To see a graphic representation of it – use “plot”. plot(irisct) plot(irisct, tp_args = list(fill = c("red", "green3", "blue")))

R background Statistical Programming Language Since 1996 Powerful – used by companies like Google, Allstate, and Pfizer. Over 4,000 packages available on CRAN Free Available for Linux, Mac, and Windows Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

Learn R – Starting Tonight Buy “R in a Nutshell” Download and Install R Download and Install Rstudio Watch 2.5 minute video on front page of rstudio.com Use read.csv to read a Kaggle data set into R Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

Learn R – Continue Tomorrow Train a model using Kaggle data Make a prediction using that model Submit the prediction to Kaggle Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

Learn R – This Weekend Install the Caret package Start reading the four Caret vignettes Use the “train” function in Caret to train a model, select a parameter, and make a prediction with this model Antonio Possolo, Division Chief of Statistical Engineering at the National Institute of Science and Technology (NIST), was charged with making sense of these varied estimates to help the government coordinate the national response to the spill. As described in this video testimonial (starting at 2:20), Possolo was sitting in the company of the Secretaries of Energy and the Interior, when he broke out R on his laptop to run uncertainty analysis and harmonize the estimates from the various sources.

Buy This Book: R in a Nutshell Excellent Reference 2nd Edition released just two weeks ago In stock at Amazon for $37.05 Extensive chapter on machine learning

R Studio

Read the vignettes – some of them are golden. R Tip Read the vignettes – some of them are golden. There is a correlation between the quality of an R package and its associated vignette.

What is kaggle? Platform/website for predictive modeling competitions Think middleman – they provide the tools for anyone to host a data mining competition Makes it easy for competitors as well – they know where to go to find the data/competitions Community/forum to find teammates

Kaggle Stats Competitions started over 2 years ago 55+ different competitions Over 60,000 Competitors 165,000+ Entries Over $500,000 in prizes awarded

Why Use Kaggle? Rich Diverse Set of Competitions Real World Data Competition = Motivation Fame Fortune

Who has Hosted on Kaggle?

Methods used by competitors source:kaggle.com

Predict HIV Progression Prizes: 1st $500.00 Objective: Predict (yes/no) if there will be an improvement in a patient's HIV viral load. Training Data: 1,000 Patients Testing Data: 692 Patients

Training Test Training Set Public Leaderboard Private Leaderboard Answer Various Features Response PR Seq RT Seq VL-t0 CD4-t0 1 CCTCAGATCA TACCTTAAAT 4.7 473 CACTCTAAAT CTTAAATTTY 5.0 7 AAGAAATCTG 3.2 349 CTCTTTGGCA 5.1 51 GAGAGATCTG 3.7 77 5.7 206 TCTAAATTTC 3.9 144 CACTTTAAAT TCTAAACTTT 4.4 496 3.4 252 TGGAAGAAAT 5.5 TTCGTCACAA 4.3 109 AAGAGATCTG 70 ACTAAATTTT 570 CCTCAAATCA 4.0 217 2.8 730 ATTAAATTTT 4.5 56 TACTTTAAAT 21 249 CTTAAATTTT 269 AAGGAATCTG 4.6 165 91 Training Training Set Test N/A Public Leaderboard Private Leaderboard

Predict HIV Progression

Predict HIV Progression Features Provided: PR: 297 letters long – or N/A RT: 193 – 494 letters long CD4: Numeric VLt0: Numeric Features Used: PR1-PR97: Factor RT1-RT435: Factor CD4: Numeric VLt0: Numeric

Predict HIV Progression Concepts / Packages: Caret train rfe randomForest

Random Forest Tree 1: Take a random ~ 63.2% sample of rows from the data set For each node – take mtry random features – in this case 2 would be the default Tree 2: Take a different random ~ 63.2% sample of rows from the data set And so on….. Sepal.Length Sepal.Width Petal.Length Petal.Width 5.1 3.5 1.4 0.2 4.9 3 4.7 3.2 1.3 4.6 3.1 1.5 5 3.6 5.4 3.9 1.7 0.4 3.4 0.3 4.4 2.9 0.1 3.7 4.8 1.6 4.3 1.1 5.8 4 1.2 5.7 3.8

Caret – train TrainData <- iris[,1:4] TrainClasses <- iris[,5] knnFit1 <- train(TrainData, TrainClasses, method = "knn", preProcess = c("center", "scale"), tuneLength = 3, trControl = trainControl(method = "cv", number=10))

Caret – train > knnFit1 150 samples 4 predictors 3 classes: 'setosa', 'versicolor', 'virginica' Pre-processing: centered, scaled Resampling: Cross-Validation (10 fold) Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... Resampling results across tuning parameters:

Caret – train k Accuracy Kappa Accuracy SD Kappa SD 5 0.94 0.91 0.0663 0.0994 7 0.967 0.95 0.0648 0.0972 9 0.953 0.93 0.0632 0.0949 11 0.953 0.93 0.0632 0.0949 13 0.967 0.95 0.0648 0.0972 15 0.967 0.95 0.0648 0.0972 17 0.973 0.96 0.0644 0.0966 19 0.96 0.94 0.0644 0.0966 21 0.96 0.94 0.0644 0.0966 23 0.947 0.92 0.0613 0.0919 Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 17.

Benefits of winning Cold hard cash Several newspaper articles Quoted in Science magazine Prestige Easier to find people willing to team up Asked to speak at STScI Perverse pleasure in telling people the team that came in second worked at….

IBM Thomas J. Watson Research Center

Training Data Provided: Chess Ratings Comp Prizes: 1st $10,000.00 Objective: Given 100 months of data predict game outcomes for months 101 – 105. Training Data Provided: Month White Player # Black Player # White Outcome – Win/Draw/Lose (1/0.5/0)

How do I convert the data into a flat 2D representation? Think: What are you trying to predict? What Features will you use?

Percentage of Games Won Number of Games won as White Outcome White Feature 1 White Feature 2 White Feature 3 White Feature 4 Black Feature 1 Black Feature 2 Black Feature 3 Black Feature 4 White/Black 1 White/Black 2 White/Black 3 White/Black 4 Game Feature 1 Game Feature 2 1 0.5 Percentage of Games Won Number of Games won as White Number of Games Played Percentage of Games Won Number of Games won as White Number of Games Played White Games Played/Black Games Played Type of Game Played

Packages/Concepts Used: igraph 1st real function

Mapping Dark Matter Mapping Dark Matter Prizes: 1st ~$3,000.00 The prize will be an expenses paid trip to the Jet Propulsion Laboratory (JPL) in Pasadena, California to attend the GREAT10 challenge workshop "Image Analysis for Cosmology". Objective: “Participants are provided with 100,000 galaxy and star pairs. A participant should provide an estimate for the ellipticity for each galaxy.”

dunnhumby's Shopper Challenge Prizes: 1st $6,000.00 2nd $3,000.00 3rd $1,000.00 Objective: Predict the next date that the customer will make a purchase AND Predict the amount of the purchase to within £10.00

Data Provided For 100,000 customers: April 1, 2010 – June 19, 2011 customer_id visit_date visit_spend For 10,000 customers: April 1, 2010 – March 31, 2011

Really two different challenges: 1) Predict next purchase date Max of ~42.73% obtained 2) Predict purchase amount to within £10.00 Max of ~38.99% obtained If independent 42.73% * 38.99% = 16.66% In reality – max obtained was 18.83%

dunnhumby's Shopper Challenge Packages Used & Concepts Explored: 1st competition with real dates zoo arima forecast SVD svd irlba

SVD Singular value decomposition

X X = U D V Original Matrix 807 x 1209 T 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important . . . Nth Most Important 1 2 3 4 … N N x N 1st 2nd 3rd 4th . . . Nth N x N X X = Row Features Column Features Col 1 Col 2 Col 3 Col 4 … Col N Row 1 Row 2 Row 3 Row 4 Row N

X X ~ U D V Original Matrix 807 x 1209 1st Most Important 1 1st X X ~ x <- read.jpeg("test.image.2.jpg") im <- imagematrix(x, type = "grey") im.svd <- svd(im) u <- im.svd$u d <- diag(im.svd$d) v <- im.svd$v

X X ~ U D V Original Matrix 807 x 1209 new.u <- as.matrix(u[, 1:1]) 1st Most Important 1 1st X X new.u <- as.matrix(u[, 1:1]) new.d <- as.matrix(d[1:1, 1:1]) new.v <- as.matrix(v[, 1:1]) new.mat <- new.u %*% new.d %*% t(new.v) new.im <- imagematrix(new.mat, type = "grey") plot(new.im, useRaster = TRUE) ~

U D V T Original Matrix 807 x 1209 1st Most Important 1 1st X X ~

X X ~ U D V Original Matrix 807 x 1209 T 1 1st 2 2nd 1st Most Important 2nd Most Important 1 2 1st 2nd X X ~

X X ~ U D V Original Matrix 807 x 1209 T 1 1st 2 2nd 3 3rd 1st Most Important 2nd Most Important 3rd Most Important 1 2 3 1st 2nd 3rd X X ~

X X ~ U D V Original Matrix 807 x 1209 T 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important 1 2 3 4 1st 2nd 3rd 4th X X ~

X X ~ U D V Original Matrix 807 x 1209 T 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important 5th Most Important 1 2 3 4 5 1st 2nd 3rd 4th 5th X X ~

X X ~ U D V Original Matrix 807 x 1209 T 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important 5th Most Important 6th Most Important 1 2 3 4 5 6 1st 2nd 3rd 4th 5th 6th X X ~

X X ~ U D V Original Matrix 807 x 1209 T 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important … 807th Most Important 1 2 3 4 . 807 1st 2nd 3rd 4th … 807th X X ~

X X = U D V Original Matrix 100,000 x 365 100,000 x 365 T 1st Most Important 2nd Most Important 3rd Most Important 4th Most Important . . . Nth Most Important 1 2 3 4 … N 365x365 1st 2nd 3rd 4th . . . Nth 365 x 365 X X = Customer Features Day Features Day 1 Day 2 Day 3 Day 4 … Day N Cust 1 Cust 2 Cust 3 Cust 4 Cust 5 100,000 x 365

D Original Matrix 100,000 x 365 1 2 3 4 … N 365x365

= U[,1] = 100,000 x 1 V = 365 x 1 = Original Matrix 100,000 x 365 1st Most Important V T = 365 x 1 = = 1st

V T = 365 x 1 [first 28 shown]= 1st

V T = 365 x 1 [first 28 shown]= 2nd

V T = 365 x 1 [first 28 shown]= 3rd

V T = 365 x 1 [first 28 shown]= 4th

V T = 365 x 1 [first 28 shown]= 5th

V T = 365 x 1 [first 28 shown]= 6th

V T = 365 x 1 [first 28 shown]= 7th

V T = 365 x 1 [all 365 shown]= 8th

Online Product Sales Prizes: 1st $15,000.00 2nd $ 5,000.00 3rd $ 2,500.00 Objective: “[P]redict monthly online sales of a product. Imagine the products are online  self-help programs following an initial advertising campaign.”

Packages/Concepts Explored: Online Product Sales Packages/Concepts Explored: Data analysis – looking at data closely gbm Teams

Looking at data closely Online Product Sales Looking at data closely ... 6532 6532 6661 6661 7696 7701 7701 8229 8412 8895 9596 9596 9772 9772 ... Cat_1=0 Cat_1=1 6274 1 6532 6661 7696 7701 8229 8412 8895 9596 9772

On the public leaderboard: Online Product Sales On the public leaderboard:

On the private leaderboard: Online Product Sales On the private leaderboard:

Thank You! Questions?

Extra Slides

R Code for Dunnhumby Time Series

X X = U D V > my.svd <- svd(iris[,1:4]) > objects(my.svd) [1] "d" "u" "v" > my.svd$d [1] 95.959914 17.761034 3.460931 1.884826 > dim(my.svd$u) [1] 150 4 > dim(my.svd$v) [1] 4 4