Data Mining Tutorial D. A. Dickey NCSU and Miami grad! (before 1809) CopyrightCopyright © Time and Date AS / Steffen Thorsen 1995-2006. All rights reserved.

Slides:

Advertisements

Similar presentations

Slides from: Doug Gray, David Poole

Advertisements

Random Forest Predrag Radenković 3237/10

Brief introduction on Logistic Regression

Data Mining Tutorial D. A. Dickey CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer | Privacy.

Logistic Regression.

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Chapter 13 Multiple Regression

Sparse vs. Ensemble Approaches to Supervised Learning

Chapter 12 Multiple Regression

Three kinds of learning

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.

Lecture 9: One Way ANOVA Between Subjects

Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)

ICS 273A Intro Machine Learning

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.

Sparse vs. Ensemble Approaches to Supervised Learning

An Introduction to Logistic Regression

1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.

Data Mining Tutorial D. A. Dickey CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer | Privacy.

Chapter 5 Data mining : A Closer Look.

Ensemble Learning (2), Tree and Forest

Decision Tree Models in Data Mining

Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Introduction to Directed Data Mining: Decision Trees

Machine Learning CS 165B Spring 2012

Lecture Notes 4 Pruning Zhangxi Lin ISQS

7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Chapter 9 – Classification and Regression Trees

5.2 Input Selection 5.3 Stopped Training

Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.

April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.

Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)

Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

LOGISTIC REGRESSION A statistical procedure to relate the probability of an event to explanatory variables Used in epidemiology to describe and evaluate.

Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.

Why Design? (why not just observe and model?) CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer.

April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

Data Mining Tutorial D. A. Dickey CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer | Privacy.

Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.

Classification and Regression Trees

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.

Finding the Gold in Your Data An introduction to Data Mining David A. Dickey North Carolina State University.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Stats Methods at IC Lecture 3: Regression.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Chapter 14 Introduction to Multiple Regression

Data Transformation: Normalization

Multiple Regression Analysis and Model Building

Data Mining Lecture 11.

Model generalization Brief summary of methods

Chapter 7: Transformations

Presentation transcript:

Data Mining Tutorial D. A. Dickey NCSU and Miami grad! (before 1809) CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer | Privacy Create short URL to this page | Linking | Feedback: Home page | Site Map | Site Search | Date Menu | The World Clock | Calendar | CountdownAbout usDisclaimerPrivacy Create short URL to this Home pageSite MapSite SearchDate MenuThe World ClockCalendar Countdown

What is it? Large datasets Fast methods Not significance testing Topics –Trees (recursive splitting) –Nearest Neighbor –Neural Networks –Clustering –Association Analysis

Trees A “divisive” method (splits) Start with “root node” – all in one group Get splitting rules Response often binary Result is a “tree” Example: Loan Defaults Example: Framingham Heart Study

Recursive Splitting X1=Debt To Income Ratio X2 = Age Pr{default} =0.007 Pr{default} =0.012 Pr{default} = Pr{default} =0.003 Pr{default} =0.006 No default Default

Some Actual Data Framingham Heart Study First Stage Coronary Heart Disease –P{CHD} = Function of: Age - no drug yet!  Cholesterol Systolic BP Import

Example of a “tree” All 1615 patients Split # 1: Age “terminal node” Systolic BP

How to make splits? Which variable to use? Where to split? –Cholesterol > ____ –Systolic BP > _____ Goal: Pure “leaves” or “terminal nodes” Ideal split: Everyone with BP>x has problems, nobody with BP<x has problems

Where to Split? First review Chi-square tests Contingency tables Heart Disease No Yes Low BP High BP 100 DEPENDENT INDEPENDENT Heart Disease No Yes

 2 Test Statistic Expect 100(150/200)=75 in upper left if independent (etc. e.g. 100(50/200)=25) 95 (75) 5 (25) 55 (75) 45 (25) Heart Disease No Yes Low BP High BP (400/75)+ 2(400/25) = Compare to Tables – Significant! WHERE IS HIGH BP CUTOFF???

Measuring “Worth” of a Split P-value is probability of Chi-square as great as that observed if independence is true. (Pr {  2 >42.67} is 6.4E-11) P-values all too small. Logworth = -log 10 (p-value) = Best Chi-square  max logworth.

Logworth for Age Splits Age 47 maximizes logworth

How to make splits? Which variable to use? Where to split? –Cholesterol > ____ –Systolic BP > _____ Idea – Pick BP cutoff to minimize p-value for  2 What does “signifiance” mean now?

Multiple testing 50 different BPs in data, 49 ways to split Sunday football highlights always look good! If he shoots enough baskets, even 95% free throw shooter will miss. Jury trial analogy Tried 49 splits, each has 5% chance of declaring significance even if there’s no relationship.

Multiple testing  = Pr{ falsely reject hypothesis 1}  = Pr{ falsely reject hypothesis 2} Pr{ falsely reject one or the other} < 2  Desired: 0.05 probabilty or less Solution: use  = 0.05/2 Or – compare 2(p-value) to 0.05

Multiple testing 50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni – original idea Kass – apply to data mining (trees) Stop splitting if minimum p-value is large. For m splits, logworth becomes -log 10 (m*p-value)

Other Split Evaluations Gini Diversity Index –{ A A A A B A B B C B} –Pick 2, Pr{different} = 1-Pr{AA}-Pr{BB}-Pr{CC} 1-[10+6+0]/45=29/45=0.64 –{ A A B C B A A B C C } 1-[6+3+3]/45 = 33/45 = 0.73  MORE DIVERSE, LESS PURE Shannon Entropy –Larger  more diverse (less pure) – -  i p i log 2 (p i ) {0.5, 0.4, 0.1}  1.36 {0.4, 0.2, 0.3}  1.51 (more diverse)

Goals Split if diversity in parent “node” > summed diversities in child nodes Observations should be –Homogeneous (not diverse) within leaves –Different between leaves –Leaves should be diverse Framingham tree used Gini for splits

Cross validation Traditional stats – small dataset, need all observations to estimate parameters of interest. Data mining – loads of data, can afford “holdout sample” Variation: n-fold cross validation –Randomly divide data into n sets –Estimate on n-1, validate on 1 –Repeat n times, using each set as holdout.

Pruning Grow bushy tree on the “fit data” Classify holdout data Likely farthest out branches do not improve, possibly hurt fit on holdout data Prune non-helpful branches. What is “helpful”? What is good discriminator criterion?

Goals Want diversity in parent “node” > summed diversities in child nodes Goal is to reduce diversity within leaves Goal is to maximize differences between leaves Use same evaluation criteria as for splits Costs (profits) may enter the picture for splitting or evaluation.

Accounting for Costs Pardon me (sir, ma’am) can you spare some change? Say “sir” to male +$2.00 Say “ma’am” to female +$5.00 Say “sir” to female -$1.00 (balm for slapped face) Say “ma’am” to male -$10.00 (nose splint)

Including Probabilities True Gender M F Leaf has Pr(M)=.7, Pr(F)=.3. You say: M F 0.7 (2) 0.3 (-1) 0.7 (-10) 0.3 (5) Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir” Expected profit is = -$5.50 (a loss) if I say “Ma’am” Weight leaf profits by leaf size (# obsns.) and sum Prune (and split) to maximize profits.

Additional Ideas Forests – Draw samples with replacement (bootstrap) and grow multiple trees. Random Forests – Randomly sample the “features” (predictors) and build multiple trees. Classify new point in each tree then average the probabilities, or take a plurality vote from the trees

“Bagging” – Bootstrap aggregation “Boosting” – Similar, iteratively reweights points that were misclassified to produce sequence of more accurate trees. * Lift Chart - Go from leaf of most to least response. - Lift is cumulative proportion responding.

Regression Trees Continuous response (not just class) Predicted response constant in regions Predict 50 Predict 80 Predict 100 Predict 130 Predict 20 X1X1 X2X2

Predict P i in cell i. Y ij j th response in cell i. Split to minimize  i  j (Y ij -P i ) 2 Predict 50 Predict 80 Predict 100 Predict 130 Predict 20

Predict P i in cell i. Y ij j th response in cell i. Split to minimize  i  j (Y ij -P i ) 2

Logistic Regression “Trees” seem to be main tool. Logistic – another classifier Older – “tried & true” method Predict probability of response from input variables (“Features”) Linear regression gives infinite range of predictions 0 < probability < 1 so not linear regression.

Logistic idea: Map p in (0,1) to L in whole real line Use L = ln(p/(1-p)) Model L as linear in temperature Predicted L = a + b(temperature) Given temperature X, compute a+bX then p = e L /(1+e L ) p(i) = e a+bXi /(1+e a+bXi ) Write p(i) if response, 1-p(i) if not Multiply all n of these together, find a,b to maximize

Example: Ignition Flame exposure time = X Ignited Y=1, did not ignite Y=0 –Y=0, X= 3, 5, 9 10, 13, 16 –Y=1, X = 11, 12 14, 15, 17, 25, 30 Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp P’s all different p=f(exposure) Find a,b to maximize Q(a,b)

DATA LIKELIHOOD; ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14; DO I=1 TO 14; INPUT X(I) y(I) END; DO A = -3 TO -2 BY.025; DO B = 0.2 TO 0.3 BY.0025; Q=1; DO i=1 TO 14; L=A+B*X(i); P=EXP(L)/(1+EXP(L)); IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P); END; IF Q< THEN Q=0.0006; OUTPUT; END;END; CARDS; ; Generate Q for array of (a,b) values

Likelihood function (Q)

IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept TIME Association of Predicted Probabilities and Observed Responses Percent Concordant 79.2 Somers' D Percent Discordant 20.8 Gamma Percent Tied 0.0 Tau-a Pairs 48 c 0.792

5 right, 4 wrong 4 right, 1 wrong

Example: Framingham X=age Y=1 if heart trouble, 0 otherwise

Framingham The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq Intercept <.0001 age <.0001

Example: Shuttle Missions O-rings failed in Challenger disaster Low temperature Prior flights “erosion” and “blowby” in O-rings Feature: Temperature at liftoff Target: problem (1) - erosion or blowby vs. no problem (0)

Neural Networks Very flexible functions “Hidden Layers” “Multilayer Perceptron” Logistic function of Logistic functions Of data output inputs

Arrows represent linear combinations of “basis functions,” e.g. logistics b1 Example: Y = a + b1 p1 + b2 p2 + b3 p3 Y = 4 + p1+ 2 p2 - 4 p3

Should always use holdout sample Perturb coefficients to optimize fit (fit data) –Nonlinear search algorithms Eliminate unnecessary arrows using holdout data. Other basis sets –Radial Basis Functions –Just normal densities (bell shaped) with adjustable means and variances.

Terms Train: estimate coefficients Bias: intercept a in Neural Nets Weights: coefficients b Radial Basis Function: Normal density Score: Predict (usually Y from new Xs) Activation Function: transformation to target Supervised Learning: Training data has response.

Hidden Layer L1 = *Age – 0.20*SBP22 H11=exp(L1)/(1+exp(L1)) L2 = *H11 Pr{first_chd} = exp(L2)/(1+exp(L2)) “Activation Function”

Demo (optional) Compare several methods using SAS Enterprise Miner –Decision Tree –Nearest Neighbor –Neural Network

Unsupervised Learning We have the “features” (predictors) We do NOT have the response even on a training data set (UNsupervised) Clustering –Agglomerative Start with each point separated –Divisive Start with all points in one cluster then spilt

EM  PROC FASTCLUS Step 1 – find “seeds” as separated as possible Step 2 – cluster points to nearest seed –Drift: As points are added, change seed (centroid) to average of each coordinate –Alternatively: Make full pass then recompute seed and iterate.

Clusters as Created

As Clustered

Cubic Clustering Criterion (to decide # of Clusters) Divide random scatter of (X,Y) points into 4 quadrants Pooled within cluster variation much less than overall variation Large variance reduction Big R-square despite no real clusters CCC compares random scatter R-square to what you got to decide #clusters 3 clusters for “macaroni” data.

Association Analysis Market basket analysis –What they’re doing when they scan your “VIP” card at the grocery –People who buy diapers tend to also buy _________ (beer?) –Just a matter of accounting but with new terminology (of course ) –Examples from SAS Appl. DM Techniques, by Sue Walsh:

Termnilogy Baskets: ABC ACD BCD ADE BCE Rule Support Confidence X=>Y Pr{X and Y} Pr{Y|X} A=>D 2/5 2/3 C=>A 2/5 2/4 B&C=>D 1/5 1/3

Don’t be Fooled! Lift = Confidence /Expected Confidence if Independent Checking-> Saving V No (1500) Yes (8500)(10000) No Yes SVG=>CHKG Expect 8500/10000 = 85% if independent Observed Confidence is 5000/6000 = 83% Lift = 83/85 < 1. Savings account holders actually LESS likely than others to have checking account !!!

Summary Data mining – a set of fast stat methods for large data sets Some new ideas, many old or extensions of old Some methods: –Decision Trees –Nearest Neighbor –Neural Nets –Clustering –Association