Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Data Mining Tutorial D. A. Dickey CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved. About us | Disclaimer | Privacy.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Pattern Recognition and Machine Learning
Correlation and regression
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 4: Linear Models for Classification
What is Statistical Modeling
x – independent variable (input)
Data Mining Techniques Outline
Data Mining Tutorial D. A. Dickey NCSU and Miami grad! (before 1809) CopyrightCopyright © Time and Date AS / Steffen Thorsen All rights reserved.
Statistical Methods Chichang Jou Tamkang University.
Basic Data Mining Techniques
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Three kinds of learning
Basics of discriminant analysis
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Ranga Rodrigo April 5, 2014 Most of the sides are from the Matlab tutorial. 1.
Inference for regression - Simple linear regression
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Data mining and machine learning A brief introduction.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Inductive learning Simplest form: learn a function from examples
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Chapter 9 – Classification and Regression Trees
5.2 Input Selection 5.3 Stopped Training
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Sampling Methods and Sampling Distributions
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
MKT 700 Business Intelligence and Decision Models Algorithms and Customer Profiling (1)
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Linear Models for Classification
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Evaluating Classification Performance
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Finding the Gold in Your Data An introduction to Data Mining David A. Dickey North Carolina State University.
Nonparametric Statistics
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Decision Trees A “divisive” method (splits)
Advanced Analytics Using Enterprise Miner
Data Mining Tutorial D. A. Dickey
CHAPTER 29: Multiple Regression*
Pattern Recognition and Machine Learning
Presentation transcript:

Decision Trees  A “divisive” method (splits)  Start with “root node” – all in one group  Get splitting rules  Response often binary  Result is a “tree”  Example: Loan Defaults  Example: Framingham Heart Study  Example: Automobile Accidentsnts

Recursive Splitting X1=Debt To Income Ratio X2 = Age Pr{default} =0.008 Pr{default} =0.012 Pr{default} = Pr{default} =0.003 Pr{default} =0.006 No default Default

Some Actual Data  Framingham Heart Study  First Stage Coronary Heart Disease  P{CHD} = Function of: »Age - no drug yet!  »Cholesterol »Systolic BP Import

Example of a “tree”  Using Default Settings

Example of a “tree”  Using Custom Settings (Gini splits, N=4 leaves)

Heart Disease No Yes Low BP High BP 100 DEPENDENT (effect) INDEPENDENT (no effect) Heart Disease No Yes  Contingency tables 180 ? 240? How to make splits?

Heart Disease No Yes Low BP High BP 100 DEPENDENT (effect)  Contingency tables 180 ? 240? 2(400/75)+ 2(400/25) = Compare to  2 tables – Significant! (Why do we say “Significant” ???) Observed - Expected Expected = How to make splits?

Framingham Conclusion: Sufficient evidence against the (null) hypothesis of no relationship. H0:H1:H0:H1: H 0 : Innocence H 1 : Guilt Beyond reasonable doubt P<0.05 H 0 : No association H 1: BP and heart disease are associated P=

Demo 1: Chi-Square for Titanic Survival (SAS) Alive Dead Crew rd Class nd Class st Class “logworth” = all possible splits    P-value -log 10 (p-value) Crew vs. other x Crew&3 versus 1& x First versus other x

Demo 2: Titanic (EM) Open Enterprise Miner File  New  Project Specify workspace (1)Startup Code – specify data library (2)New Data Source  Titanic (3)Data has class (1,2,3,crew) & MWC (4)Bring in data, Decision Tree (defaults)

 3 possible splits for M,W,C  10 possible splits for class (is that fair?)

Demo 3: Framingham (EM) Open Enterprise Miner File  New  Project Specify workspace (1)Startup Code – specify data library (2)New Data Source  Framingham (a)Firstchd = target, reject obs (b)Explore Firschd (pie) vs. Age (histogram) (3)Bring in data, Decision Tree (defaults)

Other Sp lit Criteria  Gini Diversity Index  (1) { A A A A B A B B C B}  Pick 2, Pr{different} = 1-Pr{AA}-Pr{BB}-Pr{CC} » =.58 (sampling with replacement)  (2) { A A B C B A A B C C } » = 0.66  group (2) is more diverse (less pure)  Shannon Entropy  Larger  more diverse (less pure)  -  i p i log 2 (p i ) {0.5, 0.4, 0.1}  1.36 {0.4, 0.2, 0.3}  1.51 (more diverse)

How to make splits?  Which variable to use?  Where to split?  Cholesterol > ____  Systolic BP > _____  Idea – Pick BP cutoff to minimize p-value for  2  Split point data-derived!  What does “significance” mean now?

Multiple testing  = Pr{ falsely reject hypothesis 1}  = Pr{ falsely reject hypothesis 2} Pr{ falsely reject one or the other} < 2  Desired: 0.05 probability or less Solution: Compare 2(p-value) to 0.05  

Validation  Traditional stats – small dataset, need all observations to estimate parameters of interest.  Data mining – loads of data, can afford “holdout sample”  Variation: n-fold cross validation  Randomly divide data into n sets  Estimate on n-1, validate on 1  Repeat n times, using each set as holdout.  Titanic and Framingham examples did not use holdout.

Pruning  Grow bushy tree on the “fit data”  Classify validation (holdout) data  Likely farthest out branches do not improve, possibly hurt fit on validation data  Prune non-helpful branches.  What is “helpful”? What is good discriminator criterion?

Goals  Split (or keep split) if diversity in parent “node” > summed diversities in child nodes  Prune to optimize  Estimates  Decisions  Ranking  in validation data

Assessment for:  Decisions Minimize incorrect decisions (model versus realized)  Estimates Error Mean Square (average error)  Ranking C (concordance) statistic = proportion concordant + ½ (proportion tied)  Obs number 1 (2, 3) 4 5  Probability of (  from model)  Actual 0 (0,1) 1 0 »Concordant Pairs: (1,3) (1,4) (2,4) »Discordant Pairs: (3,5) (4,5) »Tied (2,3) »6 ways to get pair with 2 different responses  C = 3/6 + (1/2)(1/6) = 7/12=0.5833

Accounting for Costs  Pardon me (sir, ma’am) can you spare some change?  Say “sir” to male +$2.00  Say “ma’am” to female +$5.00  Say “sir” to female -$1.00 (balm for slapped face)  Say “ma’am” to male -$10.00 (nose splint)

Including Probabilities True Gender M F Leaf has Pr(M)=.7, Pr(F)=.3 You say: Sir Ma’am 0.7 (2) 0.3 (-1) 0.7 (-10) 0.3 (5) Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir” Expected profit is = -$5.50 (a loss) if I say “Ma’am” Weight leaf profits by leaf size (# obsns.) and sum. Prune (and split) to maximize profits. +$ $5.50

Regression Trees  Continuous response Y  Predicted response P i constant in regions i=1, …, 5 Predict 50 Predict 80 Predict 100 Predict 130 Predict 20 X2X2 X1X1

 Predict P i in cell i.  Y ij j th response in cell i.  Split to minimize  i  j (Y ij -P i ) 2 Regression Trees

Real data example: Traffic accidents in Portugal* Y = injury induced “cost to society” * Tree developed by Guilhermina Torrao, (used with permission) NCSU Institute for Transportation Research & Education Help - I ran Into a “tree” Help - I ran Into a “tree”

Demo 4: Chemicals (EM) Hypothetical potency versus chemical content (proportions of A, B, C, D) New Data Source  Chemicals {Sum, Dead}  rejected, Potency  Target, Partition node: training 75%, validation 25% Connect and run tree node, look at results Tree Node, Results. View  Model  Variable Importance

Demo 4 (cont.) Variable Number of ImportanceValidation Ratio of Validation Name Splitting Rules Importanceto Training Importance B A C NaN D NaN Only A and B amounts contribute to splits. B splits reduce variation more. A reduces it by about 2/3 as much as B. Properties panel  Exported data  Explore  3D graph X=A Y=B Z=P_Potency

B B A A A A

 Logistic – another classifier  Older – “tried & true” method  Predict probability of response from input variables (“Features”)  Linear regression gives infinite range of predictions  0 < probability < 1 so not linear regression. Logistic Regression

Three Logistic Curves e a+bX (1+e a+bX ) X pp Logistic Regression

Example: Seat Fabric Ignition  Flame exposure time = X  Y=1  ignited, Y=0  did not ignite  Y=0, X= 3, 5, 9 10, 13, 16  Y=1, X = 7 11, 12 14, 15, 17, 25, 30  Q=(1-p 1 )(1-p 2 )(1-p 3 )(1-p 4 )p 5 p 6 (1-p 7 )p 8 p 9 (1-p 10 )p 11 p 12 p 13  p’s all different p i = f(a+bX i ) = e a+bXi /(1+e a+bXi )  Find a,b to maximize Q(a,b)

 Logistic idea:  Given temperature X, compute L(x)=a+bX then p = e L /(1+e L )  p(i) = e a+bXi /(1+e a+bXi )  Write p(i) if response, 1-p(i) if not  Multiply all n of these together, find a,b to maximize this “likelihood”  Estimated L = ___+ ____X maximize Q(a,b) b a

Example: Shuttle Missions  O-rings failed in Challenger disaster  Prior flights “erosion” and “blowby” in O-rings (6 per mission)  Feature: Temperature at liftoff  Target: (1) - erosion or blowby vs. no problem (0) L= – (temperature) p = e L /(1+e L )

Neural Networks  Very flexible functions  “Hidden Layers”  “Multilayer Perceptron” Logistic function of Logistic functions ** Of data Output = Pr{1} inputs ** (note: Hyperbolic tangent functions are just reparameterized logistic functions) H1H1 H2H2 X1X1 X2X2 X4X4 X3X3 (0,1)

Arrows on right represent linear combinations of “basis functions,” e.g. hyperbolic tangents (reparameterized logistic curves) b1b1 Y H1H1 Example: Y = a + b 1 H1 + b 2 H2 + b 3 H3 Y = H1 + 3 H2 + 5 H3 b2b2 H2H2 H3H3 b3b3 “bias” “weights” X -1 to 1

X2 P X (“biases”) (-10) (-13) (20) (-1) P A Complex Neural Network Surface X2 X1

Demo 5 (EM) Framingham Neural Network Note: No validation data used – surfaces are unnecessarily complex Age 25 Nonsmoker Age 25 Smoker Sbp Sbp Cholest Cholest “Slice” of 4-D surface at age=25 Code for graphs in NNFramingham.sas

Framingham Neural Network Surfaces “Slice” at age 55 – are these patterns real? Age 55 Smoker Sbp Cholest Age 55 Nonsmoker Sbp Cholest

Handling Neural Net Complexity 41 (1) Use validation data, stop iterating when fit gets worse on validation data. Problems for Framingham (50-50 split) Fit gets worse at step 1  no effect of inputs. Algorithm fails to converge (likely reason: too many parameters) (2) Use regression to omit predictor variables (“features”) and their parameters. This gives previous graphs. (3) Do both (1) and (2).

* Cumulative Lift Chart - Go from leaf of most to least predicted response. - Lift is proportion responding in first p% overall population response rate Lift 3.3  1  Predicted response high   low

Receiver Operating Characteristic Curve Logits of 1s Logits of 0s red black Cut point 1 Logits of 1s Logits of 0s red black (or any model predictor from most to least likely to respond) Select most likely p 1 % according to model. Call these 1, the rest 0. (call almost everything 0) Y=proportion of all 1’s correctly identified.(Y near 0) X=proportion of all 0’s incorrectly called 1’s (X near 0)

Receiver Operating Characteristic Curve Logits of 1s Logits of 0s red black Cut point 2 Logits of 1s Logits of 0s red black Select most likely p 2 % according to model. Call these 1, the rest 0. Y=proportion of all 1’s correctly identified. X=proportion of all 0’s incorrectly called 1’s

Receiver Operating Characteristic Curve Logits of 1s Logits of 0s red black Cut point 3 Logits of 1s Logits of 0s red black Select most likely p 3 % according to model. Call these 1, the rest 0. Y=proportion of all 1’s correctly identified. X=proportion of all 0’s incorrectly called 1’s

Receiver Operating Characteristic Curve Cut point 3.5 Logits of 1s Logits of 0s red black Select most likely 50% according to model. Call these 1, the rest 0. Y=proportion of all 1’s correctly identified. X=proportion of all 0’s incorrectly called 1’s

Receiver Operating Characteristic Curve Cut point 4 Logits of 1s Logits of 0s red black Select most likely p 5 % according to model. Call these 1, the rest 0. Y=proportion of all 1’s correctly identified. X=proportion of all 0’s incorrectly called 1’s

Receiver Operating Characteristic Curve Logits of 1s Logits of 0s red black Cut point 5 Logits of 1s Logits of 0s red black Select most likely p 6 % according to model. Call these 1, the rest 0. Y=proportion of all 1’s correctly identified. X=proportion of all 0’s incorrectly called 1’s

Receiver Operating Characteristic Curve Cut point 6 Logits of 1s Logits of 0s red black Select most likely p 7 % according to model. Call these 1, the rest 0.(call almost everything 1) Y=proportion of all 1’s correctly identified. (Y near 1) X=proportion of all 0’s incorrectly called 1’s (X near 1)

Why is area under the curve the same as number of concordant pairs + (number of ties)/2 ?? Tree Leaf 1: 20 0’s, 60 1’s Leaf 2: 20 0’s, 20 1’s Leaf 3: 60 0’s 20 1’s Cut pt Left of cut, call it 1  |  and right of cut, call it 0. X = number of incorrect 0 decisions, Y= number of correct 1 decisions X = number of 0’s to the left of cut point, Y = number of 1’s to the left of cut point. (X,Y) on ROC curve Cut 0 X=0 (all points called 0 so none wrong) Y=0 (no 1 decisions) (0, 0) Cut 1 (20)(60) = 1200 ties, 60(20+60)=1200 concordant (20, 60) ’s Ties Concordant (0,0) (20,60)  Areas are number of tied pairs in leaf 1 plus number of concordant pairs from 0’s in leaves 2 and 3 ROC curve (line) splits ties in two.

Why is area under the curve the same as number of concordant pairs + (number of ties)/2?? Tree Leaf 1: 20 0’s, 60 1’s Leaf 2: 20 0’s, 20 1’s Leaf 3: 60 0’s 20 1’s Cut pt Left of cut, call it 1  |  and right of cut, call it 0. Cut point 2: 20 more 1’s, 60 0’s to the right, 20x20=400 ties in leaf 2. point on ROC curve: (40.80) Cut point 3: call everything a 1. All 100 1’s are correctly called, all 0’s incorrectly called. point on ROC curve:(100,100) ’s 60 1’s 20 1’s ROC curve is in terms of proportions so join points are (0,0), (.2,.6), (.4,.8), and (1,1). For Logistic Regression or Neural Nets, each point is a potential cut point for the ROC curve. Ties Concordant (1,1) (0,0) ----(0.4,0.8) -(0.2,0.6)

52  Framingham ROC for 4 models and baseline 45 degree line Neural net highlighted, area Selected Model Model NodeModel Description Misclassification Rate Avg. Squared Error Area Under ROC YNeuralNeural Network TreeDecision Tree Tree2Gini Tree N= RegRegression Sensitivity: Pr{ calling a 1 a 1 }=Y coordinate Specificity; Pr{ calling a 0 a 0 } = 1-X.

A Combined Example Cell Phone Texting Locations Black circle: Phone moved > 50 feet in first two minutes of texting. Green dot: Phone moved < 50 feet..

Tree Neural Net Logistic Regression  Three Models Training Data  Lift Charts Validation Data  Lift Charts Resulting  Surfaces

Demo 5: Three Breast Cancer Models Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. 1. Sample code number id number 2. Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class: (2 for benign, 4 for malignant)

Use Decision Tree to Pick Variables

Decision Tree Needs Only BARE & SIZE

Can save scoring code from EM Data A; do bare=0 to 10; do size=0 to 10; (score code from EM) Output; end; end; ModelComparison Node ResultsModelComparison Node Results Selected Model Model NodeModel DescriptionTrain Misclassification Rate Train Avg. Squared Error Train Area Under ROC Validation Misclassification Rate Validation Avg. Squared Error Validation Area Under ROC YTreeDecision Tree NeuralNeural Network RegRegression Cancer Screening Results: Model Comparison Node

Pr{A} =0.5 Pr{B} =0.3 Pr{A and B } = A B Association Analysis is just elementary probability with new names = 1.0 A: Purchase Milk B: Purchase Cereal

 A B Rule B=> A “people who buy B will buy A” Support: Support= Pr{A and B} = 0.2 Independence means that Pr{A|B} = Pr{A} = 0.5 Pr{A} = 0.5 = Expected confidence if there is no relation to B.. Confidence: Confidence = Pr{A|B}=Pr{A and B}/Pr{B}=2/3 ??- Is the confidence in B=>A the same as the confidence in A=>B?? (yes, no) Lift: Lift = confidence / E{confidence} = (2/3) / (1/2) = 1.33 Gain = 33% Marketing A to the 30% of people who buy B will result in 33% better sales than marketing to a random 30% of the people. B Cereal=> Milk

Example: Grocery cart items (hypothetical) item cart bread 1 milk 1 soap 2 meat 3 bread 3 bread 4 cereal 4 milk 4 soup 5 bread 5 cereal 5 milk 5 Association Report Expected Confidence Confidence Support Transaction Relations (%) (%) (%) Lift Count Rule cereal ==> milk milk ==> cereal meat ==> milk bread ==> milk soup ==> milk milk ==> bread bread ==> cereal cereal ==> bread soup ==> cereal cereal ==> soup milk ==> meat milk ==> soup Sort Criterion: Confidence Maximum Items: 2 Minimum (single item) Support: 20%

Link Graph  Sort Criterion: Confidence Maximum Items: 3 e.g. milk & cereal  bread Minimum Support: 5% Slider bar at 62%

 We have the “features” (predictors)  We do NOT have the response even on a training data set (UNsupervised)  Another name for clustering  EM  Large number of clusters with k-means (k clusters)  Ward’s method to combine (less clusters, say r<k)  One more k means for final r-cluster solution. Unsupervised Learning

Example: Cluster the breast cancer data (disregard target) Plot clusters and actual target values:

Segment Profiler

Text Mining Hypothetical collection of news releases (“corpus”) : release 1: Did the NCAA investigate the basketball scores and vote for sanctions? release 2: Republicans voted for and Democrats voted against it for the win. (etc.) Compute word counts: NCAA basketball score vote Republican Democrat win Release Release

Text Mining Mini-Example: Word counts in 16 s  words  R B T P e a o d E r p s D u o l e u k e r S S c e s b e m V n S c c u c i l t o o a p o o m t d i b c t N L m e W r r e i e c a r e C i e e i e e n o n a l a r A a n c n _ _ t n t n l t s A r t h s V N

Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative (more) % of the variation in these 13-dimensional vectors occurs in one dimension. Variable Prin1 Basketball NCAA Tournament Score_V Score_N Wins Speech Voters Liar Election Republican President Democrat Prin 1 Prin 2

Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative (more) % of the variation in these 13-dimensional vectors occurs in one dimension. Variable Prin1 Basketball NCAA Tournament Score_V Score_N Wins Speech Voters Liar Election Republican President Democrat Prin 1 Prin 2

Sports Cluster Politics Cluster

R B T P e a o d E r p s D u o C l e u k e r S S c L e s b e m V n S c c u U P c i l t o o a p o o m S r t d i b c t N L m e W r r e T i i e c a r e C i e e i e e n E n o n a l a r A a n c n _ _ t R 1 n t n l t s A r t h s V N (biggest gap) Sports Documents Politics Documents Prin1Prin1

PROC CLUSTER (single linkage) agrees ! Cluster 2Cluster 1

Memory Based Reasoning (optional) Usual name: Nearest Neighbor Analysis Probe Point 9 blue 7red Classify as BLUE Estimate Pr{Blue} as 9/16

Note: Enterprise Miner Exported Data Explorer Chooses colors 

Original Data scored by Enterprise Miner Nearest Neighbor method Score Data (grid) scored by Enterprise Miner Nearest Neighbor method

Fisher’s Linear Discriminant Analysis - an older method of classification (optional) Assumes multivariate normal distribution of inputs (features) Computes a “posterior probability” for each of several classes, given the observations. Based on statistical theory

Example: One input, (financial stress index) Three Populations: (pay all credit card debt, pay part, default). Normal distributions with same variance. Classify a credit card holder by financial stress index: pay all (stress 27.5) Data are hypothetical. Means are 15, 25, 30.

Example: Differing variances. Not just closest mean now. Normal distributions with different variances. Classify a credit card holder by financial stress index: pay all (stress<19.48), pay part (19.48<stress<26.40), default( ) Data are hypothetical. Means are 15, 25, 30. Standard Deviations 3,6,3.

Example: 20% defaulters, 40% pay part 40% pay all Normal distributions with same variance and “priors.” Classify a credit card holder by financial stress index: pay all (stress<19.48), pay part (19.48<stress<28.33), default( ) Data are hypothetical. Means are 15, 25, 30. Standard Deviations 3,6,3. Population size ratios 2:2:1

 = 15, 25, or 30,  =3 here. The part that changes from one population to another is Fisher’s linear discriminant function. Here it is (  x-  2 /2)/  2, that is, (  x-  2 /2)/ . The bigger this is, the more likely it is that X came from the population with mean . The three probability densities are in the same proportions as their values of exp((  x-  2 /2)/  ).

 = 15, 25, or 30,  =here * * * Example: Stress index X=21. Classify as someone that pays only part of credit card bill because 21 is closer to 25 than to 15 or 30. Probabilities 0.24, 0.74, 0.02 give more detail. probability

Fisher for credit card data was (  x-  2 /2)/  when variance was 9 in every group. (  x-  2 /2)/  = 1.67 X for mean 15 (  x-  2 /2)/  = 1.67 X for mean 25 (  x-  2 /2)/  = 1.67 X for mean 30 For 2 inputs it again has a linear form ___+___X 1 +___X 2 where the coefficients come from the bivariate mean vector and the 2x2 matrix is assumed constant across populations. When variances differ, formulas become more complicated, discriminant becomes quadratic, and boundaries are not linear.

Example: Two inputs, three populations Multivariate normals, same 2x2 covariance matrix. Viewed from above, boundaries appear to be linear.

Income debt Defaults Pays part only Pays in full

Generated sample: 1000 per group Income index Debt index

proc discrim data=a; class group; var X1 X2; run; _ -1 _ -1 _ Constant = -.5 X' COV X Coefficient Vector = COV X j j j Linear Discriminant Function for group Variable Constant X X

Error Count Estimates for group Total Rate Priors Number of Observations and Percent Classified into group From group Total Total Priors

Debt index Income index Defaults Pays part only Pays in full Generated sample: 1000 per group Sample Discriminants Differ Somewhat from Theoretical