Download presentation
Presentation is loading. Please wait.
Published byBrett Newman Modified over 6 years ago
1
MIS2502: Data Analytics Classification using Decision Trees
Jaclyn Hansberry
2
What is classification?
Determining to what group a data element belongs Or “attributes” of that “entity” Examples Determining whether a customer should be given a loan Flagging a credit card transaction as a fraudulent charge Categorizing a news story as finance, entertainment, or sports
3
How classification works
Split the data set into training and validation subsets 1 Choose a discrete outcome variable 2 Find model that predicts the outcome as a function of the other attributes 3 Apply the model to the validation set to check accuracy 4 Apply the final model to future cases 5
4
Decision Tree Learning
Training Set Trans. ID Charge Amount Avg. Charge 6 months Item Same state as billing Classification 1 $800 $100 Electronics No Fraudulent 2 $60 Gas Yes Legitimate 3 $1 $50 4 $200 Restaurant 5 $40 6 $80 Groceries 7 $140 Retail 8 Derive model Classification software Validation Set Trans. ID Charge Amount Avg. Charge 6 months Item Same state as billing Classification 101 $100 $200 Electronics Yes ? 102 Groceries No 103 $1 Gas 104 $30 $25 Restaurant Apply model
5
Goals The trained model should assign new cases to the right category
It won’t be 100% accurate, but should be as close as possible The model’s rules can be applied to new records as they come along An automated, reliable way to predict the outcome
6
Classification Method: The Decision Tree
A model to predict membership of cases or values of a dependent variable based on one or more predictor variables (Tan, Steinback, and Kumar 2004)
7
Example: Credit Card Default
Classification Predictors We create the tree from a set of training data Each unique combination of predictors is associated with an outcome This set was “rigged” so that every combination is accounted for and has an outcome TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent 3 33k 15% No default 4 28k 19% Rents 5 55k 30% 6 48k 7 65k 17% 8 85k 10% … Training Data
8
Example: Credit Card Default
Classification Leaf node Predictors Child node TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent 3 33k 15% No default 4 28k 19% Rents 5 55k 30% 6 48k 7 65k 17% 8 85k 10% … Root node Credit Approval Income <40k Debt > 20% Owns house Default Rents Debt < 20% No Default Income >40k Training Data
9
Same Data, Different Tree
Classification Predictors Credit Approval Owns Income < 40k Debt > 20% Default Debt < 20% No Default Income > 40k Rents TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent 3 33k 15% No default 4 28k 19% Rents 5 55k 30% 6 48k 7 65k 17% 8 85k 10% … Training Data We just changed the order of the predictors.
10
Apply to new (validation) data
TID Income Debt Owns/ Rents Decision (Predicted) Decision (Actual) 1 80k 35% Rent Default No Default 2 20k 40% Owns 3 15k 15% 4 50k 19% Rents 5 35k 30% Credit Approval Income <40k Debt > 20% Owns house Default Rents Debt < 20% No Default Income >40k How well did the decision tree do in predicting the outcome? When it’s “good enough,” we’ve got our model for future decisions.
11
In a real situation… The tree induction software has to deal with instances where… The same set of predictors resulting in different outcomes Multiple paths result in the same outcome Not every combination of predictors is in the training set
12
Tree Induction Algorithms
Tree induction algorithms take large sets of data and compute the tree Similar cases may have different outcomes So probability of an outcome is computed Credit Approval Income <40k Debt > 20% Owns house Default Rents Debt < 20% No Default Income >40k 0.8 For instance, you may find: When income > 40k, debt < 20%, and the customers rents no default occurs 80% of the time.
13
How the induction algorithm works
Start with single node with all training data Yes Are samples all of same classification? No No Are there predictor(s) that will split the data? Yes Partition node into child nodes according to predictor(s) No Are there more nodes (i.e., new child nodes)? DONE! Yes Go to next node.
14
Start with root node Credit Approval
Income <40k Debt > 20% Owns house Default Rents Debt < 20% No Default Income >40k There are both “defaults” and “no defaults” in the set So we need to look for predictors to split the data Credit Approval
15
Split on income Income <40k Credit Approval Income >40k
Debt > 20% Owns house Default Rents Debt < 20% No Default Income >40k Income is a factor (Income < 40, Debt > 20, Owns = Default) but (Income > 40, Debt > 20, Owns = No default) But there are also a combination of defaults and no defaults within each income group So look for another split Credit Approval Income <40k Income >40k
16
Split on debt Debt is a factor
Credit Approval Income <40k Debt > 20% Owns house Default Rents Debt < 20% No Default Income >40k Debt is a factor (Income < 40, Debt < 20, Owns = No default) but (Income < 40, Debt < 20, Rents = Default) But there are also a combination of defaults and no defaults within some debt groups So look for another split Credit Approval Income <40k Debt >20% Debt <20% Income >40k
17
Split on Owns/Rents Owns/Rents is a factor
Credit Approval Income <40k Debt > 20% Owns house Default Rents Debt < 20% No Default Income >40k Credit Approval Income <40k Debt >20% Default Debt <20% Owns house No Default Rents Income >40k Owns/Rents is a factor For some cases it doesn’t matter, but for some it does So you group similar branches …And we stop because we’re out of predictors!
18
How does it know when and how to split?
There are statistics that show When a predictor variable maximizes distinct outcomes (if age is a predictor, we should see that older people buy; younger people don’t) When a predictor variable separates outcomes (if age is a predictor, we should not see older people who buy mixed up with older people who don’t)
19
The Chi-squared test Is the proportion of the outcome class the same in each child node? It shouldn’t be, or the classification isn’t very helpful Observed Owns Rents Default 300 450 750 No Default 550 200 850 650 1500 Expected Owns Rents Default 425 325 750 No Default 850 650 1500 This means that owning or renting has no influence on default rate (it’s 50/50 in either case) Root (n=1500) Default = 750 No Default = 750 Owns (n=850) Default = 300 No Default = 550 Rents (n=650) Default = 450 No Default = 200 If the groups were the same, you’d expect an even split (Expected) But we can see they aren’t distributed evenly (Observed) But is it enough (i.e., statistically significant)?
20
Chi-squared test Asks: How different is the observed distribution from the expected distribution? Note that you’ve got four distance calculations and four cells. The more difference in each cell, the higher the test statistic Observed Owns Rents Default 300 450 750 No Default 550 200 850 650 1500 Expected Owns Rents Default 425 325 750 No Default 850 650 1500 Small p-values (i.e., less than 0.05 mean it’s very unlikely the groups are the same) So Owns/Rents is a predictor that creates two different groups
21
Bottom line: Interpreting the Chi-Squared Test
High statistic (low p-value) from chi-squared test means the groups are different R calculates the logworth value behind the scenes Which is -log(p-value) Way to compare split variables (big logworths = ) Low p-values are better: -log(0.05) 2.99 -log(0.0001) 9.21
22
Reading the R Decision Tree
Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 Credit Approval Income <40k Debt >20% 0: 15% 1: 85% Debt <20% Owns house 0: 80% 1: 20% Rents 0: 70% 1: 30% Income >40k 0: 78% 1: 22% 0: 40% 1: 60% 0: 74% 1: 26% So what is the chance that: A renter making more than $40,000 and debt more than 20% of income will default? A home owner making less than $40,000 and debt more than 20% of income will default?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.