Decision Tree Models in Data Mining

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Chapter 7 Classification and Regression Trees
Brief introduction on Logistic Regression
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Bivariate Analysis Cross-tabulation and chi-square.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Analysis of frequency counts with Chi square
Outliers Split-sample Validation
Statistical Methods Chichang Jou Tamkang University.
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Outliers Split-sample Validation
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Chapter 11 Multiple Regression.
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
Inferences About Process Quality
Today Concepts underlying inferential statistics
Chapter 5 Data mining : A Closer Look.
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Hypothesis Tests and Confidence Intervals in Multiple Regressors
Copyright © 2011 Pearson Education, Inc. Multiple Regression Chapter 23.
Hypothesis Testing:.
Assessment of Model Development Techniques and Evaluation Methods for Binary Classification in the Credit Industry DSI Conference Jennifer Lewis Priestley.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Building And Interpreting Decision Trees in Enterprise Miner.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics, A First Course 4 th Edition.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables.
Chapter 9 – Classification and Regression Trees
BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
ANOVA (Analysis of Variance) by Aziza Munir
Feature Selection: Why?
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Three Broad Purposes of Quantitative Research 1. Description 2. Theory Testing 3. Theory Generation.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Evaluating Classification Performance
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Classification and Regression Trees
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
DISCRIMINANT ANALYSIS. Discriminant Analysis  Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Lecture 1.31 Criteria for optimal reception of radio signals.
DECISION TREES An internal node represents a test on an attribute.
Chapter 11 Chi-Square Tests.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Hypothesis Testing Review
Introduction to Data Mining and Classification
Decision Tree Saed Sayad 9/21/2018.
Advanced Analytics Using Enterprise Miner
MIS2502: Data Analytics Classification using Decision Trees
Chapter 10 Analyzing the Association Between Categorical Variables
Chapter 11 Chi-Square Tests.
Analyzing the Association Between Categorical Variables
Chapter 11 Chi-Square Tests.
Presentation transcript:

Decision Tree Models in Data Mining Matthew J. Liberatore Thomas Coghlan

Decision Trees in Data Mining Decision Trees can be used to predict a categorical or a continuous target (called regression trees in the latter case) Like logistic regression and neural networks decision trees can be applied for classification and prediction Unlike these methods no equations are estimated A tree structure of rules over the input variables are used to classify or predict the cases according to the target variable The rules are of an IF-THEN form – for example: If Risk = Low, then predict on-time payment of a loan

Decision Tree Approach A decision tree represents a hierarchical segmentation of the data The original segment is called the root node and is the entire data set The root node is partitioned into two or more segments by applying a series of simple rules over an input variables For example, risk = low, risk = not low Each rule assigns the observations to a segment based on its input value Each resulting segment can be further partitioned into sub-segments, and so on For example risk = low can be partitioned into income = low and income = not low The segments are also called nodes, and the final segments are called leaf nodes or leaves

Decision Tree Example – Loan Payment Income < $30k >= $30k Age Credit Score < 25 >=25 < 600 >= 600 not on-time on-time not on-time on-time

Growing the Decision Tree Growing the tree involves successively partitioning the data – recursively partitioning If an input variable is binary, then the two categories can be used to split the data If an input variable is interval, a splitting value is used to classify the data into two segments For example, if household income is interval and there are 100 possible incomes in the data set, then there are 100 possible splitting values For example, income < $30k, and income >= $30k

Evaluating the partitions When the target is categorical, for each partition of an input variable a chi-square statistic is computed A contingency table is formed that maps responders and non-responders against the partitioned input variable For example, the null hypothesis might be that there is no difference between people with income <$30k and those with income >=$30k in making an on-time loan payment The lower the significance or p-value, the more likely that we reject this hypothesis, meaning that this income split is a discriminating factor

Contingency Table $<30k $>=30k total Payment on-time Payment not on-time

Chi-Square Statistic The chi-square statistic computes a measure of how different the number of observations is in each of the four cells as compared to the expected number The p-value associated with the null hypothesis is computed Enterprise Miner then computes the logworth of the p-value, logworth = - log10(p-value) The split that generates the highest logworth for a given input variable is selected

Growing the Tree In our loan payment example, we have three interval-valued input variables: income, age, and credit score We compute the logworth of the best split for each of these variables We then select the variable that has the highest logworth and use its split – suppose it is income Under each of the two income nodes, we then find the logworth of the best split of age and credit score and continue the process -- subject to meeting the threshold on the significance of the chi-square value for splitting and other stopping criteria (described later)

Other Splitting Criteria for a Categorical Target The gini and entropy measures are based on how heterogeneous the observations are at a given node relates to the mix of responders and non-responders at the node Let p1 and p0 represent the proportion of responders and non-responders at a node, respectively If two observations are chosen (with replacement) from a node, the probability that they are either both responders or both non-responders is (p1)2 + (p0)2 The gini index = 1 – [(p1)2 + (p0)2], the probability that both observations are different Best case is a gini index of 0 (all observations are the same) An index of ½ means both groups equally represented

Other Splitting Criteria for a Categorical Target The rarity of an event is defined as: -log2(pi) Entropy sums up the rarity of response and non-response over all observations Entropy ranges from the best case of 0 (all responders or all non-responders) to 1 (equal mix of responders and non-responders)

Splitting Criteria for a Continuous (Interval) Target An F-statistic is used to measure the degree of separation of a split for an interval target, such as revenue Similar to the sum of squares discussion under multiple regression, the F-statistic is based on the ratio of the sum of squares between the groups and the sum of squares within groups, both adjusted for the number of degrees of freedom The null hypothesis is that there is no difference in the target mean between the two groups As before, the logworth of the p-value is computed

Some Adjustments The more possible splits of an input variable, the less accurate the p-value (bigger chance of rejecting the null hypothesis) If there are m splits, the Bonferroni adjustment adjusts the p-value of the best case by subtracting log10(m) from the logworth If Time of Kass Adjustment is set to before then the p-values of the splits are compared with Bonferroni adjustment

Some Adjustments Setting Split Adjustment property to Yes means that the significance of the p-value can be adjusted by the depth of the tree For example, at the fourth split, a calculate p-value of 0.04 becomes 0.04*24 = 0.64, making the split statistically insignificant This leads to rejecting more splits, limiting the size of the tree Tree growth can also be controlled by setting: Leaf Size property (minimum number of observations in a leaf) Split Size property (minimum number of observations to allow a node to be split) Maximum Depth property (maximum number of generation of nodes)

Some Results The posterior probabilities are the proportions of responders and non-responders at each node A node is classified as a responder or non-responder depending on which posterior probability is the largest In selecting the best tree, one can use Misclassification, Lift, or Average Squared Error

Creating a Decision Tree Model in Enterprise Miner Open the bankrupt project, and create a new diagram called Bankrupt_DecTree Drag and drop the bankrupt data node and the Decision Tree node (from the model tab) onto the diagram Connect the nodes

Select ProbChisq for the Criterion under Splitting Rule Change Use Input Once to Yes (otherwise, the same variable can appear more than once in the tree)

Under Subtree select Misclassification for Assessment Measure Keep defaults under P-Value Adjustment and Output Variables Under Score set Variable Selection to No (otherwise variables with importance values greater than 0.05 are set as rejected and not considered by the tree)

The Decision Tree has only one split on RE/TA The Decision Tree has only one split on RE/TA. The misclassification rate is 0.15 (3/20), with 2 false negatives and 1 false positive. The cumulative lift is somewhat lower than the best cumulative lift, and starts out at 1.777 vs. the best value of 2.000.

Under Subtree, set Method to Largest and rerun Under Subtree, set Method to Largest and rerun. The result show that another split is added, using EBIT/TA. However, the misclassification rate is unchanged at 0.15. This result shows that setting Method to Assessment and Misclassification for Assessment Measure finds the smallest tree having the lowest misclassification

Model Comparison The Model Comparison node under the Assess tab can be used to compare several different models Create a diagram called Full Model that includes the bankrupt data node connected into the regression, decision tree, and neural network nodes Connect the three model nodes into the Model Comparison node, and connect it and the bankrupt_score data node into a Score node

For Regression, set Selection Model to none; for Neural Network, set Model Selection Criterion to Average Error, and the Network properties as before; for Decision Tree, set Assessment Measure as Average Squared Error, and the other properties as before. This puts each of the models on a similar basis for fit. For Model Comparison set Selection Criterion as Average Squared Error.

Neural Network is selected, although Regression is nearly identical in average squared error. The Receiver Operating Characteristic (ROC) curve shows sensitivity (true positives) vs. 1-specificity (false positives) for various cutoff probabilities of a response. The chart shows that no matter what the cutoff probabilities are, regression and neural network classify 100% of responders as responders (sensitivity) and 0% of non-responders as responders (1-specificity). Decision tree performs reasonably well, as indicated by the area above the diagonal line.