Decision trees part II.

Slides:



Advertisements
Similar presentations
What is Chi-Square? Used to examine differences in the distributions of nominal data A mathematical comparison between expected frequencies and observed.
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
Statistical Inference for Frequency Data Chapter 16.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
CJ 526 Statistical Analysis in Criminal Justice
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
Decision Tree Models in Data Mining
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
The Chi-square Statistic. Goodness of fit 0 This test is used to decide whether there is any difference between the observed (experimental) value and.
Chapter 10 Analyzing the Association Between Categorical Variables
How Can We Test whether Categorical Variables are Independent?
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.3 Determining.
Statistics for the Behavioral Sciences (5 th ed.) Gravetter & Wallnau Chapter 17 The Chi-Square Statistic: Tests for Goodness of Fit and Independence University.
CJ 526 Statistical Analysis in Criminal Justice
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 13: Nominal Variables: The Chi-Square and Binomial Distributions.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Chapter 9 – Classification and Regression Trees
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Pearson Chi-Square Contingency Table Analysis.
Chapter 12 A Primer for Inferential Statistics What Does Statistically Significant Mean? It’s the probability that an observed difference or association.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
APPLICATION OF DATAMINING TOOL FOR CLASSIFICATION OF ORGANIZATIONAL CHANGE EXPECTATION Şule ÖZMEN Serra YURTKORU Beril SİPAHİ.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
CHAID. Example: Opening of Cinema/ Children’s Park/Exhibition Center To find consumer responses to opening of Cinema, Children’s park or Exhibition 903.
Copyright © 2010 Pearson Education, Inc. Slide
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 12. The Chi-Square Test.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Chi Square Tests PhD Özgür Tosun. IMPORTANCE OF EVIDENCE BASED MEDICINE.
Chapter 13- Inference For Tables: Chi-square Procedures Section Test for goodness of fit Section Inference for Two-Way tables Presented By:
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Chapter 14: Analysis of Variance One-way ANOVA Lecture 9a Instructor: Naveen Abedin Date: 24 th November 2015.
Simple and multiple regression analysis in matrix form Least square Beta estimation Beta Simple linear regression Multiple regression with two predictors.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Decision trees part II Decision trees part II. LESSON TOPICS  CHAID method : Chi-Squared Automatic Interaction Detection  Chi-square test  Bonferroni.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 22/11/ :12 AM 1 Contingency tables and log-linear models.
I. ANOVA revisited & reviewed
Part Four ANALYSIS AND PRESENTATION OF DATA
Nonparametric Statistics
Chapter 12 Chi-Square Tests and Nonparametric Tests
Lecture8 Test forcomparison of proportion
Advanced Quantitative Techniques
Chapter 11 Chi-Square Tests.
Inference and Tests of Hypotheses
Association between two categorical variables
Two Way ANOVAs Factorial Designs.
SPSS STATISTICAL PACKAGE FOR SOCIAL SCIENCES
Nonparametric Statistics
Analyzing One-Variable Data
MIS2502: Data Analytics Classification using Decision Trees
Is a persons’ size related to if they were bullied
Testing for Independence
Chapter 14: Analysis of Variance One-way ANOVA Lecture 8
Are you planning on doing a project in Math 124
Chapter 10 Analyzing the Association Between Categorical Variables
Chapter 10: Selection of auxiliary variables
Chapter 11 Chi-Square Tests.
Analyzing the Association Between Categorical Variables
Hypothesis Testing Part 2: Categorical variables
Modeling with Dichotomous Dependent Variables
15.1 The Role of Statistics in the Research Process
Chapter 26 Comparing Counts.
Inference for Two-way Tables
CLASS 6 CLASS 7 Tutorial 2 (EXCEL version)
Chapter 11 Chi-Square Tests.
CHAID.
Statistical Power.
Presentation transcript:

Decision trees part II

CHAID method : Chi-Squared Automatic Interaction Detection LESSON TOPICS CHAID method : Chi-Squared Automatic Interaction Detection Chi-square test Bonferroni correction factor Examples

Principal features of CHAID method

CHAID merges categories of the predictor that are homogeneous with respect to the dependent variable , but keeps distinct all the categories which are heterogeneous

CHAID uses Bonferroni multiplier for doing the needed adjustments in order for making simultaneous statistical inferences

CHAID, a differenza di altri metodi di partizione iterativa, è limitato a caratteri di tipo ordinale e nominale

It uses chi-square test for veryfing indipendence between characters (together with Bonferroni factor) for assessing significativity of partition

Chi-square test of independence   i j ( n ij - nij )2 * nij x2 =

where nij is the empirical frequency corresponding to the combination of modality i of the first character with modality j of the second character

nij = ninj * Is the corresponding theoretical frequency according to the hypothesis of indipendence between the two characters

EXAMPLE Families according to residence and personal computer ownership (empirical frequencies)

Geographic zone Ownership of personal computer North-Center South Total YES NO 150 500 650 100 250 350 750 1000

Families according to residence and personal computer ownership (theoretical frequencies)

Geographic zone Ownership of personal computer North-Center South Total YES NO 162,5 487,5 650,0 87,5 262,5 350,0 250,0 750,0 1000,0

Test calculations: (500-487,5)2/487,5+ (87,5-100)2/87,5+ (162,5-150)2/162,5+ (250-262,5)2/262,5=

Bonferroni adjustment factor Let us consider the dependent variable R and the predictors B, with five modalities, and A, with two Let us take that a is the first type error of the indipendence test in a two entry table with B e R (for example a =0,05)

There are 24 -1 = 15 different ways to make dichotomous variable B If the 15 test of hypothesis were indipendent, the probability of making a first type error would be: 1-(1-a)15 > a

In the above example, 15 is called Bonferroni factor If a è piccolo 1 - (1-a)M = Ma For the predictor A the probability of making a first type error is simply a

In the CHAID method we compare the value of a associated with the test of indipendence for the variable A with the value of a for the variable B corrected with Bonferroni factor

Basic components of CHAID:

1 2 3 A categorical dependent variable A set of independent variables, categorical too, combinations of which are used for defining the partitions 3 A set of parameters

In each step of the analysis, each subgroup is analyzed and we get the best predictor, defined as that which has the smallest value of a corrected by the smallest Bonferroni factor

Kinds of predictive variables in CHAID Monotonic 1 Free 2 Floating 3

The CHAID algorithm: STEP 1: Merging Step 2: Splitting Step 3: Stopping

Merging

For each predictor

Construct the complete two ways table 1

For each couple of categories that can be merged calculate chi-square test. For each couple which is not significative merge and go to step 3. If all the remaining couples are significative go to step 4 2

For each categories resulting from the merge of three or more categories originarie controlla con il test chi-quadrato se ogni categoria originaria può essere separata dalle altre. Torna al passo 2 3

Merge categories which have a too small number of observations, taking those which have the smallest value of chi-squared 4 Calculate the value of a corrected by Bonferroni factor on the table resulting by the merging process 5

Splitting Take as the best predictor that which has the smallest value of a corrected by Bonferroni factor If no predictor shows a significant value of a, do not split that subgroup

Stopping Come back to step 1 and analyze the next subgroup. Stop when every subgroup has been analyzed or has too few observations

Example of chaid method Dependent variable: Response rate to a promotional offer of subscribing a magazine

Indipendent Variables

Head of the family age - 5 categories -floating (AGE) gender - 2 categories -monotonic - (GENDER) Presence of children - 2 categories - monotonic (KIDS) Family income - 8 categories - monotonic (INCOME)

Credit card - 2 categories - monotonic (BANKCARD) Number of components - 6 categories - floating - (HHSIZE) Occupational status -4 categories - free (OCCUP)

Representation of the partition process by a dendrogram

Total 1.15 81,040 HHSIZE 1 1.09 25,384 23 1.52 16,132 45 1.92 6,198 ? 0.87 33,326 OCCUP GENDER -1- -4- W 2.39 1,758 BO? 1.42 14,374 M 0.81 25,531 F 1.08 7,795 -2- -3- -5- -6-

Interpretation of results Comparison of response accordin to the variable household size before and after merging

% of responses HHSIZE Frequency Before merging After merging 1 2 3 4 5 Missing value 25384 11240 4892 3187 3011 33326 1,09 1,49 1,59 1,79 2,06 0,87 1,52 1,92

Ranking of segments according to response rate

Rank Number Description Response rate 1 2 Segment 2 Segment 4 2,39 Household with two or tre components, head white collar 2,39 1,92 Households with four components and more

Household with one component Rank Number Description Response rate 3 4 Segment 3 Segment 1 Household with two or three components, head with occupational staus different from white collar 1,42 1,09 Household with one component

Household with missing number of components, head male Rank Number Description Response rate 5 Segment 6 Segment 5 Household with missing number of components, head female 1,o6 0,81 Household with missing number of components, head male