Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

Classification Tree Interaction Detection

Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction identification Category merging Discretizing continuous variables

Highly visual diagrams enable you to present categorical results in an intuitive manner— so you can more clearly explain the results to non-technical audiences. These trees enable you to explore your results and visually determine how your model flows. Visual results can help you find specific subgroups and relationships that you might not uncover using more traditional statistics. Because classification trees break the data down into branches and nodes, you can easily see where a group splits and terminates.

CHAID or CART Chi-Square Automatic Interaction Detector ◦ Based on Chi-Square ◦ All variables discretecized ◦ Dependent variable: nominal Classification and Regression Tree ◦ Variables can be discrete or continuous ◦ Based on GINI or F-Test ◦ Dependent variable: nominal or continuous

Use of Decision Trees Classify observations from a target binary or nominal variable  Segmentation Predictive response analysis from a target numerical variable  Behaviour Decision support rules  Processing

Credit risk for Bank Question for the bank: What is possibility that a customer will default on their loan? Or What are the characteristics of customers who default the loans?

Credit_ratingAgeIncomeCredit_cardsEducationCar_loans 0.0036.222.00 0.0021.992.00 0.0029.171.002.001.002.00 0.0032.751.002.00 1.00 0.0036.772.00 0.0039.322.00 0.0031.702.00 0.0034.721.002.001.002.00 1.0040.303.002.00 1.0058.422.00 1.00 24.823.002.00 1.0037.302.00 1.002.00 1.0039.683.002.001.00 40.072.001.002.001.00 43.003.001.002.001.00 44.752.001.002.001.00 21.403.002.001.002.00 Sample data for credit ratings

Credit risk for Bank A bank needs to categorize credit applicants according to whether or not they represent a reasonable credit risk. Based on what? Past data : Credit ratings of past customers, Age, Income, No of credit cards, Education

Credit risk for Bank Bank needs to categorize the customers according to the credit ratings; good or bad Explore the significant variable/s which differentiates customers with good or bad credit ratings How ?

Different Methods CHAID : Chi-Square Auto Interaction Detection CART: Classification And Regression Tree

Type of Data Nominal : categorical Ordinal : categorical with rank / order Scale: age, income (ordered categories)

CHAID Algorithm Algorithm only accepts nominal or ordinal categorical predictors. When predictors are continuous, they are transformed into ordinal predictors before using the algorithm.

CHAID Automatic Interaction Detection perform multi- level splits. Splits are based on value of Chi-square statistic For categorical predictors, the p-levels are computed for Chi-square tests of independence of the classes and the levels of the categorical predictor that are present at the node otherwise it is merged If Chi-square value is significant then split is generated otherwise it is merged.

Credit risk for Bank Dependent variable : Credit rating Good Or Bad Independent variables: Income, age, no of credit cards

Methodology Variables we have: credit ratings, age, income, credit cards, education, car loans All possible pairs of these variables are compared and most significant variable is identified based on chi-square value

Credit ratings significantly depends on income levels of customer This is the most significant variable Credit ratings for three levels of income is significantly different

Next to income, no of credit cards is significant. No. of credit cards divides the medium and high income group. Among the low income group, there is no further classification. Credit ratings for medium and high income groups depends on the no of credit cards they have.

Next significant variable is age of the customers of medium income group and having 5 or more number of credit cards. Credit ratings of customers having medium level income and 5 or more number of credit cards depends on age of the customers. Age is not significant for low or high income groups and no. of credit cards they have

Interpretation Bank should check customer’s income and number of credit cards s/he has If customer belongs to medium income group and has credit cards 5 or more than 5, then age should also be checked Possibility of customer having bad credit rating is much higher in the case of younger customers with age <28 This findings can be used for deciding if customer apply for loan should be given or not

Classification and regression trees (CART) The classic CART algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996) In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set of if-then logical (split) conditions that permit accurate prediction or classification of cases.

CART Algorithm 1. Univariate split method for categorical and ordered predictor/independent variables At each node splits are binary 2. All possible splits for each predictor variable at each node are examined to find the split producing maximum improvement in goodness of fit –best split 3. Goodness of fit is measured by appropriate measure 4. Stop the split when (i) desirable size of the tree ia achieved (2) there are no enough number of cases in the nodes to split

Branded Pharmacy Stores Dependent variable: preference for pharmacy store; branded or local shops Independent variables: Income level, annual medical expenses, age,, gender, mediclaim policy

Tree Classification by CART Every split is binary. Only two income groups are considered for split. Low income and more than low income group There are no splits for low income group and more than low group is split according to age at the step 2 Ate step 3, annual medical expenditure splits At step 4, again income level and age splits tree. Thus whenever there are more than two distinct categories present, binary splits do not give tree classification which can be interpreted easily. Thus CART is preferred when binary splits are to be done.

How to choose the appropriate method ? There is no mathematical rule for this CART is preferred when (i) independent variables are of two categories (ii) independent variable is ordinal or scaled CHAID is preferred (i) independent variables are categorical (ii) when independent variables have multiple categories and multiple splits are desirable

Fast food Customers of fast food restaurants can be categorized by marital status Married can be classified according to no of children and singles according to occupation Occupation can be further classified according to income and customers with children according to their age In the tree, number of customers and their average number of visits to restaurants is indicated

Total Sample 5.28 295 Married 2.83 154 Single Marital status 7.97 141 Blue-collarOccupation 10.18 85 White collar Occupation 5.03 56 2 or more Children 5.39 94 1 or no Children 1.86 50 Annual income Less than $15,000 14.32 63 Annual income $15,000 or more 8.11 22 Age 21-45 year 7.23 49 Over 45 years of Age 4.51 45 AID for Fast food restaurants

Interpretation Restaurant should take into account while preparing for menu and pricing, that maximum number of customers who pay visit on average 7.97 per week are single. Large number of them have blue color jobs and income less than $15000. Menu should include cheaper items and their choices Among the married customers, and visit more frequently are of the age group 21-45 with children. And thus menu should include children’s choices.

Used For Segmentation: of the members of the group Prediction: Create the rules and use them to predict future events. ( will customer default if …….) Data reduction and variable screening: select the few useful predictors from a large set Identify the interaction between variables

Preference for Pharmacy Stores What are the variables which predict the preference for pharmacy stores for medicines Preference is dependent variables and Independent variables are : gender, annual income, annual medical expenses, mediclaim and age

Interpretation Preference for shopping medicines from pharmacy stores depends mainly upon the annual income and annual medical expenses and age of the customer. However, low income group there is no budgeted annual medical expenses, they prefer local shops to branded shops Branded pharmacy stores are preferred by medium and high income groups who have some annual medical expenses Stores should have their customer services by considering these findings.

Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

Similar presentations

Presentation on theme: "Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

Similar presentations

Presentation on theme: "Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction."— Presentation transcript:

Similar presentations

About project

Feedback