Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Chapter 7 Classification and Regression Trees
Random Forest Predrag Radenković 3237/10
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Bivariate Analysis Cross-tabulation and chi-square.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Measurement in Survey Research Developing Questionnaire Items with Respect to Content and Analysis.
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
Chapter 5 Data mining : A Closer Look.
Decision Tree Models in Data Mining
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
C REDIT R ISK M ODELS C ROSS - V ALIDATION – I S T HERE A NY A DDED V ALUE ? Croatian Quants Day Zagreb, June 6, 2014 Vili Krainz The.
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
Next Generation Techniques: Trees, Network and Rules
INTRODUCTION TO STATISTICS MATH0102 Prepared by: Nurazrin Jupri.
Chapter Eight The Concept of Measurement and Attitude Scales
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Introduction to Probability and Statistics Consultation time: Ms. Chong.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Multivariate Data Analysis CHAPTER seventeen.
Deciphering Results For each survey item you are analyzing, choose one of the following: Independent samples t-test Paired samples t-test One sample t-test.
Chapter 9 – Classification and Regression Trees
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Rasch trees: A new method for detecting differential item functioning in the Rasch model Carolin Strobl Julia Kopf Achim Zeileis.
K Nearest Neighbors Classifier & Decision Trees
Economics 173 Business Statistics Lecture 22 Fall, 2001© Professor J. Petry
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
IT Management Case # 8 - A Case on Decision Tree: Customer Churning Forecasting and Strategic Implication in Online Auto Insurance using Decision Tree.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic.
APPLICATION OF DATAMINING TOOL FOR CLASSIFICATION OF ORGANIZATIONAL CHANGE EXPECTATION Şule ÖZMEN Serra YURTKORU Beril SİPAHİ.
Copyright © 2009 Cengage Learning 18.1 Chapter 20 Model Building.
MKT 700 Business Intelligence and Decision Models Algorithms and Customer Profiling (1)
Lecture DSCI 4520/5240 DATA MINING MYRAW Nonprofit donor data MYRAW Overview Determine who is likely to donate to a non-profit organization campaign.
STATISTICS: TYPES OF VARIABLES Claire 12B. Qualitative Variables  A qualitative variable is a categorical variable that represents different groups and.
Customer Relationship Management (CRM) Chapter 4 Customer Portfolio Analysis Learning Objectives Why customer portfolio analysis is necessary for CRM implementation.
CHAPTERS HYPOTHESIS TESTING, AND DETERMINING AND INTERPRETING BETWEEN TWO VARIABLES.
1 Illustration of the Classification Task: Learning Algorithm Model.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Classification and Regression Trees
Chapter 7: Data Collection and Presentation Mathematics Department Corpus Christi School.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Chapter Seventeen Copyright © 2004 John Wiley & Sons, Inc. Multivariate Data Analysis.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Exploring Demographic and Employment Characteristics of Employees with Self-reported Gambling Problems Margaret K. Glenn, EdD, CRC ; Carolyn E. Hawley,
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
BUS 308 Entire Course (Ash Course) For more course tutorials visit BUS 308 Week 1 Assignment Problems 1.2, 1.17, 3.3 & 3.22 BUS 308.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Decision trees part II Decision trees part II. LESSON TOPICS  CHAID method : Chi-Squared Automatic Interaction Detection  Chi-square test  Bonferroni.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Decision trees part II.
ECE 471/571 – Lecture 12 Decision Tree.
MIS2502: Data Analytics Classification using Decision Trees
Classification of Variables
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
MIS2502: Data Analytics Classification Using Decision Trees
Decision trees MARIO REGIN.
Presentation transcript:

Classification Tree Interaction Detection

Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction identification Category merging Discretizing continuous variables

Highly visual diagrams enable you to present categorical results in an intuitive manner— so you can more clearly explain the results to non-technical audiences. These trees enable you to explore your results and visually determine how your model flows. Visual results can help you find specific subgroups and relationships that you might not uncover using more traditional statistics. Because classification trees break the data down into branches and nodes, you can easily see where a group splits and terminates.

CHAID or CART Chi-Square Automatic Interaction Detector ◦ Based on Chi-Square ◦ All variables discretecized ◦ Dependent variable: nominal Classification and Regression Tree ◦ Variables can be discrete or continuous ◦ Based on GINI or F-Test ◦ Dependent variable: nominal or continuous

Use of Decision Trees Classify observations from a target binary or nominal variable  Segmentation Predictive response analysis from a target numerical variable  Behaviour Decision support rules  Processing

Credit risk for Bank Question for the bank: What is possibility that a customer will default on their loan? Or What are the characteristics of customers who default the loans?

Credit_ratingAgeIncomeCredit_cardsEducationCar_loans Sample data for credit ratings

Credit risk for Bank A bank needs to categorize credit applicants according to whether or not they represent a reasonable credit risk. Based on what? Past data : Credit ratings of past customers, Age, Income, No of credit cards, Education

Credit risk for Bank Bank needs to categorize the customers according to the credit ratings; good or bad Explore the significant variable/s which differentiates customers with good or bad credit ratings How ?

Different Methods CHAID : Chi-Square Auto Interaction Detection CART: Classification And Regression Tree

Type of Data Nominal : categorical Ordinal : categorical with rank / order Scale: age, income (ordered categories)

CHAID Algorithm Algorithm only accepts nominal or ordinal categorical predictors. When predictors are continuous, they are transformed into ordinal predictors before using the algorithm.

CHAID Automatic Interaction Detection perform multi- level splits. Splits are based on value of Chi-square statistic For categorical predictors, the p-levels are computed for Chi-square tests of independence of the classes and the levels of the categorical predictor that are present at the node otherwise it is merged If Chi-square value is significant then split is generated otherwise it is merged.

Credit risk for Bank Dependent variable : Credit rating Good Or Bad Independent variables: Income, age, no of credit cards

Methodology Variables we have: credit ratings, age, income, credit cards, education, car loans All possible pairs of these variables are compared and most significant variable is identified based on chi-square value

Credit ratings significantly depends on income levels of customer This is the most significant variable Credit ratings for three levels of income is significantly different

Next to income, no of credit cards is significant. No. of credit cards divides the medium and high income group. Among the low income group, there is no further classification. Credit ratings for medium and high income groups depends on the no of credit cards they have.

Next significant variable is age of the customers of medium income group and having 5 or more number of credit cards. Credit ratings of customers having medium level income and 5 or more number of credit cards depends on age of the customers. Age is not significant for low or high income groups and no. of credit cards they have

Interpretation Bank should check customer’s income and number of credit cards s/he has If customer belongs to medium income group and has credit cards 5 or more than 5, then age should also be checked Possibility of customer having bad credit rating is much higher in the case of younger customers with age <28 This findings can be used for deciding if customer apply for loan should be given or not

Classification and regression trees (CART) The classic CART algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996) In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set of if-then logical (split) conditions that permit accurate prediction or classification of cases.

CART Algorithm 1. Univariate split method for categorical and ordered predictor/independent variables At each node splits are binary 2. All possible splits for each predictor variable at each node are examined to find the split producing maximum improvement in goodness of fit –best split 3. Goodness of fit is measured by appropriate measure 4. Stop the split when (i) desirable size of the tree ia achieved (2) there are no enough number of cases in the nodes to split

Branded Pharmacy Stores Dependent variable: preference for pharmacy store; branded or local shops Independent variables: Income level, annual medical expenses, age,, gender, mediclaim policy

Tree Classification by CART Every split is binary. Only two income groups are considered for split. Low income and more than low income group There are no splits for low income group and more than low group is split according to age at the step 2 Ate step 3, annual medical expenditure splits At step 4, again income level and age splits tree. Thus whenever there are more than two distinct categories present, binary splits do not give tree classification which can be interpreted easily. Thus CART is preferred when binary splits are to be done.

How to choose the appropriate method ? There is no mathematical rule for this CART is preferred when (i) independent variables are of two categories (ii) independent variable is ordinal or scaled CHAID is preferred (i) independent variables are categorical (ii) when independent variables have multiple categories and multiple splits are desirable

Fast food Customers of fast food restaurants can be categorized by marital status Married can be classified according to no of children and singles according to occupation Occupation can be further classified according to income and customers with children according to their age In the tree, number of customers and their average number of visits to restaurants is indicated

Total Sample Married Single Marital status Blue-collarOccupation White collar Occupation or more Children or no Children Annual income Less than $15, Annual income $15,000 or more Age year Over 45 years of Age AID for Fast food restaurants

Interpretation Restaurant should take into account while preparing for menu and pricing, that maximum number of customers who pay visit on average 7.97 per week are single. Large number of them have blue color jobs and income less than $ Menu should include cheaper items and their choices Among the married customers, and visit more frequently are of the age group with children. And thus menu should include children’s choices.

Used For Segmentation: of the members of the group Prediction: Create the rules and use them to predict future events. ( will customer default if …….) Data reduction and variable screening: select the few useful predictors from a large set Identify the interaction between variables

Preference for Pharmacy Stores What are the variables which predict the preference for pharmacy stores for medicines Preference is dependent variables and Independent variables are : gender, annual income, annual medical expenses, mediclaim and age

Interpretation Preference for shopping medicines from pharmacy stores depends mainly upon the annual income and annual medical expenses and age of the customer. However, low income group there is no budgeted annual medical expenses, they prefer local shops to branded shops Branded pharmacy stores are preferred by medium and high income groups who have some annual medical expenses Stores should have their customer services by considering these findings.