William B. Hakes, Ph.D.-V 101304 1 Cluster Analysis & Hybrid Models Business Application & Conceptual Issues March 3, 2005.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Random Forest Predrag Radenković 3237/10
N. Kumar, Asst. Professor of Marketing Database Marketing Cluster Analysis.
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Decision Tree Analysis. Decision Analysis Managers often must make decisions in environments that are fraught with uncertainty. Some Examples –A manufacturer.
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Chapter 17 Overview of Multivariate Analysis Methods
x – independent variable (input)
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Data Mining.
Basic Data Mining Techniques
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Multivariate Data Analysis Chapter 9 - Cluster Analysis
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Determining the Size of
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Decision Tree Models in Data Mining
Relationships Among Variables
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Segmentation Analysis
Factor Analysis Psy 524 Ainsworth.
Marketing Research Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Data Mining Techniques
VIRTUAL BUSINESS RETAILING
Overview DM for Business Intelligence.
Next Generation Techniques: Trees, Network and Rules
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Chapter 1: Introduction to Statistics
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Model Building III – Remedial Measures KNNL – Chapter 11.
CLUSTER ANALYSIS.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Chapter 9 – Classification and Regression Trees
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Hair, Babin, Money & Samouel, Essentials of Business Research, Wiley, Learning Objectives: 1.Explain the difference between dependence and interdependence.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Data Mining and Decision Support
Classification and Regression Trees
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Chapter 7. Classification and Prediction
Chapter 15 – Cluster Analysis
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Intro to Machine Learning
Cluster Analysis.
Business Application & Conceptual Issues
Unsupervised Learning
Presentation transcript:

William B. Hakes, Ph.D.-V Cluster Analysis & Hybrid Models Business Application & Conceptual Issues March 3, 2005

William B. Hakes, Ph.D.-V Introduction to Clustering Applied Problem- Dissertation –Conceptual/Practical Issues –Research Ideas –Good Clusters/Bad Clusters & Interpretation Applied Problem II/Binary Clustering Applied Problem III- Interpretation- Clustering from a Survey Trees (RI)- Intro Dissertation RI For Further Research Today’s Outline

William B. Hakes, Ph.D.-V A financial analyst of an investment firm is interested in identifying groups of mutual funds that are look alike in a “true” context, not simply based on the way Morningstar rates them. 2.A marketing manager is interested in identifying similar cities (across multiple dimensions) that can be used for a test marketing campaign in which a new product might be introduced. 3.The Director of Marketing at a telecom firm wants to understand the types of people that he already knows are candidates for the firm’s new internet data service 4.A Golf Club General Manager wants to understand the “natural” segments of his members so that he can better utilize his clubs assets and understand how he might ideally want the club to look in the future. Introduction

William B. Hakes, Ph.D.-V Cluster Analysis- its easy when: a.You have a relatively small sample b.You have nice, neat data c.Your variables are continuous 2.Cluster Analysis- The Real World a.Sometimes sample are small, but in business they’re large b.We’d like our data to be free from error, containing no outliers, but that is rarely the case. c.Variables are often a mix of continuous and categorical data Cluster Overview

William B. Hakes, Ph.D.-V A) Cluster entire customer base (General Purpose Clusters) Build predictive models across products See how your “targeted” customers fall into the clusters, if they provide separability -or- B) Build predictive models on base Determine the “targets” for a specific campaign Cluster those “targets” based only on actionable information (Specific Purpose Clusters) -or- C)Cluster Analysis as a primary end-analysis Correct option depends on how you’ll use it! Clustering- Some Competing Macro Views

William B. Hakes, Ph.D.-V Applied Problem I Credit Data: real-world data from financial services (auto loans) –Predictive Model differentiating “goods” vs. “bads” Given that we think you’re “good”, what else is there? Cross selling opportunities –You’re a good risk, but certainly there is more to offer you –Consider GE Capital Purchasing Data: real-world motor-home data from overseas company –Predictive Model differentiating buyers vs. non-buyers Given that we think you’re a “buyer”, what else is there? Compelling qualitative messages –You’re likely a buyer, but certainly all buyers are not the same –Consider XYZ Telecom Dissertation Research

William B. Hakes, Ph.D.-V Guiding Research Question After a predictive model is built, how can variables best be pre-processed for cluster analysis so that rule induction on the resulting clusters provides maximum perspicuity while minimizing the art involved?? …. perhaps a hybrid model so as to minimize the “art” involved while maximizing perspicuity and applicability.

William B. Hakes, Ph.D.-V Quantitative Problem Domain Why Cluster Analysis? – Commonly Applied: Targeted Army recruitment - Faulds and Gohmann (2001) Identify “natural” segments of Euro tourists - Yuksel and Yuksel (2002) Uncover “natural” groups of common business goals across 15 countries - Hofstede et al. (2002) Prostate cancer treatment on various types of cells - Li & Sarkar (2002) Cluster analysis… –Identifies subgroups within a larger group –Makes each object (customer, product etc.)within each group as similar as possible while making the subgroups as different as possible from one another. Cluster Analysis

William B. Hakes, Ph.D.-V Quantitative Problem Domain How Does Cluster Analysis Work? –Variable Selection Generally must be on similar/identical scales (standardized) Metric- Ordinal/Interval/Ratio Non-metric data Correlation & outliers distort results Principle Components & Factor Analysis as Inputs Construct similarity/proximity matrix to view relationship between all observations across all variables: –Euclidean Distance (most commonly used) –Other distances measures include Euclidean, Squared Euclidean, City Block, Mahalonobis –Correlation- but consider (1, 2, 1, 2) and (9, 10, 9, 10) vs. (1, 2, 1, 2) and (1, 1, 2, 2) –Association (Jaccard coefficient for binary variables) Cluster Analysis cont’d

William B. Hakes, Ph.D.-V Quantitative Problem Domain How Does Cluster Analysis Work? –Choose clustering algorithm Hierarchical- n clusters of size 1 until one cluster remains –Choose algorithm to determine how distance is to be computed between clusters. –Ward’s Method*, single Linkage, centroid, etc. Non-Hierarchical (K-Means)- assign objects to clusters based on pre- specified number of clusters – Choose seeds for n clusters (often pre-determined) – Clusters are formed, new centroids computed, new clusters formed A Dual approach is recommended (Hartigan,1975 ; Milligan, 1980 ; Punj & Stewart, 1983) –Use hierarchical clustering to compute estimated cluster centroids –Use centroids as the Cluster seeds for a K-Means analysis Cluster Analysis cont’d

William B. Hakes, Ph.D.-V Quantitative Problem Domain Number of Clusters Hierarchical- n clusters are formed (use dendogram) K-Means –Form pre-specified k number based on theory (Milligan, 1980; Hair et al., 1998) – Form pre-specified k number based on application (Me, Everyday, (3 for Credit & 4 for Motor)) –Consult the “ Pseudo-F” in either case to assess solution (Lattin, et al., 2003; Punj & Stewart, 1983) Interpretation of Clusters – Which variables are important? How important? Univariate F-Tests on Cluster Centroids Perspicuity via “art” in initial steps Perspicuity via a different technique (a hybrid) Cluster Analysis cont’d

Logistic Regression 1- Original Vars 2- Stdized Version of 1 4- Principal Component Scores 3- X 1 B 1 – X n B n (from Logit) Rule Induction (CART) Expert Panel Review Each of 4 RI Solutions tested for Usefulness/Perspicuity using ANOVA/Tukey ’ s HSD) K-means Cluster Analysis (refined solutions) Hierarchical Cluster Analysis (seeds developed) Determine “Target “Group Original VarsStd VarsLogit VarsPCA Vars Orig Var RI SolutionStd Var RI SolutionLogit Var RI SolutionPCA Var RI Solution Variable pre- processing as inputs Extract the pre- specified number of clusters as seeds for next stage Generate a pre-specified number of clusters using seeds from prior CA Input each cluster solution into RI program and create RI solutions Transform rules into text descriptions and submit to Expert Panel for review Extracted Group of Targeted Customers Original VarsStd VarsLogit VarsPCA Vars Hybrid Test

William B. Hakes, Ph.D.-V Take the following example: You are a firm trying to generate clusters about the Atlanta area with the objective of understanding zip codes to which you want to “mass”market your products. a.Many different races exist. How do you cluster them? Typically, its: 1) White 2) African American 3) Asian 4) Hispanic 5) Native American 6) Non-white other b.What will clustering do with this variable as it groups people? Clustering- A Problematic Example

William B. Hakes, Ph.D.-V A Problematic Example cont’d 1.Can you cluster this simple example? 2.How will you interpret it (e.g., what’s a common way to look at the “answer” to see if you agree with the differentiation)?

William B. Hakes, Ph.D.-V Cluster Means- What do they tell us? 2.Assume we have three clusters, and along the “race” dimension, they are as follows: Cluster 1- Mean=2 Cluster 2- Mean=4 Cluster 3- Mean=1 How do you: Use this data to assign people into clusters? Interpret means? A Problematic Example cont’d

William B. Hakes, Ph.D.-V Binary Variables- One Possible Solution?

William B. Hakes, Ph.D.-V Binary Variables- A Closer Look 1.How will these cases cluster? 2.What can we do about it? How similar are persons 101, 102 & 103 to one another….are they more alike or more different?

William B. Hakes, Ph.D.-V Applied Problem II 1.A Golf Club General Manager wants to understand the “natural” segments of his members so that he can better utilize his clubs assets and understand how he might ideally want the club to look in the future. 2.How can cluster analysis help? 3.We took a look at the following Demographic Information Usage Information Cost Information Some data was measured and some was survey data Note that in clustering you may use N dummy variables (rather than N-1 in dependent techniques like regression)

William B. Hakes, Ph.D.-V Application of Binary Clustering - Above data taken from one question off of 30Q customer survey. 5 Clusters were formed. - Note that a dummy separated n ways will sum to 100% only if there are no missing responses.

William B. Hakes, Ph.D.-V Many different uses, but its works great for clustering (see SPSS) Jaccard Coefficient a___ S j = a + b + c where a is the sum of agreement (+ +) and b, c represent the sums of absent/present combinations (i.e. + -, and - +, respectively). The table below shows this convention of lettering for counts when calculating the similarity between two objects. Values of d are not considered because they represent complete disagreement. OBJECT OBJECT2 +a (1,1) b (1,0) -c (0,1) d (0,0)  Jaccard Process Sample

William B. Hakes, Ph.D.-V Summary of Binary Clustering 1.Assists when we want to understand the “natural” segments 2.How can binary cluster analysis help (using Jaccard or otherwise)? Allows us to use categorical data. Gives us unique summary insight into the true percentages of each cluster along various dimensions. Not tricked by the zero problem- if zero’s are “true” zero’s, then clustering can be VERY interpretable No program as of yet that integrates Jaccard algorithm with traditional algorithms. Cluster different sets of variables and then cluster the clusters using Jaccard (dummy the cluster membership) Invent your own technique!!!! (i.e. K-Modes) 3.All clustering should be “checked” with domain experts for validation.

William B. Hakes, Ph.D.-V Applied Problem III- F&B Survey Analysis For Illustration Purposes Only

William B. Hakes, Ph.D.-V Can Clustering Help…… For Illustration Purposes Only

William B. Hakes, Ph.D.-V F&B Survey Analysis – Clusters vs. Member Info Can we look at member information we know to be true in order to measure the accuracy of member responses & therefore the clusters? Survey Average AgeActual Member Average Age Cluster Cluster Cluster Cluster Cluster

William B. Hakes, Ph.D.-V F&B Survey Analysis – Clustering Overview What are the primary Clusters that exist at the Club? Big Spenders (8% of Member Base) Age = 53 1 Child Golf/Tennis/Fitness = 4x per month Opportunity Knocks w/Kids (58% of Member Base) Age = Children, under 14 Golf=2x/mo SwimPool=8x/mo Seniors (15% of Member Base) Age = 69 No Children Golf /Tennis= 3x/mo Opportunity Knocks no Kids (17% of Member Base) Age = 56 If Kids, most are 18+ Heaviest Fitness 3x/mo Golf 6x/mo Pool Heavy All-Around Users (4% of Member Base) Age = 52 1 Child, Age 11+ Golf/Swim/Fitness = 15x per month

William B. Hakes, Ph.D.-V Q4 and Q5 plotted together…What is the relationship between the factors that members find important when selecting a restaurant and their level of satisfaction? Satisfaction Importance Quality of Food Quality of Wine Menu Variety Service Atmosphere Price Speed F&B Survey Analysis – Results by Question 4 High Importance + High Satisfaction = Increased Loyalty High Importance + High Satisfaction = Increased Loyalty 4 Q4- What factors are important to you in selecting a restaurant? Q5- How satisfied are you with the same factors at the Club?

William B. Hakes, Ph.D.-V F&B Survey Analysis – Importance vs. Satisfaction Note that the actual scale begins at “1” but there were no responses measured below “3” “Big Spenders”

William B. Hakes, Ph.D.-V F&B Survey Analysis – Importance vs. Satisfaction Note that the actual scale begins at “1” but there were no responses measured below “3” “Opp Knocks/kids”

William B. Hakes, Ph.D.-V F&B Survey Analysis – Importance vs. Satisfaction Note that the actual scale begins at “1” but there were no responses measured below “3” “Seniors”

William B. Hakes, Ph.D.-V F&B Survey Analysis – Importance vs. Satisfaction Note that the actual scale begins at “1” but there were no responses measured below “3” “Opp Knocks no kids”

William B. Hakes, Ph.D.-V F&B Survey Analysis – Importance vs. Satisfaction Note that the actual scale begins at “1” but there were no responses measured below “3” “Heavy Users”

William B. Hakes, Ph.D.-V Points to Ponder- Clustering Pros- 1) Good for exploratory analysis 2) Helps discover previous unsuspected relationships 3) One of very few techniques that focuses on the groups it creates, not the variate that creates them Cons- 1)Difficult to interpret/often not actionable 2)Deemed as too “soft” by some statisticians and businesses 3)Out of sample customer assignment is very tough Solution- (Almost) Always use a Hybrid Model, at least as a check

Logistic Regression 1- Original Vars 2- Stdized Version of 1 4- Principal Component Scores 3- X 1 B 1 – X n B n (from Logit) Rule Induction (CART) Expert Panel Review Each of 4 RI Solutions tested for Usefulness/Perspicuity using ANOVA/Tukey ’ s HSD) K-means Cluster Analysis (refined solutions) Hierarchical Cluster Analysis (seeds developed) Determine “Target “Group Original VarsStd VarsLogit VarsPCA Vars Orig Var RI SolutionStd Var RI SolutionLogit Var RI SolutionPCA Var RI Solution Variable pre- processing as inputs Extract the pre- specified number of clusters as seeds for next stage Generate a pre-specified number of clusters using seeds from prior CA Input each cluster solution into RI program and create RI solutions Transform rules into text descriptions and submit to Expert Panel for review Extracted Group of Targeted Customers Original VarsStd VarsLogit VarsPCA Vars Hybrid Test

William B. Hakes, Ph.D.-V RI- Rules are induced based upon a set of inputs and a criterion (dependent) variable Although many different techniques exist such as CART, ID3, CHAID, and many others, they all tend to utilize the following procedures, even though they each have different splitting rules (Whalen & Gim, 1999): 1) Identify a dependent variable of interest along with a set of independent (predictor) variables. 2) Compare all cutpoints of predictor variables to find the one that best predicts the dependent variable (using some statistical rule, though these rules differ among methods). 3) Identify the next best rule (a predictor variable along a certain cutpoint) in each of the sub-samples already defined by (2). 4) Continue to split until all remaining subsamples are homogenous with respect to the dependent variable. 5) The set of if-then rules from the analysis are applied to a validation set to determine performance. Overview of Decision Trees/Rule Induction

William B. Hakes, Ph.D.-V Classification and Regression Trees Origins in research conducted at Berkeley & Stanford –Leo Breiman, University of California, Berkeley –Jerry Friedman, Stanford University –Charles J. Stone, University of California, Berkeley –Richard Olshen, Stanford University Solved a number of problems plaguing other decision tree methods (CHAID, ID3) Very well known in biomedical and engineering arenas Only recently becoming known in IT, DM, and AI circles Overview of CART

William B. Hakes, Ph.D.-V PATIENTS = 215 SURVIVE % DEAD3717.2% Is BP<=91? Terminal Node A SURVIVE630.0% DEAD1470.0% NODE = DEAD Terminal Node B SURVIVE % DEAD2 1.9% NODE = SURVIVE PATIENTS = 195 SURVIVE % DEAD2311.8% Is AGE<=62.5? Terminal Node C SURVIVE1450.0% DEAD1450.0% NODE = DEAD PATIENTS = 91 SURVIVE7076.9% DEAD2123.1% Is SINUS<=.5? Terminal Node D SURVIVE5688.9% DEAD711.1% NODE = SURVIVE <= 91 > 91 <= 62.5 > 62.5 >.5<=.5 Trees (Binary) Are Fundamentally Simple

William B. Hakes, Ph.D.-V A binary splitting procedure can always reproduce a multi- way split A binary splitting procedure will only partially partition on a database field if another sequence is better FYI- CART (and trees in general) handle missing data very well Tests show that when data are missing at random even 25% missing rates have minimal effect on CART accuracy Costs of misclassification: allow for certain errors to be more serious than others Fundamentally detects non-linear relationships Rules can be automatically detected, or modified by user Why CART Works Well

William B. Hakes, Ph.D.-V If the multi-way split is best, binary split method will find it If it is not best, binary method will move to another variable Binary Split Detects MultiWay Splits

William B. Hakes, Ph.D.-V Split made all at once could be too hasty is Even if age group is different, other variables might be even more valuable after the Age > 65, Age < 65 split The database is fragmented rapidly Even with 500,000 records, 5 consecutive 4-way splits leave about 2,000 records per partition Binary splits are more patient, giving a better chance to find important structure MultiWay Splits

William B. Hakes, Ph.D.-V Credit: Hierarchical K-Means Credit Cluster Size Comparison

William B. Hakes, Ph.D.-V Motor: Hierarchical K-Means Motor Cluster Size Comparison

Variable Definitions: T2924X= <=1 Trd 30dpd in 24 mo AGEAVG = Age Avg Open Trd TOTBAL = TotBal of all trades RVTRDS = # Rev Trades 2a1b 3a 3b 1a Appendix 6a Credit RI Tree: PCA Vars back

William B. Hakes, Ph.D.-V Appendix 6b- Credit RI Translation Credit RI Tree PCA Vars- Translated Rules: Cluster 1- (51% of Customers) -(1a) 81% of the customers in this cluster have: Over the last 24 months, 1 or less trades rated 30 days past due. Some information available regarding the average age of their open trades. 5 or less “revolving” accounts. -(1b) The other 19% of the customers in this cluster have: Over the last 24 months, 2 or more trades rated 30 days past due. A total balance of all trades less than $ Cluster 2- (9% of Customers) -(2) Over the last 24 months, these customers have 1 or less trades rated 30 days past due. -(2) These customers have either no record of the age of their current accounts, or they only have “inquiries” into their credit history. Cluster 3- (40% of Customers) -(3a) 15% of the customers in this cluster have: Over the last 24 months, 1 or less trades rated 30 days past due. Some information available regarding the average age of their open accounts. 6 or more “revolving” trades. -(3b) 85% of the customers in this cluster have: Over the last 24 months, 2 or more trades rated 30 days past due. A total balance of all trades is greater than or equal to $ back

William B. Hakes, Ph.D.-V Issues for Further Research 1.Predictive Models differentiate one group from another, but what about types of groups within a target group?…How Many? 2.Cluster Analysis a.Which variables are important in clustering? b.What about out-of-sample assignment? 3.Clustering followed by 2 nd Order Rule Induction (a.k.a. Decision Trees) a.Develop clusters (2-stage) b.Use as inputs into algorithm (“Best”Algorithm??) c.Take simple rules and use to assess cases across a database 4.Cluster Analysis vs. Unsupervised NN’s

William B. Hakes, Ph.D.-V Some Parting Thoughts…. Q- How much time should you spend properly defining the quantitative issue and designing the test?? A- A lot more than you think (up to 3x more than the actual “analysis”) Q- Are there opportunities for Analytics in the marketplace?? A- Yes- tremendous opportunities for people who can do more than Pivot Tables and Regression in Excel. Q- How do I “get my foot in the door” of analytics? A- Continue formal education Continue “informal” education as well Make “networking” part of your daily/weekly to-do list Join a firm that has years of experience in applied problem-solving