Clustering Algorithms Minimize distance But to Centers of Groups
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-2 Clustering First need to identify clusters –Can be done automatically –Often clusters determined by problem Then simple matter to measure distance from new observation to each cluster –Use same measures as with memory-based reasoning
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-3 Partitioning Define new categorical variables –Divide data into fixed number (k) of regions –K-means clustering
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-4 Clustering Uses Segment customers –Find profitability of each, treat accordingly Star classification : –Red giants, white dwarfs, normal –Brightness & temperature used to classify U.S. Army –Identify sizes needed for female soldiers –(males – one size fits all)
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-5 Tires Segment customers into product categories –High end (they would buy Michelins) –Intermediate & Low Standardize data (as in memory-based reasoning)
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-6 Raw Tire Data BRANDINCOMEAGE OF CAR Michelin$182,2005 months Michelin$171,2003 years Goodyear$28,8007 years Goodyear$37,8006 years Goodyear$42,2005 years Goodyear$55,6004 years Goodyear$51,2009 years Goodyear$173,4007 years Opie’s tires$13,4003 years Opie’s tires$68,8006 years
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-7 Standardize INCOME –MIN(1,INCOME/200000) AGE OF CAR –IF({AGE OF CAR})<12 months,1, –ELSE[MIN{(8-Years)/7},1]
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-8 Sort Data by Outcome BRANDINCOMEAGE OF CAR MichelinHigh incomeBought this year MichelinHigh incomeBought 1-3 yrs ago GoodyearLow incomeBought 4+ yrs ago GoodyearLow incomeBought 4+ yrs ago GoodyearLow incomeBought 4+ yrs ago GoodyearAvg incomeBought 1-3 yrs ago GoodyearAvg incomeBought 4+ yrs ago GoodyearHigh incomeBought 4+ yrs ago Opie’s tiresLow incomeBought 1-3 yrs ago Opie’s tiresAvg incomeBought 4+ yrs ago
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-9 Standardized Training Data BRANDINCOMEAGE OF CAR Michelin Michelin Goodyear Goodyear Goodyear Goodyear Goodyear Goodyear Opie’s tires Opie’s tires
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-10 Identify Cluster Means (could use median, mode) BRANDINCOMECAR AGE Michelin Goodyear Opie’s tires
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-11 New Case #1 From new data (could be test set or new observations to classify) squared distance to each centroid Michelin:0.840 Goodyear0.025 Opie’s tires0.047 So minimum distance to Goodyear
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-12 New Case #2 Squared distance to each centroid Michelin:0.634 Goodyear0.255 Opie’s tires0.057 So minimum distance to Opie’s
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-13 Software Methods Hierarchical clustering –Number of clusters unspecified a priori –Two-step a form of hierarchical clustering K-means clustering Self-organizing maps –Neural network Hybrids combine methods
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-14 Application: Credit Cards Credit scoring critical Use past applicants; develop model to predict payback –Look for indicators providing early warning of trouble
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-15 British Credit Card Company Monthly account status – over 90 thousand customers, one year operations Outcome variable STATE: cumulative months of missed payments (integer) –Some errors & missing data (eliminated observations) –Biased sample of 10 thousand observations –Required initial STATE of 0
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-16 British Credit Card Company Compared clustering approaches with pattern detection method Used medians rather than centroids –More stable –Partitioned data Clustering useful for general profile behavior Pattern search method sought local clusters –Unable to partition entire data set –Identified a few groups with unusual behavior
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-17 Insurance Claim Application Large data warehouse of financial transactions & claims Customer retention very important –Recent heavy growth in policies –Decreased profitability Used clustering to analyze claim patterns –Wanted hidden trends & patterns
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-18 Insurance Claim Mining Undirected knowledge discovery –Cluster analysis to identify risk categories Data for –Quarterly data –Claims for prior 12 months –Contribution to profit of each policy –Over 100,000 samples –Heavy growth in young people with expensive automobiles –Transformed data to normalize, remove outliers
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-19 Insurance Claim Mining Number of clusters –Too few – no discrimination – best here was 50 –Used k-means algorithm to minimize least squared error Identified a few cluster with high claims frequency, unprofitability Compared 1998 data with 1996 data to find trends Developed model to predict new policy holder performance –Used for pricing
McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-20 Computational Constraints Each cluster should have adequate sample size Since cluster averages are used, cluster analysis not as sensitive to disproportional cluster sizes relative to matching The more variables you have, the greater the computational complexity –The curse of dimensionality –(it won’t run in a reasonable time if you have too many variables)