Download presentation
Presentation is loading. Please wait.
Published byAubrey Townsend Modified over 9 years ago
1
MKT 700 Business Intelligence and Decision Models Algorithms and Customer Profiling (1)
2
Classification and Prediction
3
Classification Unsupervised Learning
4
Predicting Supervised Learning
5
SPSS Direct Marketing ClassificationPredictive Unsupervised Learning RFM Cluster analysis Postal Code Responses NA Supervised LearningCustomer ProfilingPropensity to buy
6
SPSS Analysis ClassificationPredictive Unsupervised Learning Hierarchical Cluster Two-Step Cluster K-Means Cluster NA Supervised LearningClassification Trees -CHAID -CART Linear Regression Logistic Regression Artificial Neural Nets
7
Major Algorithms ClassificationPredictive Unsupervised Learning Euclidean Distance Log Likelihood NA Supervised LearningChi-square Statistics Log Likelihood GINI Impurity Index F-Statistics (ANOVA) Log Likelihood F-Statistics (ANOVA) Nominal: Chi-square, Log Likelihood Continuous: F-Statistics, Log Likelihood
8
Euclidean Distance
9
Euclidean Distance for Continuous Variables Pythagorean distance √d 2 = √(a 2 +b 2 ) Euclidean space √d 2 = √(a 2 +b 2 +c 2 ) Euclidean distance d = [(d i ) 2 ] 1/2 (Cluster Analysis with continuous var.)
10
Pearson’s Chi-Square
11
Contingency Table NorthSouthEastWestTot. Yes68755779279 No32453331141 Tot.10012090110420
12
Observed and theoretical Frequencies NorthSouthEastWestTot. Yes68 66 75 80 57 60 79 73 279 66% No32 34 45 40 33 30 31 37 141 34% Tot.10012090110420
13
Chi-Square: Obs. f o fefe fo-fefo-fe (f o -f e ) 2 f e 1,1 68 1,2 75 1,3 57 1,4 79 2,1 32 2,2 45 2,2 33 2,4 31 66 80 60 73 34 40 30 37 2 -5 -3 6 -2 5 3 6 4 25 9 36 4 25 9 36.0606.3125.1500.4932.1176.6250.3000.9730 X 2 = 3.032
14
Statistical Inference DF: (4 col –1) (2 rows –1) = 3 3.0327.8156.251.10.05
15
Log Likelihood Chi-Square
16
Log Likelihood Based on probability distributions rather than contingency (frequency) tables. Applicable to both categorical and continuous variables, contrary to chi-square which must be discreticized.
17
Contingency Table (Observed Frequencies) Cluster 1Cluster 2Total Male103040
18
Contingency Table (Expected Frequencies) Cluster 1Cluster 2Total Male10 20 30 20 40
19
Chi-Square: Obs. f o FeFe fo-fefo-fe (f o -f e ) 2 f e 1,1 10 1,2 30 20 -10 10 100 5.00 X 2 = 10.00 p < 0.05; DF = 1; Critical value = 3.84
20
Log Likelihood Distance & Probability Cluster 1Cluster 2 Male O E 10 20 30 20 O/E Ln (O/E) O * Ln (O/E) 2∑O*Ln(O/E) 10/20 =.50 -.693 10*-.693 -6.93 30/20=1.50.405 30*.405 12.164 2*(-6.93+12.164) = 10.46 p < 0.05; critical value = 3.84
21
Variance, ANOVA, and F Statistics
22
F-Statistics For metric or continuous variables Compares explained (in the model) and unexplained variances (errors)
23
Variance SQUARED VALUEMEANDIFFERENCE 20 43.6 557 34 43.6 92.16 34 43.6 92.16 38 43.6 31.36 38 43.6 31.36 40 43.6 12.96 41 43.6 6.76 41 43.6 6.76 41 43.6 6.76 42 43.6 2.56 43 43.6 0.36 47 43.6 11.56 47 43.6 11.56 48 43.6 19.36 49 43.6 29.16 49 43.6 29.16 55 43.6 130 55 43.6 130 55 43.6 130 55 43.6 130 COUNT20SS =1461 DF=19 VAR =76.88 MEAN43.6SD=8.768 SS is Sum of Squares DF = N-1 VAR=SS/DF SD = √VAR
24
ANOVA Two Groups: T-test Three + Group Comparisons: Are errors (discrepancies between observations and the overall mean) explained by group membership or by some other (random) effect?
25
Oneway ANOVA Grand mean Group 1Group 2Group 35.042 683 592(X-Mean) 2 4710.918 5830.002 4921.085 6710.002 5831.085 4920.918 0.002 Group means1.085 4.8758.1252.1258.752 15.668 3.835 8.752 (X-Mean) 2 15.668 1.2660.0160.7663.835 0.0160.7660.0168.752 0.7661.266 15.668 0.016 0.7664.168 0.766 0.0169.252 1.266 16.335 0.016 0.7664.168 0.766 0.0169.252 16.335 4.875 4.168 9.252 SS Within14.625 Total SS158.958
26
MSS(Between)/MSS(Within) Winthin groups Between Groups Total Errors SS14.625+144.333=158.958 DF24-3=213-1=224-1=23 Mean SS0.696 72.167 6.911 Between Groups Mean SS72.167 103.624p-value <.05 Within Groups Mean SS0.696
27
ONEWAY (Excel or SPSS) Anova: Single Factor SUMMARY GroupsCountSumAverageVariance Group 18394.8750.696 Group 28658.1250.696 Group 38172.1250.696 ANOVA Source of VariationSSdfMSFP-valueF crit Between Groups144.333272.167103.6241.318E-113.467 Within Groups14.625210.696 Total158.95823
28
Profiling
29
Customer Profiling: Documenting or Describing Who is likely to buy or not respond? Who is likely to buy what product or service? Who is in danger of lapsing?
30
CHAID or CART Chi-Square Automatic Interaction Detector Based on Chi-Square All variables discretecized Dependent variable: nominal Classification and Regression Tree Variables can be discrete or continuous Based on GINI or F-Test Dependent variable: nominal or continuous
31
Use of Decision Trees Classify observations from a target binary or nominal variable Segmentation Predictive response analysis from a target numerical variable Behaviour Decision support rules Processing
32
Decision Tree
33
Example: dmdata.sav Underlying Theory X 2
34
CHAID Algorithm Selecting Variables Example Regions (4), Gender (3, including Missing) Age (6, including Missing) For each variable, collapse categories to maximize chi-square test of independence: Ex: Region (N, S, E, W,*) (WSE, N*) Select most significant variable Go to next branch … and next level Stop growing if …estimated X 2 < theoretical X 2
35
CART (Nominal Target) Nominal Targets: GINI (Impurity Reduction or Entropy) Squared probability of node membership Gini=0 when targets are perfectly classified. Gini Index =1-∑p i 2 Example Prob: Bus = 0.4, Car = 0.3, Train = 0.3 Gini = 1 –(0.4^2 + 0.3^2 + 0.3^2) = 0.660
36
CART (Metric Target) Continuous Variables: Variance Reduction (F-test)
37
Comparative Advantages (From Wikipedia) Simple to understand and interpret Requires little data preparation Able to handle both numerical and categorical data Uses a white box model easily explained by Boolean logic. Possible to validate a model using statistical tests Robust
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.