Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-1 Data Mining Methods: Classification Most frequently used DM method Employ supervised learning Learn from past data, classify new data The output variable is categorical (nominal or ordinal) in nature Classification versus regression? Classification versus clustering?
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-2 Assessment Methods for Classification Predictive accuracy Hit rate Speed Model building; predicting Robustness Scalability Interpretability The level of understanding provided by the mdoel
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-3 Accuracy of Classification Models In classification problems, the primary source for accuracy estimation is the confusion matrix
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-4 Estimation Methodologies for Classification Simple split (or holdout or test sample estimation) Split the data into 2 mutually exclusive sets training (~70%) and testing (30%)
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-5 Estimation Methodologies for Classification k-Fold Cross Validation (rotation estimation) Split the data into k mutually exclusive subsets Use each subset as testing while using the rest of the subsets as training Repeat the experimentation for k times Aggregate the test results for true estimation of prediction accuracy training Other estimation methodologies Area under the ROC curve
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-6 Estimation Methodologies for Classification – ROC Curve
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-7 Example - BMW dealership The dealership is starting a promotional campaign, whereby it is trying to push a two-year extended warranty to its past customers. The dealership has done this before and has gathered 4,500 data points from past sales of extended warranties. The attributes in the data set are: Income bracket [0=$0-$30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k-$100k, 5=$101k-$150k, 6=$151k-$500k, 7=$501k+] Year/month first BMW bought Year/month most recent BMW bought Whether they responded to the extended warranty offer in the past
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall 5-8 Weka Input file format