TOPICS IN BUSINESS INTELLIGENCE K-NN & Naive Bayes – GROUP 1 Isabel van der Lijke Nathan Bok Gökhan Korkmaz
INTRODUCTION K-NN k-NN Classifier (Categorical Outcome) Determining Neighbors Classification Rule Example: Riding Mowers Choosing k Setting the Cutoff Value Advantages and shortcomings of k-NN algorithms 2
INTRODUCTION NAIVE BAYES Basic Classification Procedure Cutoff Probability Method Conditional Probability Naive Bayes Advantages and shortcomings of the naive Bayes classifier 3
SIMPLE CASE APPLICATION Depression 4
SIMPLE CASE APPLICATION Fruits Example: P(Banana) = 500 / 1000 = 0,5 1-0,5 = 0,5 (Not banana) New fruit compute all the chances 5 Sweet Not sweet Total Banana Orange Other fruit Total
REAL-LIFE APPLICATION NAIVE BAYES Medical Data Classification with Naive Bayes Approach Introduction Requirements for systems dealing with medical data An empirical comparison Tables Conclusion 6
TABLE 2:COMPARATIVE ANALYSIS BASED ON PREDICTIVE ACCURACY 7
TABLE 3:COMPARATIVE ANALYSIS BASED ON AREA UNDER ROC CURVE (AUC) 8
REAL-LIFE APPLICATION K-NN Used to help health care professionals in diagnosing heart disease. Useful for pattern recognition and classification. Euclidean distance: Often normalized data due to different variable formats. 9
CASE STUDY “Our customer is a Dutch charity organization that wants to be able to classify it's supporters to donators and non-donators. The non-donators are sent a single marketing mail a year, whereas the donators receive multiple ones (up to 4).” Who are the donators? Who are the non-donators? Application of K-NN & Naive Bayes to training and test dataset. 4000 customers. SPSS, Excel, XLMiner 10
CLEAN-UP No missing values 1-dimensional outliers removed through sorting (regarding annual & average donation) 2-dimensional outliers removed through scatterplot 11
12
Variables Kept Average donation Frequency of Response Median Time of Response Time as client Variables removed Annual donation Last donation Time since last response. 13
Normalization of scores into z-scores. Nominal categorization of data Classification through percentiles of z-score & by manually processing values within the variables. 14
ANALYSIS OF CASE STUDY – K-NN 15 Xlminer Partition data Models created: M1 = Zavgdon & Zfrqres M2 = ZtimeCl, Zfrqres & Zavgdon M3 = Zmedtor, Zfrqres & Zavgdon ZtimeCl, Zfrqres, Zmedtor & Zavgdon
Validation Data Scoring - Summary Report (for k = 13) 16 Error Report Class# Cases# Errors% Error , , Overall , Classification Confusion Matrix Predicted Class Actual Class
CHOOSING MODEL FOR K-NN Accuracy: Proportion of correctly classified instances. Error rate: (1 – Accuracy) Sensitivity: Sensitivity is the proportion of actual positives which are correctly identified as positives by the classifier. Specificity: Like sensitivity, but for the negatives. 17
18
M1M2 Selecting everyone in validation data €711.20€ Selecting while correcting for sensitivity and specificity €583.60€
APPLICATION OF MODEL ON TEST DATA Classification Confusion Matrix Predicted Class Actual Class Error Report Class# Cases# Errors% Error , ,5812 Overall ,65415
21
ANALYSIS OF THE CASE STUDY – NAIVE BAYES 22 Classification Confusion Matrix Predicted Class Actual Class Error Report Class# Cases# Errors% Error , ,13483 Overall ,042 M1 = Cfrqres & Cavgdon M2 = Cfrqresp, Cavgdon & Cmedtor Classes --> Input Variab les 01 ValueProbValueProb CFRQR ES 10, , , , , , , , CAVGD ON 10, , , , , , , ,066362
23 Model 1Model 2 Selecting everyone€1072€1006 Selecting by class€2460,82€
APPLICATION OF MODEL ON TEST DATA Classes --> Input Variabl es 01 ValueProbValueProb CFRQR ES 10, , , , , , , , CAVGD ON 10, , , , , , , ,
25 Classification Confusion Matrix Predicted Class Actual Class Error Report Class# Cases# Errors% Error , ,59829 Overall ,61858
LOOKING AT BOTH MODELS 26
27
QUESTIONS? 28