David L. Olson Department of Management University of Nebraska Data Set Balancing David L. Olson Department of Management University of Nebraska
Skewed Data Sets Many interesting applications involve data with many cases in one category, few in another Insurance claims binary – fraudulent or not Cancer cases Loan defaults binary or other Poor performing employees binary or other Skewed data sets cause modeling problems Can cause model degeneracy call all claims non-fraudulent
Test Domain Models Data Decision tree Regression Neural network Categorical or Continuous Binary or Four-outcome
Data Sets All generated for pedagogical purposes Loan Application Data 650 observations (400 training, 250 test) Binary (0 – not on time; 1 – on time) 0.1125 late or default Insurance Fraud Data 5000 observations (4000 training, 1000 test) Binary (OK, Fraudulent) 0.0150 fraudulent Job Application Data 500 observations (250 training, 250 test) Four outputs (unacceptable, minimal, adequate, excellent) 0.028 excellent
Loan Application Data Variable Obs 1 Obs 2 Obs 3 Age 20 23 28 Income 17,152 25,862 26,169 Assets 11,090 24,756 47,355 Debt 20,455 30,083 49,341 Want 400 2,300 3,100 Risk High Credit Green Yellow Result OnTime Late
Insurance Fraud Data Variable Obs 1 Obs 2 Obs 3 Age 52 38 21 Gender Male Female Claim 2000 1800 5600 Tickets 1 Prior claims 2 Attorney Jones None Smith Outcome OK Fraud
Job Application Data Variable Obs 1 Obs 2 Obs 3 Age 27 33 22 State CA NV Degree BS MBA Major Engr BusAd InfoSys Experience 2 years 5 years Outcome Excellent Adequate Unacceptable
Experiments High degree of imbalance in each data set Tested both categorical & continuous data Categorical: Decision tree See5 Logistic regression Clementine Neural network Clementine Continuous Regression tree See5 Discriminant analysis Clementine
Procedure Full model run Training set reduced Deleted cases from most common outcome Correct classification rate Correct/total Also identified type of error (coincidence matrix)
Loan Application Data Set
Insurance Fraud Data Set
Job Application Data Set
Degeneracy Model classifies all samples in dominant category The greater the data set skew The greater the correct classification rate BUT MODEL DOESN’T HELP
Comparison Factor Positive Negative Large data sets (unbalanced) Greater accuracy Often degenerate (trees, discrim) Small data sets (balanced) Less degeneracy Can eliminate cases (logistic) Poor fit (categorical NN) Categorical data Slightly greater accuracy (mixed) Less stable (small data set worse)
Advanced Solutions BAGGING BOOSTING STACKING Combine several classifiers – majority vote BOOSTING Sequentially learn several classifiers Each classifier used to focus on data poorly classified by the previous classifier Combine by weighted vote STACKING Combine outputs of multiple classifiers obtained by different learning algorithms
Conclusions If data highly unbalanced If data balanced Algorithms tend to degenerate If data balanced Reduces training set size Can lead to degeneracy by eliminating rare cases Accuracy rates tend to decline Decision tree algorithms the most robust