Download presentation
Presentation is loading. Please wait.
Published byImogen Leona Robbins Modified over 6 years ago
1
David L. Olson Department of Management University of Nebraska
Data Set Balancing David L. Olson Department of Management University of Nebraska
2
Skewed Data Sets Many interesting applications involve data with many cases in one category, few in another Insurance claims binary – fraudulent or not Cancer cases Loan defaults binary or other Poor performing employees binary or other Skewed data sets cause modeling problems Can cause model degeneracy call all claims non-fraudulent
3
Test Domain Models Data Decision tree Regression Neural network
Categorical or Continuous Binary or Four-outcome
4
Data Sets All generated for pedagogical purposes Loan Application Data
650 observations (400 training, 250 test) Binary (0 – not on time; 1 – on time) late or default Insurance Fraud Data 5000 observations (4000 training, 1000 test) Binary (OK, Fraudulent) fraudulent Job Application Data 500 observations (250 training, 250 test) Four outputs (unacceptable, minimal, adequate, excellent) 0.028 excellent
5
Loan Application Data Variable Obs 1 Obs 2 Obs 3 Age 20 23 28 Income
17,152 25,862 26,169 Assets 11,090 24,756 47,355 Debt 20,455 30,083 49,341 Want 400 2,300 3,100 Risk High Credit Green Yellow Result OnTime Late
6
Insurance Fraud Data Variable Obs 1 Obs 2 Obs 3 Age 52 38 21 Gender
Male Female Claim 2000 1800 5600 Tickets 1 Prior claims 2 Attorney Jones None Smith Outcome OK Fraud
7
Job Application Data Variable Obs 1 Obs 2 Obs 3 Age 27 33 22 State CA
NV Degree BS MBA Major Engr BusAd InfoSys Experience 2 years 5 years Outcome Excellent Adequate Unacceptable
8
Experiments High degree of imbalance in each data set
Tested both categorical & continuous data Categorical: Decision tree See5 Logistic regression Clementine Neural network Clementine Continuous Regression tree See5 Discriminant analysis Clementine
9
Procedure Full model run Training set reduced
Deleted cases from most common outcome Correct classification rate Correct/total Also identified type of error (coincidence matrix)
10
Loan Application Data Set
11
Insurance Fraud Data Set
12
Job Application Data Set
13
Degeneracy Model classifies all samples in dominant category
The greater the data set skew The greater the correct classification rate BUT MODEL DOESN’T HELP
14
Comparison Factor Positive Negative Large data sets (unbalanced)
Greater accuracy Often degenerate (trees, discrim) Small data sets (balanced) Less degeneracy Can eliminate cases (logistic) Poor fit (categorical NN) Categorical data Slightly greater accuracy (mixed) Less stable (small data set worse)
15
Advanced Solutions BAGGING BOOSTING STACKING
Combine several classifiers – majority vote BOOSTING Sequentially learn several classifiers Each classifier used to focus on data poorly classified by the previous classifier Combine by weighted vote STACKING Combine outputs of multiple classifiers obtained by different learning algorithms
16
Conclusions If data highly unbalanced If data balanced
Algorithms tend to degenerate If data balanced Reduces training set size Can lead to degeneracy by eliminating rare cases Accuracy rates tend to decline Decision tree algorithms the most robust
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.