David L. Olson Department of Management University of Nebraska

David L. Olson Department of Management University of Nebraska
Data Set Balancing David L. Olson Department of Management University of Nebraska

Skewed Data Sets Many interesting applications involve data with many cases in one category, few in another Insurance claims binary – fraudulent or not Cancer cases Loan defaults binary or other Poor performing employees binary or other Skewed data sets cause modeling problems Can cause model degeneracy call all claims non-fraudulent

Test Domain Models Data Decision tree Regression Neural network
Categorical or Continuous Binary or Four-outcome

Data Sets All generated for pedagogical purposes Loan Application Data
650 observations (400 training, 250 test) Binary (0 – not on time; 1 – on time) late or default Insurance Fraud Data 5000 observations (4000 training, 1000 test) Binary (OK, Fraudulent) fraudulent Job Application Data 500 observations (250 training, 250 test) Four outputs (unacceptable, minimal, adequate, excellent) 0.028 excellent

Loan Application Data Variable Obs 1 Obs 2 Obs 3 Age 20 23 28 Income
17,152 25,862 26,169 Assets 11,090 24,756 47,355 Debt 20,455 30,083 49,341 Want 400 2,300 3,100 Risk High Credit Green Yellow Result OnTime Late

Insurance Fraud Data Variable Obs 1 Obs 2 Obs 3 Age 52 38 21 Gender
Male Female Claim 2000 1800 5600 Tickets 1 Prior claims 2 Attorney Jones None Smith Outcome OK Fraud

Job Application Data Variable Obs 1 Obs 2 Obs 3 Age 27 33 22 State CA
NV Degree BS MBA Major Engr BusAd InfoSys Experience 2 years 5 years Outcome Excellent Adequate Unacceptable

Experiments High degree of imbalance in each data set
Tested both categorical & continuous data Categorical: Decision tree See5 Logistic regression Clementine Neural network Clementine Continuous Regression tree See5 Discriminant analysis Clementine

Procedure Full model run Training set reduced
Deleted cases from most common outcome Correct classification rate Correct/total Also identified type of error (coincidence matrix)

Loan Application Data Set

Insurance Fraud Data Set

Job Application Data Set

Degeneracy Model classifies all samples in dominant category
The greater the data set skew The greater the correct classification rate BUT MODEL DOESN’T HELP

Comparison Factor Positive Negative Large data sets (unbalanced)
Greater accuracy Often degenerate (trees, discrim) Small data sets (balanced) Less degeneracy Can eliminate cases (logistic) Poor fit (categorical NN) Categorical data Slightly greater accuracy (mixed) Less stable (small data set worse)

Advanced Solutions BAGGING BOOSTING STACKING
Combine several classifiers – majority vote BOOSTING Sequentially learn several classifiers Each classifier used to focus on data poorly classified by the previous classifier Combine by weighted vote STACKING Combine outputs of multiple classifiers obtained by different learning algorithms

Conclusions If data highly unbalanced If data balanced
Algorithms tend to degenerate If data balanced Reduces training set size Can lead to degeneracy by eliminating rare cases Accuracy rates tend to decline Decision tree algorithms the most robust

David L. Olson Department of Management University of Nebraska

Similar presentations

Presentation on theme: "David L. Olson Department of Management University of Nebraska"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

David L. Olson Department of Management University of Nebraska

Similar presentations

Presentation on theme: "David L. Olson Department of Management University of Nebraska"— Presentation transcript:

Similar presentations

About project

Feedback