David L. Olson Department of Management University of Nebraska

Slides:



Advertisements
Similar presentations
Ensemble Learning – Bagging, Boosting, and Stacking, and other topics
Advertisements

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Data Mining Research David L. Olson University of Nebraska.
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Three kinds of learning
Data mining and statistical learning - lecture 13 Separating hyperplane.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 5 Data mining : A Closer Look.
Decision Tree Models in Data Mining
Overview DM for Business Intelligence.
Assessment of Model Development Techniques and Evaluation Methods for Binary Classification in the Credit Industry DSI Conference Jennifer Lewis Priestley.
Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages.
Chapter 6 Regression Algorithms in Data Mining
Chapter 9 – Classification and Regression Trees
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
Regression Models Fit data Time-series data: Forecast Other data: Predict.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Overview of Methods Data mining techniques What techniques do, examples, advantages & disadvantages.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Case Selection and Resampling Lucila Ohno-Machado HST951.
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Tree Algorithms Rule Based Suitable for automatic generation.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Clustering Algorithms Minimize distance But to Centers of Groups.
A Decision Support Based on Data Mining in e-Banking Irina Ionita Liviu Ionita Department of Informatics University Petroleum-Gas of Ploiesti.
Ensemble Classifiers.
Chapter 7. Classification and Prediction
Bagging and Random Forests
Week 2 Presentation: Project 3
An Empirical Comparison of Supervised Learning Algorithms
Eco 6380 Predictive Analytics For Economists Spring 2016
Trees, bagging, boosting, and stacking
Conditional Random Fields for ASR
Multiple Discriminant Analysis and Logistic Regression
COMP61011 : Machine Learning Ensemble Models
Dipartimento di Ingegneria «Enzo Ferrari»,
Basic machine learning background with Python scikit-learn
Cost-Sensitive Learning
Advanced Analytics Using Enterprise Miner
Cost-Sensitive Learning
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Classification of class-imbalanced data
Logistic Regression.
iSRD Spam Review Detection with Imbalanced Data Distributions
Multiple Decision Trees ISQS7342
By – Amey Gangal Ganga Charan Gopisetty Rakesh Sangameswaran.
Model Combination.
Model generalization Brief summary of methods
MIS2502: Data Analytics Classification Using Decision Trees
Predicting Loan Defaults
Evolutionary Ensembles with Negative Correlation Learning
Analysis on Accelerated Learning Cohorts
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

David L. Olson Department of Management University of Nebraska Data Set Balancing David L. Olson Department of Management University of Nebraska

Skewed Data Sets Many interesting applications involve data with many cases in one category, few in another Insurance claims binary – fraudulent or not Cancer cases Loan defaults binary or other Poor performing employees binary or other Skewed data sets cause modeling problems Can cause model degeneracy call all claims non-fraudulent

Test Domain Models Data Decision tree Regression Neural network Categorical or Continuous Binary or Four-outcome

Data Sets All generated for pedagogical purposes Loan Application Data 650 observations (400 training, 250 test) Binary (0 – not on time; 1 – on time) 0.1125 late or default Insurance Fraud Data 5000 observations (4000 training, 1000 test) Binary (OK, Fraudulent) 0.0150 fraudulent Job Application Data 500 observations (250 training, 250 test) Four outputs (unacceptable, minimal, adequate, excellent) 0.028 excellent

Loan Application Data Variable Obs 1 Obs 2 Obs 3 Age 20 23 28 Income 17,152 25,862 26,169 Assets 11,090 24,756 47,355 Debt 20,455 30,083 49,341 Want 400 2,300 3,100 Risk High Credit Green Yellow Result OnTime Late

Insurance Fraud Data Variable Obs 1 Obs 2 Obs 3 Age 52 38 21 Gender Male Female Claim 2000 1800 5600 Tickets 1 Prior claims 2 Attorney Jones None Smith Outcome OK Fraud

Job Application Data Variable Obs 1 Obs 2 Obs 3 Age 27 33 22 State CA NV Degree BS MBA Major Engr BusAd InfoSys Experience 2 years 5 years Outcome Excellent Adequate Unacceptable

Experiments High degree of imbalance in each data set Tested both categorical & continuous data Categorical: Decision tree See5 Logistic regression Clementine Neural network Clementine Continuous Regression tree See5 Discriminant analysis Clementine

Procedure Full model run Training set reduced Deleted cases from most common outcome Correct classification rate Correct/total Also identified type of error (coincidence matrix)

Loan Application Data Set

Insurance Fraud Data Set

Job Application Data Set

Degeneracy Model classifies all samples in dominant category The greater the data set skew The greater the correct classification rate BUT MODEL DOESN’T HELP

Comparison Factor Positive Negative Large data sets (unbalanced) Greater accuracy Often degenerate (trees, discrim) Small data sets (balanced) Less degeneracy Can eliminate cases (logistic) Poor fit (categorical NN) Categorical data Slightly greater accuracy (mixed) Less stable (small data set worse)

Advanced Solutions BAGGING BOOSTING STACKING Combine several classifiers – majority vote BOOSTING Sequentially learn several classifiers Each classifier used to focus on data poorly classified by the previous classifier Combine by weighted vote STACKING Combine outputs of multiple classifiers obtained by different learning algorithms

Conclusions If data highly unbalanced If data balanced Algorithms tend to degenerate If data balanced Reduces training set size Can lead to degeneracy by eliminating rare cases Accuracy rates tend to decline Decision tree algorithms the most robust