Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Chapter 4 Pattern Recognition Concepts: Introduction & ROC Analysis.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Class Imbalance vs. Cost-Sensitive Learning
RIPPER Fast Effective Rule Induction
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CMPUT 466/551 Principal Source: CMU
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Model Evaluation Metrics for Performance Evaluation
Decision Theory Naïve Bayes ROC Curves
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
CS Instance Based Learning1 Instance Based Learning.
InCob A particle swarm based hybrid system for imbalanced medical data sampling Pengyi Yang School of Information Technologies.
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
Active Learning for Class Imbalance Problem
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
הערכת טיב המודל F-Measure, Kappa, Costs, MetaCost ד " ר אבי רוזנפלד.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
國立雲林科技大學 National Yunlin University of Science and Technology Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
1 Feature selection with conditional mutual information maximin in text categorization CIKM2004.
Class Imbalance Classification Implementation Group 4 WEI Lili, ZENG Gaoxiong,
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Ensemble Methods in Machine Learning
Class Imbalance in Text Classification
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Defect Prediction using Smote & GA 1 Dr. Abdul Rauf.
Evolving Decision Rules (EDR)
Balancing Techniques Gretel Fernández.
Rule Induction for Classification Using
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Session 7: Face Detection (cont.)
Presented by: Dr Beatriz de la Iglesia
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining Classification: Alternative Techniques
Features & Decision regions
Data Mining Practical Machine Learning Tools and Techniques
Classification of class-imbalanced data
iSRD Spam Review Detection with Imbalanced Data Distributions
CSCI N317 Computation for Scientific Applications Unit Weka
Statistical Learning Dong Liu Dept. EEIS, USTC.
Learning Chapter 18 and Parts of Chapter 20
Data Mining Class Imbalance
Evaluating Classifiers
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología - Madrid (ex-LAEFF)

Great Workshop La Palma -June 2011 Problem ● Real world classification problems deal with imbalanced datasets ● Classifiers usually are biased towards the majority class

Great Workshop La Palma -June 2011 Problem: Misclassification Cost ● Most of the literature assumes that the minority class is more important. ● Miss-classification cost is usually less important for the majority class ● I.e. breast cancer detection

Great Workshop La Palma -June 2011 Problem: Astronomy ● But in star classification misclassification costs are the same for every class ● A class with very few instances can be very well represented

Great Workshop La Palma -June 2011 Problem: Not Only the Classifiers ● Feature selection, discretization and other preprocessing filters suffer the same problem

Great Workshop La Palma -June 2011 Multistage Classifier ● Several advantages ● Specialized classifiers ● Better selection of relevant features ● Combination of classification methods ● But there is a drawback ● Worsen the imbalanced problem

Great Workshop La Palma -June 2011 Evaluation

Great Workshop La Palma -June 2011 Evaluation ● Most used measure in classification: accuracy ● Accuracy= (TP+TN)/(TP+TN+FP+FN) ● We cannot say a classifier is good just by looking to the accuracy ● Example: when classifying a training set composed of 1000 instances labeled as A and 1 instance labeled as B is easy to get an “outstanding” 99,9% ● It can be useful for comparing classifiers

Great Workshop La Palma -June 2011 Evaluation ● Summarize performance over a range of tradeoffs between true positive and false positive error rates ● Useful if FN and FP errors have different cost

Great Workshop La Palma -June 2011 Evaluation ● Main goal for imbalanced datasets is to improve the recall without decreasing the precision ● F-value combines both measures ● (β is usually set to 1) Precision = TPTP TPTP FP

Great Workshop La Palma -June 2011 Solutions. Undersampling ● (Random) removal of instances belonging to the majority class ● Problems: we can lose important instances

Great Workshop La Palma -June 2011 Solutions. Oversampling ● Instances belonging to the minority class are replicated ● Problems: possible overfitting, does not increase the decision region for the class ● Advantage: fast

Great Workshop La Palma -June 2011 Solutions: SMOTE ● Synthetic Minority Oversampling Technique ● Generates new instances combining old ones. ● No overfitting ● Forces the minority class to be more general (broader decision region)

Great Workshop La Palma -June 2011 Smote - Warning ● "Real stupidity beats artificial intelligence every time." — Terry Pratchett (Hogfather) ● RV vs ALL ● Extreme imbalanced ratio: ● Can be so good?

Great Workshop La Palma -June 2011 RV vs all

Great Workshop La Palma -June 2011 RV Smotified

Great Workshop La Palma -June 2011 Solutions: Adding Weights ● Does not remove important examples ● Does not overfit ● But needs algorithms prepared to manage weights ● 10-fold-cv can be tricky

Great Workshop La Palma -June 2011 Solutions: Boosting ● Creates weak classifiers weighted for classifying hard instances. ● It maintains accuracy over the entire dataset

Great Workshop La Palma -June 2011 Experiment ● Hipparcos dataset ● 1661 instances ● 47 attributes + class ● 23 classes

Great Workshop La Palma -June 2011 Multistage Hierarchy ● Imbalanced ratio

Great Workshop La Palma -June 2011 Experiment – J48 ● Node 1: LPV vs. Other ● Imbalanced ratio: 4.3 ● Good classification in spite of imbalance ● Low margin for improvement

Great Workshop La Palma -June 2011 Experiment - J48 ● Node 3: Eclipsing vs Other ● Imbalanced ratio: 1.33 ● When dataset is balanced, adding new instances does not improve the classification

Great Workshop La Palma -June 2011 Experiment ● Node 5: GDOR vs. Other ● Imbalanced ratio: 28.07

Great Workshop La Palma -June 2011 Experiment ● Node 11: SPB+ACV vs. Other ● Imbalanced ratio 3.8

Great Workshop La Palma -June 2011 Results ● Using a balanced dataset improves the classification +10% ● FS is specially affected by the imbalance

Great Workshop La Palma -June 2011 Thank you ● Time to wake up