Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.

Slides:

Advertisements

Similar presentations

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.

Advertisements

Active Appearance Models

Imbalanced data David Kauchak CS 451 – Fall 2013.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Fast Algorithms For Hierarchical Range Histogram Constructions

Class Imbalance vs. Cost-Sensitive Learning

Learning Algorithm Evaluation

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Learning with Cost Intervals Xu-Ying Liu and Zhi-Hua Zhou LAMDA Group National Key Laboratory for Novel Software Technology Nanjing.

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham.

Model Evaluation Metrics for Performance Evaluation

Software Quality Engineering Roadmap

A Combinatorial Fusion Method for Feature Mining Ye Tian, Gary Weiss, D. Frank Hsu, Qiang Ma Fordham University Presented by Gary Weiss.

Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Learning From Data Chichang Jou Tamkang University.

Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Component-Based Routing for Mobile Ad Hoc Networks Chunyue Liu, Tarek Saadawi & Myung Lee CUNY, City College.

Rotation Forest: A New Classifier Ensemble Method 交通大學電子所蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.

Testing Intrusion Detection Systems: A Critic for the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory By.

COVERTNESS CENTRALITY IN NETWORKS Michael Ovelgönne UMIACS University of Maryland 1 Chanhyun Kang, Anshul Sawant Computer Science Dept.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.

Anomaly detection Problem motivation Machine Learning.

Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人：黃子齊

Evaluation – next steps

by B. Zadrozny and C. Elkan

Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.

Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.

Experimental Evaluation of Learning Algorithms Part 1.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Gary M. Weiss Alexander Battistin Fordham University.

F. Provost and T. Fawcett. Confusion Matrix 2Bitirgen - CS678.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

On Utillizing LVQ3-Type Algorithms to Enhance Prototype Reduction Schemes Sang-Woon Kim and B. John Oommen* Myongji University, Carleton University*

Progressive Sampling Instance Selection and Construction for Data Mining Ch 9. F. Provost, D. Jensen, and T. Oates 신수용.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Evaluating Classification Performance

Data Mining and Decision Support

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Adult Learning Principles

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.

NTU & MSRA Ming-Feng Tsai

Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)

Evaluation – next steps

Rule Induction for Classification Using

Objectives of the Course and Preliminaries

Learning Algorithm Evaluation

Machine Learning: Lecture 5

Presentation transcript:

Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University

August 20, UBDM 2006 Workshop Outline Introduction Introduction Motivation, cost model Motivation, cost model Experimental Methodology Experimental Methodology Results Results Adult data set Adult data set Progressive Sampling Progressive Sampling Related Work Related Work Future Work/Conclusion Future Work/Conclusion

August 20, UBDM 2006 Workshop Motivation Utility-Based Data Mining Utility-Based Data Mining Concerned with utility of overall data mining process Concerned with utility of overall data mining process A key cost is the cost of training data A key cost is the cost of training data These costs often ignored (except for active learning) These costs often ignored (except for active learning) First ones to analyze the impact of a very simple cost model First ones to analyze the impact of a very simple cost model In doing so we fill a hole in existing research In doing so we fill a hole in existing research Our cost model Our cost model A fixed cost for acquiring labeled training examples A fixed cost for acquiring labeled training examples No separate cost for class labels, missing features, etc. No separate cost for class labels, missing features, etc. Turney 1 called this the “cost of cases” Turney 1 called this the “cost of cases” No control over which training examples chosen No control over which training examples chosen No active learning No active learning

August 20, UBDM 2006 Workshop Motivation (cont.) Efficient progressive sampling 2 Efficient progressive sampling 2 Determines “optimal” training set size Determines “optimal” training set size Optimal is where the learning curve reaches a plateau Optimal is where the learning curve reaches a plateau Assumes data acquisition costs are essentially zero Assumes data acquisition costs are essentially zero What if the acquisition costs are significant? What if the acquisition costs are significant?

August 20, UBDM 2006 Workshop Motivating Examples Predicting customer behavior/buying potential Predicting customer behavior/buying potential Training data from D&B and Ziff-Davis Training data from D&B and Ziff-Davis These and other “information vendors” make money by selling information These and other “information vendors” make money by selling information Poker playing Poker playing Learn about an opponent by playing him Learn about an opponent by playing him

August 20, UBDM 2006 Workshop Outline Introduction Introduction Motivation, cost model Motivation, cost model Experimental Methodology Experimental Methodology Results Results Adult data set Adult data set Progressive Sampling Progressive Sampling Related Work Related Work Future Work/Conclusion Future Work/Conclusion

August 20, UBDM 2006 Workshop Experiments Use C4.5 to determine relationship between accuracy and training set size Use C4.5 to determine relationship between accuracy and training set size 20 runs used to increase reliability of results 20 runs used to increase reliability of results Random sampling to reduce training set size Random sampling to reduce training set size For this talk we focus on adult data set For this talk we focus on adult data set ~ 21,000 examples ~ 21,000 examples We utilize a predetermined sampling schedule We utilize a predetermined sampling schedule CPU times recorded, mainly for future work CPU times recorded, mainly for future work

August 20, UBDM 2006 Workshop Measuring Total Utility Total cost = Data Cost + Error Cost Total cost = Data Cost + Error Cost = n∙C tr + e ∙|S| ∙C err = n∙C tr + e ∙|S| ∙C err n = number training examples e = error rate |S| = number examples in score set C tr = cost of a training example C err = cost of an error Will know n and e for any experiment Will know n and e for any experiment With domain knowledge can estimate C tr, C err, |S| With domain knowledge can estimate C tr, C err, |S| But we don’t have this knowledge But we don’t have this knowledge Treat C tr and C err as parameters and vary them Treat C tr and C err as parameters and vary them Assume |S| = 100 with no loss of generality Assume |S| = 100 with no loss of generality If |S| is 100,000 then look at results for C err /1,000 If |S| is 100,000 then look at results for C err /1,000

August 20, UBDM 2006 Workshop Measuring Total Utility (cont.) Now only look at cost ratio, C tr :C err Now only look at cost ratio, C tr :C err Typical values evaluated: 1:1, 1:1000, etc. Typical values evaluated: 1:1, 1:1000, etc. Relative cost ratio is C err /C tr Relative cost ratio is C err /C tr Example Example If cost ratio is 1:1000 then even trade-off if buying 1000 training examples eliminates 1 error If cost ratio is 1:1000 then even trade-off if buying 1000 training examples eliminates 1 error Alternatively: buying 1000 examples is worth a 1% reduction in error rate (then can ignore |S| = 100) Alternatively: buying 1000 examples is worth a 1% reduction in error rate (then can ignore |S| = 100)

August 20, UBDM 2006 Workshop Outline Introduction Introduction Motivation, cost model Motivation, cost model Experimental Methodology Experimental Methodology Results Results Adult data set Adult data set Progressive Sampling Progressive Sampling Related Work Related Work Future Work/Conclusion Future Work/Conclusion

August 20, UBDM 2006 Workshop Learning Curve

August 20, UBDM 2006 Workshop Utility Curves

August 20, UBDM 2006 Workshop Utility Curves (Normalized Cost)

August 20, UBDM 2006 Workshop Optimal Training Set Size Curve

August 20, UBDM 2006 Workshop Value of Optimal Curve Even without specific cost information, this chart could be useful for a practitioner Even without specific cost information, this chart could be useful for a practitioner Can put bounds on appropriate training set size Can put bounds on appropriate training set size Analogous to Drummond and Holte’s cost curves 3 Analogous to Drummond and Holte’s cost curves 3 They looked at cost ratio of false positives and negatives They looked at cost ratio of false positives and negatives We look at cost ratio of errors vs. cost of data We look at cost ratio of errors vs. cost of data Both types of curves allows the practitioner to understand the impact of the various costs Both types of curves allows the practitioner to understand the impact of the various costs

August 20, UBDM 2006 Workshop Idealized learning curve

August 20, UBDM 2006 Workshop Outline Introduction Introduction Motivation, cost model Motivation, cost model Experimental Methodology Experimental Methodology Results Results Adult data set Adult data set Progressive Sampling Progressive Sampling Related Work Related Work Future Work/Conclusion Future Work/Conclusion

August 20, UBDM 2006 Workshop Progressive Sampling We want to find the optimal training set size We want to find the optimal training set size Need to determine when to stop acquiring data before acquiring all of it! Need to determine when to stop acquiring data before acquiring all of it! Strategy: use a progressive sampling strategy Strategy: use a progressive sampling strategy Key issues: Key issues: When do we stop? When do we stop? What sampling schedule should we use? What sampling schedule should we use?

August 20, UBDM 2006 Workshop Our Progressive Sampling Strategy We stop after first increase in total cost We stop after first increase in total cost Results therefore never optimal, but near-optimal if learning curve is non-decreasing Results therefore never optimal, but near-optimal if learning curve is non-decreasing We evaluate 2 simple sampling schedules We evaluate 2 simple sampling schedules S1: 10, 50, 100, 500, 1000, 2000, …, 9000, 10,000, 12,000, 14,000, … S1: 10, 50, 100, 500, 1000, 2000, …, 9000, 10,000, 12,000, 14,000, … S2: 50, 100, 200, 400, 800, 1600, … S2: 50, 100, 200, 400, 800, 1600, … S2 & S1 are similar given modest sized data sets S2 & S1 are similar given modest sized data sets Could use an adaptive strategy Could use an adaptive strategy

August 20, UBDM 2006 Workshop Adult Data Set: S1 vs. Straw Man

August 20, UBDM 2006 Workshop Progressive Sampling Conclusions We can use progressive sampling to determine a near optimal training set size We can use progressive sampling to determine a near optimal training set size Effectiveness mainly based on how well behaved the learning curve is (i.e., non-decreasing) Effectiveness mainly based on how well behaved the learning curve is (i.e., non-decreasing) Sampling schedule/batch size is also important Sampling schedule/batch size is also important Finer granularity requires more CPU time Finer granularity requires more CPU time But if data costly, CPU time most likely less expensive But if data costly, CPU time most likely less expensive In our experiments, cumulative CPU time < 1 minute In our experiments, cumulative CPU time < 1 minute

August 20, UBDM 2006 Workshop Related Work Efficient progressive sampling 2 Efficient progressive sampling 2 It tries to efficiently find the asymptote It tries to efficiently find the asymptote That work has a data cost of ε That work has a data cost of ε Stop only when added data has no benefit Stop only when added data has no benefit Active Learning Active Learning Similar in that data cost is factored in but setting different Similar in that data cost is factored in but setting different User has control over which examples are selected or features measured User has control over which examples are selected or features measured Does not address simple “cost of cases” scenario Does not address simple “cost of cases” scenario Find best class distribution when training data costly 4 Find best class distribution when training data costly 4 Assumes training set size limited but size pre-specified Assumes training set size limited but size pre-specified Finds the best class distribution to maximize performance Finds the best class distribution to maximize performance

August 20, UBDM 2006 Workshop Limitations/Future Work Improvements: Improvements: Bigger data sets where learning curve plateaus Bigger data sets where learning curve plateaus More sophisticated sampling schemes More sophisticated sampling schemes Incorporate cost-sensitive learning (cost FP ≠ FN) Incorporate cost-sensitive learning (cost FP ≠ FN) Generate better behaved learning curves Generate better behaved learning curves Include CPU time in utility metric Include CPU time in utility metric Analyze other cost models Analyze other cost models Study the learning curves Study the learning curves Real world motivating examples Real world motivating examples Perhaps with cost information Perhaps with cost information

August 20, UBDM 2006 Workshop Conclusion We analyze impact of training data cost on classification process We analyze impact of training data cost on classification process Introduce new ways of visualizing the impact of data cost Introduce new ways of visualizing the impact of data cost Utility curves Utility curves Optimal training set size curves Optimal training set size curves Show that we can use progressive sampling to help learn a near-optimal classifier Show that we can use progressive sampling to help learn a near-optimal classifier

August 20, UBDM 2006 Workshop We Want Feedback We are continuing this work We are continuing this work Clearly many minor enhancements possible Clearly many minor enhancements possible Feel free to suggest some more Feel free to suggest some more Any major new directions/extensions? Any major new directions/extensions? What if anything is most interesting? What if anything is most interesting? Any really good motivating examples that you are familiar with Any really good motivating examples that you are familiar with

August 20, UBDM 2006 Workshop Questions? If I have run out of time, please find me during the break!! If I have run out of time, please find me during the break!!

August 20, UBDM 2006 Workshop References 1. P. Turney (2000). Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17 th International Conference on Machine Learning. 2. F. Provost, D. Jensen & T. Oates (1999). Proceedings of the 5 th International Conference on Knowledge Discovery and Data Mining. 3. C. Drummond & R. Holte (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. Proceedings of the 6 th ACM SIGKDD International Conference of Knowledge Discovery and Data Mining, G. Weiss & F. Provost (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction, Journal of Artificial Intelligence Research, 19:

August 20, UBDM 2006 Workshop Learning Curves for Large Data Sets

August 20, UBDM 2006 Workshop Optimal Curves for Large Data Sets

August 20, UBDM 2006 Workshop Learning Curves for Small Data Sets

August 20, UBDM 2006 Workshop Optimal Curves for Small Data Sets

August 20, UBDM 2006 Workshop Results for Adult Data Set

August 20, UBDM 2006 Workshop Optimal vs. S1 for Large Data Sets