M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
What is Statistical Modeling
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Evaluation.
Ensemble Learning: An Introduction
Three kinds of learning
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Evaluation of Learning Models
Today Evaluation Measures Accuracy Significance Testing
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
by B. Zadrozny and C. Elkan
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
CpSc 881: Machine Learning Evaluating Hypotheses.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
9. Credibility: Evaluating What’s Been Learned
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Dept. of Computer Science University of Liverpool
CSCI N317 Computation for Scientific Applications Unit Weka
Dept. of Computer Science University of Liverpool
Dept. of Computer Science University of Liverpool
Dept. of Computer Science University of Liverpool
Simulation Berlin Chen
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Classification: Evaluation February 23, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Evaluation Samples Cross Validation Bootstrap Confidence of Accuracy Today's Topics Classification: Evaluation February 23, 2009 Slide 3 COMP527: Data Mining

We need some way to quantitatively evaluate the results of data mining.  Just how accurate is the classification?  How accurate can we expect a classifier to be?  If we can't evaluate the classifier, how can it be improved?  Can different types of classifier be evaluated in the same way?  What are useful criteria for such a comparison?  How can we evaluate clusters or association rules? There are lots of issues to do with evaluation. Evaluation Classification: Evaluation February 23, 2009 Slide 4 COMP527: Data Mining

Assuming classification, the basic evaluation is how many correct predictions it makes as opposed to incorrect predictions. Can't test on data used for training the classifier and get an accurate result. The result is "hopelessly optimistic" (Witten). Eg: Due to over-fitting, a classifier might get 100% accuracy on the data it was trained from and 0% accuracy on other data. This is called the resubstitution error rate -- the error rate when you substitute the data back into the classifier generated from it. So we need some new, but labeled data to test on. Evaluation Classification: Evaluation February 23, 2009 Slide 5 COMP527: Data Mining

Most of the time we do not have enough data to have a lot for training and a lot for testing, though sometimes this is possible (eg sales data)‏ Some systems have two phases of training. An initial learning period and then fine tuning. For example the Growing and Pruning sets for building trees. It's important to not use the validation set either. Note that this reduces the amount of data that you can actually train on by a significant amount. Validation Classification: Evaluation February 23, 2009 Slide 6 COMP527: Data Mining

Further issues to consider:  Some classifiers produce probabilities for one or more classes. We need some way to handle the probabilities – for a classifier to be partly correct. Also for multi-class problems (eg instance has 2 or more classes) we need some 'cost' function for getting an accurate subset of the classes.  Regression/Numeric Prediction produces a numeric value. We need statistical tests to determine how accurate this is rather than true/false for nominal classes. Numeric Data, Multiple Classes Classification: Evaluation February 23, 2009 Slide 7 COMP527: Data Mining

Obvious answer: Keep part of the data set aside for testing purposes and use the rest to train the classifier. Then use the test set to evaluate the resulting classifier in terms of accuracy. Accuracy: Number of correctly classified instances / total number of instances to classify. Ratio is often 2/3rds training, 1/3rd test. How should we select the instances for each section? Hold Out Method Classification: Evaluation February 23, 2009 Slide 8 COMP527: Data Mining

Easy: Randomly select instances. Data could be very unbalanced: Eg 99% one class, 1% the other class. Then random sampling is likely to not draw any of the 1% class. Stratified: Group the instances by class and then select a proportionate number from each class. Balanced: Randomly select a desired amount of minority class instances, and then add the same number from the majority class. Samples Classification: Evaluation February 23, 2009 Slide 9 COMP527: Data Mining

Stratified: Group the instances by class and then select a proportionate number from each class. Samples Classification: Evaluation February 23, 2009 Slide 10 COMP527: Data Mining

Balanced: Randomly select a desired amount of minority class instances, and then add the same number from the majority class. Samples Classification: Evaluation February 23, 2009 Slide 11 COMP527: Data Mining

For small data sets, removing some as a test set and still having a representative set to train from is hard. Solutions? Repeat the process multiple times, select a different test set. Then find the error from each, and average across all of the iterations. Of course there's no reason to do this only for small data sets! Different test sets might still overlap, which might give a biased estimate of the accuracy. (eg if it randomly selects good records multiple times)‏ Can we prevent this? Small Data Sets Classification: Evaluation February 23, 2009 Slide 12 COMP527: Data Mining

Split the dataset up into k parts, then use each part in turn as the test set and the others as the training set. If the data set is also stratified, we can have stratified cross validation, rather than perhaps ending up with a non representative sample in one or more parts. Common values for k are 3 (eg hold out) and 10. Hence: stratified 10-fold cross validation Again, the error values are averaged after the k iterations. Cross Validation Classification: Evaluation February 23, 2009 Slide 13 COMP527: Data Mining

Why 10? Extensive testing shows it to be a good middle ground -- not too much processing, not too random. Cross validation is used extensively in all data mining literature. It's the simplest and easiest to understand evaluation technique, while having a good accuracy. There are other similar evaluation techniques, however... Cross Validation Classification: Evaluation February 23, 2009 Slide 14 COMP527: Data Mining

Select one instance and train on all others. Then see if the instance is correctly classified. Repeat and find the percentage of accurate results. Eg: N-fold cross validation, where N is the number of instances in the data set. Attractive: If 10 is good, surely N is better :)‏ No random sampling problems Trains with the most amount of data Leave One Out Classification: Evaluation February 23, 2009 Slide 15 COMP527: Data Mining

Disadvantages: Computationally expensive, builds N models! Guarantees a non-stratified, non-balanced sample. Worst case: class distribution is exactly 50/50. Data is so complicated, classifier simply picks the most common class. -- Will always pick the wrong class. Leave One Out Classification: Evaluation February 23, 2009 Slide 16 COMP527: Data Mining

Until now, the sampling has been without replacement (eg each instance occurs once, either in training or test set). However we could put back an instance to be drawn again -- sampling with replacement. This results in the bootstrap evaluation technique. Draw a training set from the data set with replacement such that the number of instances in both is the same, then use the instances which are not in the training set as the test set. (Eg some instances will appear more than once in the training set)‏ Statistically, the likelihood of an instance not being picked is 0.368, hence the name. Bootstrap Classification: Evaluation February 23, 2009 Slide 17 COMP527: Data Mining

Eg: Have a dataset of 1000 instances. We sample with replacement 1000 times – eg we randomly select an instance from all 1000 instances 1000 times. This should leave us with approximately 368 instances that have not been selected. We remove these and use them for the test set. Error rate will be pessimistic – only training on 63% of the data, with some repeated instances. We compensate by combining with the optimistic error rate from resubstitution: error rate: * error-on-test * error-on-training Bootstrap Classification: Evaluation February 23, 2009 Slide 18 COMP527: Data Mining

What about the size of the test set? More test instances should make us more confident that the accuracy predicted is close to the true accuracy. Eg getting 75% on 10,000 samples is more likely closer to the accuracy than 75% on 10. A series of events that succeed of fail is a Bernoulli process, eg coin tosses. We can find out S successes from N trials, and then S/N... but what does that tell us about the true accuracy rate? Statistics can then tell us the range within which the true accuracy rate should fall. Eg: 750/1000 is very likely to be between 73.2% to 76.7%. (Witten 147 to 149 has the full maths!)‏ Confidence of Accuracy Classification: Evaluation February 23, 2009 Slide 19 COMP527: Data Mining

We might wish to compare two classifiers of different types. Could compare accuracy of 10 fold cross validation, but there's another method: Student's T-Test Method:  We perform cross validation 10 times – eg 10 times TCV = 100 models  Perform the same repeated TCV with the second classifier  This gives us x1..x10 for the first, and y1..y10 for the second.  Find the mean of the 10 cross-validation runs for each.  Find the difference between the two means. We want to know if the difference is statistically significant. Confidence of Accuracy Classification: Evaluation February 23, 2009 Slide 20 COMP527: Data Mining

We then find 't' by: Where d is the difference between the means, k is the number of times the cross validation was performed, and 2 is the variance of the differences between the samples. (variance = sum of squared differences between mean and actual)‏ Then look up on the table for k-1 number of degrees of freedom. (more tables! But printed in Witten pg 155)‏ If t is greater than z on the table, then it is statistically significant. Student's T-Test Classification: Evaluation February 23, 2009 Slide 21 COMP527: Data Mining

Introductory statistical text books, again Witten, Han 6.2, 6.12, 6.13 Berry and Browne, 1.4 Devijver and Kittler, Chapter 10 Further Reading Classification: Evaluation February 23, 2009 Slide 22 COMP527: Data Mining