Machine Learning Group University College Dublin Evaluation in Machine Learning Pádraig Cunningham.

Slides:



Advertisements
Similar presentations
Introduction to the t Statistic
Advertisements

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Author: Steven L. Salzberg Presented by: Zheng Liu.
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Brief introduction on Logistic Regression
Evaluating Classifiers
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Learning Algorithm Evaluation
Model assessment and cross-validation - overview
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluation.
Evaluating Hypotheses
Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park.
Types of Errors Difference between measured result and true value. u Illegitimate errors u Blunders resulting from mistakes in procedure. You must be careful.
Evaluation and Credibility How much should we believe in what was learned?
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Experimental Evaluation
S519: Evaluation of Information Systems
 What is t test  Types of t test  TTEST function  T-test ToolPak 2.
Evaluation and Credibility
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
INTRODUCTION TO Machine Learning 3rd Edition
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
II.Simple Regression B. Hypothesis Testing Calculate t-ratios and confidence intervals for b 1 and b 2. Test the significance of b 1 and b 2 with: T-ratios.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Statistical Intervals for a Single Sample From only one sample, An interval has been found. Because the sample was ample, The results were quite profound!
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Ttests Programming in R. The first part of these notes will address ttesting basics. The second part of these notes will address z test (or proportion.
8-2 Estimation Estimating μ when σ is UNKNOWN. Imagine You are in charge of quality control at Guinness Brewery in Dublin, Ireland. Your job is to make.
Experimental Evaluation of Learning Algorithms Part 1.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 8, continued.... III. Interpretation of Confidence Intervals Remember, we don’t know the population mean. We take a sample to estimate µ, then.
Validation methods.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Ttests INCM 9102 Quantitative Methods. Ttests The term “Ttest” comes from the application of the t-distribution to evaluate a hypothesis. Note: a “t-statistic”
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
T-TEST. Outline  Introduction  T Distribution  Example cases  Test of Means-Single population  Test of difference of Means-Independent Samples 
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
Data Analysis Module: Bivariate Testing
Bivariate Testing (ttests and proportion tests)
An Empirical Comparison of Supervised Learning Algorithms
9. Credibility: Evaluating What’s Been Learned
Bivariate Testing (ttests and proportion tests)
Dipartimento di Ingegneria «Enzo Ferrari»,
Daniela Stan Raicu School of CTI, DePaul University
Bivariate Testing (ttests and proportion tests)
Learning Algorithm Evaluation
Daniela Stan Raicu School of CTI, DePaul University
INTRODUCTION TO Machine Learning
Supervised vs. unsupervised Learning
Inference for Who? Young adults. What? Heart rate (beats per minute).
Model generalization Brief summary of methods
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Machine Learning Group University College Dublin Evaluation in Machine Learning Pádraig Cunningham

2 Outline Student’s t-test Test for paired data Cross Validation McNemar’s Test ROC Analysis Other Statistical Tests for Evaluation

3 William Sealy Gosset The t-statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews. "Student" was his pen name. Gosset was a statistician for the Guinness brewery in Dublin, Ireland, and was hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes. Gosset published the t test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown not only to fellow statisticians but to his employer - the company insisted on the pseudonym so that it could turn a blind eye to the breach of its rules. Wikipedia

4 Student’s t-Test Scores by two rugby teams:  Is B better than A?

5 What does the t-statistic mean? % For a given t-statistic you can look up the confidence i.e. there is a 31.7% chance that this difference is due to chance (according to this test).

6 Student’s t-Test More data and/or clearer difference will give statistical significance

7 Student’s t-Test (paired) Scores paired, i.e. against same team  With paired data statistical significance can be determined with less observations  We can say with 95% confidence that B are better than A

8 Student’s t-test: Formulae Two samples, A and B  is the average in A, is the variance in A  Test for paired data, 1 Sample (D is difference in pairs)

9 Paired t-Test example t-Test can be used for comparing errors in regression systems. It can also be used for comparing classifiers if multiple test sets are available Also with cross validation (more later) = 5.2

10 U Evaluation in Machine Learning Supervised Learning  Typical Question: Which is better, Classifier A or Classifier B?  Evaluate Generalisation Accuracy  Hold back some training data to use for testing Use performance on Test data as a proxy for performance on unseen data (i.e. Generalization). Training Data TestTrain

11 Problems with ‘Hold-out’ Validation Imagine 200 samples are available for training:  50:50 split underestimates generalisation acc.  80:20 estimate based on a small sample (40) Different hold-out sets - different results # Samples Accuracy 160

12 k-Fold Cross Validation Having your cake and eating it too… Divide data into k folds  For each fold in turn Use that fold for testing and Use the remainder of the data for training

13 Tuning is explicit 1. Divide dataset into k folds (say 10) 2. For each of the k folds 1. Create training and test sets T & S 2. Divide T into sets T1 and T2 3. For each of the classifiers 1. Use T2 to tune parameters on a model trained with T1 2. Use these ‘good’ parameters to train a model with T 3. Measure Accuracy on S 4. Record 0-1 loss results for each classifier 3. Assess significance of results (e.g. McNemar’s test). Comparing Two Classifiers (Salzberg, 1997)

14 McNemar’s test Which is better C1 or C2? Which is better C2 or C3? McNemar’s test captures this notion:  n 01 number misclassified by 1 st but not 2 nd classifier  n 10 number misclassified by 2 nd but not 1 st classifier C1C2C3           For test to be applicable (n 01 + n 10 ) > 10 >3.84 required for statistical significance at 95% MNS score for C2 v’s C1 = 1/2 MNS score for C2 v’s C1 = 1/6

15 McNemar’s test example CxCy             CaCb            

16 Other Tests Dietterich’s 5x2cv paired t-test (Dietterich, 1998) 5 repetitions of 2-fold cross validation  2-fold  no overlap in training data  This gives 10 pairs of error estimates from which a t statistic can be derived + flexible on choice of loss function - training sets comprise 50% of data Demšar’s comparisons over multiple datasets (Demšar, 2006) Comparisons between classifiers done on multiple datasets   a table of results  Averaging across datasets is dodgy  Demšar’s Test Wilcoxon Signed Ranks Test to compare a pair of classifiers Friedman’s Test to combine these scores for multiple classifiers Counts of wins, losses and ties This methodology could become the standard

17 Loss Functions How you keep the score… Regression  Quadratic Loss Function Minimize Mean Squared Error Big errors matter more Classification  Misclassification Rate aka: 0-1 Loss Function  Many alternatives are possible and appropriate in different circumstances, e.g. F measure.

18 Loss Functions: ROC Curves Ranking Classifiers Many (binary) classifiers return a numeric score between 0 and 1. Classifier bias can be controlled by adjusting a threshold. For a given test set the ROC curve shows classifier performance over a range of thresholds/biases.

19 References Salzberg, S., (1997) On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data Mining and Knowledge Discovery, 1, 317–327. Dietterich, T.G., (1998) Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, 10:1895–1924. Demšar, J., (2006) Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, 7(Jan):1--30.