Evaluating Classifiers

Slides:



Advertisements
Similar presentations
Classification.. continued. Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we.
Advertisements

Statistical Significance and Performance Measures
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
ENGR 4296 – Senior Design II Question: How do you establish your “product” is better or your experiment was a success? Objectives: Sources of Variability.
Sampling Distributions (§ )
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
1 MF-852 Financial Econometrics Lecture 4 Probability Distributions and Intro. to Hypothesis Tests Roy J. Epstein Fall 2003.
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Evaluation.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
Evaluating Hypotheses
Evaluating Classifiers Lecture 2 Instructor: Max Welling.
Experimental Evaluation
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Standard error of estimate & Confidence interval.
Computer Vision Lecture 8 Performance Evaluation.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
AM Recitation 2/10/11.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
Random Sampling, Point Estimation and Maximum Likelihood.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Lab 3b: Distribution of the mean
8 Sampling Distribution of the Mean Chapter8 p Sampling Distributions Population mean and standard deviation,  and   unknown Maximal Likelihood.
Evaluating Results of Learning Blaž Zupan
CpSc 881: Machine Learning Evaluating Hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Machine Learning Chapter 5. Evaluating Hypotheses
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
THE NORMAL APPROXIMATION TO THE BINOMIAL. Under certain conditions the Normal distribution can be used as an approximation to the Binomial, thus reducing.
Inen 460 Lecture 2. Estimation (ch. 6,7) and Hypothesis Testing (ch.8) Two Important Aspects of Statistical Inference Point Estimation – Estimate an unknown.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
Probability and Moment Approximations using Limit Theorems.
8.1 Estimating µ with large samples Large sample: n > 30 Error of estimate – the magnitude of the difference between the point estimate and the true parameter.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
7. Performance Measurement
Statistical Significance (Review)
LECTURE 32: STATISTICAL SIGNIFICANCE AND CONFIDENCE
STATISTICAL INFERENCE
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Chapter 4. Inference about Process Quality
Evaluating Results of Learning
Evaluating Classifiers
Empirical Evaluation (Ch 5)
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
Evaluating Hypotheses
LECTURE 25: STATISTICAL SIGNIFICANCE AND CONFIDENCE
Sampling Distributions (§ )
Evaluating Hypothesis
Machine Learning: Lecture 5
Presentation transcript:

Evaluating Classifiers Lecture 2 Instructor: Max Welling

Evaluation of Results How do you report classification error? How certain are you about the error you claim? How do you compare two algorithms? How certain are you if you state one algorithm performs better than another?

Evaluation Given: Hypothesis h(x): XC, in hypothesis space H, mapping attributes x to classes c=[1,2,3,...C] A data-sample S(n) of size n. Questions: What is the error of “h” on unseen data? If we have two competing hypotheses, which one is better on unseen data? How do we compare two learning algorithms in the face of limited data? How certain are we about our answers?

Sample and True Error We can define two errors: 1) Error(h|S) is the error on the sample S: 2) Error(h|P) is the true error on the unseen data sampled from the distribution P(x): where f(x) is the true hypothesis.

Binomial Distributions Assume you toss a coin n times. And it has probability p of coming heads (which we will call success) What is the probability distribution governing the number of heads in n trials? Answer: the Binomial distribution.

Distribution over Errors Consider some hypothesis h(x) Draw n samples Xk~P(X). Do this k times. Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk). {e1,...,ek} are samples from a Binomial distribution ! Why? imagine a magic coin, where God secretly determines the probability of heads by the following procedure. First He takes some random hypothesis h. Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly. If it does, he makes sure the coin lands heads up... You have a single sample S, for which you observe e(S) errors. What would be a reasonable estimate for Error(h|P) you think?

Binomial Moments mean If we match the mean, np, with the observed value n*error(h|S) we find: If we match the variance we can obtain an estimate of the width:

Confidence Intervals We would like to state: With N% confidence we believe that error(h|P) is contained in the interval: 80% Normal(0,1) In principle is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to approximate a Binomial by a Gaussian for which we can easily compute “z-values”.

Bias-Variance The estimator is unbiased if Imagine again you have infinitely many sample sets X1,X2,.. of size n. Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi) If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator. Two unbiased estimators can still differ in their variance (efficiency). Which one do you prefer? p Eav

Flow of Thought Determine the property you want to know about the future data (e.g. error(h|P)) Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X)) Determine the distribution P(E) of E under the assumption you have infinitely many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P)) Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S)) Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. (sums of random variables converge to normal distributions – central limit theorem) State you confidence interval as: with confidence N% error(h|P) is contained in the interval

Assumptions We only consider discrete valued hypotheses (i.e. classes) Training data and test data are drawn IID from the same distribution P(x). (IID: independently & identically distributed) The hypothesis must be chosen independently from the data sample S! When you obtain a hypothesis from a learning algorithm, split the data into a training set and a testing set. Find the hypothesis using the training set and estimate error on the testing set.

Comparing Hypotheses Assume we like to compare 2 hypothesis h1 and h2, which we have tested on two independent samples S1 and S2 of size n1 and n2. I.e. we are interested in the quantity: ? Define estimator for d: with X1,X2 sample sets of size n1,n2. Since error(h1|S1) and error(h2|S2) are both approximately Normal their difference is approximately Normal with: Hence, with N% confidence we believe that d is contained in the interval:

Paired Tests Consider the following data: error(h1|s1)=0.1 error(h2|s1)=0.11 error(h1|s2)=0.2 error(h2|s2)=0.21 error(h1|s3)=0.66 error(h2|s3)=0.67 error(h1|s4)=0.45 error(h2|s4)=0.46 and so on. We have var(error(h1)) = large, var(error(h2)) = large. The total variance of error(h1)-error(h2) is their sum. However, h1 is consistently better than h2. We ignored the fact that we compare on the same data. We want a different estimator that compares data one by one. You can use a “paired t-test” (e.g. in matlab) to see if the two errors are significantly different, or if one error is significantly larger than the other.

Paired t-test Chunk the data up in subsets T1,...,Tk with |Ti|>30 On each subset compute the error and compute: Now compute: State: With N% confidence the difference in error between h1 and h2 is: “t” is the t-statistic which is related to the student-t distribution (table 5.6).

Comparing Learning Algorithms In general it is a really bad idea to estimate error rates on the same data on which a learning algorithm is trained. WHY? So just as in x-validation, we split the data into k subsets: S{T1,T2,...Tk}. Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement of each subset: {S-T1,S-T2,...) to produce hypotheses {L1(S-Ti), L2(S-Ti)} for all i. Compute for all i : Note: we train on S-Ti, but test on Ti. As in the last slide perform a paired t-test on these differences to compute an estimate and a confidence interval for the relative error of the hypothesis produced by L1 and L2.

Evaluation: ROC curves moving threshold class 0 (negatives) class 1 (positives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives FN = false negatives = # positives classified as negative Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter.

Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course)