Empirical Evaluation (Ch 5)

Slides:



Advertisements
Similar presentations
Welcome to PHYS 225a Lab Introduction, class rules, error analysis Julia Velkovska.
Advertisements

Evaluating Classifiers
Sampling: Final and Initial Sample Size Determination
Sampling Distributions (§ )
1 Chapter 15 System Errors Revisited Ali Erol 10/19/2005.
Objectives Look at Central Limit Theorem Sampling distribution of the mean.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Sociology 601: Class 5, September 15, 2009
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Chapter 7 Sampling and Sampling Distributions
Evaluation.
Evaluating Hypotheses
PSY 307 – Statistics for the Behavioral Sciences
Evaluating Classifiers Lecture 2 Instructor: Max Welling.
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
BCOR 1020 Business Statistics
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Survey Methodology Sampling error and sample size EPID 626 Lecture 4.
Normal and Sampling Distributions A normal distribution is uniquely determined by its mean, , and variance,  2 The random variable Z = (X-  /  is.
Standard error of estimate & Confidence interval.
Review of normal distribution. Exercise Solution.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
AP Statistics Chapter 9 Notes.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Topic 5 Statistical inference: point and interval estimate
Lecture 14 Dustin Lueker. 2  Inferential statistical methods provide predictions about characteristics of a population, based on information in a sample.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 881: Machine Learning Evaluating Hypotheses.
Chapter 7 Sampling Distributions. Sampling Distribution of the Mean Inferential statistics –conclusions about population Distributions –if you examined.
Machine Learning Chapter 5. Evaluating Hypotheses
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Chapter 7: The Distribution of Sample Means. Frequency of Scores Scores Frequency.
Sampling Theory and Some Important Sampling Distributions.
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Data Science Credibility: Evaluating What’s Been Learned
Confidence Intervals Cont.
And distribution of sample means
Evaluating Hypotheses
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
04/10/
STA 291 Spring 2010 Lecture 12 Dustin Lueker.
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Chapter 6: Sampling Distributions
Introduction, class rules, error analysis Julia Velkovska
Evaluating Classifiers
Machine Learning Techniques for Data Mining
Introduction to Probability and Statistics
MATH 2311 Section 4.4.
Evaluating Hypotheses
Chapter 7: The Distribution of Sample Means
Sampling Distributions (§ )
Evaluating Hypothesis
The Normal Distribution
STA 291 Summer 2008 Lecture 12 Dustin Lueker.
STA 291 Spring 2008 Lecture 12 Dustin Lueker.
Machine Learning: Lecture 5
MATH 2311 Section 4.4.
Presentation transcript:

Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased error: errortrain(h) = #misclassifications/|Strain| errorD(h) ≥ errortrain(h) could set aside a random subset of data for testing sample error for any finite sample S drawn randomly from D is unbiased, but not necessarily same as true error errS(h) ≠ errD(h) what we want is estimate of “true” accuracy over distribution D

Confidence Intervals put a bound on errorD(h) based on Binomial Distribution suppose sample error rate is errorS(h)=p then 95% CI for errorD(h) is E[errorD(h)] = errorS(h) = p E[var(errorD(h))] = p(1-p)/n standard deviation = s = var; var = s2 1.96s comes from confidence level (95%)

Binomial Distribution put a bound on errorD(h) based on Binomial distribution suppose true error rate is errorD(h)=p on a sample of size n, would expect np errors on average, but could vary around that due to sampling (error rate, as a proportion:)

Hypothesis Testing is errorD(h)<0.2 (is error rate of h less than 20%?) example: is better than majority classifier? (suppose errormaj(h)=20%) if we approximate Binomial as Normal, then m±2s should bound 95% of likely range for errorD(h) two-tailed test: risk of true error being higher or lower is 5% Pr[Type I error]≤0.05 restrictions: n≥30 or np(1-p)≥5

Gaussian Distribution 1.28s = 80% of distr. z-score: relative distance of a value x from mean

for a one-tailed test, use z value for a/2 for example suppose errorS(h)=0.19 and s=0.02 suppose you want 95% confidence that errorD(h)<20%, then test if 0.2-errorS(h)>1.64s 1.64 comes from z-score for a=90%

example: compare 2 trials that have 10% error notice that confidence interval on error rate tightens with larger sample sizes example: compare 2 trials that have 10% error test set A has 100 examples, h makes 10 errors: 10/100 sqrt(.1x.9/100)=0.03 CI95%(err(h)) = [10±6%] = [4-16%] test set B has 100,000 examples, 10,000 errors: 10,000/100,000=sqrt(.1x.9/100,000)=0.00095 CI95%(err(h)) = [10±0.19%] = [9.8-10.2%]

Comparing 2 hypotheses (decision trees) test whether 0 is in conf. interval of difference add variances example...

Estimating the accuracy of a Learning Algorithm errorS(h) is the error rate of a particular hypothesis, which depends on training data what we want is estimate of error on any training set drawn from the distribution we could repeat the splitting of data into independent training/testing sets, build and test k decisions trees, and take average We want to estimate the accuracy or error rate of a learning algorithm. In a way similar to testing hypothesis, but we get multiple estimates Test the learning algorithm on different sets of data

k-fold Cross-Validation partition the dataset D into k subsets of equal size (30), T1..Tk for i from 1 to k do: Si = D-Ti // training set, 90% of D hi = L(Si) // build decision tree ei = error(hi,Ti) // test d-tree on 10% held out m = (1/k)Sei s = (1/k)S(ei-m)2 SE = s/k  (1/k(k-1))S(ei-m)2 CI95 = m  tdof,a SE (tdof,a=2.23 for k=10 and a=95%) This is basic algorithm of Cross Validation K is a parameter that determines how many portions you divide the set of data Ultimately you get K samples of the accuracy or error rate.

k-fold Cross-Validation Typically k=10 note that this is a biased estimator, probably under-estimates true accuracy because uses less examples this is a disadvantage of CV: building d-trees with only 90% of the data (and it takes 10 times as long)

what to do with 10 accuracies from CV? accuracy of alg is just the mean (1/k)Sacci for CI, use “standard error” (SE): s = (1/k)S(ei-m)2 SE = s/k  (1/k(k-1))S(ei-m)2 standard deviation for estimate of the mean 95% CI = m ± tdof,a (1/k(k-1))S(ei-m)2 Central Limit Theorem we are estimating a “statistic” (parameter of a distribution, e.g. the mean) from multiple trials regardless of underlying distribution, estimate of the mean approaches a Normal distribution if std. dev. of underlying distribution is s, then std. dev. of mean of distribution is s/n - Once we have the K estimates of accuracy, we can use that to determine a Confidence Interval on the True Accuracy (or True Error rate). The Central Limit Therom means the distribution of the sample means is Normal. Also, the more sample means you have, the better the estimate

example: multiple trials of testing the accuracy of a learner, assuming true acc=70% and s=7% there is intrinsic variability in accuracy between different trials with more trials, distribution converges to underlying (std. dev. stays around 7) but the estimate of the mean (vertical bars, m2s/n) gets tighter With low number of samples, the estimate is broader With large number of samples the estimate is very tight round the true accuracy. est of true mean= 71.02.5 est of true mean= 70.50.6 est of true mean= 69.980.03

Student’s T distribution is similar to Normal distr Student’s T distribution is similar to Normal distr., but adjusted for small sample size; dof = k-1 example: t9,0.05 = 2.23 (Table 5.6) Because in practice we have few number of samples (i.e. K estimates of the accuracy), we can correct for that using Student’s T distribution. The lower the number of samples / DoF the broader the tails.

Comparing 2 Learning Algorithms approach 1: e.g. ID3 with 2 different pruning methods approach 1: run each algorithm 10 times (using CV) independently to get CI for acc of each alg acc(A), SE(A) acc(B), SE(B) T-test: statistical test if difference means ≠ 0 d=acc(A)-acc(B) problem: the variance is additive - We want to compare the accuracy of two learning algorithms and see if it’s significantly different. One way would be to do the Cross Validation independently, two different times. Then do a t-test on the difference of the accuracies. This assumes the estimates were independent and not related. The variance of the two estimates gets added together. (unpooled)

suppose mean acc for A is 61%2, mean acc for B is 64%2 58 60 62 64 66 68 -3 0 3 6 9 suppose mean acc for A is 61%2, mean acc for B is 64%2 d=acc(B)-acc(A) mean = 3% SE =  3.7 (just a guess) acc(LA,Ti) acc(LB,Ti) d=B-A 58 59 +2 62 63 +1 59 63 +4 60 63 +3 62 64 +2 59 62 +3 60% 63% +3%, SE=1% -You can visualize this as two normal distributions (due to central limit theorem) with a given variance (top left) -The distribution of their difference is then broader because their variance gets added (top right) -Because the variance is large 0 overlaps with 95% Confidence Interval -However, if you had trained and tested them on the same datasets, you may have seen a systematic difference one way (table) -If you compare them in a pair-wise way, then the variance is much smaller (bottom right) mean diff is same but B is systematically higher than A -3 0 3 6 9

approach 2: Paired T-test run the algorithms in parallel on the same divisions of tests -This leads to the paired t-test approach. -Basically just do Cross Validation, but train and test both learning algorithms on the same cases, and get estimates of the difference in error/accuracy -This means the tests are paired, which allows you to do the paired t-test which has a smaller confidence interval. -You use information from the fact that the comparisons are paired, giving you more certainty. - This is the preferable method if possible: Less CI iterations and you will have tighter estimates of mean. test whether 0 is in CI of differences: