Empirical Evaluation (Ch 5)

Empirical Evaluation (Ch 5)
how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased error: errortrain(h) = #misclassifications/|Strain| errorD(h) ≥ errortrain(h) could set aside a random subset of data for testing sample error for any finite sample S drawn randomly from D is unbiased, but not necessarily same as true error errS(h) ≠ errD(h) what we want is estimate of “true” accuracy over distribution D

Confidence Intervals put a bound on errorD(h) based on Binomial Distribution suppose sample error rate is errorS(h)=p then 95% CI for errorD(h) is E[errorD(h)] = errorS(h) = p E[var(errorD(h))] = p(1-p)/n standard deviation = s = var; var = s2 1.96s comes from confidence level (95%)

Binomial Distribution
put a bound on errorD(h) based on Binomial distribution suppose true error rate is errorD(h)=p on a sample of size n, would expect np errors on average, but could vary around that due to sampling (error rate, as a proportion:)

Hypothesis Testing is errorD(h)<0.2 (is error rate of h less than 20%?) example: is better than majority classifier? (suppose errormaj(h)=20%) if we approximate Binomial as Normal, then m±2s should bound 95% of likely range for errorD(h) two-tailed test: risk of true error being higher or lower is 5% Pr[Type I error]≤0.05 restrictions: n≥30 or np(1-p)≥5

Gaussian Distribution
1.28s = 80% of distr. z-score: relative distance of a value x from mean

for a one-tailed test, use z value for a/2 for example
suppose errorS(h)=0.19 and s=0.02 suppose you want 95% confidence that errorD(h)<20%, then test if 0.2-errorS(h)>1.64s 1.64 comes from z-score for a=90%

example: compare 2 trials that have 10% error
notice that confidence interval on error rate tightens with larger sample sizes example: compare 2 trials that have 10% error test set A has 100 examples, h makes 10 errors: 10/100 sqrt(.1x.9/100)=0.03 CI95%(err(h)) = [10±6%] = [4-16%] test set B has 100,000 examples, 10,000 errors: 10,000/100,000=sqrt(.1x.9/100,000)= CI95%(err(h)) = [10±0.19%] = [ %]

Comparing 2 hypotheses (decision trees)
test whether 0 is in conf. interval of difference add variances example...

Estimating the accuracy of a Learning Algorithm
errorS(h) is the error rate of a particular hypothesis, which depends on training data what we want is estimate of error on any training set drawn from the distribution we could repeat the splitting of data into independent training/testing sets, build and test k decisions trees, and take average We want to estimate the accuracy or error rate of a learning algorithm. In a way similar to testing hypothesis, but we get multiple estimates Test the learning algorithm on different sets of data

k-fold Cross-Validation
partition the dataset D into k subsets of equal size (30), T1..Tk for i from 1 to k do: Si = D-Ti // training set, 90% of D hi = L(Si) // build decision tree ei = error(hi,Ti) // test d-tree on 10% held out m = (1/k)Sei s = (1/k)S(ei-m)2 SE = s/k  (1/k(k-1))S(ei-m)2 CI95 = m  tdof,a SE (tdof,a=2.23 for k=10 and a=95%) This is basic algorithm of Cross Validation K is a parameter that determines how many portions you divide the set of data Ultimately you get K samples of the accuracy or error rate.

k-fold Cross-Validation
Typically k=10 note that this is a biased estimator, probably under-estimates true accuracy because uses less examples this is a disadvantage of CV: building d-trees with only 90% of the data (and it takes 10 times as long)

what to do with 10 accuracies from CV?
accuracy of alg is just the mean (1/k)Sacci for CI, use “standard error” (SE): s = (1/k)S(ei-m)2 SE = s/k  (1/k(k-1))S(ei-m)2 standard deviation for estimate of the mean 95% CI = m ± tdof,a (1/k(k-1))S(ei-m)2 Central Limit Theorem we are estimating a “statistic” (parameter of a distribution, e.g. the mean) from multiple trials regardless of underlying distribution, estimate of the mean approaches a Normal distribution if std. dev. of underlying distribution is s, then std. dev. of mean of distribution is s/n - Once we have the K estimates of accuracy, we can use that to determine a Confidence Interval on the True Accuracy (or True Error rate). The Central Limit Therom means the distribution of the sample means is Normal. Also, the more sample means you have, the better the estimate

example: multiple trials of testing the accuracy of a learner,
assuming true acc=70% and s=7% there is intrinsic variability in accuracy between different trials with more trials, distribution converges to underlying (std. dev. stays around 7) but the estimate of the mean (vertical bars, m2s/n) gets tighter With low number of samples, the estimate is broader With large number of samples the estimate is very tight round the true accuracy. est of true mean= 71.02.5 est of true mean= 70.50.6 est of true mean= 69.980.03

Student’s T distribution is similar to Normal distr
Student’s T distribution is similar to Normal distr., but adjusted for small sample size; dof = k-1 example: t9,0.05 = 2.23 (Table 5.6) Because in practice we have few number of samples (i.e. K estimates of the accuracy), we can correct for that using Student’s T distribution. The lower the number of samples / DoF the broader the tails.

Comparing 2 Learning Algorithms approach 1:
e.g. ID3 with 2 different pruning methods approach 1: run each algorithm 10 times (using CV) independently to get CI for acc of each alg acc(A), SE(A) acc(B), SE(B) T-test: statistical test if difference means ≠ 0 d=acc(A)-acc(B) problem: the variance is additive - We want to compare the accuracy of two learning algorithms and see if it’s significantly different. One way would be to do the Cross Validation independently, two different times. Then do a t-test on the difference of the accuracies. This assumes the estimates were independent and not related. The variance of the two estimates gets added together. (unpooled)

suppose mean acc for A is 61%2, mean acc for B is 64%2
suppose mean acc for A is 61%2, mean acc for B is 64%2 d=acc(B)-acc(A) mean = 3% SE =  3.7 (just a guess) acc(LA,Ti) acc(LB,Ti) d=B-A 60% 63% +3%, SE=1% -You can visualize this as two normal distributions (due to central limit theorem) with a given variance (top left) -The distribution of their difference is then broader because their variance gets added (top right) -Because the variance is large 0 overlaps with 95% Confidence Interval -However, if you had trained and tested them on the same datasets, you may have seen a systematic difference one way (table) -If you compare them in a pair-wise way, then the variance is much smaller (bottom right) mean diff is same but B is systematically higher than A

approach 2: Paired T-test
run the algorithms in parallel on the same divisions of tests -This leads to the paired t-test approach. -Basically just do Cross Validation, but train and test both learning algorithms on the same cases, and get estimates of the difference in error/accuracy -This means the tests are paired, which allows you to do the paired t-test which has a smaller confidence interval. -You use information from the fact that the comparisons are paired, giving you more certainty. - This is the preferable method if possible: Less CI iterations and you will have tighter estimates of mean. test whether 0 is in CI of differences:

Empirical Evaluation (Ch 5)

Similar presentations

Presentation on theme: "Empirical Evaluation (Ch 5)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Empirical Evaluation (Ch 5)

Similar presentations

Presentation on theme: "Empirical Evaluation (Ch 5)"— Presentation transcript:

Similar presentations

About project

Feedback