Download presentation
Presentation is loading. Please wait.
Published byCory Waters Modified over 9 years ago
1
How good is my classifier?
2
8/29/03Evaluating Hypotheses2 Have seen the accuracy metric Classifier performance on a test set
3
8/29/03Evaluating Hypotheses3 If we are to trust a classifier’s results Must keep the classifier blindfolded Make sure that classifier never sees the test data When things seem too good to be true…
4
8/29/03Evaluating Hypotheses4 Confusion Matrix Predicted Actual classposneg postrue posfalse neg negfalse postrue neg
5
8/29/03Evaluating Hypotheses5 Sensitivity Out of the things predicted as being positive, how many were correct Specificity Out of the things predicted as being negative how many were correct Predicted Actual classposneg postrue posfalse neg negfalse postrue neg Not as sensitive if begins missing what it is trying to detect If identify more and more things as target class, then beginning to get less specific Not as sensitive if begins missing what it is trying to detect If identify more and more things as target class, then beginning to get less specific
6
8/29/03Evaluating Hypotheses6 Can we quantify our Uncertainty? Will the accuracy hold with brand new, never before seen data?
7
8/29/03Evaluating Hypotheses7 Discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments Successes or failures—Just what we’re looking for!
8
8/29/03Evaluating Hypotheses8 Probability that the random variable R will take on a specific value r Might be probability of an error or of a positive Since we have been working with accuracy let’s go with positive Book works with errors
9
8/29/03Evaluating Hypotheses9
10
8/29/03Evaluating Hypotheses10
11
8/29/03Evaluating Hypotheses11 How confident should I be in the accuracy measure? If we can live with statements like: 95% of the accuracy measures will fall in the range of 94% and 97% Life is good Confidence interval
12
8/29/03Evaluating Hypotheses12
13
8/29/03Evaluating Hypotheses13 In R lb=qbinom(.025,n,p) ub=qbinom(.975,n,p) Lower and upper bound constitute confidence interval
14
8/29/03Evaluating Hypotheses14 What if none of the small cluster of Blues were in the training set? All of them would be in the test set How well would it do? Sample error vs. true error Might have been an accident—a pathological case
15
8/29/03Evaluating Hypotheses15 What if we could test the classifier several times with different test sets If it performed well each time wouldn’t we be more confident in the results?
16
8/29/03Evaluating Hypotheses16 Usually we have a big chunk of training data If we bust it up into randomly drawn chunks Can train on remainder And test with chunk
17
8/29/03Evaluating Hypotheses17 If 10 chunks Train 10 times Now have performance data on ten completely different test datasets
18
8/29/03Evaluating Hypotheses18 Must stay blindfolded while training Must discard all lessons after each fold
19
8/29/03Evaluating Hypotheses19 Weka and DataMiner both default to 10-fold Could be just as easily be 20-fold or 25-fold With 20-fold it would be a 95-5 split Performance is reported as the average accuracy across the K runs
20
8/29/03Evaluating Hypotheses20 If 10-fold satisfies this should be in good shape
21
8/29/03Evaluating Hypotheses21 Called of leave-one-out Disadvantage: slow Largest possible training set Smallest possible test set Has been promoted as an unbiased estimator or error Recent studies indicate that there is no unbiased estimator
22
8/29/03Evaluating Hypotheses22 Can calculate confidence interval with a single test set More runs (K-fold) gives us more confidence that we didn’t just get lucky in test set selection Do these runs help narrow the confidence interval?
23
8/29/03Evaluating Hypotheses23 Central limit applies As the number of runs grows the distribution approaches normal With a reasonably large number of runs we can derive a more trustworthy confidence interval With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals
24
8/29/03Evaluating Hypotheses24
25
8/29/03Evaluating Hypotheses25 meanAcc = mean(accuracies) sdAcc = sd(accuracies) qnorm(.975,meanAcc,sdAcc) 0.9980772 qnorm(.025,meanAcc,sdAcc) 0.8169336
26
8/29/03Evaluating Hypotheses26 Can we say that one classifier is significantly better than another T-test Null hypothesis: they are from the same distribution
27
8/29/03Evaluating Hypotheses27 In R t.test(distOne,distTwo,paired = TRUE) Paired t-test data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates: mean of the differences -0.1980214
28
8/29/03Evaluating Hypotheses28 In Perl use Statistics::TTest; my $ttest = new Statistics::TTest; $ttest->load_data(\@r1,\@r2); $ttest->set_significance(95); $ttest->print_t_test(); print "\n\nt statistic is ". $ttest->t_statistic."\n"; print "p val ".$ttest->{t_prob}."\n"; t_prob: 0 significance: 95 … df1: 29 alpha: 0.025 t_statistic: 12.8137016607408 null_hypothesis: rejected t statistic is 12.8137016607408 p val 0 t_prob: 0 significance: 95 … df1: 29 alpha: 0.025 t_statistic: 12.8137016607408 null_hypothesis: rejected t statistic is 12.8137016607408 p val 0
29
8/29/03Evaluating Hypotheses29 The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross- validation on a training-set of size 1,000. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross- validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross- validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%.
30
8/29/03Evaluating Hypotheses30 Randomly permute an array From the Perl Cookbook http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm sub fisher_yates_shuffle { my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } sub fisher_yates_shuffle { my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; }
31
8/29/03Evaluating Hypotheses31
32
8/29/03Evaluating Hypotheses32
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.