How good is my classifier?. 8/29/03Evaluating Hypotheses2  Have seen the accuracy metric  Classifier performance on a test set.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
10-3 Inferences.
Inferential Statistics
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
© 2013 Pearson Education, Inc. Active Learning Lecture Slides For use with Classroom Response Systems Introductory Statistics: Exploring the World through.
Inference1 Data Analysis Inferential Statistics Research Methods Gail Johnson.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
PSY 307 – Statistics for the Behavioral Sciences
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Sample size computations Petter Mostad
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Final Jeopardy $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 LosingConfidenceLosingConfidenceTesting.
Evaluation.
Topic 2: Statistical Concepts and Market Returns
Lecture Inference for a population mean when the stdev is unknown; one more example 12.3 Testing a population variance 12.4 Testing a population.
Chapter 2 Simple Comparative Experiments
Sample Size Determination In the Context of Hypothesis Testing
Experimental Evaluation
1 Inference About a Population Variance Sometimes we are interested in making inference about the variability of processes. Examples: –Investors use variance.
BCOR 1020 Business Statistics Lecture 18 – March 20, 2008.
5-3 Inference on the Means of Two Populations, Variances Unknown
“There are three types of lies: Lies, Damn Lies and Statistics” - Mark Twain.
1 Confidence Interval for Population Mean The case when the population standard deviation is unknown (the more common case).
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
1/2555 สมศักดิ์ ศิวดำรงพงศ์
Section #4 October 30 th Old: Review the Midterm & old concepts 1.New: Case II t-Tests (Chapter 11)
Analysis & Interpretation: Individual Variables Independently Chapter 12.
More About Significance Tests
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Statistical Inferences Based on Two Samples Chapter 9.
Today’s lesson Confidence intervals for the expected value of a random variable. Determining the sample size needed to have a specified probability of.
Chapter 9 Hypothesis Testing and Estimation for Two Population Parameters.
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
PowerPoint presentations prepared by Lloyd Jaisingh, Morehead State University Statistical Inference: Hypotheses testing for single and two populations.
Section 9.2 Testing the Mean  9.2 / 1. Testing the Mean  When  is Known Let x be the appropriate random variable. Obtain a simple random sample (of.
One Sample Inf-1 If sample came from a normal distribution, t has a t-distribution with n-1 degrees of freedom. 1)Symmetric about 0. 2)Looks like a standard.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Interval Estimation and Hypothesis Testing Prepared by Vera Tabakova, East Carolina University.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
Chapter 8 Parameter Estimates and Hypothesis Testing.
MeanVariance Sample Population Size n N IME 301. b = is a random value = is probability means For example: IME 301 Also: For example means Then from standard.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Mystery 1Mystery 2Mystery 3.
Hypothesis Testing Errors. Hypothesis Testing Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean.
AP Statistics Unit 5 Addie Lunn, Taylor Lyon, Caroline Resetar.
© Copyright McGraw-Hill 2004
T Test for Two Independent Samples. t test for two independent samples Basic Assumptions Independent samples are not paired with other observations Null.
1 Probability and Statistics Confidence Intervals.
T tests comparing two means t tests comparing two means.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Analysis of Variance ANOVA - method used to test the equality of three or more population means Null Hypothesis - H 0 : μ 1 = μ 2 = μ 3 = μ k Alternative.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Lecture Notes and Electronic Presentations, © 2013 Dr. Kelly Significance and Sample Size Refresher Harrison W. Kelly III, Ph.D. Lecture # 3.
Review of Power of a Test
Introduction For inference on the difference between the means of two populations, we need samples from both populations. The basic assumptions.
More on Inference.
Parameter Estimation.
ESTIMATION.
More on Inference.
Machine Learning: Lecture 5
Presentation transcript:

How good is my classifier?

8/29/03Evaluating Hypotheses2  Have seen the accuracy metric  Classifier performance on a test set

8/29/03Evaluating Hypotheses3  If we are to trust a classifier’s results  Must keep the classifier blindfolded  Make sure that classifier never sees the test data  When things seem too good to be true…

8/29/03Evaluating Hypotheses4  Confusion Matrix Predicted Actual classposneg postrue posfalse neg negfalse postrue neg

8/29/03Evaluating Hypotheses5  Sensitivity  Out of the things predicted as being positive, how many were correct  Specificity  Out of the things predicted as being negative how many were correct Predicted Actual classposneg postrue posfalse neg negfalse postrue neg Not as sensitive if begins missing what it is trying to detect If identify more and more things as target class, then beginning to get less specific Not as sensitive if begins missing what it is trying to detect If identify more and more things as target class, then beginning to get less specific

8/29/03Evaluating Hypotheses6  Can we quantify our Uncertainty?  Will the accuracy hold with brand new, never before seen data?

8/29/03Evaluating Hypotheses7 Discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments Successes or failures—Just what we’re looking for!

8/29/03Evaluating Hypotheses8  Probability that the random variable R will take on a specific value r  Might be probability of an error or of a positive  Since we have been working with accuracy let’s go with positive  Book works with errors

8/29/03Evaluating Hypotheses9

8/29/03Evaluating Hypotheses10

8/29/03Evaluating Hypotheses11  How confident should I be in the accuracy measure?  If we can live with statements like:  95% of the accuracy measures will fall in the range of 94% and 97%  Life is good  Confidence interval

8/29/03Evaluating Hypotheses12

8/29/03Evaluating Hypotheses13  In R  lb=qbinom(.025,n,p)  ub=qbinom(.975,n,p)  Lower and upper bound constitute confidence interval

8/29/03Evaluating Hypotheses14  What if none of the small cluster of Blues were in the training set?  All of them would be in the test set  How well would it do?  Sample error vs. true error  Might have been an accident—a pathological case

8/29/03Evaluating Hypotheses15  What if we could test the classifier several times with different test sets  If it performed well each time wouldn’t we be more confident in the results?

8/29/03Evaluating Hypotheses16  Usually we have a big chunk of training data  If we bust it up into randomly drawn chunks  Can train on remainder  And test with chunk

8/29/03Evaluating Hypotheses17  If 10 chunks  Train 10 times  Now have performance data on ten completely different test datasets

8/29/03Evaluating Hypotheses18  Must stay blindfolded while training  Must discard all lessons after each fold

8/29/03Evaluating Hypotheses19  Weka and DataMiner both default to 10-fold  Could be just as easily be 20-fold or 25-fold  With 20-fold it would be a 95-5 split Performance is reported as the average accuracy across the K runs

8/29/03Evaluating Hypotheses20 If 10-fold satisfies this should be in good shape

8/29/03Evaluating Hypotheses21  Called of leave-one-out  Disadvantage: slow  Largest possible training set  Smallest possible test set Has been promoted as an unbiased estimator or error Recent studies indicate that there is no unbiased estimator

8/29/03Evaluating Hypotheses22  Can calculate confidence interval with a single test set  More runs (K-fold) gives us more confidence that we didn’t just get lucky in test set selection  Do these runs help narrow the confidence interval?

8/29/03Evaluating Hypotheses23  Central limit applies  As the number of runs grows the distribution approaches normal  With a reasonably large number of runs we can derive a more trustworthy confidence interval With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals

8/29/03Evaluating Hypotheses24

8/29/03Evaluating Hypotheses25 meanAcc = mean(accuracies) sdAcc = sd(accuracies) qnorm(.975,meanAcc,sdAcc) qnorm(.025,meanAcc,sdAcc)

8/29/03Evaluating Hypotheses26  Can we say that one classifier is significantly better than another  T-test  Null hypothesis: they are from the same distribution

8/29/03Evaluating Hypotheses27 In R t.test(distOne,distTwo,paired = TRUE) Paired t-test data: distOne and distTwo t = , df = 29, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of the differences

8/29/03Evaluating Hypotheses28 In Perl use Statistics::TTest; my $ttest = new Statistics::TTest; $ttest->set_significance(95); $ttest->print_t_test(); print "\n\nt statistic is ". $ttest->t_statistic."\n"; print "p val ".$ttest->{t_prob}."\n"; t_prob: 0 significance: 95 … df1: 29 alpha: t_statistic: null_hypothesis: rejected t statistic is p val 0 t_prob: 0 significance: 95 … df1: 29 alpha: t_statistic: null_hypothesis: rejected t statistic is p val 0

8/29/03Evaluating Hypotheses29  The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set.  The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross- validation on a training-set of size 1,000.  The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross- validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%.  The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross- validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%.

8/29/03Evaluating Hypotheses30  Randomly permute an array  From the Perl Cookbook  sub fisher_yates_shuffle { my $array = shift; my $i; for ($i --$i; ) { my $j = int rand ($i+1); next if $i == } sub fisher_yates_shuffle { my $array = shift; my $i; for ($i --$i; ) { my $j = int rand ($i+1); next if $i == }

8/29/03Evaluating Hypotheses31

8/29/03Evaluating Hypotheses32