1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Learning Algorithm Evaluation
CHAPTER 2 Building Empirical Model. Basic Statistical Concepts Consider this situation: The tension bond strength of portland cement mortar is an important.
Is it statistically significant?
Statistical Decision Making
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Elementary hypothesis testing
Section 7.1 Hypothesis Testing: Hypothesis: Null Hypothesis (H 0 ): Alternative Hypothesis (H 1 ): a statistical analysis used to decide which of two competing.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Sample size computations Petter Mostad
Evaluation.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Topic 2: Statistical Concepts and Market Returns
Evaluating Hypotheses
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Horng-Chyi HorngStatistics II41 Inference on the Mean of a Population - Variance Known H 0 :  =  0 H 0 :  =  0 H 1 :    0, where  0 is a specified.
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Chapter 9 Hypothesis Testing.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Today Evaluation Measures Accuracy Significance Testing
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
AM Recitation 2/10/11.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Testing Hypotheses about a Population Proportion Lecture 29 Sections 9.1 – 9.3 Tue, Oct 23, 2007.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Statistical Analysis Statistical Analysis
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
Significance Tests …and their significance. Significance Tests Remember how a sampling distribution of means is created? Take a sample of size 500 from.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Chapter 9 Large-Sample Tests of Hypotheses
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Chapter 9 Hypothesis Testing and Estimation for Two Population Parameters.
Chapter 8 Introduction to Hypothesis Testing
LECTURE 19 THURSDAY, 14 April STA 291 Spring
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Experimental Evaluation of Learning Algorithms Part 1.
Learning from Observations Chapter 18 Through
Hypothesis Testing Hypothesis Testing Topic 11. Hypothesis Testing Another way of looking at statistical inference in which we want to ask a question.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
4 Hypothesis & Testing. CHAPTER OUTLINE 4-1 STATISTICAL INFERENCE 4-2 POINT ESTIMATION 4-3 HYPOTHESIS TESTING Statistical Hypotheses Testing.
Hypothesis and Test Procedures A statistical test of hypothesis consist of : 1. The Null hypothesis, 2. The Alternative hypothesis, 3. The test statistic.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
CpSc 881: Machine Learning Evaluating Hypotheses.
Machine Learning Chapter 5. Evaluating Hypotheses
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Welcome to MM570 Psychological Statistics
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 FINAL EXAMINATION STUDY MATERIAL III A ADDITIONAL READING MATERIAL – INTRO STATS 3 RD EDITION.
Chapter 9 Introduction to the t Statistic
Empirical Evaluation (Ch 5)
Inference on the Mean of a Population -Variance Known
Machine Learning: Lecture 5
Presentation transcript:

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin

2 Evaluating Inductive Hypotheses Accuracy of hypotheses on training data is obviously biased since the hypothesis was constructed to fit this data. Accuracy must be evaluated on an independent (usually disjoint) test set. The larger the test set is, the more accurate the measured accuracy and the lower the variance observed across different test sets.

3 Variance in Test Accuracy Let error S (h) denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h. Let error D (h) denote the true error rate for the overall data distribution D. When n is at least 30, the central limit theorem ensures that the distribution of error S (h) for different random samples will be closely approximated by a normal (Guassian) distribution. P(error S (h)) error S (h) error D (h)

4 Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. –Assume h 1 is tested on test set S 1 of size n 1 –Assume h 2 is tested on test set S 2 of size n 2 P(error S (h)) error S (h) error S1 (h 1 ) error S2 (h 2 ) Observe h 1 more accurate than h 2

5 Comparing Two Learned Hypotheses When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies. –Assume h 1 is tested on test set S 1 of size n 1 –Assume h 2 is tested on test set S 2 of size n 2 P(error S (h)) error S (h) error S1 (h 1 ) error S2 (h 2 ) Observe h 1 less accurate than h 2

6 Statistical Hypothesis Testing Determine the probability that an empirically observed difference in a statistic could be due purely to random chance assuming there is no true underlying difference. Specific tests for determining the significance of the difference between two means computed from two samples gathered under different conditions. Determines the probability of the null hypothesis, that the two samples were actually drawn from the same underlying distribution. By scientific convention, we reject the null hypothesis and say the difference is statistically significant if the probability of the null hypothesis is less than 5% (p < 0.05) or alternatively we accept that the difference is due to an underlying cause with a confidence of (1 – p).

7 One-sided vs Two-sided Tests One-sided test assumes you expected a difference in one direction (A is better than B) and the observed difference is consistent with that assumption. Two-sided test does not assume an expected difference in either direction. Two-sided test is more conservative, since it requires a larger difference to conclude that the difference is significant.

8 Z-Score Test for Comparing Learned Hypotheses Assumes h 1 is tested on test set S 1 of size n 1 and h 2 is tested on test set S 2 of size n 2. Compute the difference between the accuracy of h 1 and h 2 Compute the standard deviation of the sample estimate of the difference. Compute the z-score for the difference

9 Z-Score Test for Comparing Learned Hypotheses (continued) Determine the confidence in the difference by looking up the highest confidence, C, for the given z-score in a table. This gives the confidence for a two-tailed test, for a one tailed test, increase the confidence half way towards 100% confidence level 50%68%80%90%95%98%99% z-score

10 Sample Z-Score Test 1 Assume we test two hypotheses on different test sets of size 100 and observe: Confidence for two-tailed test: 90% Confidence for one-tailed test: (100 – (100 – 90)/2) = 95%

11 Sample Z-Score Test 2 Assume we test two hypotheses on different test sets of size 100 and observe: Confidence for two-tailed test: 50% Confidence for one-tailed test: (100 – (100 – 50)/2) = 75%

12 Z-Score Test Assumptions Hypotheses can be tested on different test sets; if same test set used, stronger conclusions might be warranted. Test sets have at least 30 independently drawn examples. Hypotheses were constructed from independent training sets. Only compares two specific hypotheses regardless of the methods used to construct them. Does not compare the underlying learning methods in general.

13 Comparing Learning Algorithms Comparing the average accuracy of hypotheses produced by two different learning systems is more difficult since we need to average over multiple training sets. Ideally, we want to measure: where L X (S) represents the hypothesis learned by method L from training data S. To accurately estimate this, we need to average over multiple, independent training and test sets. However, since labeled data is limited, generally must average over multiple splits of the overall data set into training and test sets.

14 K-Fold Cross Validation Randomly partition data D into k disjoint equal-sized subsets P 1 …P k For i from 1 to k do: Use P i for the test set and remaining data for training S i = (D – P i ) h A = L A (S i ) h B = L B (S i ) δ i = error Pi (h A ) – error Pi (h B ) Return the average difference in error:

15 K-Fold Cross Validation Comments Every example gets used as a test example once and as a training example k–1 times. All test sets are independent; however, training sets overlap significantly. Measures accuracy of hypothesis generated for [(k–1)/k]  |D| training examples. Standard method is 10-fold. If k is low, not sufficient number of train/test trials; if k is high, test set is small and test variance is high and run time is increased. If k=|D|, method is called leave-one-out cross validation.

16 Significance Testing Typically k<30, so not sufficient trials for a z test. Can use (Student’s) t-test, which is more accurate when number of trials is low. Can use a paired t-test, which can determine smaller differences to be significant when the training/sets sets are the same for both systems. However, both z and t test’s assume the trials are independent. Not true for k-fold cross validation: –Test sets are independent –Training sets are not independent Alternative statistical tests have been proposed, such as McNemar’s test. Although no test is perfect when data is limited and independent trials are not practical, some statistical test that accounts for variance is desirable.

17 Sample Experimental Results SystemASystemB Trial 187%82% Trail 283%78% Trial 388%83% Trial 482%77% Trial 585%80% Average85%80% Experiment 1 SystemASystemB Trial 190%82% Trail 293%76% Trial 380%85% Trial 485%75% Trial 577%82% Average85%80% Experiment 2 Diff +5% Diff +8% +17% –5% +10% – 5% +5% Which experiment provides better evidence that SystemA is better than SystemB?

18 Learning Curves Plots accuracy vs. size of training set. Has maximum accuracy (Bayes optimal) nearly been reached or will more examples help? Is one system better when training data is limited? Most learners eventually converge to Bayes optimal given sufficient training examples. Test Accuracy # Training examples 100% Bayes optimal Random guessing

19 Cross Validation Learning Curves Split data into k equal partitions For trial i = 1 to k do: Use partition i for testing and the union of all other partitions for training. For each desired point p on the learning curve do: For each learning system L Train L on the first p examples of the training set and record training time, training accuracy, and learned concept complexity. Test L on the test set, recording testing time and test accuracy. Compute average for each performance statistic across k trials. Plot curves for any desired performance statistic versus training set size. Use a paired t-test to determine significance of any differences between any two systems for a given training set size.

20 Noise Curves Plot accuracy versus noise level to determine relative resistance to noisy training data. Artificially add category or feature noise by randomly replacing some specified fraction of category or feature values with random values. Test Accuracy % noise added 100%

21 Experimental Evaluation Conclusions Good experimental methodology is important to evaluating learning methods. Important to test on a variety of domains to demonstrate a general bias that is useful for a variety of problems. Testing on 20+ data sets is common. Variety of freely available data sources –UCI Machine Learning Repository –KDD Cup (large data sets for data mining) –CoNLL Shared Task (natural language problems) Data for real problems is preferable to artificial problems to demonstrate a useful bias for real-world problems. Many available datasets have been subjected to significant feature engineering to make them learnable.