EVALUATION David Kauchak CS 451 – Fall 2013. Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Learning Algorithm Evaluation
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
PTP 560 Research Methods Week 9 Thomas Ruediger, PT.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Using Statistics in Research Psych 231: Research Methods in Psychology.
MARE 250 Dr. Jason Turner Hypothesis Testing II To ASSUME is to make an… Four assumptions for t-test hypothesis testing: 1. Random Samples 2. Independent.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
What z-scores represent
SADC Course in Statistics Comparing Means from Independent Samples (Session 12)
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Evaluation.
Independent Samples and Paired Samples t-tests PSY440 June 24, 2008.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Understanding Statistics in Research
Evaluation and Credibility How much should we believe in what was learned?
Inference about a Mean Part II
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Experimental Evaluation
Evaluation and Credibility
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Using Statistics in Research Psych 231: Research Methods in Psychology.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Bootstrapping applied to t-tests
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Inferential Statistics Part 2: Hypothesis Testing Chapter 9 p
Tuesday, September 10, 2013 Introduction to hypothesis testing.
Hypothesis testing – mean differences between populations
Significance Tests …and their significance. Significance Tests Remember how a sampling distribution of means is created? Take a sample of size 500 from.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
1 CSI5388 Error Estimation: Re-Sampling Approaches.
The Hypothesis of Difference Chapter 10. Sampling Distribution of Differences Use a Sampling Distribution of Differences when we want to examine a hypothesis.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
One-sample In the previous cases we had one sample and were comparing its mean to a hypothesized population mean However in many situations we will use.
For 95 out of 100 (large) samples, the interval will contain the true population mean. But we don’t know  ?!
9 Mar 2007 EMBnet Course – Introduction to Statistics for Biologists Nonparametric tests, Bootstrapping
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
The Practice of Statistics Third Edition Chapter 10: Estimating with Confidence Copyright © 2008 by W. H. Freeman & Company Daniel S. Yates.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Descriptive Statistics Used to describe a data set –Mean, minimum, maximum Usually include information on data variability (error) –Standard deviation.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Hypothesis Testing and the T Test. First: Lets Remember Z Scores So: you received a 75 on a test. How did you do? If I said the mean was 72 what do you.
Some Alternative Approaches Two Samples. Outline Scales of measurement may narrow down our options, but the choice of final analysis is up to the researcher.
AP Statistics Chapter 21 Notes
1 Probability and Statistics Confidence Intervals.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Prof. Robert Martin Southeastern Louisiana University.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Science Credibility: Evaluating What’s Been Learned
Learning Algorithm Evaluation
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

EVALUATION David Kauchak CS 451 – Fall 2013

Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices() from dataset and do weight initialization there Reading

Supervised evaluation DataLabel Training data Testing data Labeled data

Supervised evaluation DataLabel Training data Testing data train a classifier model Labeled data

Supervised evaluation DataLabel 1 0 Pretend like we don’t know the labels

Supervised evaluation DataLabel 1 0 model Classify 1 1 Pretend like we don’t know the labels

Supervised evaluation DataLabel 1 0 model Pretend like we don’t know the labels Classify 1 1 Compare predicted labels to actual labels

Comparing algorithms DataLabel 1 0 model model Is model 2 better than model 1?

Idea 1 model model Predicted 1 0 Label 1 0 Predicted Evaluation score 1 score 2 model 2 better if score 2 > score 1 When would we want to do this type of comparison?

Idea 1 model model Predicted 1 0 Label 1 0 Predicted Evaluation score 1 score 2 compare and pick better Any concerns?

Is model 2 better? Model 1: 85% accuracy Model 2: 80% accuracy Model 1: 85.5% accuracy Model 2: 85.0% accuracy Model 1: 0% accuracy Model 2: 100% accuracy

Comparing scores: significance Just comparing scores on one data set isn’t enough! We don’t just want to know which system is better on this particular data, we want to know if model 1 is better than model 2 in general Put another way, we want to be confident that the difference is real and not just do to random chance

Idea 2 model model Predicted 1 0 Label 1 0 Predicted Evaluation score 1 score 2 model 2 better if score 2 + c > score 1 Is this any better?

Idea 2 model model Predicted 1 0 Label 1 0 Predicted Evaluation score 1 score 2 model 2 better if score 2 + c > score 1 NO! Key: we don’t know the variance of the output

Variance Recall that variance (or standard deviation) helped us predict how likely certain events are: How do we know how variable a model’s accuracy is?

Variance Recall that variance (or standard deviation) helped us predict how likely certain events are: We need multiple accuracy scores! Ideas?

Repeated experimentation DataLabel Training data Testing data Labeled data Rather than just splitting once, split multiple times

Repeated experimentation DataLabel Training data DataLabel DataLabel … = development = train

n-fold cross validation Training data break into n equal-size parts … repeat for all parts/splits: train on n-1 parts optimize on the other … split 1 split 2 … split 3 …

n-fold cross validation … split 1 split 2 … split 3 … … evaluate score 1 score 2 score 3 …

n-fold cross validation better utilization of labeled data more robust: don’t just rely on one test/development set to evaluate the approach (or for optimizing parameters) multiplies the computational overhead by n (have to train n models instead of just one) 10 is the most common choice of n

Leave-one-out cross validation n-fold cross validation where n = number of examples aka “jackknifing” pros/cons? when would we use this?

Leave-one-out cross validation Can be very expensive if training is slow and/or if there are a large number of examples Useful in domains with limited training data: maximizes the data we can use for training Some classifiers are very amenable to this approach (e.g.?)

Comparing systems: sample 1 splitmodel 1model average:85 Is model 2 better than model 1?

Comparing systems: sample 2 splitmodel 1model average:8285 Is model 2 better than model 1?

Comparing systems: sample 3 splitmodel 1model average:8285 Is model 2 better than model 1?

Comparing systems splitmodel 1model average:8285 splitmodel 1model average:8285 What’s the difference?

Comparing systems splitmodel 1model average:8285 std dev splitmodel 1model average:8285 std dev Even though the averages are same, the variance is different!

Comparing systems: sample 4 splitmodel 1model average:8385 std dev Is model 2 better than model 1?

Comparing systems: sample 4 splitmodel 1model 2model 2 – model average:8385 std dev Is model 2 better than model 1?

Comparing systems: sample 4 splitmodel 1model 2model 2 – model average:8385 std dev Model 2 is ALWAYS better

Comparing systems: sample 4 splitmodel 1model 2model 2 – model average:8385 std dev How do we decide if model 2 is better than model 1?

Statistical tests Setup:  Assume some default hypothesis about the data that you’d like to disprove, called the null hypothesis  e.g. model 1 and model 2 are not statistically different in performance Test:  Calculate a test statistic from the data (often assuming something about the data)  Based on this statistic, with some probability we can reject the null hypothesis, that is, show that it does not hold

t-test Determines whether two samples come from the same underlying distribution or not ?

t-test Null hypothesis: model 1 and model 2 accuracies are no different, i.e. come from the same distribution Assumptions: there are a number that often aren’t completely true, but we’re often not too far off Result: probability that the difference in accuracies is due to random chance (low values are better)

Calculating t-test For our setup, we’ll do what’s called a “pair t-test”  The values can be thought of as pairs, where they were calculated under the same conditions  In our case, the same train/test split  Gives more power than the unpaired t-test (we have more information) For almost all experiments, we’ll do a “two-tailed” version of the t-test Can calculate by hand or in code, but why reinvent the wheel: use excel or a statistical package

p-value The result of a statistical test is often a p-value p-value: the probability that the null hypothesis holds. Specifically, if we re-ran this experiment multiple times (say on different data) what is the probability that we would reject the null hypothesis incorrectly (i.e. the probability we’d be wrong) Common values to consider “significant”: 0.05 (95% confident), 0.01 (99% confident) and (99.9% confident)

Comparing systems: sample 1 splitmodel 1model average:85 Is model 2 better than model 1? They are the same with: p = 1

Comparing systems: sample 2 splitmodel 1model average:8285 Is model 2 better than model 1? They are the same with: p = 0.15

Comparing systems: sample 3 splitmodel 1model average:8285 Is model 2 better than model 1? They are the same with: p = 0.007

Comparing systems: sample 4 splitmodel 1model average:8385 Is model 2 better than model 1? They are the same with: p = 0.001

Statistical tests on test data Labeled Data (data with labels) All Training Data Test Data Training Data Development Data cross-validation with t-test Can we do that here?

Bootstrap resampling training set t with n samples do m times: - sample n examples with replacement from the training set to create a new training set t’ - train model(s) on t’ - calculate performance on test set calculate t-test (or other statistical test) on the collection of m results

Bootstrap resampling All Training Data (data with labels) Test Data sample with replacement Sample model evaluate repeat m times to get m samples

Experimentation good practices Never look at your test data! During development  Compare different models/hyperparameters on development data  use cross-validation to get more consistent results  If you want to be confident with results, use a t-test and look for p = 0.05 For final evaluation, use bootstrap resampling combined with a t-test to compare final approaches