Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.

Slides:



Advertisements
Similar presentations
“Students” t-test.
Advertisements

Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Learning Algorithm Evaluation
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Sampling: Final and Initial Sample Size Determination
Is it statistically significant?
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Statistics: Purpose, Approach, Method. The Basic Approach The basic principle behind the use of statistical tests of significance can be stated as: Compare.
Evaluation.
The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.
T-tests Computing a t-test  the t statistic  the t distribution Measures of Effect Size  Confidence Intervals  Cohen’s d.
PSY 307 – Statistics for the Behavioral Sciences
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
The Normal Distribution. n = 20,290  =  = Population.
The t-test:. Answers the question: is the difference between the two conditions in my experiment "real" or due to chance? Two versions: (a) “Dependent-means.
Sample size computations Petter Mostad
Evaluation.
Topic 2: Statistical Concepts and Market Returns
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
“There are three types of lies: Lies, Damn Lies and Statistics” - Mark Twain.
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Review of normal distribution. Exercise Solution.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
AM Recitation 2/10/11.
Hypothesis Testing:.
Two Sample Tests Ho Ho Ha Ha TEST FOR EQUAL VARIANCES
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Statistical Inferences Based on Two Samples Chapter 9.
STATISTICAL INFERENCE PART VII
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Chapter 9 Hypothesis Testing and Estimation for Two Population Parameters.
Mid-Term Review Final Review Statistical for Business (1)(2)
Vegas Baby A trip to Vegas is just a sample of a random variable (i.e. 100 card games, 100 slot plays or 100 video poker games) Which is more likely? Win.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
CpSc 881: Machine Learning Evaluating Hypotheses.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Chapter 9 Inferences Based on Two Samples: Confidence Intervals and Tests of Hypothesis.
6.3 One- and Two- Sample Inferences for Means. If σ is unknown Estimate σ by sample standard deviation s The estimated standard error of the mean will.
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
6-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Confidence Intervals Dr. Amjad El-Shanti MD, PMH,Dr PH University of Palestine 2016.
THE NORMAL DISTRIBUTION
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Chapter 9 Introduction to the t Statistic
Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.
Data Science Credibility: Evaluating What’s Been Learned
9. Credibility: Evaluating What’s Been Learned
Machine Learning Techniques for Data Mining
Sampling Distribution Models
Statistical Inference for the Mean: t-test
Presentation transcript:

Evaluation (practice)

2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount of test data  Prediction is just like tossing a (biased!) coin  “Head” is a “success”, “tail” is an “error”  In statistics, a succession of independent events like this is called a Bernoulli process  Statistical theory provides us with confidence intervals for the true underlying proportion

3 Confidence intervals  We can say: p lies within a certain specified interval with a certain specified confidence  Example: S=750 successes in N=1000 trials  Estimated success rate: 75%  How close is this to true success rate p?  Answer: with 80% confidence p  [73.2,76.7]  Another example: S=75 and N=100  Estimated success rate: 75%  With 80% confidence p  [69.1,80.1]  I.e. the probability that p  [69.1,80.1] is 0.8.  Bigger the N more confident we are, i.e. the surrounding interval is smaller.  Above, for N=100 we were less confident than for N=1000.

4 Mean and Variance  Let Y be the random variable with possible values 1 for success and 0 for error.  Let probability of success be p.  Then probability of error is q=1-p.  What’s the mean? 1*p + 0*q = p  What’s the variance? (1-p) 2 *p + (0-p) 2 *q = q 2 *p+p 2 *q = pq(p+q) = pq

5 Estimating p  Well, we don’t know p. Our goal is to estimate p.  For this we make N trials, i.e. tests.  More trials we do more confident we are.  Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y.  Now, we approximate p with the success rate in N trials, i.e. S/N.  By the Central Limit Theorem, when N is big, the probability distribution of the random variable f=S/N is approximated by a normal distribution with  mean p and  variance pq/N.

6 Estimating p  c% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by: Pr[− z≤ X≤ z]= c  With a symmetric distribution: Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z]  Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable f=S/N to have 0 mean and unit variance

7 Estimating p Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable S/N to have 0 mean and unit variance: Pr[−1.65≤ (S/N – p) /  S/N ≤1.65]=90% Now we solve two equations: (S/N – p) /  S/N =1.65 (S/N – p) /  S/N =-1.65

8 Estimating p Let N=100, and S=70  S/N is sqrt( pq/N ) and we approximate it by sqrt(p'(1-p')/N) where p' is the estimation of p, i.e. 0.7 So,  S/N is approximated by sqrt(.7*.3/100) =.046 The two equations become: (0.7 – p) /.046 =1.65 p = *.046 =.624 (0.7 – p) /.046 =-1.65 p = *.046 =.776 Thus, we say: With a 90% confidence we have that the success rate p of the classifier will be  p  0.776

9 Exercise  Suppose I want to be 95% confident in my estimation.  Looking at a detailed table we find: Pr[−2≤ X≤2]  95%  Normalizing S/N, we need to solve: (S/N – p) /  f =2 (S/N – p) /  f =-2  We approximate  f with  where p' is the estimation of p through trials, i.e. S/N  So we need to solve:  So,

10 Exercise  Suppose N=1000 trials, S=590 successes  p'=S/N=590/1000 =.59

11 Cross-validation k-fold cross-validation: First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training  The error estimates are averaged to yield an overall error estimate

12 Comparing data mining Schemes  Frequent question: which of two learning schemes performs better?  Obvious way: compare for example 10-fold Cross Validation estimates  Problem: variance in estimate  We don’t know whether the results are reliable  need to use statistical-test for that

13 Paired t-test  Student’s t-test tells whether the means of two samples are significantly different.  In our case the samples are cross-validation estimates for different datasets from the domain  Use a paired t-test because the individual samples are paired  The same Cross Validation is applied twice

14 Distribution of the means  x 1, x 2, … x k and y 1, y 2, … y k are the 2k samples for the k different datasets  m x and m y are the means  With enough samples, the mean of a set of independent samples is normally distributed  Estimated variances of the means are  s x 2 / k and s y 2 / k  If  x and  y are the true means then the following are approximately normally distributed with mean 0, and variance 1:

15 Student’s distribution  With small samples (k < 30) the mean follows Student’s distribution with k–1 degrees of freedom  similar shape, but wider than normal distribution  Confidence limits (mean 0 and variance 1):

16 Distribution of the differences  Let m d = m x – m y  The difference of the means (m d ) also has a Student’s distribution with k–1 degrees of freedom  Let s d 2 be the estimated variance of the difference  The standardized version of m d is called the t-statistic:

17 Performing the test  Fix a significance level  If a difference is significant at the  % level, there is a (100-  )% chance that the true means differ  Divide the significance level by two because the test is two-tailed  Look up the value for z that corresponds to  /2  If t  –z or t  z then the difference is significant  I.e. the null hypothesis (that the difference is zero) can be rejected

18 Example  We have compared two classifiers through cross- validation on 10 different datasets.  The error rates are: Dataset Classifier A Classifier B Difference

19 Example  m d = 0.48  s d =  The critical value of t for a two-tailed statistical test,  = 10% and 9 degrees of freedom is: 1.83  is way bigger than 1.83, so classifier B is much better than A.

20 Dependent estimates  We assumed that we have enough data to create  several datasets of the desired size  Need to reuse data if that's not the case  E.g. running cross-validations with different randomizations on the same data  Samples become dependent  insignificant differences can become significant  A heuristic test is the corrected resampled t-test:  Assume we use the repeated holdout method, with n 1 instances for training and n 2 for testing  New test statistic is: