CpSc 881: Machine Learning Evaluating Hypotheses.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Sampling Distributions
1. Variance of Probability Distribution 2. Spread 3. Standard Deviation 4. Unbiased Estimate 5. Sample Variance and Standard Deviation 6. Alternative Definitions.
Evaluating Hypotheses. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 2 Some notices Exam can be made in Artificial Intelligence (Department.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Sampling Distributions
1 MF-852 Financial Econometrics Lecture 4 Probability Distributions and Intro. to Hypothesis Tests Roy J. Epstein Fall 2003.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Evaluation.
Comparing Systems Using Sample Data
Topic 2: Statistical Concepts and Market Returns
Evaluating Hypotheses
CHAPTER 6 Statistical Analysis of Experimental Data
Experimental Evaluation
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Standard error of estimate & Confidence interval.
Computer Vision Lecture 8 Performance Evaluation.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 8 Continuous.
AP Statistics Chapter 9 Notes.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Statistics for Data Miners: Part I (continued) S.T. Balke.
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
Random Sampling, Point Estimation and Maximum Likelihood.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Basic Statistics for Engineers. Collection, presentation, interpretation and decision making. Prof. Dudley S. Finch.
Experimental Evaluation of Learning Algorithms Part 1.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 9 Samples.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
8 Sampling Distribution of the Mean Chapter8 p Sampling Distributions Population mean and standard deviation,  and   unknown Maximal Likelihood.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 29 February 2008 William.
Machine Learning Chapter 5. Evaluating Hypotheses
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Sampling Fundamentals 2 Sampling Process Identify Target Population Select Sampling Procedure Determine Sampling Frame Determine Sample Size.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
1 Mean Analysis. 2 Introduction l If we use sample mean (the mean of the sample) to approximate the population mean (the mean of the population), errors.
8.1 Estimating µ with large samples Large sample: n > 30 Error of estimate – the magnitude of the difference between the point estimate and the true parameter.
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Chapter Eleven Sample Size Determination Chapter Eleven.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
STATISTICS People sometimes use statistics to describe the results of an experiment or an investigation. This process is referred to as data analysis or.
THE NORMAL DISTRIBUTION
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Theoretical distributions: the Normal distribution.
Chapter 6: Sampling Distributions
Evaluating Hypotheses
STATISTICAL INFERENCE
Consolidation & Review
Evaluating Classifiers
Empirical Evaluation (Ch 5)
CONCEPTS OF ESTIMATION
Evaluating Hypotheses
Lecture 1 Cameron Kaplan
Evaluating Hypothesis
Machine Learning: Lecture 5
Presentation transcript:

CpSc 881: Machine Learning Evaluating Hypotheses

2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

3 Basics of Sampling Theory A random variable can be viewed as the name of experiment with a probabilistic outcome. Its value is the outcome of the experiment. A probability distribution for a random variable Y specifies the probability Pr(Y=y i ) that Y will take on the value y i, for each possible value y i. Consider a random variable Y that takes on possible values y 1, …, y n. The expected value (or mean value) of Y, E[Y], is: E[Y] =  i=1 n y i Pr(Y=y i ) The variance of a random variable Y, Var[Y], is: Var[Y] = E[(Y-E[Y]) 2 ] The variance characterizes the width or dispersion of the distribution of about its mean The standard deviation σ Y of a random variable Y is the square root of the variance.

4 Basics of Sampling Theory: Binomial distribution The Binomial distribution gives the probability of observing r heads in a series of n independent coin tosses, if the probability of heads on a single coin toss is p. its probability function: A reasonable estimate of p is r/n If the random variable X follows a Binomial distribution, then The expected, or mean value of X, E[X]=np The variance of X, Var(X) = np(1-p) The standard deviation of X, σ Y is

5 Basics of Sampling Theory: Normal distribution A Normal distribution (also called Gaussian distribution) is a bell shaped distribution defined by the probability density function: A normal distribution is fully determined by two parameters: μ and σ If the random variable X follows a Normal distribution, then The expected, or mean value of X, E[X]= μ The variance of X, Var(X) = σ 2 The standard deviation of X, σ X = σ

6 Confidence Interval A N% confidence interval estimate for parameter p is an interval that includes p with probability N% If a random variable Y obeys a Normal distribution with mean μ and standard deviation σ, then the measured random variable y of Y will fall into following interval with N% confidence: μ  z N σ Two-sided and one sided bounds

7 Basics of Sampling Theory: Central Limit Theorem The Central Limit Theorem is a theorem stating that the sum of a large number of independent, identically distributed random variable approximately follows a Normal distribution. Consider a set of independent, identically distributed random variable Yi,…Yn. All governed by an arbitrary probability distribution with mean u and standard deviation σ. Define the sample mean Central Limit Theorem: As n->∞, the distribution governing Y approaches a Normal distribution, with mean u and variance σ 2 /n

8 Evaluating Hypotheses: Motivation Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of “future” unlabeled data points. In some cases, evaluating hypotheses is an integral part of the learning process (example, when pruning a decision tree)

9 Difficulties in Evaluating Hypotheses when only limited data are available Bias in the estimate: The observed accuracy of the learned hypothesis over the training examples is a poor estimator of its accuracy over future examples we test the hypothesis on a test set chosen independently of the training set and the hypothesis. Variance in the estimate: Even with a separate test set, the measured accuracy can vary from the true accuracy, depending on the makeup of the particular set of test examples. The smaller the test set, the greater the expected variance.

10 Questions Considered Given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? Given that one hypothesis outperforms another over some sample data, how probable is it that this hypothesis is more accurate, in general? When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy?

11 Two Definition of Error Definition 1: The sample error (denoted error s (h)) of hypothesis h with respect to target function f and data sample S is: error s (h)= 1/n  x  S  (f(x),h(x)) where n is the number of examples in S, and the quantity  (f(x),h(x)) is 1 if f(x)  h(x), and 0, otherwise. Definition 2: The true error (denoted error D (h)) of hypothesis h with respect to target function f and distribution D, is the probability that h will misclassify an instance drawn at random according to D. error D (h)= Pr x  D [f(x)  h(x)]

12 Problems Estimating Error Bias: if S is the training set, error s (h) is optimistically biased Bias = E[error s (h)]-error D (h) For unbiased estimate, h and S must be chose independently Variance: even with unbiased S, error s (h) may still vary from error D (h)

13 Estimating Hypothesis Accuracy Two Questions of Interest: Given a hypothesis h and a data sample containing n examples drawn at random according to distribution D, what is the best estimate of the accuracy of h over future instances drawn from the same distribution? ==> sample vs. true error What is the probable error in this accuracy estimate? ==> confidence intervals

14 Estimators An estimator is a random variable used to estimate some parameter of an underlying population error s (h) is a random variable Experiments Choose samples of size n according to distribution D. Measure error s (h) Example: Hypothesis h misclassified 12 of 40 examples in S error s (h)=12/40=0.3 Given observed error s (h), what can we conclude about error D (h)?

15 Estimators, Bias and Variance error s (h) follows a Binomial distribution, we have error S (h) = r/n error D (h)= μ errors(h) = p where n is the number of instances in the sample S, r is the number of instances from S misclassified by h, and p is the probability of misclassifying a single instance drawn from D Definition: The estimation bias (  from the inductive bias) of an estimator Y for an arbitrary parameter p is E[Y] - p If the estimation bias is zero, we say that Y is an unbiased estimator Estimation bias should not be confused with the inductive bias error s (h) is an unbiased estimator for error D (h) The expected value of r is np, so the expected values of r/n is p.

16 Estimators, Bias and Variance In general, given r errors in a sample of n independently draw test examples, the standard deviation for error S (h) is given by Approximation: p ≈ r/n = error S (h) Then:

17 Confidence Intervals for Discrete-Valued Hypotheses If S contains n examples, drawn independently of h and each other n  30 We can approximate the distribution of error S (h) as normal distribution. The general expression for approximate N% confidence intervals for error D (h) is : error S (h)  z N  error S (h)(1-error S (h))/n where Z N is given in This approximation is quite good when n error S (h)(1 - error S (h))  5 N%50%68%80%90%95%98%99% zNzN

18 Confidence Intervals for Discrete-Valued Hypotheses If S contains n examples, drawn independently of h and each other n  30 The general expression for approximate 95% confidence intervals for error D (h) is: error S (h)  1.96  error S (h)(1-error S (h))/n

19 Calculating Confidence Intervals 1. Pick parameter p to be estimated, for example, error D (h) 2. Choose an estimator Y, e.g. error S (h) It is desirable to choose a minimum-variance, unbiased estimator 3. Determine probability distribution D Y that governs estimator Y, including its mean and variance error S (h) governed by Binomial distribution, approximate by Normal when n>30 4. Determine the N% confidence interval by finding thresholds L and U such that N% of the mass in the probability distribution D Y falls between L and U.

20 Difference in Error of two Hypotheses Let h 1 and h 2 be two hypotheses for some discrete-valued target function. h 1 has been tested on a sample S 1 containing n 1 randomly drawn examples, and h 2 has been tested on an independent sample S 2 containing n 2 examples drawn from the same distribution. The difference between true errors of these two hypothesis is d: d= error D (h1)- error D (h2) d can be estimated by the difference between the sample errors d ˆ d ˆ = error S1 (h1)-error S2 (h2)

21 Difference in Error of two Hypotheses The difference of two Normal distribution is also a Normal distribution, then d ˆ will also can be approximated by a normal distribution and the variance of this distribution is the sum of the variance of error S1 (h1) and error S2 (h2): The approximate N% confidence interval for d is:

22 Comparing Learning Algorithms Which of L A and L B is the better learning method on average for learning some particular target function f ? To answer this question, we wish to estimate the expected value of the difference in their errors: E S  D [error D (L A (S))-error D (L B (S))] Of course, since we have only a limited sample D 0 we estimate this quantity by dividing D 0 into a training set S 0 and a testing set T 0 and measure: error T0 (L A (S 0 ))-error T0 (L B (S 0 )) Problem: We are only measuring the difference in errors for one training set S 0 rather than the expected value of this difference over all samples S drawn from D Solution: k-fold Cross-Validation

23 k-Fold Cross-Validation 1. Partition the available data D 0 into k disjoint subsets T 1, T 2, …, T k of equal size, where this size is at least For i from 1 to k, do use T i for the test set, and the remaining data for training set S i S i <- {D 0 - T i } h A <- L A (S i ) h B <- L B (S i )  i <- error Ti (h A )-error Ti (h B ) 3. Return the value avg(  ), where avg(  ) = 1/k  i=1 k  i

24 Confidence of the k-fold Estimate The approximate N% confidence interval for estimating E S  D0 [error D (L A (S))-error D (L B (S))] using avg(  ), is given by: avg(  )  t N,k-1 s avg(  ) where t N,k-1 is a constant similar to Z N (See [Mitchell, Table 5.6]) and s avg(  ) is an estimate of the standard deviation of the distribution governing avg(  ) s avg(  )) =  1/k(k-1)  i=1 k (  i -avg(  )) 2