机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel : 82529680  助教:程再兴, Tel : 62763742  课程网页:

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
Evaluating Classifiers
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Sampling: Final and Initial Sample Size Determination
Sampling Distributions (§ )
Evaluating Hypotheses. Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 2 Some notices Exam can be made in Artificial Intelligence (Department.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Class notes for ISE 201 San Jose State University
1 MF-852 Financial Econometrics Lecture 4 Probability Distributions and Intro. to Hypothesis Tests Roy J. Epstein Fall 2003.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Evaluation.
Evaluating Hypotheses
Evaluating Classifiers Lecture 2 Instructor: Max Welling.
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Experimental Evaluation
Inferences About Process Quality
Statistical Inference Lab Three. Bernoulli to Normal Through Binomial One flip Fair coin Heads Tails Random Variable: k, # of heads p=0.5 1-p=0.5 For.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Inferential Statistics
Standard error of estimate & Confidence interval.
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. Concept Learning Reference : Ch2 in Mitchell’s book 1. Concepts: Inductive learning hypothesis General-to-specific.
AP Statistics Chapter 9 Notes.
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Topic 5 Statistical inference: point and interval estimate
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Statistics for Data Miners: Part I (continued) S.T. Balke.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Confidence intervals and hypothesis testing Petter Mostad
Introduction  Populations are described by their probability distributions and parameters. For quantitative populations, the location and shape are described.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Manijeh Keshtgary Chapter 13.  How to report the performance as a single number? Is specifying the mean the correct way?  How to report the variability.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
1 9 Tests of Hypotheses for a Single Sample. © John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. 9-1.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CpSc 881: Machine Learning Evaluating Hypotheses.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 29 February 2008 William.
Machine Learning Chapter 5. Evaluating Hypotheses
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Sampling and estimation Petter Mostad
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Empirical Evaluation (Ch 5) how accurate is a hypothesis/model/dec.tree? given 2 hypotheses, which is better? accuracy on training set is biased – error:
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Evaluating Hypotheses
3. The X and Y samples are independent of one another.
Sample Mean Distributions
Evaluating Classifiers
Computational Learning Theory
Evaluating Hypotheses
Sampling Distributions (§ )
Evaluating Hypothesis
Machine Learning: Lecture 5
Presentation transcript:

机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心

课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页: uexi/jqxx2011.mht 2

Ch5 Evaluating Hypotheses 1. Given observed accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy) 2. Given that hypothesis h outperforms h’ over some sample of data, how probable is that h outperforms h’ in general? (difference between hypotheses) 3. When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms) 3

Agenda  Estimating hypothesis accuracy  Basics of sampling theory  Deriving confidence interval (general approach)  Difference between hypotheses  Comparing learning algorithm 4

Learning Problem Setting  Space of possible instances X (e.g. set of all people) over which target functions may be defined.  Assume that different instances in X may be encountered with different frequencies.  Modeling above assumption as: unknown probability distribution D that defines the probability of encountering each instance in X  Training examples are provided by drawing instances independently from X, according to D. 5

Bias & Variance  In case of limited data, when we try to estimate the accuracy of a learned hypothesis, two difficulties arise: Bias : The training examples typically provide an optimistically biased estimate of accuracy of learned hypo over future examples (overfitting problem) Variance: Even if the hypo accuracy is measured over an unbiased set of testing examples, the makeup of testing set could still effect the measurement of the accuracy of learned hypo 6

Agenda  Estimating hypothesis accuracy  Basics of sampling theory  Deriving confidence interval (general approach)  Difference between hypotheses  Comparing learning algorithm 7

Qs in Focus 1. Given a hypo h and a data sample containing n examples drawing at random according to distribution D, what is best estimate of accuracy of h over future instances drawn from D? 2. What is probable error in above estimate? 8

Sample Error & True Error  Sample error of hypo h w.r.t. target function f and data set S of n sample is  True error of hypo h w.r.t. target function f and distribution D is  So the two Qs become: How well error S (h) estimates error D (h)? 9

Confidence Interval for Discrete-Valued Hypo  Assume sample S contains n examples drawn independent of another, and independent of h, according to distribution D, and n ≧ 30  Then given no other information, the most probable value of error D (h) is error S (h); Furthermore, with approximately 95% probability, error D (h) lie in 10

Agenda  Estimating hypothesis accuracy  Basics of sampling theory  Deriving confidence interval (general approach)  Difference between hypotheses  Comparing learning algorithm 11

Binomial Probability Distribution  Probability P(r) of r heads in n coin flips, given Pr(head in one flip)=p:  Expected value of binomial distribution X=b(n,p) is: E[X]=np  Variance of X is Var(X)=np(1-p)  Standard deviation of X is σ Χ =sqrt(np(1-p)) 12

Example  Remark: Bell-shape figure 13

Compute error S (h)  Assume h misclassified r sample from set S of n samples, then 14

Normal Distribution  80% of area of probability density function N(μ,σ) lies in μ±1.28σ  N% of area of probability density function N(μ,σ) lies in μ±z N σ 15

Approximation of error S (h)  When n is large enough, error S (h) can be approximated by Normal distribution with same expected value & variance, i.e. N(error D (h), error D (h)(1-error D (h))/n) (Corollary of Central Limit Theorem) The rule of thumb is that, n≥30, or n ×error D (h)(1-error D (h))≥5 16

Confidence Interval of Estimation of error D (h)  It follows that with around N% probability, error S (h) lies in interval error D (h)±z N sqrt[error D (h)(1-error D (h))/n]  Equivalently, error D (h) lies in interval error S (h)±z N sqrt[error D (h)(1-error D (h))/n], which can be approximated by ( 贝努里大数定律 ) error S (h)±z N sqrt[error S (h)(1-error S (h))/n]  Therefore we have derived confidence interval for discrete-valued hypo 17

Two-Sided & One-Sided Bounds  Sometimes it is desirable to transfer two-sided bound into one-sided bound, for example, when we are interested in Q “What is probability that error D (h) is at most U (certain upper bound)?  Transfer two-sided bound into one- sided bound using symmetry of normal distribution (fig 5.1 in textbook) 18

Qs in Focus 1. Given a hypo h and a data sample containing n examples drawing at random according to distribution D, what is best estimate of accuracy of h over future instances drawn from D? A: Prefer unbiased estimator with minimum variance 2. What is probable error in above estimate? A: Derive confidence interval 19

Agenda  Estimating hypothesis accuracy  Basics of sampling theory  Deriving confidence interval (general approach)  Difference between hypotheses  Comparing learning algorithm 20

General Approach 1. Pick up parameter p to be estimated e.g. error D (h) 2. Choose an estimator, desirable unbiased plus minimum variance e.g. error S (h) with large n 3. Determine probability distribution that governs estimator 4. Find interval (L,U) such that N% of probability mass falls in the interval 21

Central Limit Theorem  Consider a set of independent, identically distributed (i.i.d) random variable Y 1 …Y n, all governed by an arbitrary probability distribution D with mean μ and finite variance σ 2. Define the sample mean  Central Limit Theorem: As n → ∞, the distribution governing approaches N(μ, σ 2 /n). 22

Approximate error S (h) by Normal Distribution  In Central Limit Theorem take distribution D to be Bernoulli experiment with p to be error D (h), and we are done! 23

Agenda  Estimating hypothesis accuracy  Basics of sampling theory  Deriving confidence interval (general approach)  Difference between hypotheses  Comparing learning algorithm 24

Ch2 Evaluating Hypotheses 1. Given observed accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy, done!) 2. Given that hypothesis h outperforms h’ over some sample, how probable is that h outperforms h’ in general? (difference between hypotheses, this section) 3. When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms) 25

Difference in Error Test h 1 on sample S 1, test h 2 on S 2 1. Pick up parameter to be estimated: d ≡ error D (h 1 )- error D (h 2 ) 2. Choose an estimator Property of  Unbiased estimator  When n is large enough e.g. ≧ 30, it can be approximated by difference of two Normal distribution, also a normal distribution, with mean=d, and in case that these two tests are independent, var=var(error S1 (h 1 ))+var(error S2 (h 2 )). 3. …… 26

Difference in Error (2)  Remark: when S 1 =S 2, the estimator usually becomes smaller (elimination of difference in composition of two sample sets) 27

Hypothesis Tesing  Consider question “What is the probability that error D (h 1 ) ≧ error D (h 2 )” instead  E.g. S 1, S 2 of size 100, error S1 (h 1 )=0.3, error S2 (h 2 )=0.2, hence  Pr(d>0) is equivalent to one-sided interval  1.64σ corresponds to a two-sided interval with confidence level 90%, i.e. one-sided interval with confidence level 95%. 28

Agenda  Estimating hypothesis accuracy  Basics of sampling theory  Deriving confidence interval (general approach)  Difference between hypotheses  Comparing learning algorithm 29

Ch2 Evaluating Hypotheses 1. Given observed accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy) 2. Given that hypothesis h outperforms h’ over some sample, how probable is that h outperforms h’ in general? (difference between hypotheses) When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms) 30

Qs in Focus Let L A and L B be two learning algorithms  What is an appropriate test for comparing L A and L B ?  How can we determine whether an observed difference is statistically significant? 31

Statement of Problem We want to estimate where L(S) is the hypothesis output by learner L using training set S. Remark: The difference of errors is averaged over all training set of size n randomly drawn from D In practice, given limited data D 0, what is a good estimator?  Partition D 0 into training set S 0 and testing set T 0, and measure  Ever better, repeat above many times and average the results 32

Procedure 1. Partition D 0 into k disjoint subsets T 1, T 2, …, T k of equal size of at least For i from 1 to k, do use T i for testing S i ← {D 0 -T i } h A ← L A (S i ) h B ← L B (S i ) δ i ← error Ti (h A )-error Ti (h B ) 3. Return average of δ i as the estimation 33

Estimator  The approximate N% confidence interval for estimating d using is given by 34

Paired t Tests  To understand justification for confidence level given in previous page, consider the following estimation problem: We are given observed values of a set of i.i.d random variables Y 1, Y 2, …, Y k. Wish to estimate expected value of these Y i Use sample mean as the estimator 35

Problem with Limited Data D 0  δ 1 … δ k are not i.i.d, because they are based on overlapping sets of training examples drawn from D 0 rather than full distribution D.  View the algorithm in page 33 as producing estimation for instead. 36

HW  5.4 (10pt, Due Monday, 10-24)  5.6 (10pt, Due Monday, 10-24) 37