Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 29 February 2008 William.

Similar presentations


Presentation on theme: "Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 29 February 2008 William."— Presentation transcript:

1 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 29 February 2008 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/~bhsu Readings: Section 6.13, Han & Kamber 2e Methods for Statistical Evaluation of Hypotheses Lecture 17 of 42

2 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Lecture Outline Read Section 6.13 – 6.14, Han & Kamber 2e Statistical Evaluation Methods for Learning: Three Questions –Generalization quality Estimating hypothesis accuracy on future data: sample theory How well does this express generalization accuracy? Estimation bias and variance Confidence intervals –Comparing generalization quality How certain are we that h 1 is better than h 2 ? Significance evaluation and t tests –Learning and statistical evaluation What is the best way to make the most of limited data? Validation choices and tradeoffs Next Lecture: Sections 6.1-6.5, Mitchell (Bayesian Learning Basics)

3 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Motivation Why Evaluate Performance? –To decide whether to use hypothesis h (e.g., database of medical treatments) –To boost performance (e.g., validation-set accuracy in DT, ANN learning) Precise Evaluation from Limited Data: Issues –Bias: overoptimistic estimates of generalization quality –Variance: uncertainty about generalization quality (even if estimate unbiased) What Are We Evaluating? –Accuracy of learned hypothesis (percentage of examples correctly classified) Range of probable error Probability that observed accuracy was “due to chance” –Relative accuracy of two hypotheses –Relative accuracy of two hypotheses using limited data –Other figures of merit Utility: cost of false positives, negatives Efficiency of model: e.g., syntactic complexity of rules

4 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Estimating Hypothesis Error: Definitions Setting for Learning Problem –Objective: learn target function f: X  A –Unknown probability distribution D over X (e.g., age of people met on street) –Input: D  { } (aka { }) drawn from D –Output: represented as hypothesis h  H Questions –Given h, D (|D| = n), give best estimate of future accuracy of h on x drawn from D –What is the probable error in this estimate? Definitions –Sample error of hypothesis h with respect to target function f, sample D –True error of h: the probability that h will misclassify a single x drawn from D –Main question: How well does sample error estimate true error?

5 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Confidence Intervals for Discrete-Valued Hypotheses Estimating True Error –Discrete-valued hypothesis h (classification model) –Estimate based on observed sample error over sample D Samples (instances) x are independent and identically distributed (i.i.d.) ~ D Let: (i.e., error D (h) = r/n) Suppose n  30 (heuristics for n: later today) –Given no other information, the most probable error D (h) is error D (h) Definition –N% confidence interval (CI) for a parameter  : an interval that is expected with probability N% to contain  –Intuitive idea: want “error D (h)  error D (h)  z N  ” for given N, desired   error D (h) Example –Want 95% CI for r = 12, n = 40 (error D (h) = 12/40 = 0.30) –z N =1.96,  = 0.07, CI = 0.30  0.14 (more on what z N and  are in a bit)

6 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Sampling Theory: Definitions from Probability and Statistics Estimating Error Rate –How does |error D (h) - error D (h)| depend on |D|? –General statistical framework: parameter estimation Want to estimate Pr(P h (x)) - proportion of x drawn from D with property P h P h (x)  “h misclassifies x” (i.e., h(x)  c(x)) Experiment with random outcome: select an x and test h on it (Bernoulli trial) –Goal Estimate Pr(P h (D, e))  Pr(error D (h) = e) based on Pr(P h (x)) D: random sample of size n Definitions to Review –F: cumulative distribution function (cdf); aggregation of probability measure Pr Sum (discrete mass function p) or integral (continuous density function f) p(x) = Pr(X = x), F(x) = Pr(X  x) –Relevant distributions: Binomial, Normal –Statistical measures on random variable X ~ Pr(X): mean  X, variance  X

7 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Error Estimation and Estimating Binomial Proportions Bernoulli Random Variables –Random variable (RV): real-valued function over a sample space (X:   R )  denotes range of values for X X maps values to their probability –Bernoulli random variable: a random variable where   H  {0, 1} –Bernoulli trial: experiment whose outcome is denoted by a Bernoulli RV (e.g., coin flip) Understanding the Binomial Distribution –Definition: Binomial RV One where   N (natural numbers, i.e., nonnegative integers) Value denotes number of observations in n Bernoulli trials (n  N ) –Idea: calculate number of ways to get r observations in n trials, probability of r Bernoulli Trials, Binomial Distributions, and Error Estimation –Interpret observation (X = 1) as “h incorrectly classifies x  D”, no observation (X = 0) as “h correctly classifies x  D” –Consider n trials, i.i.d. ~ Bernoulli: distribution on error rate for all of D

8 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition The Binomial Distribution Probability Mass Function for Binomial Distribution –p(R = r)  number of ways to get r observations on n trials Pr(r out of n) Count number of ways: Calculate probability of a particular combination: Calculate overall probability mass function: Using The Estimate –Binomial RV of interest: R –Based on p of interest We know p from error D (h) To answer questions about error D (h), characterize R (bias, variance)

9 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Mean, Variance, and Estimation Bias Definitions –Expectation of a random variable X (aka expected value or mean) General definition Suppose X is discrete-valued Binomial distribution: E[X] = np –Variance of X: –Standard deviation of X: –Estimation bias of an estimator X for a parameter  : NB: Estimation Bias  Inductive Bias –  = 0  X is unbiased estimator –Is error D (h) an unbiased estimator for error D (h)? (Hint: E[R] = np)

10 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Confidence Intervals Objective: Calculate Bounds for True Error (at Given Confidence Level) Definitions –N% confidence interval (CI) for  : interval that contains  with probability N% –Normal (aka Gaussian) distribution: where  denotes mean of X,  variance of X Finding CI for  in Terms of Given N –  of interest: error D (h) –Evaluator specifies N –Depends on a “constant” z N (a function of N), n = |D|, and error D (h) –Want error D (h) - z N   error D (h)  error D (h) + z N  N% CI for error D (h): error D (h)  z N  –Coefficient z N : table lookup –

11 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Confidence Intervals: Two-Sided and One-Sided Bounds Two-Sided versus One-Sided Bounds –Previous z N : for a two-sided bound (upper and lower extrema: [L, U]) –Sometimes, want to ask: What is p(error D (h)  U)? –In many experiments, maximum error bound U is all that matters One-sided bound: don’t care that expected error is at least L Want to adapt CI computation procedure to handle half-interval Property: Normal Distribution is Symmetric about  Modification to CI Procedure –  = probability of error falling into tails of the distribution ( U) –Simply convert 100(1 -  )% CI for [L, U] into 100(1 -  /2)% CI for [- , U] Example –Suppose again that h commits r = 12 errors on n = 40 examples –95% CI of 0.30  0.14 –97.5% CI of [0.30, 0.44]

12 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Confidence Intervals: Applying The Central Limit Theorem General 4-Step Approach for Deriving CIs –1. Identify the population parameter  to be estimated (e.g., error D (h)) –2. Determine the estimator X (e.g., error D (h)) Prefer unbiased X (E[X] -  = 0), minimum variance - sometimes in conflict! –3. Determine the distribution D X such that X ~ D X ; particularly want  X and  X Other ways to characterize probability mass (or density) function for D X (see moments, moment generating functions) –4. Determine N% CI for given N%: find L, U such that N% of pmf for  lies in [L, U] Central Limit Theorem –Consider: set of n RVs {X i } i.i.d. ~ (arbitrary) D( ,  2 ), sample mean –Claim: as n  , ~ Normal (0, 1) –Ramification: for any estimator X that is a mean (e.g., error D (h)), X ~ D X  Normal (  X,  X ) for large enough n

13 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Difference in Error of Two Hypotheses Estimating Difference in True Errors: Want a CI for Difference Use Generic 4-Step Procedure to Derive Estimator and CI –1. Desired parameter: d  error D (h 1 ) - error D (h 2 ) –2Q. What is a good estimator for d? –2A. Difference between sample errors –3. Determine distribution D X governing the estimator X Central Limit Theorem (CLT): errors are ~ Normal for large enough n 1, n 2 Difference of Normal distributions is a Normal distribution Result:  X,  X –4. Use parameters of distribution D X to derive N% CI NB: Typically, Samples Are Identical (S 1 = S 2 ) –Ramification: (usually) lower estimator variance - why?

14 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Hypothesis Testing Neyman-Pearson Paradigm –Intuitive idea: comparing two hypotheses h 1 and h 2 Subtlety: “hypothesis” here refers to assertions about error D (h 1 ), error D (h 2 ) Define such a hypothesis H 0 (null hypothesis: “error D (h 1 )  error D (h 2 )”) Want to decide between it and H A (alternative hypothesis,  h 0 ) –The test question Event: error D (h 1 ) > error D (h 2 ) makes us reject H 0 What is Pr[error D (h 1 ) > error D (h 2 )]? Observation of Estimators and Hypothesis Testing –We see: –We want to use this to estimate d  error D (h 1 ) - error D (h 2 ) Confidence Intervals and Hypothesis Testing – = probability that estimator falls in one-sided CI –Pr(estimate of d is within observed value) = Pr(correct evaluation)

15 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Hypothesis Testing: An Example Statistical Experiment Definition –Common distribution D –Two samples (data sets): D 1, D 2 –Two hypotheses h 1 and h 2 (e.g., old hypothesis and new one in learning algorithm) –Parameter: error difference at confidence level N% (compute and compare) Observation: Measured Errors Example –Suppose – –Here, we have solved for z N given sample errors Reverse table lookup: 90% two-sided level of confidence, 95% one-sided Can answer: difference in sample errors needed to accept H 0 (N% confidence) U  boundary of one-sided rejection region

16 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Comparing Learning Algorithms Another Statistical Experiment Definition –Common distribution D –Error computed over single sample D drawn from D –Two learning algorithms L A and L B –Parameter: error difference at confidence level N% (compute and compare) Observation: Measured Errors on Test Data Observation of Estimators and Hypothesis Testing –Break limited sample D into D train, D test –We see: –We want to use this to estimate –Need method for reducing loss of training data to test set

17 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Estimating Difference in Error between Learning Algorithms Algorithm k-Fold-Cross-Validation(D, L A, L B ) –Partition D into k disjoint subsets T 1, T 2, …, T k of equal size (|T i | at least 30) –FOR i  1 TO k DO Use T i for the test set and the remaining data D i = D ~ T i for training h A  L A (D i ) h B  L B (D i )  i  error T i (h A ), error T i (h B ) –RETURN k-Fold Cross Validation: aka Jackknife –N% confidence interval for estimating –

18 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Statistical Errors and Significance Neyman-Pearson Revisited –Two kinds of statistical errors Type I  false negatives (reject H 0 when it is true) Type II  false positives (accept H 0 when it is false) –Probability of statistical errors Pr(false negative)    significance level of the test Pr(false positive)   ; 1 -   Pr(sound rejection)  power of the test Jargon –“Significant to the N% level of confidence”: N = 100(1 -  ) –“p <  ” p-value  smallest value of  for which H 0 will be rejected p-value inversely proportional to observed successes “Significance of p”: p-value = p

19 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Paired t Tests Student t Test –Estimator: sample mean –Test statistic: t N,k-1 (k - 1 degrees of freedom) k counts number of independent random events used to estimate parameter Property: Result: cross-validated CI Paired Tests –Evaluation of tests over identical samples –Lower variance (no random differences in samples)  tighter CI For More Information: Consult Statistics Textbook (e.g., [Rice, 1988])

20 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition The Bias-Variance Tradeoff Intuitive Idea –Variance: being uncertain about d Effort expended to reduce variance: coding the model Antithesis of precision –(Estimation) Bias: being wrong about d Effort expended to reduce bias: coding the residual error Antithesis of accuracy –Important in designing learning algorithms Jargon –Bias-variance tradeoff: idea that limited resources exist to code model and error –Algorithms that exploit bias-variance tradeoff: exchange flexibility for complexity Flexibility: more trainable parameters (e.g., ANN weights) Complexity: slower convergence, overfitting, other problems

21 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Using Statistics in Evaluation of Learning Algorithms Practical Considerations –Setting k in k-fold CV –“User-defined” target N% Things to Keep in Mind –Limited data resources More used to evaluate, less to generalize over Goal: find ways to evaluate in a mathematically sound fashion –Limited computational resources Only so much space and time to perform experiments Goal: design efficient and informative tests –No silver bullet No learning algorithm does equally well on all data sets Evaluation of learning algorithms  evaluation of inductive biases Further Reading: Duda and Hart (2 e Due Out Any Day Now™)

22 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Statistics and ANN Jargon Many Common Concepts Differences in Terminology –Inductive learning ANNs and AI: “generalizing from noisy data”; “supervised learning” Statistics: “statistical inference”; “(multivariate) regression” –Design of learning systems ANNs and AI: “architecture” Statistics: “model” –Model components ANNs: “trainable weight or bias” Statistics: “parameter” Many, Many More Most Comprehensive Resources to Date –http://www.aic.nrl.navy.mil/~aha/research/txt/searle.txt –http://www.faqs.org/faqs/ai-faq/neural-nets/part1/section-14.html

23 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Terminology Statistical Evaluation Methods for Learning –Foundations Mean, variance, estimation bias Confidence intervals –Comparing generalization quality Significance (to a confidence level) p-value t test –Making the most of limited data k-fold cross validation Degrees of freedom Neyman-Pearson Paradigm –Null hypothesis (H 0 ) and alternative hypothesis (H A ) –Rejection, acceptance of H 0 –Type I, Type II statistical errors; significance, power

24 Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Statistical Evaluation Methods for Learning: Three Questions –Generalization quality How well does observed accuracy estimate generalization accuracy? Estimation bias and variance Confidence intervals –Comparing generalization quality How certain are we that h 1 is better than h 2 ? Confidence intervals for paired tests –Learning and statistical evaluation What is the best way to make the most of limited data? k-fold CV Tradeoffs: Bias versus Variance Next Lecture: Sections 6.1-6.5, Mitchell (Bayes’s Theorem; ML; MAP) Summary Points


Download ppt "Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 29 February 2008 William."

Similar presentations


Ads by Google