Evaluating Models Part 2

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Relationship Between Sample Data and Population Values You will encounter many situations in business where a sample will be taken from a population, and.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Determining the Size of
Evaluation.
Evaluating Hypotheses
1 Confidence Interval for the Population Mean. 2 What a way to start a section of notes – but anyway. Imagine you are at the ground level in front of.
Chapter Topics Confidence Interval Estimation for the Mean (s Known)
CONFIDENCE INTERVALS What is the Purpose of a Confidence Interval?
Estimating a Population Proportion
Experimental Evaluation
Determining the Size of
Standard Error of the Mean
Computer Vision Lecture 8 Performance Evaluation.
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Chapter 13: Inference in Regression
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.
Topic 5 Statistical inference: point and interval estimate
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Statistical Inference Basic Principle Underlying all inferential statistics… samples are representative of the population from which they are drawn Hypotheses.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Estimating Means and Proportions Using Sample Means and Proportions To Make Inferences About Population Parameters.
Test Topics 1)Notation and symbols 2)Determining if CLT applies. 3)Using CLT to find mean/mean proportion and standard error of sampling distribution 4)Finding.
Normal Distributions Z Transformations Central Limit Theorem Standard Normal Distribution Z Distribution Table Confidence Intervals Levels of Significance.
Statistical Intervals for a Single Sample Chapter 8 continues Chapter 8B ENM 500 students reacting to yet another day of this.
CpSc 881: Machine Learning Evaluating Hypotheses.
Machine Learning Chapter 5. Evaluating Hypotheses
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
EGR 252 Ch. 9 Lecture1 JMB th edition Slide 1 Chapter 9: One- and Two- Sample Estimation  Statistical Inference  Estimation  Tests of hypotheses.
Psychology 202a Advanced Psychological Statistics October 20, 2015.
Lesoon Statistics for Management Confidence Interval Estimation.
Chapter 8 Single Sample Tests Part II: Introduction to Hypothesis Testing Renee R. Ha, Ph.D. James C. Ha, Ph.D Integrative Statistics for the Social &
Classification Vikram Pudi IIIT Hyderabad.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Data Science Credibility: Evaluating What’s Been Learned
Confidence Intervals about a Population Proportion
Estimation.
Chapter 6 Inferences Based on a Single Sample: Estimation with Confidence Intervals Slides for Optional Sections Section 7.5 Finite Population Correction.
Evaluating Classifiers
04/10/
C4.5 - pruning decision trees
Module 22: Proportions: One Sample
Evaluating Classifiers
Interval Estimation.
Machine Learning Techniques for Data Mining
Confidence Interval Estimation
Why does sampling work?.
8.3 – Estimating a Population Mean
Reasoning in Psychology Using Statistics
Evaluating Models Part 1
Overfitting and Underfitting
Confidence Intervals with Proportions
Classification Breakdown
Chapter 24 Comparing Two Means.
Statistics for the Social Sciences
Evaluating Hypothesis
Machine Learning: Lecture 5
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Evaluating Models Part 2 Comparing Models Geoff Hulten

How good is a model? Goal: predict how well a model will perform when deployed to customers Use data: Train Validation (tune) Test (generalization) Assumption: All data is created independently by the same process.

What does good mean? Training Environment Performance Environment Build Model Dataset Training Data Deploy Customer Interaction Testing Data Evaluate Model Estimated Accuracy Actual Accuracy 𝑒𝑟𝑟𝑜𝑟 𝑆 (ℎ) How do they relate? 𝑒𝑟𝑟𝑜𝑟 𝐷 (ℎ)

Binomial Distribution Test n testing samples, how many correct? Flip n coins, how many heads?

Estimating Accuracy 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦= # 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦= # 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑛 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 ≈ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(1 −𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦) 𝑛

Confidence Intervals Upper = Accuracy+ 𝑍 𝑁 ∗ 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 Lower = Accuracy − 𝑍 𝑁 ∗ 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 Confidence 95% 98% 99% 𝑍 𝑁 1.96 2.33 2.58

Confidence Interval Examples 95% 98% 99% 𝑍 𝑁 1.96 2.33 2.58 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 ≈ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(1 −𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦) 𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦= # 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑛 Upper = Accuracy+ 𝑍 𝑁 ∗ 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 Lower = Accuracy − 𝑍 𝑁 ∗ 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 N # correct Accuracy 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 Confidence Interval Width 100 15 15% 3.5707% 95% 6.998% 1000 500 50% 1.5811% 3.099% 10000 7500 75% 0.433% 99% 1.117%

Summary of Error Bounds Use error bounds to know how certain you are of your error estimates Use error bounds to estimate the worst case behavior

Comparing Models Training Environment Performance Environment Build a New Model Dataset Training Data Customer Interaction Deploy?? Testing Data Evaluate Models Actual Accuracy Estimated Accuracies 𝑒𝑟𝑟𝑜𝑟 𝐷 (𝑡𝑟𝑒𝑒) 𝑒𝑟𝑟𝑜𝑟 𝑆 (𝑡𝑟𝑒𝑒) Which will be better? 𝑒𝑟𝑟𝑜𝑟 𝐷 (𝑙𝑖𝑛𝑒𝑎𝑟) 𝑒𝑟𝑟𝑜𝑟 𝑆 (𝑙𝑖𝑛𝑒𝑎𝑟)

Comparing Models using Confidence Intervals IF: Model1 – Bound > Model2 + Bound Samples Model(89%) – Bound Model(80%) + Bound 100 82.9% 87.8% 200 84.7% 85.5% 1000 87.0% 82.5% 95% Confidence Interval

One Sided Bounds

Cross Validation Instead of dividing training data into two parts (train & validation). Divide it into K parts and loop over them Hold out one part for validation, train on remaining data K = 1 K = 2 K = 3 K = 4 K = 5 Train on Validate on 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐶𝑉 = 1 𝑛 𝑘 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑘 𝜎 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐶𝑉 ≈ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐶𝑉 (1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐶𝑉 ) 𝑛

Cross Validation pseudo-code totalCorrect = 0 for i in range(K): (foldTrainX, foldTrainY) = GetAllDataExceptFold(trainX, trainY, i) (foldValidationX, foldValidationY) = GetDataInFold(trainX, trainY, i) # do feature engineering/selection on foldTrainX, foldTrainY model.fit(foldTrainX, foldTrainY) # featurize foldValidationX using the same method you used on foldTrainX totalCorrect += CountCorrect(model.predict(foldValidationX), foldValidationY) accuracy = totalCorrect / len(trainX) upper = accuracy + z * sqrt( (accuracy * (1 - accuracy) ) / len(trainX) ) lower = accuracy - z * sqrt( (accuracy * (1 - accuracy) ) / len(trainX) )

When to use cross validation K = 5 or 10 – k-fold cross validation Do this in almost every situation K = n – Leave out one cross validation Do this if you have very little data And be careful of: Time series Dependencies (e.g. spam campaigns) Other violations of independence assumptions

Machine Learning Does LOTS of Tests For each type of feature selection, for each parameter setting… # Tests P(all hold) 1 .95 10 .598 100 .00592 1000 5.29E-23 # Tests P(all hold) 1 .999 10 .990 100 .9048 1000 .3677 95% Bounds 99.9% Bounds

Summary Always think about your measurements: Independent test data Think of statistical estimates instead of point estimates Be suspicious of small gains Get lots of data!