Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D 70 80 90 100 Learning algorithm.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Chapter 4 Pattern Recognition Concepts: Introduction & ROC Analysis.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Learning Algorithm Evaluation
Inference Sampling distributions Hypothesis testing.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Evaluation.
Decision Theory Naïve Bayes ROC Curves
Three kinds of learning
ROC Curves.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Jeremy Wyatt Thanks to Gavin Brown
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
INTRODUCTION TO Machine Learning 3rd Edition
Rugby players, Ballet dancers, and the Nearest Neighbour Classifier COMP24111 lecture 2.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Evaluating Classifiers
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ Reduced-error pruning : –breaks the samples into a training set and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
1 Performance Measures for Machine Learning. 2 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall –F –Break Even Point.
Evaluating Classification Performance
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
7. Performance Measurement
Performance Evaluation 02/15/17
Evaluating Results of Learning
Machine Learning Week 10.
Data Mining Classification: Alternative Techniques
Features & Decision regions
Data Mining Practical Machine Learning Tools and Techniques
Experiments in Machine Learning
Learning Algorithm Evaluation
Model Evaluation and Selection
More on Maxent Env. Variable importance:
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Presentation transcript:

Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm

Scientists vs Normal People

Learning Algorithm 1.Split our data randomly 2.Train a model… 3.Test it! Why do we split the data?

The Most Important Concept in Machine Learning… Looks good so far…

The Most Important Concept in Machine Learning… Looks good so far… Oh no! Mistakes! What happened?

The Most Important Concept in Machine Learning… Looks good so far… Oh no! Mistakes! What happened? We didn’t have all the data. We can never assume that we do. This is called “OVER-FITTING” to the small dataset.

Inputs Labels

Inputs Labels :50 (random) split

Training set Train a K-NN or Perceptron on this… Testing set … then, test it on this! “simulates” what it might be like to see new data in the future Needs to be quite big. Bigger is better. Ideally should be small. Smaller is better. But if too small... you’ll make many mistakes on the testing set.

Training set Testing set Build a model (knn, perceptron, decision tree, etc) How many incorrect predictions on testing set? Percentage of incorrect predictions is called the “error” e.g. “Training” error e.g. “Testing” error

error K Classifying ‘3’ versus ‘8’ digits Plotting error as a function of ‘K’ (as in the K-NN) Percentage of incorrect predictions is called the “error” e.g. “Training” error e.g. “Testing” error Training data can behave very differently to testing data!

Which is the “best” value of k? Accuracy (%) K value used in knn classifier MATLAB: “help errorbar” Different random split of the data = different performance!

Error bars (a.k.a. “confidence intervals”) Plots the average, and the standard deviation (spread) of values. Wider spread = less stability…. Accuracy (%) K value used in knn classifier MATLAB: “help errorbar”

Cross-validation We could do repeated random splits… but there is another way… MATLAB: “help crossfold” 1.Break the data evenly into N chunks 2.Leave one chunk out 3.Train on the remaining N-1 chunks 4.Test on the chunk you left out 5.Repeat until all chunks have been used to test 6.Plot average and error bars for the N chunks All data 5 chunks Train on these (all together) Test on this … then repeat, but test on a different chunk.

Box-plots MATLAB: “help boxplot” Reflects shape of true distribution more accurately.

Most datapoints clustered high (around 30-35) Wide confidence intervals, but slight skew downward, (around 15-25) MATLAB: “help boxplot”

What factors should affect our decision? Accuracy (%) A BC D Learning algorithm 1.Accuracy - How accurate is it? 2.Training time and space complexity -How long does it take to train? -How much memory overhead does it take? 3.Testing time and space complexity -How long does it take to execute the model for a new datapoint? -How much memory overhead does it take? 4.Interpretability -How easily can we explain why it does what it does?

Measuring accuracy is not always useful… Accuracy (%) A BC D Learning algorithm 1.Accuracy - How accurate is it? 2.Training time and space complexity -How long does it take to train? -How much memory overhead does it take? 3.Testing time and space complexity -How long does it take to execute the model for a new datapoint? -How much memory overhead does it take? 4.Interpretability -How easily can we explain why it does what it does?

Best testing error at K=3, about 3.2%. Is this “good”?

Zip-codes: “8” is very common on the West Coast, while “3” is rare. Making a mistake will mean your Florida post ends up in Las Vegas!

Sometimes, classes are rare, so your learner will not see many of them. What if, in testing phase you saw 1,000 digits …. 32 instances968 instances 3.2% error was achieved by just saying “8” to everything!

32 instances968 instances 0% correct100% correct Measure accuracy on each class separately. Solution?

32 instances968 instances Our obvious solution is in fact backed up a whole statistical framework. Receiver Operator Characteristics Developed in WW-2 to assess radar operators. “How good is the radar operator at spotting incoming bombers?” False positives – i.e. falsely predicting a bombing raid False negatives – i.e. missing an incoming bomber (VERY BAD!) ROC analysis

False positives – i.e. falsely predicting an event False negatives – i.e. missing an incoming event The “‘3” digits are like the bombers. Rare events but costly if we misclassify! TN FP FNTP Truth Prediction Similarly, we have “true positives” and “true negatives” ROC analysis

TN FP FNTP Truth Prediction Sensitivity = TP TP+FN Specificity = TN TN+FP … chances of spotting a “3” when presented with one (i.e. accuracy on class “3”) … chances of spotting an 8 when presented with one (i.e. accuracy on class “8”) Building a “Confusion” Matrix

Sensitivity = = ? TP TP+FN Specificity = = ? TN TN+FP = 100 examples in the dataset were class = 90 examples in the dataset were class = 190 examples in the data overall Truth Prediction TN FP FNTP ROC analysis… your turn to think…

ROC analysis – incorporating costs… FP = total number of false positives I make on the testing data FN = total number of false negatives I make on the testing data Useful if one is more important than another. For example if a predictor… … misses a case of a particularly horrible disease (FALSE NEGATIVE) or … sends a patient for painful surgery when it is unnecessary (FALSE POSITIVE)

Experiments in Machine Learning Learning algorithms depend on the data you give them - some algorithms are more “stable” than others - must take this into account in experiments - cross-validation is one way to monitor stability - plot confidence intervals! ROC analysis can help us if… - there are imbalanced classes - false positives/negatives have different costs