Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Linear Classifiers (perceptrons)
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Model Assessment and Selection
CMPUT 466/551 Principal Source: CMU
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Evaluation.
Evaluation.
Ensemble Learning: An Introduction
Experimental Evaluation
Today Evaluation Measures Accuracy Significance Testing
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Evaluating Classifiers
Issues with Data Mining
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Learning from Observations Chapter 18 Through
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Machine Learning in Practice Lecture 19 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
Machine Learning in Practice Lecture 18
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Learning Algorithm Evaluation
Machine Learning in Practice Lecture 26
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Machine Learning in Practice Lecture 22
Machine Learning in Practice Lecture 7
Machine Learning in Practice Lecture 17
Machine Learning in Practice Lecture 6
Machine Learning in Practice Lecture 27
Evaluating Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements  Questions?  Quiz answer key posted Today’s Data Set: Prevalence of Gambling Exploring the Concept of Cost

Quiz Notes

Leave-one-out cross validation On each fold, train on all but 1 data point, test on 1 data point  Pro: Maximizes amount of training data used on each fold  Con: Not stratified  Con: Take a long time on large sets Best to only use when you have a very small amount of training data  Only needed when 10-fold cross validation is not feasible because of lack of data

632 Bootstrap A method for estimating performance when you have a small data set  Consider it an alternative to leave-one-out cross validation Sample n times with replacement to create the training set  Some instances will be repeated  Some will be left out – this will be your test set  About 63% of the instances in the original set will end up in the training set

632 Bootstrap Estimating error over the training set will be an optimistic estimate of performance  Because you trained on these examples Estimating error over test set will be a pessimistic estimate of the error  Because the 63/37 split gives you less training data than 90/10 Estimate error by combining optimistic and pessimistic estimates .63*pessimistic_estimate +.37*optimistic_estimate Iterate several times and average performance estimates

Prevalence of Gambling

Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

Gambling Prevalence * Risk is the most predictive feature.

Gambling Prevalence

* Demographic is the least predictive feature.

Which algorithm will perform best?

Which algorithm will perform best? Decision Trees.26 Kappa Naïve Bayes.31 Kappa SMO.53 Kappa

Decision Trees * What’s it ignoring and why?

With Binary Splits – Kappa.41

What was different with SMO? Trained a model for all pairs  The features that were important for one pairwise distinction were different than those for other pairwise distinctions  Characteristic=Black was most important for High versus Low (ignored by decision trees)  When and Risk were most important for High versus Medium Decision Trees pay attention to all distinctions at once  Totally ignored feature that was important for some pairwise distinctions

What was wrong with Naïve Bayes? Probably just learned noisy probabilities because the data set is small Hard to distinguish Low and Medium

Back to Chapter 5

Thinking about the cost of an error – A Theoretical Foundation for Machine Learning Cost  Making the right choice doesn’t cost you anything  Making an error comes with a cost  Some errors cost more than others  Rather than evaluating your model in terms of accuracy, which treats every error as though it was the same, you can think about average cost  The real cost is determined by your application

Unified Framework Connection between optimization techniques and evaluation methods Think about what function you are optimizing  That’s what learning is Evaluation measures how well you did that optimization  So it makes sense for there to be a deep connection between the learning technique and the evaluation New machine learning algorithms are often motivated by modifications to the conceptualization of the cost of an error

What’s the cost of a gambling mistake?

Thinking About the Practical Cost of an Error In document retrieval, precision is more important than recall  You’re picking from the whole web, so if you miss some relevant documents it’s not a big deal  Precision is more important – you don’t want to have to slog through lot’s of irrelevant stuff

Thinking About the Practical Cost of an Error What if you are trying to predict whether someone will be late?  Is it worse to not predict someone will be late when they won’t or vice versa?

Thinking About the Practical Cost of an Error What if you’re trying to predict that a message will get a response or not?

Thinking About the Practical Cost of an Error Let’s say you are picking out errors in student essays  If you detect an error, you offer the student a correction for their error  What are the implications of missing an error?  What are the implications of imagining an error that doesn’t exist?

Cost Sensitive Classification An example of the connection between the notion of cost of an error and the training method Say you manipulate the cost of different types of errors  Cost of a decision is computed based on the expected cost That affects the function the algorithm is “trying” to maximize  Minimize expected cost rather than maximizing accuracy

Cost Sensitive Classification Cost sensitive classifiers work in two ways  manipulate the composition of the training data (by either changing the weight of some instances or by artificially boosting the number of instances of some types by strategically including some duplicates)  Manipulate the way predictions are made Select the option that minimizes cost rather than the most likely choice In practice it’s hard to use cost-sensitive classification in a useful way

Cost Sensitive Classification What if it’s 10 times more expensive to make a mistake when selecting Class C Expected cost of a decision   j C j p j  The cost of predicting class C j is computed by multiplying the j column of the cost matrix by the corresponding probabilities The expected cost of selecting C if probabilities are computed at A=75%, B=10%, C=15% is.75*10 +.1*1 = ABC A B C

Cost Sensitive Classification The expected cost of selecting B if probabilities are computed at A=75%, B=10%, C=15% is.75*1 +.15*1 =.9 If A is selected, expected cost is.1*1 +.15*1 =.25 You can make a choice by minimizing the expected cost of an error So in this case, the expected cost is less when selecting A with highest probability ABC A B C

Cost Sensitive Classification The expected cost of selecting B if probabilities are computed at A=75%, B=10%, C=15% is.75*1 +.15*1 =.9 If A is selected, expected cost is.1*1 +.15*1 =.25 You can make a choice by minimizing the expected cost of an error So in this case, the expected cost is less when selecting A with highest probability ABC A B C

Using Cost Sensitive Classification

* Set up the cost matrix * Assign a high penalty to the largest error cell

Results Without Cost Sensitive Classification.53 Using Cost Sensitive Classification increased performance to.55  Tiny difference because SMO assigns probability 1 to all predictions  Not statistically significant SMO with default settings normally predicts one class with confidence 1 and the others with confidence 0, so cost sensitive classification does not have a big effect

What is the cost of an error? Assume first all errors have the same cost Quadratic loss:  j (p j – a j ) 2  Cost of a decision  J iterates over classes (A, B, C) Penalizes you for putting high confidence on a wrong prediction and/or low confidence on a right prediction ABC A B C

What is the cost of an error? Assume first all errors have the same cost Quadratic loss:  j (p j – a j ) 2  Cost of a decision  J iterates over classes (A, B, C) If C is right and you say A=75%, B=10%, C=15%  (.75 – 0) 2 + (.1 – 0) 2 + (.15 -1) 2  1.3 If A is right and you say A=75%, B=10%, C=15%  (.75 – 1) 2 + (.1 – 0) 2 + (.15 – 0) 2 .09  Lower cost if highest probability is on the correct choice ABC A B C

What is the cost of an error? Assume all errors have the same cost Informational Loss: -log k p i k is the number of classes i is the correct class P i is the probability of selecting class i If C is right and you say A=75%, B=10%, C=15%  -log 3 (.15)  1.73 If A is right and you say A=75%, B=10%, C=15%  -log 3 (.75) .26  Lower cost if highest probability is on the correct choice ABC A B C

Trade Offs Between Quadratic Loss and Information Loss Quadratic Loss pays attention to probabilities placed on all classes  So you can get partial credit if you put really low probabilities on some of the wrong choices  Bounded (Max value is 2) Information loss only pays attention to how you treated the correct prediction  More like gambling  Not bounded

Minimum Description Length Principle Another way of viewing the connection between optimization and evaluation  Based on information theory Training minimizes how much information you encode in the model  How much information does it take to determine what class an instance belongs to?  Information is encoded in your feature space Evaluation measures how much information is lost in the classification Tension between complexity of the model at training time and information loss at testing time

Take Home Message Different types of errors have different costs  Costs associated with cells in the confusion matrix  Costs may also be associated with the level of confidence with which decisions are made Connection between concept of cost of an error and learning method  Machine learning algorithms are optimizing a cost function  The cost function should reflect the real cost in the world In cost sensitive classification, the notion of which types of errors cost more can influence classification performance