Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Slides:

Advertisements

Similar presentations

COMP3740 CR32: Knowledge Management and Adaptive Systems

Advertisements

Machine Learning: Intro and Supervised Classification

Decision Trees Decision tree representation ID3 learning algorithm

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Experimental Evaluation

On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,

Data Mining – Algorithms: OneR Chapter 4, Section 4.1.

Issues with Data Mining

Short Introduction to Machine Learning Instructor: Rada Mihalcea.

GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.

Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.

1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.

WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)

Weka Project assignment 3

1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.

User Study Evaluation Human-Computer Interaction.

Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Learning from Observations Chapter 18 Through

Debugging Strategies from Software Carpentry. Agan's Rules Many people make debugging harder than it needs to be by: Using inadequate tools Not going.

Today Ensemble Methods. Recap of the course. Classifier Fusion

1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.

Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

WEKA Machine Learning Toolbox. You can install Weka on your computer from

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

An Exercise in Machine Learning

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

Introduction to Computer Programming - Project 2 Intro to Digital Technology.

Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 18

Introduction to Machine Learning and Text Mining

Advanced data mining with TagHelper and Weka

Machine Learning Techniques for Data Mining

Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.

Machine Learning in Practice Lecture 26

CSCI N317 Computation for Scientific Applications Unit Weka

Machine Learning in Practice Lecture 23

Machine Learning in Practice Lecture 22

Machine Learning in Practice Lecture 7

Machine Learning in Practice Lecture 17

Machine Learning in Practice Lecture 6

Machine Learning in Practice Lecture 27

Evaluating Classifiers

Data Mining CSCI 307, Spring 2019 Lecture 8

Presentation transcript:

Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Any questions? Announcements:  First homework assigned Machine Learning process overview Learn how to use weka Introduce assignment Introduction to Cross-Validation

Overview of Machine Learning Process Skills

Naïve Approach: When all you have is a hammer… Target Representation Data

Naïve Approach: When all you have is a hammer… Target Representation Problem: there isn’t one universally best approach!!!!! Data

Slightly less naïve approach: Aimless wandering… Target Representation Data

Slightly less naïve approach: Aimless wandering… Target Representation Problem 1: It takes too long!!! Data

Slightly less naïve approach: Aimless wandering… Target Representation Problem 2: You might not realize all of the options that are available to you! Data

Expert Approach: Hypothesis driven Target Representation Data

Expert Approach: Hypothesis driven Target Representation You might end up with the same solution in the end, but you’ll get there faster. Data

Expert Approach: Hypothesis driven Target Representation Today we’ll start to learn how! Data

Warm Up Exercise

Every combination of feature values is represented. Warm Up Exercise

Every combination of feature values is represented. What will happen if you try to predict HairColor from the other features? Warm Up Exercise

If you don’t have good features, even the most powerful algorithm won’t be able to learn an accurate prediction rule.

Every combination of feature values is represented. What will happen if you try to predict HairColor from the other features? Warm Up Exercise If you don’t have good features, even the most powerful algorithm won’t be able to learn an accurate prediction rule. But that doesn’t mean this data set is a hopeless case! For example, maybe the people who like red and have brown hair like a different shade of red than the ones who have blond hair. So ask yourself: what information might be hidden or implicit that might allow me to learn a rule?

Getting a bit more sophisticated…

Example Data Set

We’re going to consider a new algorithm

Example Data Set We’re going to consider a new algorithm We’re also going to consider data representation issues

More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!

More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!

More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

More Complex Algorithm… Two simple algorithms last time  0R – Predict the majority class  1R – Use the most predictive single feature Today – Intro to Decision Trees  Today we will stay at a high level  We’ll investigate more details of the algorithm next time * Only makes 2 mistakes! What will it do with this example?

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn?

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn?

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn?

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn?

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake.

Why is it better? Not because it is more complex  Sometimes more complexity makes performance worse What is different in what the three rule representations assume about your data?  0R  1R  Trees The best algorithm for your data will give you exactly the power you need

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set Who ran the opinion poll

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set When the poll was conducted

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set Who the Democratic candidate would be :

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set Who the Republican candidate would be :

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set Who is running against who

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set Which party will win

Back to the Opinion Poll Data Set From Example of the kind of data set you could use for your course project  Better to find a larger data set This is what we want to predict

Do you see any redundant information?

Do you see any missing or hidden information?

How could you expand on what’s here? Add features that describe the source

How could you expand on what’s here? Add features that describe things that were going on during the time when the poll was taken

How could you expand on what’s here? Add features that describe personal characteristics of the candidates

What do you think would be the best rule?

What would Weka do with this data?

Using Weka Start Weka Open up the Explorer interface

Using Weka Click on Open File  Open OpinionPoll.csv from the Lectures folder You can save it as a.arff file

Using Weka Click on Open File  Open OpinionPoll.csv from the Lectures folder You can save it as a.arff file Summary stats for selected attributes are displayed

Using Weka Observe interaction between attributes by selecting on interface Select one attribute here Select another attribute here

Using Weka Observe interaction between attributes by selecting on interface Select one attribute here Select another attribute here Based on what you see, do you think the sources of the opinion polls were biased?

Using Weka Go to Classify Panel Select a classifier

Using Weka Select a classifier Select the predicted value

Using Weka Select a classifier Select the predicted value Start the evaluation

Using Weka Select a classifier Select the predicted value Start the evaluation Observe the results

Looking at the Results Percent correct Percent correct, controlling for correct by chance Performance on individual categories Confusion matrix * Right click in Result list and select Save Result Buffer to save performance stats.

Looking at the Results Percent correct Percent correct, controlling for correct by chance Performance on individual categories Confusion matrix * Right click in Result list and select Save Result Buffer to save performance stats.

Notice the shape of the tree (although the text is too small to read!)

It’s making its decision based only on who the Republican candidate is.

Why did it do that?

Where will it make mistakes?

Notice the more complex rule if we force binary splits … Note that the more complex rule performs worse!!!

More representation issues… “Gyre” by Eric Rosé

Low resolution image gives some information

Higher resolution image gives more information

But not if the accuracy is bad

Question: When might that happen?

Low resolution gives more information if the accuracy is higher

Assignment 1

Make sure Weka is set up properly on your machine Know the basics of using Weka Information about you…

Information about You Learning goals Priority on learning activities Project goals Programming competence

Cross-Validation

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes Performance on training data?

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes Performance on training data? Performance on testing data?

If Outlook = sunny, no else if Outlook = overcast, yes else if Outlook = rainy and Windy = TRUE, no else yes IMPORTANT! If you evaluate the performance of your rule on the same data you trained on, you won’t get an accurate estimate of how well it will do on new data.

What is cross validation?

Notice that Cross validation is for testing only! Not for building the rule!

But then….. If we are satisfied with the performance estimate we get Then we build the model with the WHOLE SET Now let’s see how it works…

But then….. If we are satisfied with the performance estimate we get Then we build the model with the WHOLE SET Now let’s see how it works… If you are not satisfied with the performance you get, then you should try to determine what went wrong, and then evaluate a different model that compensates.

Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 2, 3, 4, 5, 6,7 and apply trained model to 1 The results is Accuracy TEST TRAIN Fold: 1

Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 3, 4, 5, 6,7 and apply trained model to 2 The results is Accuracy TRAIN TEST Fold: 2

Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 4, 5, 6,7 and apply trained model to 3 The results is Accuracy TRAIN TEST TRAIN Fold: 3

Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1,2, 3, 5, 6,7 and apply trained model to 4 The results is Accuracy TRAIN TEST TRAIN Fold: 4

Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 3, 4, 6,7 and apply trained model to 5 The results is Accuracy TRAIN TEST TRAIN Fold: 5

Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 3, 4, 5, 7 and apply trained model to 6 The results is Accuracy TRAIN TEST TRAIN Fold: 6

Simple Cross Validation Let’s say your data has attributes A, B, and C You want to train a rule to predict D First train on 1, 2, 3, 4, 5, 6 and apply trained model to 7 The results is Accuracy7 Finally: Average Accuracy1 through Accuracy TRAIN TEST TRAIN Fold: 7

Remember! If we are satisfied with the performance estimate we get using cross-validation Then we build the model with the WHOLE SET We don’t use cross-validation to build the model

Why do we do cross validation? Use cross-validation when you do not have enough data to have completely independent train and test sets We are trying to estimate what performance would you get if you trained over your whole set and applied that model to an independent set of the same size We compute that estimate by averaging over folds

Do we have to do all of the folds? Yes! The test set on each fold is too small to give you an accurate estimate of performance alone Variation across folds Evaluation over part of the data is likely to be misleading

Why do we do cross validation? Makes the most of your data – large portion used for training Avoids testing on training data  Testing on training data will over estimate your performance!!! But if you do multiple iterations of cross- validation, in some ways you are using insights from your testing data in building your model

Questions about cross-validation from in-person students… How do you decide how many folds? How is data divided between folds? Don’t you need to have a hold-out set to be totally sure you have a good estimate of performance?

Other questions from in-person students… Do our class projects have to be classification problems per se?  Clustering of pen stroke data Will we learn to work with time series data in this course?

Questions?