Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Slides:



Advertisements
Similar presentations
Classification Classification Examples
Advertisements

For Wednesday Read chapter 19, sections 1-3 No homework.
PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
. Markov Chains as a Learning Tool. 2 Weather: raining today40% rain tomorrow 60% no rain tomorrow not raining today20% rain tomorrow 80% no rain tomorrow.
Introduction to Data Mining with XLMiner
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
x – independent variable (input)
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
CES 514 – Data Mining Lecture 8 classification (contd…)
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Mathematics Algebra 1 The Meaning of Numbers Choosing the Correct Equation(s) Mathematics is a language. It is used to describe the world around us. Can.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Introduction to Linear Regression
Association between 2 variables
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Classification Techniques: Bayesian Classification
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Linear Discrimination Reading: Chapter 2 of textbook.
Slides for “Data Mining” by I. H. Witten and E. Frank.
In Stat-I, we described data by three different ways. Qualitative vs Quantitative Discrete vs Continuous Measurement Scales Describing Data Types.
Machine Learning in Practice Lecture 19 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
CIS 335 CIS 335 Data Mining Classification Part I.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 18
Chapter 14 Introduction to Multiple Regression
Chapter 7. Classification and Prediction
Gradient descent David Kauchak CS 158 – Fall 2016.
Deep Feedforward Networks
Advanced data mining with TagHelper and Weka
CH 5: Multivariate Methods
Perceptrons Lirong Xia.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
Machine Learning in Practice Lecture 11
Machine Learning in Practice Lecture 26
Machine Learning in Practice Lecture 23
Machine Learning in Practice Lecture 7
Machine Learning in Practice Lecture 17
Machine Learning in Practice Lecture 6
Machine Learning in Practice Lecture 27
Perceptrons Lirong Xia.
Machine Learning in Practice Lecture 20
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements  No new homework this week  No quiz this week  Project Proposal Due by the end of the week Naïve Bayes Review Linear Model Review Tic Tac Toe across models X X OX O O X X O

Project proposals If you are using one of the prefabricated projects on blackboard, let me know which one Otherwise, tell me what data you are using  Number of instances  What you’re predicting  What features you are working with Short description of what your ideas are for improving performance If convenient, let me know what the baseline performance is Note: you can use your own data for the assignments from now on….

Example of ideas: How could you expand on what’s here?

Add features that describe the source

Example of ideas: How could you expand on what’s here? Add features that describe things that were going on during the time when the poll was taken

Example of ideas: How could you expand on what’s here? Add features that describe personal characteristics of the candidates

Getting the Baseline Performance Percent correct Percent correct, controlling for correct by chance Performance on individual categories Confusion matrix * Right click in Result list and select Save Result Buffer to save performance stats.

Clarification about Cohen’s Kappa A B AB OverallTotal Total agreements = 13 Percent agreement = 13/16 =.81 Agreement by chance =  i (Row i *Col i )/OverallTotal = 7*6/16 + 9*10/16 = = 8.3 Kappa = (TotalAgreement – Agreement by Chance)/ (Overall Total – Agreement by Chance) = (13 – 8.3)/(16 – 8.3) = 4.7 / 7.7 =.61 Assume 2 coders were assigning instances to category A or category B, and you want to measure their agreement. Coder 1’s Codes Coder 2’s Codes

Naïve Bayes Review

Naïve Bayes Simulation You can modify the Class counts and Counts for each attribute value within each class. You can also turn smoothing on or off. Finally, you can manipulate the attribute values for the instance you want to classify with your model.

Naïve Bayes Simulation You can modify the Class counts and Counts for each attribute value within each class. You can also turn smoothing on or off. Finally, you can manipulate the attribute values for the instance you want to classify with your model.

Naïve Bayes Simulation You can modify the Class counts and Counts for each attribute value within each class. You can also turn smoothing on or off. Finally, you can manipulate the attribute values for the instance you want to classify with your model.

Linear Model Review

Remember this: What do concepts look like?

What are we learning? We’re learning to draw a line through a multidimensional space  Really a “hyperplane” Each function we learn is like single split in a decision tree  But it can take many features into account at one time rather than just one F(x) = C 0 + C 1 X 1 + C 2 X 2 + C 3 X 3  X 1 -X n are our attributes  C 0 -C n are coefficients  We’re learning the coefficients, which are weights

What do linear models do? Notice that what you want to predict is a number  You use the number to order instances You want to learn a function that can get the same ordering Linear models literally add evidence Result = 2*A - B - 3*C Actual values between 2 and -4, rather than between 1 and 5, but order is the same. Order affects correlation, actual value affects absolute error.

What do linear models do? If what you want to predict is a category, you can assign values to ranges  Sort instances based on predicted value  Cut based on threshold  i.e., Val1 where f(x) < 0, Val2 otherwise Result = 2*A - B - 3*C Actual values between 2 and -4, rather than between 1 and 5, but order is the same.

What do linear models do? F(x) = C 0 + C 1 X 1 + C 2 X 2 + C 3 X 3  X 1 -X n are our attributes  C 0 -C n are coefficients  We’re learning the coefficients, which are weights Think of linear models as imposing a ranking on instances  Features associated with one class get negative weights  Features associated with the other class get positive weights

More on Linear Regression Linear regressions try to minimize the sum of the squares of the differences between predicted values and actual values for all training instances  Sum over all instances [ Square(predicted value of instance – actual value of instance) ]  Note that this is different from back propagation for neural nets that minimize the error at the output nodes considering only one training instance at a time What is learned is a set of weights (not probabilities!)

Limitations of Linear Regressions Can only handle numeric attributes What do you do with your nominal attributes?  You could turn them into numeric attributes  For example: red = 1, blue = 2, orange = 3  But is red really less than blue?  Is red closer to blue than it is to orange?  If you treat your attributes in an unnatural way, your algorithms may make unwanted inferences about relationships between instances  Another option is to turn nominal attributes into sets of binary attributes Note: Some people said linear models don’t handle nominal attributes on the homework and I disagreed – the reason I disagreed is because you CAN have nominal attributes, you just have to represent them in a way the model can deal with.

Performing well with skewed class distributions Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilities  Remember our math problem case Linear models can compensate for this  They don’t have any notion of prior probability per se  If they can find a good split on the data, they will find it wherever it is  Problem if there is not a good split

Skewed but clean separation

Skewed but no clean separation

Taking a Step Back Linear models have rules composed of numbers  So they “look” more like Naïve Bayes than like Decision Trees But the numbers are obtained through a focus on achieving accuracy  So the learning process is more like Decision Trees Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?

Tic Tac Toe

What algorithm do you think would work best? How would you represent the feature space? What cases do you think would be hard? X X OX O O X X O

Tic Tac Toe X X OX O O X X O

Decision Trees:.67 Kappa SMO:.96 Kappa Naïve Bayes:.28 Kappa What do you think is different about what these algorithms is learning? X X OX O O X X O

Decision Trees

Naïve Bayes Each conditional probability is based on each square in isolation Can you guess which square is most informative? X X OX O O X X O

Linear Function Counts every X as evidence of winning If there are more X’s, then it’s a win for X Usually right, except in the case of a tie X X OX O O X X O

Take Home Message Naïve Bayes is affected by prior probabilities in two places  Note that prior probabilities have an indirect effect on all conditional probabilities Linear functions are not directly affected by prior probabilities  So sometimes they can perform better on skewed data sets Even with the same data representation, different algorithms learn something different  Naïve Bayes learned that the center square is important  Decision trees memorized important trees  Linear function counted Xs