Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

G53MLE | Machine Learning | Dr Guoping Qiu
Linear Regression.
Prof. Navneet Goyal CS & IS BITS, Pilani
Brief introduction on Logistic Regression
Perceptron Learning Rule
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Simple Neural Nets For Pattern Classification
x – independent variable (input)
Ensemble Learning: An Introduction
Machine Learning: Ensemble Methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Neural Networks Lecture 8: Two simple learning algorithms
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
SVM by Sequential Minimal Optimization (SMO)
Biointelligence Laboratory, Seoul National University
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
EEE502 Pattern Recognition
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Machine Learning: Ensemble Methods
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
10701 / Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Data Mining Lecture 11.
Roberto Battiti, Mauro Brunato
Data Mining Practical Machine Learning Tools and Techniques
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression Chapter 7.
Linear Discrimination
Presentation transcript:

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Rodney Nielsen, Human Intelligence & Language Technologies Lab Algorithms: The Basic Methods ● Inferring rudimentary rules ● Naïve Bayes, probabilistic model ● Constructing decision trees ● Constructing rules ● Association rule learning ● Linear models ● Instance-based learning ● Clustering

Rodney Nielsen, Human Intelligence & Language Technologies Lab Linear Models: Linear Regression ● Work most naturally with numeric attributes ● Standard technique for numeric prediction  Outcome is linear combination of attributes ● Weights are calculated from the training data ● Predicted value for i th training instance x (i) (assuming each instance is extended with a constant attribute with value 1)

Rodney Nielsen, Human Intelligence & Language Technologies Lab Minimizing the Squared Error ● Choose k +1 coefficients to minimize the squared error on the training data ● Squared error: ● Derive coefficients using standard matrix operations ● Can be done if there are more instances than attributes (roughly speaking) ● Minimizing the absolute error is more difficult

Rodney Nielsen, Human Intelligence & Language Technologies Lab Classification ● Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t  Prediction: predict class corresponding to model with largest output value (membership value) ● For linear regression this is known as multi- response linear regression ● Weakness: values are not proper probability estimates, not even necessarily in [0,1] range

Rodney Nielsen, Human Intelligence & Language Technologies Lab Linear Models: Logistic Regression Builds a linear model for a transformed target variable Assume we have two classes Logistic regression replaces the target by this target Logit transformation maps [0,1] to (- , +  )

Rodney Nielsen, Human Intelligence & Language Technologies Lab Maximum Likelihood Aim: maximize probability of training data wrt parameters Can use logarithms of probabilities and maximize log- likelihood of model: where the y (i) are either 0 or 1 Weights w i need to be chosen to maximize log-likelihood (relatively simple method: iteratively re-weighted least squares)

Rodney Nielsen, Human Intelligence & Language Technologies Lab Multiple Classes Can perform logistic regression independently for each class (like multi-response linear regression) Problem: probability estimates for different classes won't sum to one Alternative that often works well in practice: pairwise classification (a.k.a., one versus one, in contrast with one versus all)

Rodney Nielsen, Human Intelligence & Language Technologies Lab Pairwise Classification Idea: build model for each pair of classes, using only training data from those classes Problem? Have to solve k(k-1)/2 classification problems for k-class problem Turns out not to be a problem in many cases because training sets become small: Assume data evenly distributed, i.e. 2n/k per learning problem for n instances in total Suppose learning algorithm is linear in n Then runtime of pairwise classification is proportional to (k(k-1)/2) ×( 2n/k) = (k-1)n

Rodney Nielsen, Human Intelligence & Language Technologies Lab Linear Models are Hyperplanes Two-class logistic regression Decision boundary is where “probability” equals 0.5 which occurs when Thus logistic regression can only separate data that can be separated by a hyperplane

Rodney Nielsen, Human Intelligence & Language Technologies Lab Linear Models are Hyperplanes Multi-response linear regression has the same problem. Class 1 is assigned if Each class-pair is separated by a linear decision boundary / hyperplane

Rodney Nielsen, Human Intelligence & Language Technologies Lab Linear Models: The Perceptron Don't actually need probability estimates if all we want to do is classification Different approach: learn separating hyperplane Assumption: data is linearly separable Algorithm for learning separating hyperplane: perceptron learning rule Hyperplane: where we again assume that there is a constant attribute w 0 with value 1 (bias) If sum is greater than zero we predict the first/positive class, otherwise the second/negative class

Rodney Nielsen, Human Intelligence & Language Technologies Lab The Algorithm Why does this work? Consider situation where instance y pertaining to the first class has been added: This means output for y has increased by: This number is always positive, thus the hyperplane has moved in the correct direction Set all weights to zero Until all instances in the training data are classified correctly For each instance x (i) in the training data If x (i) is classified incorrectly by the perceptron If x (i) belongs to the first class add it to the weight vector else subtract it from the weight vector

Rodney Nielsen, Human Intelligence & Language Technologies Lab Perceptron as a Neural Network Input layer Output layer

Rodney Nielsen, Human Intelligence & Language Technologies Lab Online Learning Can also be used in an online setting in which new instances arrive continuously

Rodney Nielsen, Human Intelligence & Language Technologies Lab Linear Models: Winnow Another mistake-driven algorithm for finding a separating hyperplane Assumes binary data (i.e. attribute values are either zero or one) Difference: multiplicative updates instead of additive updates Weights are multiplied by a user-specified parameter  (or its inverse) Another difference: user-specified threshold parameter  Predict first class if

Rodney Nielsen, Human Intelligence & Language Technologies Lab The Algorithm Winnow is very effective in homing in on relevant features (it is attribute efficient) Can also be used in an on-line setting in which new instances arrive continuously (like the perceptron algorithm) while some instances are misclassified for each instance x in the training data classify x using the current weights if the predicted class is incorrect if x belongs to the first class for each x j that is 1, multiply w j by alpha (if x j is 0, leave w j unchanged) otherwise for each x j that is 1, divide w j by alpha (if x j is 0, leave w j unchanged)

Rodney Nielsen, Human Intelligence & Language Technologies Lab Balanced Winnow Winnow doesn't allow negative weights and this can be a drawback in some applications Balanced Winnow maintains two weight vectors, one for each class: Instance is classified as belonging to the first class (of two classes) if while some instances are misclassified for each instance x in the training data classify x using the current weights if the predicted class is incorrect if x belongs to the first class for each x j that is 1, multiply w j + by alpha and divide w j - by alpha (if x j is 0, leave w j + and w j - unchanged) otherwise for each x j that is 1, multiply w j - by alpha and divide w j + by alpha (if x j is 0, leave w j + and w j - unchanged)