CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

G53MLE | Machine Learning | Dr Guoping Qiu
Linear Regression.
Prof. Navneet Goyal CS & IS BITS, Pilani
Brief introduction on Logistic Regression
Perceptron Learning Rule
Data Mining Classification: Alternative Techniques
Linear Separators.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Linear Separators.
Lecture 14 – Neural Networks
Simple Neural Nets For Pattern Classification
x – independent variable (input)
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Ensemble Learning: An Introduction
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Machine Learning: Ensemble Methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
SVM by Sequential Minimal Optimization (SMO)
Biointelligence Laboratory, Seoul National University
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
Classification: Feature Vectors
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Classification with Perceptrons
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
EEE502 Pattern Recognition
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Classification and Regression Trees
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
CH 5: Multivariate Methods
10701 / Machine Learning.
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Data Mining Lecture 11.
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 188: Artificial Intelligence
Collaborative Filtering Matrix Factorization Approach
Logistic Regression Chapter 7.
Linear Discrimination
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression

Linear Models: Linear Regression ● Work most naturally with numeric attributes ● Standard technique for numeric prediction  Outcome is linear combination of attributes ● Weights are calculated from the training data ● Predicted value for first training instance a (1) (assuming each instance is extended with a constant attribute with value 1) 2

Minimizing the Squared Error ● Choose k +1 coefficients to minimize the squared error on the training data ● Squared error: ● Derive coefficients using standard matrix operations ● Can be done if there are more instances than attributes (roughly speaking) ● Minimizing the absolute error is more difficult 3

Classification ● Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t  Prediction: predict class corresponding to model with largest output value (membership value) ● For linear regression this is known as multi-response linear regression ● Problem: membership values are not in [0,1] range, so aren't proper probability estimates 4

Linear Models: Logistic Regression ● Builds a linear model for a transformed target variable ● Assume we have two classes ● Logistic regression replaces the target by this target Logit transformation maps [0,1] to (- , +  ) 5

Logit Transformation ● Resulting model: 6

Example Logistic Regression Model ● Model with w 0 = 0.5 and w 1 = 1: ● Parameters are found from training data using maximum likelihood 7

Multiple Classes ● Can perform logistic regression independently for each class (like multi- response linear regression) ● Problem: probability estimates for different classes won't sum to one ● Better: train coupled models by maximizing likelihood over all classes ● Alternative that often works well in practice: pairwise classification 8

Pairwise Classification ● Idea: build model for each pair of classes, using only training data from those classes ● Problem? Have to solve k(k-1)/2 classification problems for k-class problem ● Turns out not to be a problem in many cases because training sets become small: ● Assume data evenly distributed, i.e. 2n/k per learning problem for n instances in total ● Suppose learning algorithm is linear in n ● Then runtime of pairwise classification is proportional to (k(k-1)/2)×(2n/k) = (k-1)n 9

Linear Models are Hyperplanes ● Decision boundary for two-class logistic regression is where probability equals 0.5: which occurs when ● Thus logistic regression can only separate data that can be separated by a hyperplane 10

Linear Models: The Perceptron ● Don't actually need probability estimates if all we want to do is classification ● Different approach: learn separating hyperplane ● Assumption: data is linearly separable ● Algorithm for learning separating hyperplane: perceptron learning rule ● Hyperplane: where we again assume that there is a constant attribute with value 1 (bias) ● If sum is greater than zero we predict the first class, otherwise the second class 11

The Algorithm ● Why does this work? Consider situation where instance a pertaining to the first class has been added: This means output for a has increased by: This number is always positive, thus the hyperplane has moved into the correct direction (and we can show output decreases for instances of other class) 12 Set all weights to zero Until all instances in the training data are classified correctly For each instance I in the training data If I is classified incorrectly by the perceptron If I belongs to the first class add it to the weight vector else subtract it from the weight vector

Perceptron as a Neural Network 13 Input layer Output layer

Linear Models: Winnow ● Another mistake-driven algorithm for finding a separating hyperplane ● Assumes binary data (i.e. attribute values are either zero or one) ● Difference: multiplicative updates instead of additive updates ● Weights are multiplied by a user-specified parameter  (or its inverse)  Another difference: user-specified threshold parameter  ● Predict first class if 14

The Algorithm ● Winnow is very effective in homing in on relevant features (it is attribute efficient) ● Can also be used in an on-line setting in which new instances arrive continuously (like the perceptron algorithm) 15 while some instances are misclassified for each instance a in the training data classify a using the current weights if the predicted class is incorrect if a belongs to the first class for each a i that is 1, multiply w i by alpha (if a i is 0, leave w i unchanged) otherwise for each a i that is 1, divide w i by alpha (if a i is 0, leave w i unchanged)

The Mystery Sound  What is that?!? 16