Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Lecture 3. Linear Models for Classification
Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Recognition and Machine Learning
CHAPTER 10: Linear Discrimination
Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning 3rd Edition
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Support Vector Machines (and Kernel Methods in general)
x – independent variable (input)
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Linear Methods for Classification
Machine Learning CMPT 726 Simon Fraser University
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Outline Separating Hyperplanes – Separable Case
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning 5. Parametric Methods.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
CH 5: Multivariate Methods
Classification Discriminant Analysis
Chapter 4: Linear Classifiers
Classification Discriminant Analysis
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Support Vector Machines
Mathematical Foundations of BME
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Logistic Regression Chapter 7.
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan

Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy Categorical y  {c 1,…,c m }: classification Real-valued y: regression Note: usually assume {c 1,…,c m } are mutually exclusive and exhaustive

Probabilistic Classification Let p(c k ) = prob. that a randomly chosen object comes from c k Objects from c k have: p(x |c k,  k ) (e.g., MVN) Then: p(c k | x )  p(x |c k,  k ) p(c k ) Bayes Error Rate: Lower bound on the best possible error rate

Bayes error rate about 6%

Classifier Types Discriminative: model p(c k | x ) - e.g. logistic regression, CART Generative: model p(x |c k,  k ) - e.g. “Bayesian classifiers”, LDA

Regression for Binary Classification Can fit a linear regression model to a 0/1 response Predicted values are not necessarily between zero and one zeroOneR.txt With p>1, the decision boundary is linear e.g. 0.5 = b0 + b1 x1 + b2 x2

Linear Discriminant Analysis K classes, X n × p data matrix. p(c k | x )  p(x |c k,  k ) p(c k ) Could model each class density as multivariate normal: LDA assumes for all k. Then: This is linear in x.

Linear Discriminant Analysis (cont.) It follows that the classifier should predict: “linear discriminant function” If we don’t assume the  k ’s are identicial, get Quadratic DA:

Linear Discriminant Analysis (cont.) Can estimate the LDA parameters via maximum likelihood:

LDAQDA

LDA (cont.) Fisher is optimal if the class are MVN with a common covariance matrix Computational complexity O(mp 2 n)

Logistic Regression Note that LDA is linear in x: Linear logistic regression looks the same: But the estimation procedure for the co-efficients is different. LDA maximizes joint likelihood [y,X]; logistic regression maximizes conditional likelihood [y|X]. Usually similar predictions.

Logistic Regression MLE For the two-class case, the likelihood is: The maximize need to solve (non-linear) score equations:

Logistic Regression Modeling South African Heart Disease Example (y=MI) Coef.S.E.Z score Intercept sbp Tobacco ldl Famhist Obesity Alcohol Age Wald

Regularized Logistic Regression Ridge/LASSO logistic regression Successful implementation with over 100,000 predictor variables Can also regularize discriminant analysis

Simple Two-Class Perceptron Define: Classify as class 1 if h(x)>0, class 2 otherwise Score function: # misclassification errors on training data For training, replace class 2 x j ’s by -x j ; now need h(x)>0 Initialize weight vector Repeat one or more times: For each training data point x i If point correctly classified, do nothing Else Guaranteed to converge to a separating hyperplane (if exists)

“Optimal” Hyperplane The “optimal” hyperplane separates the two classes and maximizes the distance to the closest point from either class. Finding this hyperplane is a convex optimization problem. This notion plays an important role in support vector machines

w  x+b=0