Linear Methods for Classification

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

Brief introduction on Logistic Regression
Pattern Recognition and Machine Learning
Lecture 8,9 – Linear Methods for Classification Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Chapter 4: Linear Models for Classification
Support Vector Machine
x – independent variable (input)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Dimensional reduction, PCA
Missing at Random (MAR)  is unknown parameter of the distribution for the missing- data mechanism The probability some data are missing does not depend.
Linear Discriminant Analysis (Part II) Lucian, Joy, Jie.
Linear Methods for Classification
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Crash Course on Machine Learning
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning 5. Parametric Methods.
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
BINARY LOGISTIC REGRESSION
Linear Methods for Classification, Part 1
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Data Mining Lecture 11.
Overview of Supervised Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Support Vector Machines
Parametric Methods Berlin Chen, 2005 References:
Support Vector Machines 2
Presentation transcript:

Linear Methods for Classification Jie Lu, Joy, Lucian {jielu+,joy+, llita+}@cs.cmu.edu

Linear Methods for Classification What are they? Methods that give linear decision boundaries between classes Linear decision boundaries {x: 0 + 1T x = 0} How to define decision boundaries? Two classes of methods Model discriminant functions k(x) for each class as linear Model the boundaries between classes as linear

Two Classes of Linear Methods Model discriminant functions k(x) for each class as linear Linear regression fit to the class indicator variables Linear discriminant analysis (LDA) Logistic regression (LOGREG) Model the boundaries between classes as linear (will be discussed on next Tuesday) Perceptron Non-overlap support vector classifier (SVM)

Model Discriminant Functions k(x) For Each Class Different for linear regression fit, linear discriminant analysis, and logistic regression Discriminant functions k(x) Based on the model Decision Boundaries between class k and l {x: k(x) = l(x)} Classify to the class with the largest k(x) value

Linear Regression Fit to the Class Indicator Variables Linear model for kth indicator response variable Decision boundary is set of points Linear discriminant function for class k Classify to the class with the largest value for its k(x) Parameters estimation Objective function Estimated coefficients

Linear Regression Fit to the Class Indicator Variables Rationale An estimate of conditional expectation An estimate of the target value An observation: Why? A “straightforward” verification --- see next page courtesy of Jian zhang and Yan Rong

Linear Regression Fit to the Class Indicator Variables Verification of We want to prove which is equivalent to prove Notice that (Eq. 1) (Eq. 2)

Linear Regression Fit to the Class Indicator Variables And the augmented X has From Eq. 2: we can see that Which means that

Linear Regression Fit to the Class Indicator Variables Eq. 1 becomes: True for any x.

Mask Problem Because the rigid nature of the regression model: When K3, classes can be masked by others Because the rigid nature of the regression model:

Mask(2) Quadratic Polynomials

Linear Regression Fit y +++++ x ------ Question: P81 Let's just consider binary classification. In "machine learning course", when we transfer from regression to classification, we fit a single regression curve on samples of both two classes, Then we decide a threshold on the curve and finished classification. Here we use two regression curves ,each for a category.Can you compare the two methods? (Fan Li) y +++++ x ------

Linear Discriminant Analysis (Common Convariance Matrix ) Model class-conditional density of X in class k as multivariate Gaussian Class posterior Decision boundary is set of points

Linear Discriminant Analysis (Common ) con’t Linear discriminant function for class k Classify to the class with the largest value for its k(x) Parameters estimation Objective function Estimated parameters

Logistic Regression Model the class posterior Pr(G=k|X=x) in terms of K-1 log-odds Decision boundary is set of points Linear discriminant function for class k Classify to the class with the largest value for its k(x)

Questions The log odds-ratio is typically defines as log(p/(1-p)), how is this consistent with p96 where they use log(pk/pl) where k,l are different classes in K. (Ashish Venugopal)

Logistic Regression con’t Parameters estimation Objective function IRLS (iteratively reweighted least squares) Particularly, for two-class case, using Newton-Raphson algorithm to solve the equation (pages 98-99 for details)

Logistic Regression con’t When it is used binary responses (two classes) As a data analysis and inference tool to understand the role of the input variables in explaining the outcome Feature selection Find a subset of the variables that are sufficient for explaining their joint effect on the response. One way is to repeatedly drop the least significant coefficient, and refit the model until no further terms can be dropped Another strategy is to refit each model with one variable removed, and perform an analysis of deviance to decide which one variable to exclude Regularization Maximum penalized likelihood Shrinking the parameters via an L1 constraint, imposing a margin constraint in the separable case

Questions p102 Are stepwise methods the only practical way to do model selection for logistic regression (because of nonlinearity + max likelihood criteria)? (comparing to section 3.4: what about the bias/variance tradeoff, where we could shrink coefficient estimates instead of just setting them to zero?) (Kevyn Collins-Thompson)

Classification by Linear Least Squares vs. LDA Two-class case, simple correspondence between LDA and classification by linear least squares The coefficient vector from least squares is proportional to the LDA direction in its classification rule (page 88) For more than two classes, the correspondence between regression and LDA can be established through the notion of optimal scoring (Section 12.5).

Questions On p88 paragraph 2 it says "the derivation of LDA via least squares does not use a Gaussian assumption for the features" - how can this statement be made, simply because the least squares coefficient vector is proportional to the LDA direction, how does that remove the obvious Gaussian assumptions that are made in LDA? (Ashish Venugopal)

LDA vs. Logistic Regression LDA (Generative model) Assumes Gaussian class-conditional densities and a common covariance Model parameters are estimated by maximizing the full log likelihood, parameters for each class are estimated independently of other classes, Kp+p(p+1)/2+(K-1) parameters Makes use of marginal density information Pr(X) Easier to train, low variance, more efficient if model is correct Higher asymptotic error, but converges faster Logistic Regression (Discriminative model) Assumes class-conditional densities are members of the (same) exponential family distribution Model parameters are estimated by maximizing the conditional log likelihood, simultaneous consideration of all other classes, (K-1)(p+1) parameters Ignores marginal density information Pr(X) Harder to train, robust to uncertainty about the data generation process Lower asymptotic error, but converges more slowly

Generative vs. Discriminative Learning (Rubinstein 97) Generative Discriminative Example Linear Discriminant Analysis Logistic Regression Objective Functions Full log likelihood: Conditional log likelihood Model Assumptions Class densities: e.g. Gaussian in LDA Discriminant functions Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization Advantages More efficient if model correct, borrows strength from p(x) More flexible, robust because fewer assumptions Disadvantages Bias if model is incorrect May also be biased. Ignores information in p(x)

Comparison between LDA and LOGREG (ErrorRate / Standard Error) True Distribution Highly non-Gaussian N/A Gaussian LDA 25.2/0.47 9.6/0.61 7.6/0.12 LOGREG 12.6/0.94 4.1/0.17 8.1/0.27 (Rubinstein 97)

Questions Can you give a more detailed explanation about the difference between the two methods: linear discriminant analysis and linear logistic regression. (P. 80. book: the essential difference between them is in the way the linear function is fit to the training data.) (Yanjun Qi) P105 first paragrpha. Why conditional likelihood need 30% more data to do as well? (Yi Zhang) The book says logistic regression is safer. Then it says LDA and logistic regression work very similar even when LDA is used in inappropriately, why not use LDA? Using LDA, we have a change to save 30% training data in case the assumption on marginal distribution is true. How inappropriately will make LDA worse than logistic regression? (Yi Zhang) Figure 4.2 Shows the different effects from linear regression and linear Discriminant analysis on one data set. Can we have a more deep and general understanding about when linear regression does not work well compared with linear discriminant analysis? (Yanjun Qi)

Questions On p88 paragraph 2 it says "the derivation of LDA via least squares does not use a Gaussian assumption for the features" - how can this statement be made, simply because the least squares coefficient vector is proportional to the LDA direction, how does that remove the obvious Gaussian assumptions that are made in LDA? (Ashish Venugopal) p91 - what does it mean to "Sphere" the data with a covariance matrix? (Ashish Venugopal) The log odds-ratio is typically defines as log(p/(1-p)), how is this consistent with p96 where they use log(pk/pl) where k,l are different classes in K. (Ashish Venugopal)

Questions Figure 4.2 on p. 83 gives an example of masking and in text, the authors go on to say, "a general rule is that...polynomial terms up to degree K - 1might be needed to resolve them". There seems to be an implication that adding polynomial basis functions according to this rule could be detrimental sometimes. I was trying to think of a graphical representation of a case where that would occur but can't come up with one. Do you have one? (Paul Bennett) (p. 80) what do the decision boundaries for the logit transformation space look like in the original space? (Francisco Pereira) (p. 82) whis is E(Y_k|X=x) = Pr(G=k|X=x)? (Francisco Pereira) (p. 82) the target approach is just "predicting a vector of with all 0s except 1 at the position of the true class"? (Francisco Pereira) (p. 83) Can all of this be seen as projecting the data into a line with a given direction and then dividing that line according to the classes (seems so in 2 class case, not sure in general). (Francisco Pereira)

Questions What is the difference between logistic regression and exponential model, in terms of definition, properties and experimental results? ( Discriminative VS Generative) [Yan Liu] The question is on the Indicator response matrix: as a general way to decompose the multi-class classification problems to binary-class classification problems, when it is applied, how do we evaluate the results? (Error rate or something else?) There is a good way called ECOC (Error Correcting Output Coding) to reduce multi-class problems to binary-class problems, can we use the same way as indicator response matrix and do linear regression? [Yan Liu] On page82. Why it is quite straight forward to that sum f,(x) =1 for any x? As is said in the book (page 80), if the problem is linearly non-separable, we can expand our variable set X1, X2,.., Xp by including their squares and cross-product and solve it. Furthermore, this approach can be used with any basis transformation. In theory, can any classification problems be solved using this way? (Maybe in practical, we might have the problems like “curse of dimension”) [Yan Liu]

Questions one important step for applying regression method to the classification problem is to encode the class label into some code scheme. In the book, it only illustrates the simplest one. More complicated code scheme includes the redundant code. However, it is not necessary to encode the class label into N region. Do you think it is possible to encode it with real number and actually achieve better performance? [Rong Jin] P.82. Book: If we allow linear regression onto basis expansions h(X) ofthe inputs,this approach can lead to consistent estimates of the probabilities.I do not fully understand this sentence. [Yanjun] In LDA, book tells us that it is easy to show that the coefficient vectorfrom leastsquares is proportional to the LDA diretion given by 4.11.Then how to understand this correspondence occurs for any distinct coding ofthe targets? [Yanjun] Both LDA and QDA performs well on an amazingly large and diverse set ofclassification tasks.But LDA assumes the data covariances are approximatel equal. Then i feelthis methodis too restricted to the general case, right? [Yanjun]

Questions The indicator matrix Y in the 4.2 first paragraph is a matrix of 0's and1's, with each row having a single 1.  It seems that we can extends it to multi-label data by allowing each row having two or more 1, and for the model using Eq. 4.3.  Have this way been tried in multi-label classification problem? [Wei-hao]

References Rubinstein, Y. D., & Hastie, T. (1997). Discriminative vs. informative learning. In Proceedings Third International Conference on Knowledge Discovery and Data Mining, pp. 49--53. Jordan, M. I. (1995) "Why the logistic function? A tutorial discussion on probabilities and neural networks," Technical Report A. Y. Ng and M. I. Jordan, "On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes," Neural Information Processing Systems p88  "QDA is generally preferred to LDA (in the quadratic space)".  Why,and how do you decide which to use?(Is the main reason because QDA is more general in what it can modelaccurately, in not assuming a common covariance across classes?) [Kevyn] "By relying on the additional model assumptions,we have more information about the parameters,and hence can estimate them more efficiently (low variance)“, how? [Jian]