Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
CMPUT 466/551 Principal Source: CMU
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
What is Statistical Modeling
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Visual Recognition Tutorial
x – independent variable (input)
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Linear Methods for Classification
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
Bayesian Learning Rong Jin.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Crash Course on Machine Learning Part II
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
INTRODUCTION TO Machine Learning 3rd Edition
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Chapter 3: Maximum-Likelihood Parameter Estimation
10701 / Machine Learning.
ECE 5424: Introduction to Machine Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Probabilistic Models for Linear Regression
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
Presentation transcript:

Logistic Regression Rong Jin

Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c

Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c

Logistic Regression Model  The log-ratio of positive class to negative class  Results

Logistic Regression Model  The log-ratio of positive class to negative class  Results

Logistic Regression Model  Assume the inputs and outputs are related in the log linear function  Estimate weights: MLE approach

Example 1: Heart Disease Input feature x: age group id output y: having heart disease or not +1: having heart disease -1: no heart disease 1: : : : : : : : 60-64

Example 1: Heart Disease Logistic regression model Learning w and c: MLE approach Numerical optimization: w = 0.58, c = -3.34

Example 1: Heart Disease  W = 0.58 An old person is more likely to have heart disease  C = xw+c < 0  p(+|x) < p(-|x) xw+c > 0  p(+|x) > p(-|x) xw+c = 0  decision boundary  x* = 5.78  53 year old

Naïve Bayes Solution Inaccurate fitting Non Gaussian distribution i* = 5.59 Close to the estimation by logistic regression Even though naïve Bayes does not fit input patterns well, it still works fine for the decision boundary

Problems with Using Histogram Data?

Uneven Sampling for Different Ages

Solution w = 0.63, c =  i* = 5.65

Solution w = 0.63, c =  i* = 5.65 < 5.78

Solution w = 0.63, c =  i* = 5.65 < 5.78

Example: Text Classification  Learn to classify text into predefined categories  Input x: a document Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}  Output y: if the document is politics or not +1 for political document, -1 for not political document  Training data:

Example 2: Text Classification  Logistic regression model Every term t i is assigned with a weight w i  Learning parameters: MLE approach  Need numerical solutions

Example 2: Text Classification  Logistic regression model Every term t i is assigned with a weight w i  Learning parameters: MLE approach  Need numerical solutions

Example 2: Text Classification  Weight w i w i > 0: term t i is a positive evidence w i < 0: term t i is a negative evidence w i = 0: term t i is irrelevant to the category of documents The larger the | w i |, the more important t i term is determining whether the document is interesting.  Threshold c

Example 2: Text Classification  Weight w i w i > 0: term t i is a positive evidence w i < 0: term t i is a negative evidence w i = 0: term t i is irrelevant to the category of documents The larger the | w i |, the more important t i term is determining whether the document is interesting.  Threshold c

Example 2: Text Classification Dataset: Reuter Classification accuracy Naïve Bayes: 77% Logistic regression: 88%

Why Logistic Regression Works better for Text Classification?  Optimal linear decision boundary Generative model  Weight ~ logp(w|+) - logp(w|-)  Sub-optimal weights  Independence assumption Naive Bayes assumes that each word is generated independently Logistic regression is able to take into account of the correlation of words

Discriminative Model  Logistic regression model is a discriminative model Models the conditional probability p(y|x), i.e., the decision boundary  Gaussian generative model Models p(x|y), i.e., input patterns of different classes

Comparison Generative Model Model P(x|y) Model the input patterns Usually fast converge Cheap computation Robust to noise data But Usually performs worse Discriminative Model Model P(y|x) directly Model the decision boundary Usually good performance But Slow convergence Expensive computation Sensitive to noise data

Comparison Generative Model Model P(x|y) Model the input patterns Usually fast converge Cheap computation Robust to noise data But Usually performs worse Discriminative Model Model P(y|x) directly Model the decision boundary Usually good performance But Slow convergence Expensive computation Sensitive to noise data

A Few Words about Optimization  Convex objective function  Solution could be non-unique

Problems with Logistic Regression? How about words that only appears in one class?

Overfitting Problem with Logistic Regression  Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight  Consider the derivative of l(D train ) with respect to w  w will be infinite !

Overfitting Problem with Logistic Regression  Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight  Consider the derivative of l(D train ) with respect to w  w will be infinite !

Example of Overfitting for LogRes Iteration Decrease in the classification accuracy of test data

Solution: Regularization  Regularized log-likelihood  s||w|| 2 is called the regularizer Favors small weights Prevents weights from becoming too large

The Rare Word Problem  Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight

The Rare Word Problem  Consider the derivative of l(D train ) with respect to w  When s is small, the derivative is still positive  But, it becomes negative when w is large

The Rare Word Problem  Consider the derivative of l(D train ) with respect to w  When w is small, the derivative is still positive  But, it becomes negative when w is large

Regularized Logistic Regression Using regularization Without regularization Iteration

Interpretation of Regularizer  Many interpretation of regularizer Bayesian stat.: model prior Statistical learning: minimize the generalized error Robust optimization: min-max solution

Regularizer: Robust Optimization  assume each data point is unknown-but-bounded in a sphere of radius s and center x i  find the classifier w that is able to classify the unknown-but- bounded data point with high classification confidence

Sparse Solution  What does the solution of regularized logistic regression look like ?  A sparse solution Most weights are small and close to zero

Sparse Solution  What does the solution of regularized logistic regression look like ?  A sparse solution Most weights are small and close to zero

Why do We Need Sparse Solution?  Two types of solutions 1. Many non-zero weights but many of them are small 2. Only a small number of non-zero weights, and many of them are large  Occam’s Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights  less amount of evidence to consider  simpler model  case 2 is preferred

Occam’s Razer

Occam’s Razer: Power = 1

Occam’s Razer: Power = 3

Occam’s Razor: Power = 10

Finding Optimal Solutions  Concave objective function  No local maximum Many standard optimization algorithms work

Gradient Ascent  Maximize the log-likelihood by iteratively adjusting the parameters in small increments  In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient) Predication ErrorsPreventing weights from being too large

Graphical Illustration No regularization case

Gradient Ascent  Maximize the log-likelihood by iteratively adjusting the parameters in small increments  In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient)

When should Stop?  Log-likelihood will monotonically increase during the gradient ascent iterations  When should we stop?

Using regularization Without regularization Iteration

When should Stop?  The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction: