Brief introduction on Logistic Regression

Slides:



Advertisements
Similar presentations
A Brief Introduction to Spatial Regression
Advertisements

Dummy Dependent variable Models
Econometric Modeling Through EViews and EXCEL
Managerial Economics in a Global Economy
Regression and correlation methods
Lesson 10: Linear Regression and Correlation
Linear Regression.
The Multiple Regression Model.
Logistic Regression Psy 524 Ainsworth.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Logistic Regression.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Ch11 Curve Fitting Dr. Deshi Ye
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Nguyen Ngoc Anh Nguyen Ha Trang
Chapter 4: Linear Models for Classification
A Short Introduction to Curve Fitting and Regression by Brad Morantz
Multiple Linear Regression Model
x – independent variable (input)
Correlation and Autocorrelation
Chapter 10 Simple Regression.
QUALITATIVE AND LIMITED DEPENDENT VARIABLE MODELS.
GRA 6020 Multivariate Statistics; The Linear Probability model and The Logit Model (Probit) Ulf H. Olsson Professor of Statistics.
Statistical Methods Chichang Jou Tamkang University.
Chapter 4 Multiple Regression.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Statistical Background
Topic 3: Regression.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Decision Tree Models in Data Mining
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
MODELS OF QUALITATIVE CHOICE by Bambang Juanda.  Models in which the dependent variable involves two ore more qualitative choices.  Valuable for the.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Inference for regression - Simple linear regression
2-1 MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4)
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
April 6 Logistic Regression –Estimating probability based on logistic model –Testing differences among multiple groups –Assumptions for model.
9-1 MGMG 522 : Session #9 Binary Regression (Ch. 13)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Logistic Regression Database Marketing Instructor: N. Kumar.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
Linear vs. Logistic Regression Log has a slightly better ability to represent the data Dichotomous Prefer Don’t Prefer Linear vs. Logistic Regression.
Copyright © 2005 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Managerial Economics Thomas Maurice eighth edition Chapter 4.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Regression. Types of Linear Regression Model Ordinary Least Square Model (OLS) –Minimize the residuals about the regression linear –Most commonly used.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Logistic Regression. Linear Regression Purchases vs. Income.
Logistic Regression Analysis Gerrit Rooks
Machine Learning 5. Parametric Methods.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Logistic Regression Categorical Data Analysis.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Instructor: R. Makoto 1richard makoto UZ Econ313 Lecture notes.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Chapter 4: Basic Estimation Techniques
BINARY LOGISTIC REGRESSION
Chapter 4 Basic Estimation Techniques
Logistic Regression APKC – STATS AFAC (2016).
Notes on Logistic Regression
Basic Estimation Techniques
Basic Estimation Techniques
Introduction to Logistic Regression
Simple Linear Regression
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression.
Presentation transcript:

Brief introduction on Logistic Regression ECLT 5810 Brief introduction on Logistic Regression

Review on Ordinary Least Square (OLS) Regression A “curve fitting on data points” procedure Achieved by minimizing the total squared distance between the curve and the data points The model usually looks like y = β0 + β1x1 + β2x2 + · · · + βnxn

Our analysis on such models are usually: If the beta coefficients are significantly positive/ negative/ different from a certain value, with estimation errors considered. (done by t-statistic on beta estimates) If the model has good explanatory power to describe the dependent variable, with estimation errors considered. (done by F-statistic on R^2 measures) The implication from the model, i.e., does y depends on x ? In what extent? Are there any interaction effect? (Done by differentiation/differencing on the estimated model) Prediction (in interval) given the dependent variable.

Classical Assumptions for OLS However, all those analysis is done under the following assumptions. A1 (Linear in Parameter) y = β0 + β1x1 + β2x2 + · · · + error . A2 (No perfect collinearity) No independent variable is constant or a perfect linear combination of the others A1 and A2 could be fulfilled by choosing a suitable form of equation.

A3 (Zero conditional mean of errors) E(error t |X) = 0, t = 1, 2, · · · # of data, where X is a collection of all independent variables X = (x1, x2, · · · , xn) Under A1-A3 the OLS estimators are unbiased, i.e. E( estimated βj ) = βj for all j. A4 (Homoskedasticity in errors) Var(error t |X) = σ^2 (i.e. independent of X), t = 1, 2, · · ·. A5 (No serial correlation in errors) Corr (errort , errors |X) = 0, for t not equal to s. Under A1-A5, the OLS estimators are the minimum-variance linear unbiased estimators conditional on X.

A6 (Normality of errors) ut are are independently and identically distributed as N (0, σ^2). Under A1-A6, the OLS estimators are normally distributed conditional on X. And t-statistic on parameters and F-statistic on the R^2 can be used for different statistical reasoning. A3-A6 are usually assumed to be true unless there is significant evidence/ reason against them.

Early models for classification As our main target is make prediction in data mining, the dependent variable is usually nominal/ ordinal/ binary in nature. Usually we use a binary y to represent this, i.e. y=1 for yes and 0 for no. An early model is the linear probability model, which regress binary y on other explanatory variable X. As y is binary, the predicted value is usually around the range 0 and 1. So people used this model to predict the probability for an event. However, such model violates A3, A4 and A6. Also, the predicted value could be out of the range 0 and 1. The model become not so useful.

The problem could be rectified by introducing a threshold such that when the predicted y is greater than the threshold, we classify y as 1. This become the most simple neural network model, which will be introduced later. However, what we obtain become a decision rather than a probability, which might be useful in some cases. Also, the relation between the probability and the explanatory variable become less clear. Statisticians invented logistic regression to solve the problem.

Logistic Regression The idea is to use a 1 to 1 mapping to map the probability from range between [0,1] to all real numbers. Then, there will be no problem no matter what the right hand side is. 3 common transformation/ link function (provided by SAS): Logit : ln(p/1-p) (We call this log of odd ratio) Probit: Normal inverse of p (Recall: normal table’s mapping scheme) Complementary log-log: ln(-ln(1-p)) The choice of link function depends on your purpose rather than performance. They all perform equally good but the implications is a bit different.

However, as the model is no longer in linear form, ordinary least square cannot be used. Furthermore, if we put y directly into transformation, we get positive/negative infinity. We use Maximum Likelihood Estimator (MLE) methods instead. In which we choose beta coefficients that maximize the probability that the data as we see now. MLE needs fewer assumptions than OLS, but much less inference could be made, especially for logistic regression. Also, as both MLE and OLS use only one beta coefficient to describe the effect of an explanatory variable brings about, data scaling/ normalization is particular important.

Example on Logit Assume we believe the relation between probability p of an event is “yes” and independent variable x can be described by the equation ln(p(x)/1-p(x)) = a+bx Then, p(x) = exp(a+bx) / [1+exp(a+bx)] If we have 4 data points :(Yes,x1) ,(No,x2) ,(Yes,x3), (No,x4) and assume they are mutually independent , then the probability that we see these 4 data point is the product: p(x1)[1-p(x2)]p(x3 )[1-p(x4)] and MLE tries to maximize this by choosing suitable a and b.

Reading the Report Akaike’s Information Criteria (AIC) and Schwarz’s Bayesian Criteria (SBC) : (Compare to: F-test on Adjusted R^2 for OLS) - both has smaller value for higher maximized likelihood, and higher value if more explanatory variable is used (to penalize over-fitting). - So smaller of it is preferred. (though is not the only consideration for choosing model) T-score (Compare to: t-test on estimated betas for OLS) It is the estimate divided by its standard error. We may treat it like t-test as in OLS, and construct a confidence interval for the betas. But in practice, it works only asymptotically. We just consider large t-score as an indicator for possibly significant effect but no hypothesis testing could be done.

Wald’s Chi-square (Compare to t-test for OLS) We could treat an effect as significant if the tail probability is small enough (< 5%). If we are using the model for predicting the outcome rather than the probability for that outcome (the case when the criterion is set to minimize loss), the interpretation for misclassification rate/ profit and loss/ ROC curve/ lift chart is similar to those for decision tree. Some scholars suggest prediction interval for the probability P of the event given independent variable be Pestimated + Z1-a/2 [Pestimated (1-Pestimated)/#data]^(1/2) Z being the Z-score for normal table and a being the significance level. But we do not have this in SAS.

The interpretation for the model form is similar for OLS by techniques like differentiation and differencing. One common use is, for Logit model with form: f(x) = ln(P(x)/1-P(x)) = a+bx, x being binary f(1) = a+b, f(0)= a f(1)/f(0) ~ ln(P(1)/P(0)) = b for small P(0), P(1) P(1) = exp(b) * P(0) Hence P(1) is exp(b) as big as P(0). We can draw conclusion like “Having something (x) done increases the probability to exp(b) times for not having it done”