Download presentation
Presentation is loading. Please wait.
Published byGeorgiana Richardson Modified over 6 years ago
1
Logistic and Poisson Regression: Modeling Binary and Count Data
Kyle Grottini November 5, 2008
2
Presentation Outline 1. Introduction to Generalized Linear Models
2. Count Response Data - Poisson Regression Model 3. Binary Response Data - Logistic Regression Model 4. Takeaways 2
3
Linear Models Linear Regression is one of the handiest tools you have for finding relationships in data. Relies on several key assumptions Response variable is continuous There exists some linear relationship between X’s and Y No significant outliers Constant variance Normally distributed residuals Independence between observations 3
4
Linear Models Linear regression is most appropriate when your predictors and response variables are normally distributed Examples of normally distributed variables Shrink % # of POS Transactions # of EAS alarms in a month 4
5
Linear Models Linear regression is fairly robust – meaning that even if the variables aren’t normally distributed, the estimates shouldn’t be drastically affected by the distribution of the response or explanatory variable. The distribution of the variables can be skewed and shouldn’t cause too much damage to the validity of your model Skewed distributions can include: # of shoplifters detained across all stores Square footage of stores 5
6
Count Data What if our distributions are so skewed, they don’t resemble anything close to a bell curve? Most of the values are concentrated at one point and “fizzle out” with an increase of the variable Consider events that can not go below 0 and are also fairly rare # of armed robberies over all stores # of “pushouts” over all stores You can still model using linear regression but the estimates and standard errors may be off 6
7
Binary Data Alternatively, what if the outcome variable is not continuous? The outcome may take the form of a binary or categorical response Has there been an active shooter event at the location? Yes or no Has a store been hit by ORC? Yes or no Let’s say you want to model the likelihood that one of those events will happen at a store given a set of factors. Linear regression may provide estimates greater than 100% or lower than 0% Violates several laws of probability, Despite what your coaches told you in high school, you can’t give 110% 7
8
Binary Data 8
9
Binary Data Consider a binary response/outcome/dependent variable.
Variable with two outcomes One outcome represented by a 1 and the other represented by a 0 Examples: Does a store have CCTV cameras? Yes or No What is the store format? Free standing or in a mall Has a store been burglarized before? Yes or No 9
10
Generalized Linear Models
Why do we use GLM’s? Linear regression assumes that the response is distributed normally GLM’s allow us to analyze the linear relationship between predictor variables and the mean of the response variable when it is not reasonable to assume the data is distributed normally. Today we’re going to take a look at modeling Burglaries 10
11
Looking at Burglaries For many stores, burglaries either do not happen or happen infrequently. Expect a heavy concentration of stores to have 0 or 1 robberies. The number of stores who have been burglarized drops off pretty heavily after 2. A simulated dataset considers over 900 stores with robberies where over 400 of them have had at least one burglary event 11
12
The Problem How do we better model to find factors that are associated with burglarized stores over the course of a year. Response Variable – > Has the store been burglarized? Measured in 2 ways 0 if no burglaries, 1 if it has been burglarized Count data of how many times a store has been burglarized Predictor Variables CAP Index National Score Continuous Store Layout 0 – free standing, 1 – strip mall, 2-mall Shrink % Proximity to pawn shops 12
13
Our Data First 10 Observations of the Data Set
We will have to recode our building layout variable into 2 Dummy Variables for linear regression. One new column for stores that are located in malls, and one new column for stores located in a strip mall. We don’t need 3 categories because the information for Free Standing will be captured by the intercept. # of Burglaries Burglarized? 0=no 1=yes CAP National Score Building layout 0 = free standing 1 = Mall 2 = Strip mall Shrink % proximity to nearest pawn shop (in miles) 145 2 0.028 6.4 72 1 7.8 54 0.024 5.5 26 0.032 6.9 61 0.02 5.7 126 0.021 8.2 114 0.047 7.5 39 8.8 29 0.049 5.4 13
14
Linear Regression Model B Std Error T-value P-value
Let’s take a crack at this problem with linear regression R^ Model B Std Error T-value P-value Constant 1.04 .099 10.6 <.001 Building Layout - Mall -.056 .056 -1.0 .319 Building Layout – Strip mall -.055 .057 .331 CAP Scores .005 .000 34.4 Proximity to pawn shops -.189 .010 -18.3 Shrink % 2.55 1.617 1.6 .118 14
15
Linear Regression Model B Std Error T-value P-value
Interpretation – For a one unit increase in x we expect a ____ increase in Y For every mile away from a pawn shop a store is, we expect the # robberies to decrease by .189 More intuitively, if a store is a little over 5 miles away from a pawn shop, we would expect 1 less burglary per year than a store that is much closer to a pawn shop Model B Std Error T-value P-value Constant 1.04 .099 10.6 <.001 CAP Scores .005 .000 34.4 Proximity to pawn shops -.189 .010 -18.3 15
16
Count Data Consider our frequency of burglaries:
Can only take integer values Bounded at 0 Not expected to exceed 10 – and that’s a pretty rough set of stores at the higher end For a variable to be considered “continuous”, it typically has to take a range of at least 30 values Our first step is to look at the distribution of robberies across all stores considered 16
17
Poisson Regression 17
18
Poisson Regression Using SPSS, we can model this information using the Poisson distribution. With Poisson models we are modeling a rate of burglaries over a year holding other factors constant. Poisson regression is also called log-linear regression: For Poisson regression, “best” results occur when the mean and standard deviation for the response variable are approximately equal In this case, Burglary Mean = .89, and standard deviation is 1.352 Those values are close but not quite spot on. 18
19
Poisson Regression Model B Std Error P-value
Keeping the calculations behind the scenes we get: Model B Std Error P-value Constant 1.376 .099 <.001 Building Layout - Mall -.152 .0874 .083 Building Layout – Strip mall -.073 .0908 .420 CAP .001 .0002 Proximity to pawn shops -.399 1.617 Shrink 2.559 .010 <.401 19
20
Poisson Regression Interpretation of the parameter estimate (B): we are looking at the relative risk of one store compared to another keeping other factors consistent. Relative risk in this case is the ratio of the probability of a burglary of an exposed group (in this case, “exposed” to proximity to pawn shops, CAP Index values and building structure) vs a less exposed group Let’s consider store type. Our store layout variable compares mall and strip mall layouts to free standing locations. Exp{-.152} = .85 = multiplicative effect on the expected number of robberies for stores that are located in a mall when compared to free standing stores In English– If a free standing store is expected to have 4 burglaries, then a store in a mall setting is expected to have .85*4 = 3.43 if all other factors are similar. Building Layout - Mall -.152 20
21
Poisson Regression Overdispersion for Poisson Regression Models
The variance of the response is much larger than the mean. Larger variance known as overdispersion May have caused issues with our estimates Consequences: Parameter estimates are still consistent, Standard errors are inconsistent Since Standard error is tied into the calculation of the test statistic (which impacts the p-value), this could cause significant variables to be deemed as not significant (Type 2 error) Remedy: Negative Binomial model – a headache for another time! 21
22
Logistic Regression So we have seen that our distribution of robbery events over a year is definitely not normal, and not completely Poisson. Instead of trying to estimate how many times a store will be burglarized, we may just want to see what factors are associated with burglarized stores vs non-burglarized stores. We’ll have to create a new variable which assigns a “0” to non-burglarized stores and a “1” for stores that have been burglarized before 22
23
PROCEED WITH CAUTION Now before we toss out data, some important considerations: Recoding data into a binary outcome takes away some of our information. A store with 10 robberies is now treated the same as a store with one robbery Stores with 3+ burglaries may behave differently than stores with one or two. Logistic regression is a little more difficult to interpret and requires different tools to evaluate model fit Logistic regression can be done in excel, but it isn’t pretty or easy (for me at least). Never, ever, ever, ever throw away your data after recoding. Logistic regression lets us model the likelihood of an event happening with the given set of predictor variables 23
24
Logistic Regression Interpretation of Coefficient β – Odds Ratio
The odds ratio is a statistic that measures the odds of an event (no robbery) compared to the odds of another event (robbery). Say the probability of Event 1 ( no robbery) is π1 and the probability of Event 2 (robbery) is π2 . Then the odds ratio of Event 1 to Event 2 is: Value of Odds Ratio range from 0 to Infinity Value between 0 and 1 indicate the odds of Event 2 are greater Value between 1 and infinity indicate odds of Event 1 are greater Value equal to 1 indicates events are equally likely 24
25
Logistic Regression Model B Std Error P-value Constant 9.752 2.537
<.001 Building Layout - Mall -2.903 .916 .002 Building Layout – Strip mall -2.242 .924 .015 CAP .057 .008 Proximity to pawn shops -2.942 .437 Shrink 32.296 20.686 .118 25
26
Logistic Regression - Interpretation
Interpretation of the Parameter Estimate – Building structure = strip mall: Exp{-2.242} = .10 = Stores that are based in strip malls are .1 times as likely to get burglarized when compared to free standing stores (our base level) Building Layout – Strip mall -2.242 .924 26
27
Logistic Regression - Interpretation
Estimating the probability of robbery given a set of information So we want to model the probability that a store will be robbed given the following information: - We have a mall based store, in an area with a CAP Index rating of 100, with .025 shrink and is 5 miles away from a pawn shop 27
28
Logistic Regression - Interpretation
Model B Actual value Constant 9.752 - Building Layout - Mall -2.903 1 Building Layout – Strip mall -2.242 CAP .057 100 Shrink 32.296 .025 Proximity to pawn shops -2.942 5.0 28
29
Logistic Regression - Interpretation
Model B Actual value Constant 9.752 Building Layout - Mall -2.903 1 Building Layout – Strip mall -2.242 CAP .057 100 Shrink 32.296 .025 Proximity to pawn shops -2.942 5.0 29
30
Logistic Regression - Interpretation
The previous calculation, although painful, provided us with good news. That store with those characteristics only faces a 20% chance of being burglarized. In the real world, you could assign that store as a low risk store and play around with other characteristics to see what drives a higher risk store Identifying high risk stores and being proactive before they get burglarized is what data analysis is all about. 30
31
Quick Comparison Variable Linear regression P-value Poisson reg
Logistic reg Constant <.001 Building Layout - Mall .319 .083 .002 Building Layout – Strip mall .331 .420 .015 CAP Scores Proximity to pawn shops Shrink % .118 .401 31
32
Take Aways Important to look at the distribution of your response variable as well as your predictors to find the appropriate modeling technique Life is easier with linear regression, but alternative modeling techniques can give you new insights that may be missed. Advanced modeling techniques can be helpful or stressful depending on your familiarity with statistics When collecting information, it’s best to have data that can take a wider range of values You can turn a continuous variable into a categorical or binary variable, but not vice versa Modeling continuous variables usually leads to easier interpretations of coefficients If you are serious about statistical modeling, Microsoft Excel may not cut it A “cheap” statistical model from Excel may end up costing your more than an “expensive” model from a statistical program 32
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.