Logistic Regression. What is the purpose of Regression?

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Brief introduction on Logistic Regression
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Logistic Regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Logistic Regression Example: Horseshoe Crab Data
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Logistic Regression.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Inferences About Process Quality
Correlation and Regression Analysis
Relationships Among Variables
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Inference for regression - Simple linear regression
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
MATH 3359 Introduction to Mathematical Modeling Project Multiple Linear Regression Multiple Logistic Regression.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Simple Linear Regression ANOVA for regression (10.2)
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Chapter 13 Multiple Regression
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Evaluating Risk Adjustment Models Andy Bindman MD Department of Medicine, Epidemiology and Biostatistics.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Logistic Regression. Linear Regression Purchases vs. Income.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Linear Models Alan Lee Sample presentation for STATS 760.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Logistic Regression Analysis Gerrit Rooks
ALISON BOWLING MAXIMUM LIKELIHOOD. GENERAL LINEAR MODEL.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Stats Methods at IC Lecture 3: Regression.
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
Logistic regression.
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
John Loucks St. Edward’s University . SLIDES . BY.
Nonparametric Statistics
Logistic Regression.
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Logistic Regression

What is the purpose of Regression?

A multiple regression model We shall develop a multiple regression model to predict the stack loss for a given set of values for air flow, water temperature, and acid concentration > head(stackloss) Air.Flow Water.Temp Acid.Conc. stack.loss

A multiple regression model We apply the lm function to a formula that describes the variable stack.loss by the variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regression model in a new variable stackloss.lm. > stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stackloss)

Multiple regression model: Estimated value of y Apply the multiple linear regression model for the data set stackloss, and predict the stack loss if the air flow is 72, water temperature is 20 and acid concentration is 85. stackloss > predictorvals = data.frame(Air.Flow = 72, Water.Temp = 20, Acid.Conc. = 85) > predictorvals Air.Flow Water.Temp Acid.Conc > predict(stackloss.lm, predictorvals) The predicted stack loss is

Multiple regression model: Significance Test > summary(stackloss.lm) Call: lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stackloss) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) ** Air.Flow e-05 *** Water.Temp ** Acid.Conc Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 17 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09 As the p-values of Air.Flow and Water.Temp are less than 0.05, they are both statistically significant in the multiple linear regression model of stackloss.

Types of variables also called Nominal

Why Logistic Regression? Often in studies we encounter outcomes that are not continuous, but instead fall into one of two categories (dichotomous) For example: ●Disease status (disease vs no disease) ●Alive or dead ●Fire or no fire ●Result of an exam (passed or failed) ●Credit Card payment (default or no)

Logistic Regression Models the relationship between a set of variables x i, dichotomous (smoker : yes/no) categorical (social class, gender... ) continuous (age, weight, height...) and a dichotomous variable Y

Logistic Regression Age and signs of coronary heart disease (CD) in women Can we predict from a woman’s age whether she will have symptoms of CD?

How can we analyse these data? Comparison of the mean age of diseased and non-diseased women: Non-diseased: 38.6 years Diseased: 58.7 years Is linear regression possible?

Dot-plot from the data Note that there are only two values of Signs of coronary disease (y) – Yes and No. If you try to fit a line through the data, the line would not begin nor end at 0 and 1

How can we analyse these data? Prevalence (%) of signs of CD according to age group

Dot-plot: Data from Table of age-groups Diseased % Age-group (years)

How the Logistic function curve looks like? Probability of disease x Called a Sigmoid function

Example An auto club mails a flier to its members offering to send more information on a supplemental health insurance plan if the club member returns a brief form in an enclosed envelope. Why do some members return the form and others do not? Can a model be developed to predict whether a club member will return the form? One theory is that older people are more likely to return the form because they are more concerned about their health and they may have more disposable income to afford such coverage. Suppose a random sample of 92 club members is taken and members are asked their age and if they have returned the form.

Example What is the explanatory variable or predictor? What is the response or outcome variable?

Example

Example: Autoclub data in the sample > head(autoclub) Requested Age > tail(autoclub) Requested Age

Generalized Linear Models

Logistic Regression

Properties of the Logit

Logistic function x

Logistic regression function

Logistic regression: Odds ratio

If there is a 0.60 probability that it will rain, then there is a 0.40 probability that it will not rain. What are the odds that it will rain? Odds it will rain are probability(rain)/ probability(not rain) = 0.60 / 0.40 = 1.50 Odds ratio: Example

Odds ratio: Transformation of the logistic model

Transformed logistic model

β coefficients need to be calculated in statistical analysis. For this, use the following steps: 1.Calculate the logarithmic value of the probability function 2.Calculate the partial derivatives with respect to each β coefficient. For n number of unknown β coefficients, there will be n equations. 3.Set n equations for n unknown β coefficients. 4.Solve the n equations for n unknown β coefficients to get the values of the β coefficients. Transformed logistic model

Developing logistic model In logistic regression, least squares regression methodology is not used to develop the model. A maximum likelihood method which maximizes the probability of getting the observed results is used. This method is preferred as it has better statistical properties. Maximum likelihood estimation is an iterative process and is done by software.

Maximum likelihood estimation Likelihood measures how well a set of data support a particular value of a parameter or coefficient (the probability of having obtained the observed data if the true parameter(s) equaled that value). Calculate the probability of obtaining the sample data observed for each possible value of the parameter. Compare this probability among the different values generated. Value with the highest support (i.e. highest probability) is the maximum likelihood estimate which is the best estimate of the parameter/ coefficient.

Auto club data: Logistic Regression Output from Minitab

Auto club data: Analysing the logistic model

Auto club data: Developing the regression model in R > auto.glm= glm(formula=Requested~Age,data=autoclub,family=binomial) > summary(auto.glm) Call: glm(formula = Requested ~ Age, family = binomial, data = autoclub) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-06 *** Age e-06 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 91 degrees of freedom Residual deviance: on 90 degrees of freedom AIC: Number of Fisher Scoring iterations: 7

Developing a logistic regression model mtcars: Motor Trend Car Road Tests The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Developing a logistic regression model Problem By use of the logistic regression equation of vehicle transmission in the data set mtcars in R, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120hp engine and weights 2800 lbs.logistic regression equation of vehicle transmissionmtcars

Developing a logistic regression model mtcars: A data frame with 32 observations on 11 variables. [, 1]mpgMiles/(US) gallon [, 2]cylNumber of cylinders [, 3]disp Displacement (cu.in.) [, 4]hpGross horsepower [, 5]dratRear axle ratio [, 6]wtWeight (lb/1000) [, 7]qsec1/4 mile time [, 8]vsV/S [, 9]am Transmission (0 = automatic, 1 = manual) [,10]gear Number of forward gears [,11]carb Number of carburetors

Developing a logistic regression model mtcars: A data frame with 32 observations on 11 variables. > head(mtcars) mpg cyl disp hp drat wt qsec vs am Mazda RX Mazda RX4 Wag Datsun Hornet 4 Drive Hornet Sportabout Valiant gear carb Mazda RX4 4 4 Mazda RX4 Wag 4 4 Datsun Hornet 4 Drive 3 1 Hornet Sportabout 3 2 Valiant 3 1

Developing a logistic regression model > am.glm = glm(formula=am ~ hp + wt,data = mtcars, family=binomial) > summary(am.glm) Call: glm(formula = am ~ hp + wt, family = binomial, data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * hp * wt ** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 31 degrees of freedom Residual deviance: on 29 degrees of freedom AIC: Number of Fisher Scoring iterations: 8

Significance testing in logistic regression Similar to linear and multiple regression we have several hypotheses to test with logistic regression. The contribution of individual regression coefficients is done by Wald’s test (similar to t test) The contribution of several coefficients simultaneously is done by Deviance tests (similar to F test)

Significance testing in logistic regression – Deviance Test Deviance is a measure of how well the model fits the data. It is 2 times the negative likelihood of the dataset, given the model. If you think of deviance as analogous to variance, then the null deviance is similar to the variance of the data around the average rate of positive examples. The residual deviance is similar to the variance of the data around the model. The first thing we can do with the null and residual deviances is the check whether the model’s probability predictions are better than guessing the average rate of positives, statistically speaking. In other words, is the reduction/ drop in deviance from the model meaningful, or just something that was observed by chance?

Significance testing in logistic regression – Deviance Test (contd..) This is similar to calculating the F-test statistics for linear regression. The test you will run is the chi-squared test. To do that, you need to know the degrees of freedom for the null (“no predictor” or “constant only”) model and the actual model (reported in summary of R output). The degrees of freedom of the null model is the number of data points minus 1. The degrees of freedom of the model that you fit is the number of data points minus the number of coefficients in the model.

Significance testing in logistic regression - Deviance Test (contd..) If the number of data points in the sample is large, and degree of freedom (null) – degree of freedom (model) is small, then the probability of the value of G statistic which is the difference in null deviance and residual deviance, being as large as we observed is approximately distributed as a chi- square distribution with the degrees of freedom = degree of freedom (null) – degree of freedom (model) If the associated p-value is very small, it is extremely unlikely that we could have seen this much reduction (drop) in deviation by chance and the model is statistically significant.

Significance testing in logistic regression – Pseudo R-squared Analogous to R-squared measure for linear regression. Equals to 1 – (Residual deviance / Null deviance) A measure of how much of the deviance is explained by the model. Ideally should be close to 1.

Significance testing in logistic regression – AIC AIC stands for Akaike Information Criterion It is the log likelihood adjusted for the number of coefficients. Just as the R-squared of a linear regression is generally higher when the number of predictors is higher, the log likelihood also increases with the number of predictors. If you have several different models with different sets of predictors on the same sample, you can consider the model with the lowest AIC to be the best fit.

Significance testing in logistic regression – Fisher Scoring Iterations An iterative optimization method used to find the best coefficients for the logistic regression model. The iteration should converge in about 6 to 8 iterations If there are more iterations, then the algorithm may not have converged, and the model may not be valid.

Concordance and Discordance Calculate the estimated probability in logistic regression model. Divide the sample data into two datasets. One dataset contains observations having actual value of dependent variable 1 (event) and other contains all observations having actual value of dependent variable 0 (non-event). Compare each predicted value in first dataset with each predicted value in second dataset.

Concordance and Discordance Total Number of pairs to compare = x * y where x: Number of observations in first dataset (actual values of 1 in response variable) y: Number of observations in second dataset (actual values of 0 in response variable)

Concordance and Discordance A pair is concordant if 1 (observation with the desired outcome i.e. event) has a higher predicted probability than 0 (observation without the outcome i.e. non-event). A pair is discordant if 0 (observation without the desired outcome i.e. non-event) has a higher predicted probability than 1 (observation with the outcome i.e. event). A pair is tied if 1 (observation with the desired outcome i.e. event) has same predicted probability than 0 (observation without the outcome i.e. non-event).

Concordance and Discordance Percent Concordant = (Number of concordant pairs)/Total number of pairs Percent Discordance = (Number of discordant pairs)/Total number of pairs Percent Tied = (Number of tied pairs)/Total number of pairs

Concordance and Discordance Percent Concordant : Percentage of pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event). Percent Discordant : Percentage of pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event). Percent Tied : Percentage of pairs where the observation with the desired outcome (event) has same predicted probability than the observation without the outcome (non-event). In general, higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model.