Presentation is loading. Please wait.

Presentation is loading. Please wait.

Logistic Regression. What is the purpose of Regression?

Similar presentations


Presentation on theme: "Logistic Regression. What is the purpose of Regression?"— Presentation transcript:

1 Logistic Regression

2 What is the purpose of Regression?

3

4 A multiple regression model We shall develop a multiple regression model to predict the stack loss for a given set of values for air flow, water temperature, and acid concentration > head(stackloss) Air.Flow Water.Temp Acid.Conc. stack.loss 1 80 27 89 42 2 80 27 88 37 3 75 25 90 37 4 62 24 87 28 5 62 22 87 18 6 62 23 87 18

5 A multiple regression model We apply the lm function to a formula that describes the variable stack.loss by the variables Air.Flow, Water.Temp and Acid.Conc. And we save the linear regression model in a new variable stackloss.lm. > stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stackloss)

6 Multiple regression model: Estimated value of y Apply the multiple linear regression model for the data set stackloss, and predict the stack loss if the air flow is 72, water temperature is 20 and acid concentration is 85. stackloss > predictorvals = data.frame(Air.Flow = 72, Water.Temp = 20, Acid.Conc. = 85) > predictorvals Air.Flow Water.Temp Acid.Conc. 1 72 20 85 > predict(stackloss.lm, predictorvals) 1 24.58173 The predicted stack loss is 24.582.

7 Multiple regression model: Significance Test > summary(stackloss.lm) Call: lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data = stackloss) Residuals: Min 1Q Median 3Q Max -7.2377 -1.7117 -0.4551 2.3614 5.6978 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -39.9197 11.8960 -3.356 0.00375 ** Air.Flow 0.7156 0.1349 5.307 5.8e-05 *** Water.Temp 1.2953 0.3680 3.520 0.00263 ** Acid.Conc. -0.1521 0.1563 -0.973 0.34405 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.243 on 17 degrees of freedom Multiple R-squared: 0.9136,Adjusted R-squared: 0.8983 F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09 As the p-values of Air.Flow and Water.Temp are less than 0.05, they are both statistically significant in the multiple linear regression model of stackloss.

8 Types of variables also called Nominal

9 Why Logistic Regression? Often in studies we encounter outcomes that are not continuous, but instead fall into one of two categories (dichotomous) For example: ●Disease status (disease vs no disease) ●Alive or dead ●Fire or no fire ●Result of an exam (passed or failed) ●Credit Card payment (default or no)

10 Logistic Regression Models the relationship between a set of variables x i, dichotomous (smoker : yes/no) categorical (social class, gender... ) continuous (age, weight, height...) and a dichotomous variable Y

11 Logistic Regression Age and signs of coronary heart disease (CD) in women Can we predict from a woman’s age whether she will have symptoms of CD?

12 How can we analyse these data? Comparison of the mean age of diseased and non-diseased women: Non-diseased: 38.6 years Diseased: 58.7 years Is linear regression possible?

13 Dot-plot from the data Note that there are only two values of Signs of coronary disease (y) – Yes and No. If you try to fit a line through the data, the line would not begin nor end at 0 and 1

14 How can we analyse these data? Prevalence (%) of signs of CD according to age group

15 Dot-plot: Data from Table of age-groups Diseased % Age-group (years)

16 How the Logistic function curve looks like? Probability of disease x Called a Sigmoid function

17 Example An auto club mails a flier to its members offering to send more information on a supplemental health insurance plan if the club member returns a brief form in an enclosed envelope. Why do some members return the form and others do not? Can a model be developed to predict whether a club member will return the form? One theory is that older people are more likely to return the form because they are more concerned about their health and they may have more disposable income to afford such coverage. Suppose a random sample of 92 club members is taken and members are asked their age and if they have returned the form.

18 Example What is the explanatory variable or predictor? What is the response or outcome variable?

19 Example

20 Example: Autoclub data in the sample > head(autoclub) Requested Age 1 1 52 2 1 57 3 1 53 4 1 57 5 1 48 6 1 50 > tail(autoclub) Requested Age 87 0 39 88 0 42 89 0 39 90 0 32 91 0 29 92 0 34

21 Generalized Linear Models

22 Logistic Regression

23 Properties of the Logit

24 Logistic function x

25 Logistic regression function

26 Logistic regression: Odds ratio

27

28 If there is a 0.60 probability that it will rain, then there is a 0.40 probability that it will not rain. What are the odds that it will rain? Odds it will rain are probability(rain)/ probability(not rain) = 0.60 / 0.40 = 1.50 Odds ratio: Example

29 Odds ratio: Transformation of the logistic model

30 Transformed logistic model

31 β coefficients need to be calculated in statistical analysis. For this, use the following steps: 1.Calculate the logarithmic value of the probability function 2.Calculate the partial derivatives with respect to each β coefficient. For n number of unknown β coefficients, there will be n equations. 3.Set n equations for n unknown β coefficients. 4.Solve the n equations for n unknown β coefficients to get the values of the β coefficients. Transformed logistic model

32 Developing logistic model In logistic regression, least squares regression methodology is not used to develop the model. A maximum likelihood method which maximizes the probability of getting the observed results is used. This method is preferred as it has better statistical properties. Maximum likelihood estimation is an iterative process and is done by software.

33 Maximum likelihood estimation Likelihood measures how well a set of data support a particular value of a parameter or coefficient (the probability of having obtained the observed data if the true parameter(s) equaled that value). Calculate the probability of obtaining the sample data observed for each possible value of the parameter. Compare this probability among the different values generated. Value with the highest support (i.e. highest probability) is the maximum likelihood estimate which is the best estimate of the parameter/ coefficient.

34 Auto club data: Logistic Regression Output from Minitab

35 Auto club data: Analysing the logistic model

36 Auto club data: Developing the regression model in R > auto.glm= glm(formula=Requested~Age,data=autoclub,family=binomial) > summary(auto.glm) Call: glm(formula = Requested ~ Age, family = binomial, data = autoclub) Deviance Residuals: Min 1Q Median 3Q Max -1.95015 -0.32016 -0.05335 0.26538 1.72940 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -20.40782 4.52332 -4.512 6.43e-06 *** Age 0.42592 0.09482 4.492 7.05e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 123.156 on 91 degrees of freedom Residual deviance: 49.937 on 90 degrees of freedom AIC: 53.937 Number of Fisher Scoring iterations: 7

37 Developing a logistic regression model mtcars: Motor Trend Car Road Tests The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

38 Developing a logistic regression model Problem By use of the logistic regression equation of vehicle transmission in the data set mtcars in R, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120hp engine and weights 2800 lbs.logistic regression equation of vehicle transmissionmtcars

39 Developing a logistic regression model mtcars: A data frame with 32 observations on 11 variables. [, 1]mpgMiles/(US) gallon [, 2]cylNumber of cylinders [, 3]disp Displacement (cu.in.) [, 4]hpGross horsepower [, 5]dratRear axle ratio [, 6]wtWeight (lb/1000) [, 7]qsec1/4 mile time [, 8]vsV/S [, 9]am Transmission (0 = automatic, 1 = manual) [,10]gear Number of forward gears [,11]carb Number of carburetors

40 Developing a logistic regression model mtcars: A data frame with 32 observations on 11 variables. > head(mtcars) mpg cyl disp hp drat wt qsec vs am Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 gear carb Mazda RX4 4 4 Mazda RX4 Wag 4 4 Datsun 710 4 1 Hornet 4 Drive 3 1 Hornet Sportabout 3 2 Valiant 3 1

41 Developing a logistic regression model > am.glm = glm(formula=am ~ hp + wt,data = mtcars, family=binomial) > summary(am.glm) Call: glm(formula = am ~ hp + wt, family = binomial, data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -2.2537 -0.1568 -0.0168 0.1543 1.3449 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 18.86630 7.44356 2.535 0.01126 * hp 0.03626 0.01773 2.044 0.04091 * wt -8.08348 3.06868 -2.634 0.00843 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 43.230 on 31 degrees of freedom Residual deviance: 10.059 on 29 degrees of freedom AIC: 16.059 Number of Fisher Scoring iterations: 8

42 Significance testing in logistic regression Similar to linear and multiple regression we have several hypotheses to test with logistic regression. The contribution of individual regression coefficients is done by Wald’s test (similar to t test) The contribution of several coefficients simultaneously is done by Deviance tests (similar to F test)

43 Significance testing in logistic regression – Deviance Test Deviance is a measure of how well the model fits the data. It is 2 times the negative likelihood of the dataset, given the model. If you think of deviance as analogous to variance, then the null deviance is similar to the variance of the data around the average rate of positive examples. The residual deviance is similar to the variance of the data around the model. The first thing we can do with the null and residual deviances is the check whether the model’s probability predictions are better than guessing the average rate of positives, statistically speaking. In other words, is the reduction/ drop in deviance from the model meaningful, or just something that was observed by chance?

44 Significance testing in logistic regression – Deviance Test (contd..) This is similar to calculating the F-test statistics for linear regression. The test you will run is the chi-squared test. To do that, you need to know the degrees of freedom for the null (“no predictor” or “constant only”) model and the actual model (reported in summary of R output). The degrees of freedom of the null model is the number of data points minus 1. The degrees of freedom of the model that you fit is the number of data points minus the number of coefficients in the model.

45 Significance testing in logistic regression - Deviance Test (contd..) If the number of data points in the sample is large, and degree of freedom (null) – degree of freedom (model) is small, then the probability of the value of G statistic which is the difference in null deviance and residual deviance, being as large as we observed is approximately distributed as a chi- square distribution with the degrees of freedom = degree of freedom (null) – degree of freedom (model) If the associated p-value is very small, it is extremely unlikely that we could have seen this much reduction (drop) in deviation by chance and the model is statistically significant.

46 Significance testing in logistic regression – Pseudo R-squared Analogous to R-squared measure for linear regression. Equals to 1 – (Residual deviance / Null deviance) A measure of how much of the deviance is explained by the model. Ideally should be close to 1.

47 Significance testing in logistic regression – AIC AIC stands for Akaike Information Criterion It is the log likelihood adjusted for the number of coefficients. Just as the R-squared of a linear regression is generally higher when the number of predictors is higher, the log likelihood also increases with the number of predictors. If you have several different models with different sets of predictors on the same sample, you can consider the model with the lowest AIC to be the best fit.

48 Significance testing in logistic regression – Fisher Scoring Iterations An iterative optimization method used to find the best coefficients for the logistic regression model. The iteration should converge in about 6 to 8 iterations If there are more iterations, then the algorithm may not have converged, and the model may not be valid.

49

50 Concordance and Discordance Calculate the estimated probability in logistic regression model. Divide the sample data into two datasets. One dataset contains observations having actual value of dependent variable 1 (event) and other contains all observations having actual value of dependent variable 0 (non-event). Compare each predicted value in first dataset with each predicted value in second dataset.

51 Concordance and Discordance Total Number of pairs to compare = x * y where x: Number of observations in first dataset (actual values of 1 in response variable) y: Number of observations in second dataset (actual values of 0 in response variable)

52 Concordance and Discordance A pair is concordant if 1 (observation with the desired outcome i.e. event) has a higher predicted probability than 0 (observation without the outcome i.e. non-event). A pair is discordant if 0 (observation without the desired outcome i.e. non-event) has a higher predicted probability than 1 (observation with the outcome i.e. event). A pair is tied if 1 (observation with the desired outcome i.e. event) has same predicted probability than 0 (observation without the outcome i.e. non-event).

53 Concordance and Discordance Percent Concordant = (Number of concordant pairs)/Total number of pairs Percent Discordance = (Number of discordant pairs)/Total number of pairs Percent Tied = (Number of tied pairs)/Total number of pairs

54 Concordance and Discordance Percent Concordant : Percentage of pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event). Percent Discordant : Percentage of pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event). Percent Tied : Percentage of pairs where the observation with the desired outcome (event) has same predicted probability than the observation without the outcome (non-event). In general, higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model.


Download ppt "Logistic Regression. What is the purpose of Regression?"

Similar presentations


Ads by Google