1 Logistic Regression and the new: Residual Logistic Regression F. Berenice Baez-Revueltas Wei Zhu F. Berenice Baez-Revueltas Wei Zhu.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/
Logistic Regression Psy 524 Ainsworth.
Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Inference for Regression
Logistic Regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
HSRP 734: Advanced Statistical Methods July 24, 2008.
Chance, bias and confounding
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Detecting Spatial Clustering in Matched Case-Control Studies Andrea Cook, MS Collaboration with: Dr. Yi Li November 4, 2004.
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
EPI 809/Spring Multiple Logistic Regression.
Logistic Regression Biostatistics 510 March 15, 2007 Vanessa Perez.
1 G Lect 11M Binary outcomes in psychology Can Binary Outcomes Be Studied Using OLS Multiple Regression? Transforming the binary outcome Logistic.
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
An Introduction to Logistic Regression
Generalized Linear Models
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
Unit 6: Standardization and Methods to Control Confounding.
Logistic Regression. Outline Review of simple and multiple regressionReview of simple and multiple regression Simple Logistic RegressionSimple Logistic.
Chapter 13: Inference in Regression
Logistic Regression III: Advanced topics Conditional Logistic Regression for Matched Data Conditional Logistic Regression for Matched Data.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 4: Taking Risks and Playing the Odds: OR vs.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Correlation and Regression SCATTER DIAGRAM The simplest method to assess relationship between two quantitative variables is to draw a scatter diagram.
COMH7202: EPIDEMIOLOGY III – INTERMEDIATE CONCEPTS Confounding & Effect Modification
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Linear correlation and linear regression + summary of tests
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Chapter 13 Multiple Regression
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
1 STA 617 – Chp10 Models for matched pairs Summary  Describing categorical random variable – chapter 1  Poisson for count data  Binomial for binary.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Multiple Regression  Similar to simple regression, but with more than one independent variable R 2 has same interpretation R 2 has same interpretation.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
POPLHLTH 304 Regression (modelling) in Epidemiology Simon Thornley (Slides adapted from Assoc. Prof. Roger Marshall)
1 Introduction to Modeling Beyond the Basics (Chapter 7)
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Probability and odds Suppose we a frequency distribution for the variable “TB status” The probability of an individual having TB is frequencyRelative.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Nonparametric Statistics
Analysis of matched data Analysis of matched data.
Additional Regression techniques Scott Harris October 2009.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
(www).
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Applied Biostatistics: Lecture 2
Introduction to logistic regression a.k.a. Varbrul
Multiple logistic regression
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Introduction to Logistic Regression
Presentation transcript:

1 Logistic Regression and the new: Residual Logistic Regression F. Berenice Baez-Revueltas Wei Zhu F. Berenice Baez-Revueltas Wei Zhu

2 Outline 1.Logistic Regression 2.Confounding Variables 3.Controlling for Confounding Variables 4.Residual Linear Regression 5.Residual Logistic Regression 6.Examples 7.Discussion 8.Future Work 1.Logistic Regression 2.Confounding Variables 3.Controlling for Confounding Variables 4.Residual Linear Regression 5.Residual Logistic Regression 6.Examples 7.Discussion 8.Future Work

1. Logistic Regression Model In 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a binary response variable.

A popular model for categorical response variable  Logistic regression model is the most popular model for binary data.  Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuous or categorical). Y = 1 (true, success, YES, etc.) or Y = 0 ( false, failure, NO, etc.)  Logistic regression model can be extended to model a categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the ‘binomial’ logistic regression for a binary response variable.)  Logistic regression model is the most popular model for binary data.  Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuous or categorical). Y = 1 (true, success, YES, etc.) or Y = 0 ( false, failure, NO, etc.)  Logistic regression model can be extended to model a categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the ‘binomial’ logistic regression for a binary response variable.)

More on the rationale of the logistic regression model  Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x) as a linear function of the predictor. This model can be rewritten as  E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x. The following linear model may violate this condition sometimes: P(Y=1|x) =  Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x) as a linear function of the predictor. This model can be rewritten as  E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x. The following linear model may violate this condition sometimes: P(Y=1|x) =

More on the properties of the logistic regression model  In the simple logistic regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.  For multiple predictor variables, the logistic regression model is  In the simple logistic regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.  For multiple predictor variables, the logistic regression model is

Logistic Regression, SAS Procedure   Proc Logistic  This page shows an example of logistic regression with footnotes explaining the output. The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is high writing test score (honcomp), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded from  data logit;  set "c:\temp\hsb2"; honcomp = (write >= 60); run; proc logistic data= logit descending; model honcomp = female read science; run;   Proc Logistic  This page shows an example of logistic regression with footnotes explaining the output. The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is high writing test score (honcomp), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded from  data logit;  set "c:\temp\hsb2"; honcomp = (write >= 60); run; proc logistic data= logit descending; model honcomp = female read science; run; 7

Logistic Regression, SAS Output 8

9 2. Confounding Variables  Correlated with both the dependent and independent variables  Represent major threat to the validity of inferences on cause and effect  Add to multicollinearity  Can lead to over or underestimation of an effect, it can even change the direction of the conclusion  They add error in the interpretation of what may be an accurate measurement  Correlated with both the dependent and independent variables  Represent major threat to the validity of inferences on cause and effect  Add to multicollinearity  Can lead to over or underestimation of an effect, it can even change the direction of the conclusion  They add error in the interpretation of what may be an accurate measurement

10 For a variable to be a confounder it needs to have  Relationship with the exposure  Relationship with the outcome even in the absence of the exposure (not an intermediary)  Not on the causal pathway  Uneven distribution in comparison groups For a variable to be a confounder it needs to have  Relationship with the exposure  Relationship with the outcome even in the absence of the exposure (not an intermediary)  Not on the causal pathway  Uneven distribution in comparison groups Exposure Outcome Third variable

11 Birth order Down Syndrome Maternal Age Maternal age is correlated with birth order and a risk factor for Down Syndrome, even if Birth order is low Smoking is correlated with alcohol consumption and is a risk factor for Lung Cancer even for persons who don’t drink alcohol Alcohol Lung Cancer Smoking Confounding No Confounding

12 3. Controlling for Confounding Variables  In study designs Restriction Random allocation of subjects to study groups to attempt to even out unknown confounders Matching subjects using potential confounders  In study designs Restriction Random allocation of subjects to study groups to attempt to even out unknown confounders Matching subjects using potential confounders

13  In data analysis Stratified analysis using Mantel Haenszel method to adjust for confounders Case-control studies Cohort studies Restriction (is still possible but it means to throw data away) Model fitting using regression techniques  In data analysis Stratified analysis using Mantel Haenszel method to adjust for confounders Case-control studies Cohort studies Restriction (is still possible but it means to throw data away) Model fitting using regression techniques

14 Pros and Cons of Controlling Methods  Matching methods call for subjects with exactly the same characteristics  Risk of over or under matching  Cohort studies can lead to too much loss of information when excluding subjects  Some strata might become too thin and thus insignificant creating also loss of information  Regression methods, if well handled, can control for confounding factors  Matching methods call for subjects with exactly the same characteristics  Risk of over or under matching  Cohort studies can lead to too much loss of information when excluding subjects  Some strata might become too thin and thus insignificant creating also loss of information  Regression methods, if well handled, can control for confounding factors

15 4. Residual Linear Regression  Consider a dependant variable Y and a set of n independent covariates, from which the first k (k<n) of them are potential confounding factors  Initial model treating only the confounding variables as follows  Residuals are calculated from this model, let  Consider a dependant variable Y and a set of n independent covariates, from which the first k (k<n) of them are potential confounding factors  Initial model treating only the confounding variables as follows  Residuals are calculated from this model, let

16 The residuals are with the following properties:  Zero mean  Homoscedasticity  Normally distributed , This residual will be considered the new dependant variable. That is, the new model to be fitted is which is equivalent to: The residuals are with the following properties:  Zero mean  Homoscedasticity  Normally distributed , This residual will be considered the new dependant variable. That is, the new model to be fitted is which is equivalent to:

17 The Usual Logistic Regression Approach to ‘Control for’ Confounders  Consider a binary outcome Y and n covariates where the first k (k<n) of them being potential confounding factors  The usual way to ‘control for’ these confounding variables is to simply put all the n variables in the same model as:  Consider a binary outcome Y and n covariates where the first k (k<n) of them being potential confounding factors  The usual way to ‘control for’ these confounding variables is to simply put all the n variables in the same model as:

18 5. Residual Logistic Regression  Each subject has a binary outcome Y  Consider n covariates, where the first k (k<n) are potential confounding factors  Initial model with as the probability of success where only confounding effect is analyzed  Each subject has a binary outcome Y  Consider n covariates, where the first k (k<n) are potential confounding factors  Initial model with as the probability of success where only confounding effect is analyzed

19 Method 1  The confounding variables effect is retained and plugged in to the second level regression model along with the variables of interest following the residual linear regression approach.  That is, let  The new model to be fitted is  The confounding variables effect is retained and plugged in to the second level regression model along with the variables of interest following the residual linear regression approach.  That is, let  The new model to be fitted is

20 Method 2  Pearson residuals are calculated from the initial model using the Pearson residual (Hosmer and Lemeshow, 1989) where is the estimated probability of success based on the confounding variables alone: The second level regression will use this residual as the new dependant variable.  Pearson residuals are calculated from the initial model using the Pearson residual (Hosmer and Lemeshow, 1989) where is the estimated probability of success based on the confounding variables alone: The second level regression will use this residual as the new dependant variable.

21 Therefore the new dependant variable is Z, and because it is not dichotomous anymore we can apply a multiple linear regression model to analyze the effect of the rest of the covariates. The new model to be fitted is a linear regression model Therefore the new dependant variable is Z, and because it is not dichotomous anymore we can apply a multiple linear regression model to analyze the effect of the rest of the covariates. The new model to be fitted is a linear regression model

22 6. Example 1  Data: Low Birth Weight  Dow. Indicator of birth weight less than 2.5 Kg  Age: Mother’s age in years  Lwt: Mother’s weight in pounds  Smk: Smoking status during pregnancy  Ht: History of hypertension  Data: Low Birth Weight  Dow. Indicator of birth weight less than 2.5 Kg  Age: Mother’s age in years  Lwt: Mother’s weight in pounds  Smk: Smoking status during pregnancy  Ht: History of hypertension AgeLwtSmkHt Age Lwt Smk Ht Correlation matrix with alpha=0.05

23  Potential confounding factor: Age  Model for (probability of low birth weight)  Logistic regression  Residual logistic regression initial model  Method 1  Method 2  Potential confounding factor: Age  Model for (probability of low birth weight)  Logistic regression  Residual logistic regression initial model  Method 1  Method 2

24 Results Variables Logistic RegressionRLR Method1 Odds ratioP-valueSEOdds ratioP-valueSE lwt smk ht RLR Method 2 VariablesP-valueSE lwt Smk ht Conf. factors Variables P-value Log regIni model Age

25 Example 2  Data: Alzheimer patients Decline: Whether the subjects cognitive capabilities deteriorates or not Age: Subjects age Gender: Subjects gender MMS: Mini Mental Score PDS: Psychometric deterioration scale HDT: Depression scale  Data: Alzheimer patients Decline: Whether the subjects cognitive capabilities deteriorates or not Age: Subjects age Gender: Subjects gender MMS: Mini Mental Score PDS: Psychometric deterioration scale HDT: Depression scale AgeGenderMMSPDSHDT Age Gender MMS PDS HDT Correlation matrix with alpha=0.05

26  Potential confounding factors: Age, Gender  Model for (probability of declining)  Logistic regression  Residual logistic regression initial model  Method 1  Method 2  Potential confounding factors: Age, Gender  Model for (probability of declining)  Logistic regression  Residual logistic regression initial model  Method 1  Method 2

27 Results Variables Logistic RegressionRLR Method1 Odds ratioP-valueSEOdds ratioP-valueSE mms pds hdt RLR Method 2 VariablesP-valueSE mms< pds< hdt Conf. factors Variables P-value Log regIni model Age Gender

28 7. Discussion  The usual logistic regression is not designed to control for confounding factors and there is a risk for multicollinearity.  Method 1 is designed to control for confounding factors; however, from the given examples we can see Method 1 yields similar results to the usual logistic regression approach  Method 2 appears to be more accurate with some SE significantly reduced and thus the p-values for some regressors are significantly smaller. However it will not yield the odds ratios as Method 1 can.  The usual logistic regression is not designed to control for confounding factors and there is a risk for multicollinearity.  Method 1 is designed to control for confounding factors; however, from the given examples we can see Method 1 yields similar results to the usual logistic regression approach  Method 2 appears to be more accurate with some SE significantly reduced and thus the p-values for some regressors are significantly smaller. However it will not yield the odds ratios as Method 1 can.

29 8. Future Work We will further examine the assumptions behind Method 2 to understand why it sometimes yields more significant results. We will also study residual longitudinal data analysis, including the survival analysis, where one or more time dependant variable(s) will be taken into account. We will further examine the assumptions behind Method 2 to understand why it sometimes yields more significant results. We will also study residual longitudinal data analysis, including the survival analysis, where one or more time dependant variable(s) will be taken into account.

30 Selected References  Menard, S. Applied Logistic Regression Analysis. Series: Quantitative Applications in the Social Sciences. Sage University Series  Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83,  Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets Logistic Regression. Biometrics 45,  Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4),  Menard, S. Applied Logistic Regression Analysis. Series: Quantitative Applications in the Social Sciences. Sage University Series  Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83,  Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets Logistic Regression. Biometrics 45,  Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4),

31 Questions?