Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1

Slides:



Advertisements
Similar presentations
Multilevel Event History Modelling of Birth Intervals
Advertisements

Continued Psy 524 Ainsworth
Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1
Linear Regression.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Unit 6a: Motivating Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6a– Slide 1
Binary Logistic Regression: One Dichotomous Independent Variable
Simple Logistic Regression
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Ch11 Curve Fitting Dr. Deshi Ye
Regression With Categorical Variables. Overview Regression with Categorical Predictors Logistic Regression.
© Willett, Harvard University Graduate School of Education, 5/21/2015S052/I.3(b) – Slide 1 More details can be found in the “Course Objectives and Content”
Models with Discrete Dependent Variables
Multiple Logistic Regression RSQUARE, LACKFIT, SELECTION, and interactions.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Multinomial Logistic Regression
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Unit 5c: Adding Predictors to the Discrete Time Hazard Model © Andrew Ho, Harvard Graduate School of EducationUnit 5c– Slide 1
Generalized Linear Models
S052/Shopping Presentation – Slide #1 © Willett, Harvard University Graduate School of Education S052: Applied Data Analysis Shopping Presentation: A.
Unit 5c: Adding Predictors to the Discrete Time Hazard Model © Andrew Ho, Harvard Graduate School of EducationUnit 5c– Slide 1
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
Unit 3b: From Fixed to Random Intercepts © Andrew Ho, Harvard Graduate School of EducationUnit 3b – Slide 1
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Unit 2b: Dealing “Rationally” with Nonlinear Relationships © Andrew Ho, Harvard Graduate School of EducationUnit 2b – Slide 1
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
Unit 4b: Fitting the Logistic Model to Data © Andrew Ho, Harvard Graduate School of EducationUnit 4b – Slide 1
© Willett, Harvard University Graduate School of Education, 8/27/2015S052/I.3(c) – Slide 1 More details can be found in the “Course Objectives and Content”
Chapter 13: Inference in Regression
Unit 6b: Principal Components Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 6b – Slide 1
© Willett & Singer, Harvard University Graduate School of Education S077/Week #4– Slide 1 S077: Applied Longitudinal Data Analysis Week #4: What Are The.
Andrew Ho Harvard Graduate School of Education Tuesday, January 22, 2013 S-052 Shopping – Applied Data Analysis.
Applications The General Linear Model. Transformations.
Chapter 14 Multiple Regression Models. 2  A general additive multiple regression model, which relates a dependent variable y to k predictor variables.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 1
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Linear vs. Logistic Regression Log has a slightly better ability to represent the data Dichotomous Prefer Don’t Prefer Linear vs. Logistic Regression.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
S052/Shopping Presentation – Slide #1 © Willett, Harvard University Graduate School of Education S052: Applied Data Analysis What Would You Like To Know.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1
Copyright © 2009 Cengage Learning 18.1 Chapter 20 Model Building.
Please turn off cell phones, pagers, etc. The lecture will begin shortly.
Unit 3a: Introducing the Multilevel Regression Model © Andrew Ho, Harvard Graduate School of EducationUnit 3a – Slide 1
Logistic Regression. Linear Regression Purchases vs. Income.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1
© Willett, Harvard University Graduate School of Education, 12/16/2015S052/I.1(d) – Slide 1 More details can be found in the “Course Objectives and Content”
Multiple Logistic Regression STAT E-150 Statistical Methods.
© Willett, Harvard University Graduate School of Education, 1/19/2016S052/I.2(a) – Slide 1 More details can be found in the “Course Objectives and Content”
1 Introduction to Modeling Beyond the Basics (Chapter 7)
© Willett, Harvard University Graduate School of Education, 2/19/2016S052/II.1(c) – Slide 1 S052/II.1(c): Applied Data Analysis Roadmap of the Course.
Assumptions of Multiple Regression 1. Form of Relationship: –linear vs nonlinear –Main effects vs interaction effects 2. All relevant variables present.
© Willett, Harvard University Graduate School of Education, 3/1/2016S052/III.1(b) – Slide 1 S052/III.1(b): Applied Data Analysis Roadmap of the Course.
Logistic Regression and Odds Ratios Psych DeShon.
What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data?  Do people who say they study for more hours also think they’ll.
Unit 2a: Dealing “Empirically” with Nonlinear Relationships © Andrew Ho, Harvard Graduate School of EducationUnit 2a – Slide 1
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Beginners statistics Assoc Prof Terry Haines. 5 simple steps 1.Understand the type of measurement you are dealing with 2.Understand the type of question.
© Willett, Harvard University Graduate School of Education, 6/13/2016S052/II.2(a3) – Slide 1 S052/II.2(a3): Applied Data Analysis Roadmap of the Course.
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Notes on Logistic Regression
Chapter 13 Nonlinear and Multiple Regression
Presentation transcript:

Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1

Replicating Life Table Analysis with Logistic Regression Interpreting coefficients using the noconstant option. Fitting the Hazard Function with polynomial regression. © Andrew Ho, Harvard Graduate School of Education Unit 5b– Slide 2 Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? Course Roadmap: Unit 5b Today’s Topic Area

© Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 3 Person-Period Dataset ID PERIOD EVENT Etc. Person-Period Dataset ID PERIOD EVENT Etc. So, why not replace life-table analysis by the logistic-regression analysis of EVENT on PERIOD in the person-period dataset?  From a technical perspective, this turns out to be exactly the right thing to do.  It’s then called Discrete-Time Survival Analysis. So, why not replace life-table analysis by the logistic-regression analysis of EVENT on PERIOD in the person-period dataset?  From a technical perspective, this turns out to be exactly the right thing to do.  It’s then called Discrete-Time Survival Analysis. In our earlier life-table analysis in the person-period dataset:  EVENT recorded whether the teacher experienced the event of interest (quitting teaching) in each time PERIOD.  Conceptually, in these analyses:  EVENT served as a (dichotomous) outcome.  PERIOD served as a predictor. In our earlier life-table analysis in the person-period dataset:  EVENT recorded whether the teacher experienced the event of interest (quitting teaching) in each time PERIOD.  Conceptually, in these analyses:  EVENT served as a (dichotomous) outcome.  PERIOD served as a predictor. In a person-period dataset:  Each person has one row of data in each time-period.  Their data record continues until, and includes, the time-period in which they experience the event of interest, or are censored:  A person cannot be present in a time- period unless they had a value of 0 for EVENT in the previous period.  In other words, they must have survived the prior period.  So, the person-period dataset has been formatted to permit each person to be present in a particular time period only if they are a legitimate member of the risk set in that period. In a person-period dataset:  Each person has one row of data in each time-period.  Their data record continues until, and includes, the time-period in which they experience the event of interest, or are censored:  A person cannot be present in a time- period unless they had a value of 0 for EVENT in the previous period.  In other words, they must have survived the prior period.  So, the person-period dataset has been formatted to permit each person to be present in a particular time period only if they are a legitimate member of the risk set in that period. Notice how, in the person-period dataset, outcome EVENT has been encoded to embody the same conditionality present in the definition of the hazard probability … The Person-Period Dataset

© Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 4 * * Input the person-period dataset, name and label the variables in the dataset. * Note that this is a different input dataset -- in person-period format, rather * than the prior person-level format -- than the one that was used in the previous * data-analytic handout on life-table analysis, in Unit5a.do * * Input the person-period dataset: infile ID PERIOD EVENT P1-P12 /// using ""C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period?“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl * * Inspect the structure of the new person-period dataset. * Notice that there is one row per discrete time-period for each person. * list ID EVENT PERIOD P1-P12 in 1/40 * * Input the person-period dataset, name and label the variables in the dataset. * Note that this is a different input dataset -- in person-period format, rather * than the prior person-level format -- than the one that was used in the previous * data-analytic handout on life-table analysis, in Unit5a.do * * Input the person-period dataset: infile ID PERIOD EVENT P1-P12 /// using ""C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period?“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl * * Inspect the structure of the new person-period dataset. * Notice that there is one row per discrete time-period for each person. * list ID EVENT PERIOD P1-P12 in 1/40 Here’s the STATA code that kicks off Data-Analytic Handout, Unit5b.do, in which I conduct the suggested logistic regression analyses of EVENT. In Unit5a.do, recall that I provided code that allows you to convert the person-level dataset to the person-period dataset. Here I list the values of EVENT and P1 thru P12 for the few cases we inspected on the previous slide. Here are the time-period indicators -- P1 through P12 -- that were present in the person- period dataset, but were input and ignored up to this point. Loading in the dataset

Unit 5b– Slide 5 Calculating Hazard Probabilities in Person-Period Datasets © Andrew Ho, Harvard Graduate School of Education tabulate EVENT PERIOD, column This calculates what we see in the table above. count(ID) gives us our Total in each PERIOD, sum(EVENT) gives us the number who Quit by PERIOD, and NEVENT/NPERIOD gives us our Hazard Probabilities by PERIOD.

Unit 5b– Slide 6 Calculating Survival Probabilities in Person-Period Datasets © Andrew Ho, Harvard Graduate School of Education preserve and, at the end, restore, allows us to mess with our dataset and get it back at the end. Our collapsed dataset with HAZARDP (collapsed) and SURVIVEP (calculated)

© Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 7 ColVarVariable DescriptionLabels 1IDTeacher identification code.Integer 2PERIODIndicates discrete time period to which record refers.Integer 3EVENT Dummy variable indicating whether the teacher experienced the event of interest in this period. 0 = no; 1 = yes 4P1Is this the first year of the teaching career?0 = no; 1= yes 5P2Is this the second year of the teaching career?0 = no; 1= yes 6P3Is this the third year of the teaching career?0 = no; 1= yes 7P4Is this the fourth year of the teaching career?0 = no; 1= yes 8P5Is this the fifth year of the teaching career?0 = no; 1= yes 9P6Is this the sixth year of the teaching career?0 = no; 1= yes 10P7Is this the seventh year of the teaching career?0 = no; 1= yes 11P8Is this the eighth year of the teaching career?0 = no; 1= yes 12P9Is this the ninth year of the teaching career?0 = no; 1= yes 13P10Is this the tenth year of the teaching career?0 = no; 1= yes 14P11Is this the eleventh year of the teaching career?0 = no; 1= yes 15P12Is this the twelfth year of the teaching career?0 = no; 1= yes To conduct logistic regression analyses in the person-period dataset, we must think about how we represent time PERIOD in our models -- recall that the dataset contains a vector of predictors that we have not yet used … “General Specification of PERIOD”  Dichotomous predictors, P1 thru P12 are defined to distinguish among the discrete time periods.  For each person in each period, each of the time period indicators, P1 thru P12, is set to 1 in the corresponding period, and 0 in other periods. “General Specification of PERIOD”  Dichotomous predictors, P1 thru P12 are defined to distinguish among the discrete time periods.  For each person in each period, each of the time period indicators, P1 thru P12, is set to 1 in the corresponding period, and 0 in other periods. Representing PERIOD by this “vector of dummies” in our logistic regression analysis provides the most general specification possible for any potential relationship between EVENT and PERIOD. The “Discrete” of DTSA: The Person-Period Dummy Variables

| ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | | | 1. | 1 Quit | 2. | 2 No Quit | 3. | 2 Quit | 4. | 3 Quit | 5. | 4 Quit | | | 6. | 5 No Quit | 7. | 5 No Quit | 8. | 5 No Quit | 9. | 5 No Quit | 10. | 5 No Quit | | | 11. | 5 No Quit | 12. | 5 No Quit | 13. | 5 No Quit | 14. | 5 No Quit | 15. | 5 No Quit | | | 16. | 5 No Quit | 17. | 5 No Quit | 18. | 6 Quit | … 39. | 12 No Quit | 40. | 12 No Quit | © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 8 Here are the values of the time-period indicators for a few teachers from the person-period dataset … Here are the original 12 years of data on the time periods in which Teacher #5 was present in the person-period dataset The time-period indicators, P1 - P12, identify each time-period in a very general way In the 1 st time period: P1 = 1 P2 thru P12 = 0 In the 1 st time period: P1 = 1 P2 thru P12 = 0 … … In the 2 nd time period: P2 = 1 P1 & P3 thru P12 = 0 In the 2 nd time period: P2 = 1 P1 & P3 thru P12 = 0 In the 12 th time period: P12 = 1, P1 thru P11 = 0. In the 12 th time period: P12 = 1, P1 thru P11 = 0. The “Discrete” of DTSA: Person-Period Dummies as Time Period Indicators

Unit 5b– Slide 9 The “Discrete” of DTSA: Person-Period Dummies as Time Period Indicators | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | | | 1. | 1 Quit | 2. | 2 No Quit | 3. | 2 Quit | 4. | 3 Quit | 5. | 4 Quit | | | 6. | 5 No Quit | 7. | 5 No Quit | 8. | 5 No Quit | 9. | 5 No Quit | 10. | 5 No Quit | | | 11. | 5 No Quit | 12. | 5 No Quit | 13. | 5 No Quit | 14. | 5 No Quit | 15. | 5 No Quit | | | 16. | 5 No Quit | 17. | 5 No Quit | 18. | 6 Quit | … 39. | 12 No Quit | 40. | 12 No Quit | Hazard Function © Andrew Ho, Harvard Graduate School of Education You might notice that the Hazard Function shows the conditional means of the dichotomous variable, EVENT, on the predictor variable, PERIOD. If we wanted to model these means, and test the null hypothesis that all means are equal, how might we do it? In the population, are hazard probabilities different across years of teaching? If we wanted to model these means, and test the null hypothesis that all means are equal, how might we do it? In the population, are hazard probabilities different across years of teaching? tabulate EVENT PERIOD, column

Unit 5b– Slide 10 A Model for each of the Means | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | | | 1. | 1 Quit | 2. | 2 No Quit | 3. | 2 Quit | 4. | 3 Quit | 5. | 4 Quit | | | 6. | 5 No Quit | 7. | 5 No Quit | 8. | 5 No Quit | 9. | 5 No Quit | 10. | 5 No Quit | | | 11. | 5 No Quit | 12. | 5 No Quit | 13. | 5 No Quit | 14. | 5 No Quit | 15. | 5 No Quit | | | 16. | 5 No Quit | 17. | 5 No Quit | 18. | 6 Quit | … 39. | 12 No Quit | 40. | 12 No Quit | Hazard Function © Andrew Ho, Harvard Graduate School of Education We could fit this model with the dummy variables that we have: regress EVENT P1-P12 // OR // The “i.” notation auto-creates dummy variables regress EVENT i.PERIOD There are two problems with this statistical model as written. What are they? We could fit this model with the dummy variables that we have: regress EVENT P1-P12 // OR // The “i.” notation auto-creates dummy variables regress EVENT i.PERIOD There are two problems with this statistical model as written. What are they? tabulate EVENT PERIOD, column

Unit 5b– Slide 11 A Model for each of the Logits? © Andrew Ho, Harvard Graduate School of Education  A model for the log-odds (logits) of teachers exiting the system for the first time, given “survival” through a given number of years of teaching.  We might think of PERIOD as a continuous variable, but let’s start by trying to reproduce the Hazard Probabilities at each discrete period, in the same way that we would estimate probabilities for a large number of racial/ethnic groups or polychotomies/categories.  A model for the log-odds (logits) of teachers exiting the system for the first time, given “survival” through a given number of years of teaching.  We might think of PERIOD as a continuous variable, but let’s start by trying to reproduce the Hazard Probabilities at each discrete period, in the same way that we would estimate probabilities for a large number of racial/ethnic groups or polychotomies/categories.

Unit 5b– Slide 12 The Discrete-Time Hazard Model: Reproducing Life Tables © Andrew Ho, Harvard Graduate School of Education P1P2P3P4P5P6P7P8P9P10P11P12 Percentage 11.57%11.02%11.58%10.76%8.91%8.25%6.01%4.81%4.22%3.69%2.47%1.28% Odds Log-Odds Hazard Function

A No-Constant (Zero-Constant) Model P1P2P3P4P5P6P7P8P9P10P11P12 Percentage 11.57%11.02%11.58%10.76%8.91%8.25%6.01%4.81%4.22%3.69%2.47%1.28% Odds Log-Odds Hazard Function Unit 5b– Slide 13© Andrew Ho, Harvard Graduate School of Education

Unit 5b– Slide 14 How Logistic Models Replicate (and Extend?) Life Table Analyses © Andrew Ho, Harvard Graduate School of Education  Logistic Regression provides us a statistical model for Hazard Probabilities and allows us to ask questions about differences in Hazard Probabilities in the population.  Does the probability of exit really decline over time in the population (conditional on survival to that point?)  Logistic Regression provides us a statistical model for Hazard Probabilities and allows us to ask questions about differences in Hazard Probabilities in the population.  Does the probability of exit really decline over time in the population (conditional on survival to that point?)  Now, we can extend this analysis by adding predictors (What about certified teachers? Age? The year that they started?).  And, instead of modeling the logit at each PERIOD, we can use a more parsimonious model for the trajectory of Hazard Probabilities over time.  Now, we can extend this analysis by adding predictors (What about certified teachers? Age? The year that they started?).  And, instead of modeling the logit at each PERIOD, we can use a more parsimonious model for the trajectory of Hazard Probabilities over time.

Unit 5b– Slide 15 Instead of logit EVENT P2-P12, why not logit EVENT PERIOD? © Andrew Ho, Harvard Graduate School of Education  What is the estimated change in the Hazard Probability (in logits) per unit PERIOD?  Is this change different from 0 in the population?  What is the estimated change in the Hazard Probability (in logits) per unit PERIOD?  Is this change different from 0 in the population? Preparing for some polynomial regression. Linear, quadratic, and cubic fits to the Hazard function.

Unit 5b– Slide 16 A linear model for the logits © Andrew Ho, Harvard Graduate School of Education When PERIOD = 0, the estimated logit of exiting the system is Remember your logit scale. This is a fitted probability of 14.7%. This is a linear model. Why are the fitted probabilities clearly curvilinear? And does this seem like a good fit to you?

Unit 5b– Slide 17 Quadratic Fit © Andrew Ho, Harvard Graduate School of Education When PERIOD = 0, the estimated logit of exiting the system is Remember your logit scale. This is a fitted probability of 11.3%. Remember that coefficients from polynomial regression equations are, like coefficients from all interactions, difficult to interpret on their own. We graph: Is this a quadratic function? Does this seem like a better fit to you? Is this a quadratic function? Does this seem like a better fit to you?

Unit 5b– Slide 18 Cubic Fit © Andrew Ho, Harvard Graduate School of Education When PERIOD = 0, the estimated logit of exiting the system is Remember your logit scale. This is a fitted probability of 10.5%. Is this a cubic function? Does this seem like a better fit to you? Is this a cubic function? Does this seem like a better fit to you? Remember that coefficients from polynomial regression equations are, like coefficients from all interactions, difficult to interpret on their own. We graph: