Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1

Similar presentations


Presentation on theme: "Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1"— Presentation transcript:

1 Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1 http://xkcd.com/931/

2 Research questions addressed by survival analysis: Whether+When Contrasting 2 Data Formats: Person vs. Person-Period Life Table Analysis: Hazard Probability vs. Survival Probability © Andrew Ho, Harvard Graduate School of Education Unit 5a– Slide 2 Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? Course Roadmap: Unit 5a Today’s Topic Area

3 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 3 The “Whether and When” Test You need survival analysis if your research questions ask “Whether” and “When” a critical event occurs. The “Whether and When” Test You need survival analysis if your research questions ask “Whether” and “When” a critical event occurs. Time-to-Relapse Among Treated Alcoholics Cooney, et al. (1991). Research Questions:  Whether, and if so when, rehabilitated alcoholics relapse to drinking?  Which treatment regimens are more effective in preventing relapse? 89 post-treatment alcoholics, randomized to either a “coping skills” or an “interaction skills” follow-up treatment. Prospective data collection for 2 years. During follow-up 57 patients relapsed to alcoholism, 28 remained abstinent, 4 disappeared after remaining abstinent for a short time. Time-to-Relapse depended on:  Type of follow-up program.  Psychopathology of the patient. Time-to-Relapse Among Treated Alcoholics Cooney, et al. (1991). Research Questions:  Whether, and if so when, rehabilitated alcoholics relapse to drinking?  Which treatment regimens are more effective in preventing relapse? 89 post-treatment alcoholics, randomized to either a “coping skills” or an “interaction skills” follow-up treatment. Prospective data collection for 2 years. During follow-up 57 patients relapsed to alcoholism, 28 remained abstinent, 4 disappeared after remaining abstinent for a short time. Time-to-Relapse depended on:  Type of follow-up program.  Psychopathology of the patient. Age at 1 st Suicide Ideation For Adolescents Bolger, et al. (1989). Research Questions:  Whether, and if so when, an adolescent 1 st considers suicide?  Does occurrence of suicide ideation differ by gender and developmental phase? 391 undergraduates, aged 16 through 22. Retrospective data collection, through current age. At interview, 275 respondents had considered suicide, 116 had not. Time-to-First-Suicide-Ideation.  Greatest risk in middle adolescence.  Higher among females.  Higher in adolescents w/ absent parents.  Race by Age interaction. Age at 1 st Suicide Ideation For Adolescents Bolger, et al. (1989). Research Questions:  Whether, and if so when, an adolescent 1 st considers suicide?  Does occurrence of suicide ideation differ by gender and developmental phase? 391 undergraduates, aged 16 through 22. Retrospective data collection, through current age. At interview, 275 respondents had considered suicide, 116 had not. Time-to-First-Suicide-Ideation.  Greatest risk in middle adolescence.  Higher among females.  Higher in adolescents w/ absent parents.  Race by Age interaction. Research questions addressed by Survival Analysis

4 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 4 Classical Methods of Survival Analysis  Simple data-analytic approaches for summarizing survival data appropriately: Estimation of the sample hazard function. Estimation of the sample survivor function..Estimation of the median lifetime.  Simple tests of differences in survivor functions by “group”: Survival analytic equivalent of the t-test. Classical Methods of Survival Analysis  Simple data-analytic approaches for summarizing survival data appropriately: Estimation of the sample hazard function. Estimation of the sample survivor function..Estimation of the median lifetime.  Simple tests of differences in survivor functions by “group”: Survival analytic equivalent of the t-test. Today Discrete-Time Survival Analysis  Replicates classical methods of survival analysis, using logistic regression analysis.  Extends classical survival analytic methods by making a regression format available: Can include multiple predictors, including interactions. Provides single parameter and GLH testing, using the – 2LL statistic. Fitted hazard functions, survivor functions & median lifetimes, can be recovered from the fitted logistic regression model.Discrete-Time Survival Analysis  Replicates classical methods of survival analysis, using logistic regression analysis.  Extends classical survival analytic methods by making a regression format available: Can include multiple predictors, including interactions. Provides single parameter and GLH testing, using the – 2LL statistic. Fitted hazard functions, survivor functions & median lifetimes, can be recovered from the fitted logistic regression model. Next 2-3 class meetings Continuous-Time Survival Analysis  Replaces discrete-time survival analysis when time has been measured continuously.  Imposes additional assumptions on the data.  Extends classical survival analytic methods by making a regression format available: Can include predictors, including interactions. Has its own testing procedures, based on standard practices. Fitted hazard functions, survivor functions & median lifetimes, are easily recovered from the fitted models.Continuous-Time Survival Analysis  Replaces discrete-time survival analysis when time has been measured continuously.  Imposes additional assumptions on the data.  Extends classical survival analytic methods by making a regression format available: Can include predictors, including interactions. Has its own testing procedures, based on standard practices. Fitted hazard functions, survivor functions & median lifetimes, are easily recovered from the fitted models. Time Permitting Analytic Approaches to Survival Analysis

5 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 5 DatasetSPEC_ED.txt OverviewDiscrete-time person-level dataset on the career duration of special education teachers who began their teaching careers in the Michigan public schools between 1972 and 1978, and who were followed uninterruptedly until 1985. SourceState Department of Education, Michigan. Sample size3941 teachers. More InfoSinger & Willett, 2003 Note on labeling of discrete-time “bins.” We regarded a teacher’s physical first year as their zero th year, a year in which they must have taught in order to be a part of the study. If they quit sometime during the following year, they were classified as having taught for one year and having quit in “bin one.” Important Distinction You Must Keep In Mind The two “modern” approaches to survival analysis are distinct in the way that duration must be measured: In Discrete-time Survival Analysis, time is measured in discrete units, such as semesters, years, etc. In Continuous-time Survival Analysis, time can be measured to any level of precision. Important Distinction You Must Keep In Mind The two “modern” approaches to survival analysis are distinct in the way that duration must be measured: In Discrete-time Survival Analysis, time is measured in discrete units, such as semesters, years, etc. In Continuous-time Survival Analysis, time can be measured to any level of precision. Research Question Whether, and if so when, do special education teachers in Michigan leave the teaching profession for the first time? Research Question Whether, and if so when, do special education teachers in Michigan leave the teaching profession for the first time? “Multiple Cohort” Sample Design Multiple annual cohorts of special education teachers are pooled together in the sample: Cohorts entered the sample sequentially between 1972/3 and 1978/9 school years. * All cohorts were followed until the end of the 1984/5 school year (i.e., in June 1985). “Multiple Cohort” Sample Design Multiple annual cohorts of special education teachers are pooled together in the sample: Cohorts entered the sample sequentially between 1972/3 and 1978/9 school years. * All cohorts were followed until the end of the 1984/5 school year (i.e., in June 1985). 72 |--|--|--|--|--|--|--|--|--|--|--|--|--|85 73 |--|--|--|--|--|--|--|--|--|--|--|--|85 74 |--|--|--|--|--|--|--|--|--|--|--|85 75 |--|--|--|--|--|--|--|--|--|--|85 76 |--|--|--|--|--|--|--|--|--|85 77 |--|--|--|--|--|--|--|--|85 78 |--|--|--|--|--|--|--|85 72 |--|--|--|--|--|--|--|--|--|--|--|--|--|85 73 |--|--|--|--|--|--|--|--|--|--|--|--|85 74 |--|--|--|--|--|--|--|--|--|--|--|85 75 |--|--|--|--|--|--|--|--|--|--|85 76 |--|--|--|--|--|--|--|--|--|85 77 |--|--|--|--|--|--|--|--|85 78 |--|--|--|--|--|--|--|85 The SPEC_ED Dataset

6 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 6 The dataset is straightforward, containing Teacher IDs and length of service, with one small hitch … Structure of Dataset Col # Var NameVariable DescriptionVariable Metric/Labels 1IDTeacher identification code.Integer 2YRSTCH # of years the teacher remained in teaching, to first quit, or until the teacher was censored in 1985 by the end of the study. Integer 3CENSOR to indicate Dummy variable to indicate whether a teacher’s career was censored by the end of data collection in 1985. Dichotomous variable: 0 = not censored, 1 = censored. There is a problem intrinsic to survival data, and is illustrated here: is  The event of interest is “quitting teaching for the first time.”  But, not every teacher experiences this event while being observed by researchers.  We say that these teachers are “censored” by the end of the data-collection.  We call this “right censoring” because the YRSTCH range is cut off on the right (positive) side. The actual observation (if we had waited) would be higher. There is a problem intrinsic to survival data, and is illustrated here: is  The event of interest is “quitting teaching for the first time.”  But, not every teacher experiences this event while being observed by researchers.  We say that these teachers are “censored” by the end of the data-collection.  We call this “right censoring” because the YRSTCH range is cut off on the right (positive) side. The actual observation (if we had waited) would be higher. presence of the censored cases probability longer than the period of observation Key Idea: The presence of the censored cases is telling you something about the probability that the time-to-event is longer than the period of observation. If you want an unbiased estimate of time-to-event, you cannot ignore the censored cases, but must find a way to include them in the analysis so that they can contribute whatever information they contain. Why Is Censoring A Problem For Data Analysis? … time-to-event … because if censoring occurs we don’t know the time-to-event for the people in the sample who may have the longest times-to-event. Why Is Censoring A Problem For Data Analysis? … time-to-event … because if censoring occurs we don’t know the time-to-event for the people in the sample who may have the longest times-to-event. Dataset variables and the issue of “Censoring”

7 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 7 *--------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and selected values. *--------------------------------------------------------------------------- * Input the target dataset: infile ID YRSTCH CENSOR /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable YRSTCH "Number of Years in Teaching" label variable CENSOR "Was Teaching Career Censored?" * Label the values of important categorical variables: * Dichotomous censoring variable CENSOR: label define censorlbl 0 "Not Censored" 1 "Censored" label values CENSOR censorlbl *---------------------------------------------------------------------------- * Examining the data, for the first 40 cases. *---------------------------------------------------------------------------- list ID YRSTCH CENSOR in 1/40, clean *--------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and selected values. *--------------------------------------------------------------------------- * Input the target dataset: infile ID YRSTCH CENSOR /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable YRSTCH "Number of Years in Teaching" label variable CENSOR "Was Teaching Career Censored?" * Label the values of important categorical variables: * Dichotomous censoring variable CENSOR: label define censorlbl 0 "Not Censored" 1 "Censored" label values CENSOR censorlbl *---------------------------------------------------------------------------- * Examining the data, for the first 40 cases. *---------------------------------------------------------------------------- list ID YRSTCH CENSOR in 1/40, clean Bearing this in mind, let’s explore the special educator data in Stata Do File, Unit5a.do … Standard data-input and labeling statements Print out the data on the first 40 teachers in the dataset for inspection … +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored | … 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored | … 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+ The “Person-Level” Dataset

8 +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored | … 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | 15. | 16 12 Censored | |----------------------------| 16. | 17 2 Not Censored | 17. | 18 12 Censored | 18. | 19 1 Not Censored | 19. | 20 3 Not Censored | … 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+ © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 8 Here’s the data listing (with cases omitted to save space) … Dataset formatted in this way is known as a person-level dataset:  Because it contains one row of event history data per teacher. Dataset formatted in this way is known as a person-level dataset:  Because it contains one row of event history data per teacher. Teacher #2 was in the dataset for 2 years and was not censored. S/he experienced the event of interest in the second year, That is, s/he quit teaching for the first time sometime during the second year. Teacher #2 was in the dataset for 2 years and was not censored. S/he experienced the event of interest in the second year, That is, s/he quit teaching for the first time sometime during the second year. Teacher #5 was in the dataset for 12 years and was censored. S/he outlasted the data collection. S/he taught for at least 12 years, and possibly more. Teacher #5 was in the dataset for 12 years and was censored. S/he outlasted the data collection. S/he taught for at least 12 years, and possibly more. We tend to be drawn to dangerous analyses with this dataset structure!!! The “Person-Level” dataset encourages dangerous analyses…

9 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 9 One sensible thing you can do in such datasets is display the frequency with which each career length occurs, in a vertical histogram that includes all the teachers in the sample, both censored and un-censored. Note the impact of the multi-cohort research design -- any teacher who began teaching after 1978 and taught longer than six years is a censored case. Comparing Uncensored and Censored Cases Uncensored Censored

10 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 10 Here, are two misleading – but common -- strategies for trying to summarize teachers’ career length, while trying to deal with censoring … Second Misleading Approach only one teacher has lasted longer than the censored duration If you set the career lengths of the censored teachers to their longest observed career length, then the sample mean teaching career length is 6.31 years. This too is a negatively biased estimate of population career length even if only one teacher has lasted longer than the censored duration. Second Misleading Approach only one teacher has lasted longer than the censored duration If you set the career lengths of the censored teachers to their longest observed career length, then the sample mean teaching career length is 6.31 years. This too is a negatively biased estimate of population career length even if only one teacher has lasted longer than the censored duration. First Misleading Approach If you take the average of the career lengths of only the uncensored teachers, their sample mean teaching career is 3.73 years, a negatively biased estimate of the average population teaching career length. First Misleading Approach If you take the average of the career lengths of only the uncensored teachers, their sample mean teaching career is 3.73 years, a negatively biased estimate of the average population teaching career length. Uncensored Censored Bias imparted when ignoring censoring

11 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 11 DatasetSPEC_ED_PP.txt OverviewPerson-period dataset containing the same information as the SPEC_ED.txt person dataset, on the career duration of special education teachers who began their teaching careers in the Michigan public schools between 1972 and 1978, and who were followed uninterruptedly until 1985. SourceState Department of Education, Michigan. Sample size24875 annual person-period records. More InfoSinger & Willett, 2003 Notice that the name of the dataset is different Here’s a clue to the difference between the person-level and the person-period dataset… There is a row for every person-period combination in the data. The Person-Period Dataset To convert from one to the other, use the dthaz library. Type “net install dthaz.pkg” or type “findit prsnperd” The library was created by a former Ph.D. student at our School of Public Health, Alexis Dinno (now an Assistant Professor at Portland State).

12 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 12 ColVarVariable DescriptionLabels 1IDTeacher identification code.Integer 2PERIOD Records the discrete time period to which each record refers. Integer 3EVENT Dummy variable indicating whether the teacher experienced the event of interest in this period. 0 = no; 1 = yes 4P1 5P2 6P3 Etc. In a person-period dataset, each person has one row of data for each discrete time-period, each containing … The earlier YRSTCH variable, which recorded the duration of the teaching career in the person-level dataset, has been replaced by variable PERIOD, which labels the time-period to which each row of the person-period dataset refers. Person-period dataset contains other variables too, that are labeled and explained in these rows of the codebook. We ignore them here, but will return to them later during the presentation on discrete-time survival analysis. We’ve also acquired a new variable called EVENT, which records whether a teacher experienced the event of interest (“Quit Teaching For The 1 st Time”) in the particular discrete time-period in question. The Person-Period Data Structure

13 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 13 *----------------------------------------------------------------------------- Input the person-period dataset *----------------------------------------------------------------------------- * Input the dataset: infile ID PERIOD EVENT P1-P12 /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *------------------------------------------------------------------------------ * Inspect the structure of the new person-period dataset. * Notice that there is one row per discrete time-period for each person. *----------------------------------------------------------------------------- list ID PERIOD EVENT in 1/40 *------------------------------------------------------------------------------ * Carry out the life-table analysis, by classical contingency table analysis. *------------------------------------------------------------------------------ tabulate EVENT PERIOD, column *----------------------------------------------------------------------------- Input the person-period dataset *----------------------------------------------------------------------------- * Input the dataset: infile ID PERIOD EVENT P1-P12 /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *------------------------------------------------------------------------------ * Inspect the structure of the new person-period dataset. * Notice that there is one row per discrete time-period for each person. *----------------------------------------------------------------------------- list ID PERIOD EVENT in 1/40 *------------------------------------------------------------------------------ * Carry out the life-table analysis, by classical contingency table analysis. *------------------------------------------------------------------------------ tabulate EVENT PERIOD, column In Unit5a.do, I input the special educator person-period dataset and list the data, including estimation of a life table … Standard data input statements, reading in the ID, PERIOD and EVENT variables and the mystery variables, P1 through P12, that we will return to later during our discrete-time survival-analysis presentation Print out the first 40 cases for inspection. Carry out a Life Table Analysis:  Tabulate the frequencies of EVENT by PERIOD.  Kill the row & total percentage computation, but retain the estimation of percentages in the columns defined by PERIOD. Carry out a Life Table Analysis:  Tabulate the frequencies of EVENT by PERIOD.  Kill the row & total percentage computation, but retain the estimation of percentages in the columns defined by PERIOD. Reading in the Person-Period Dataset

14 Person-Level Dataset +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | … 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+ Person-Level Dataset +----------------------------+ | ID YRSTCH CENSOR | |----------------------------| 1. | 1 1 Not Censored | 2. | 2 2 Not Censored | 3. | 3 1 Not Censored | 4. | 4 1 Not Censored | 5. | 5 12 Censored | |----------------------------| 6. | 6 1 Not Censored | 7. | 7 12 Censored | 8. | 8 1 Not Censored | 9. | 9 2 Not Censored | 10. | 10 2 Not Censored | |----------------------------| 11. | 12 7 Not Censored | 12. | 13 12 Censored | 13. | 14 1 Not Censored | 14. | 15 12 Censored | … 37. | 38 1 Not Censored | 38. | 39 3 Not Censored | 39. | 40 12 Censored | 40. | 41 6 Not Censored | +----------------------------+ © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 14 Person-Period Dataset +-----------------------+ | ID PERIOD EVENT | |-----------------------| 1. | 1 1 Quit | 2. | 2 1 No Quit | 3. | 2 2 Quit | 4. | 3 1 Quit | 5. | 4 1 Quit | |-----------------------| 6. | 5 1 No Quit | 7. | 5 2 No Quit | 8. | 5 3 No Quit | 9. | 5 4 No Quit | 10. | 5 5 No Quit | |-----------------------| 11. | 5 6 No Quit | 12. | 5 7 No Quit | 13. | 5 8 No Quit | 14. | 5 9 No Quit | 15. | 5 10 No Quit | |-----------------------| 16. | 5 11 No Quit | 17. | 5 12 No Quit | 18. | 6 1 Quit | 19. | 7 1 No Quit | 20. | 7 2 No Quit | |-----------------------| 21. | 7 3 No Quit | 22. | 7 4 No Quit | 23. | 7 5 No Quit | 24. | 7 6 No Quit | 25. | 7 7 No Quit | |-----------------------| 26. | 7 8 No Quit | 27. | 7 9 No Quit | … Person-Period Dataset +-----------------------+ | ID PERIOD EVENT | |-----------------------| 1. | 1 1 Quit | 2. | 2 1 No Quit | 3. | 2 2 Quit | 4. | 3 1 Quit | 5. | 4 1 Quit | |-----------------------| 6. | 5 1 No Quit | 7. | 5 2 No Quit | 8. | 5 3 No Quit | 9. | 5 4 No Quit | 10. | 5 5 No Quit | |-----------------------| 11. | 5 6 No Quit | 12. | 5 7 No Quit | 13. | 5 8 No Quit | 14. | 5 9 No Quit | 15. | 5 10 No Quit | |-----------------------| 16. | 5 11 No Quit | 17. | 5 12 No Quit | 18. | 6 1 Quit | 19. | 7 1 No Quit | 20. | 7 2 No Quit | |-----------------------| 21. | 7 3 No Quit | 22. | 7 4 No Quit | 23. | 7 5 No Quit | 24. | 7 6 No Quit | 25. | 7 7 No Quit | |-----------------------| 26. | 7 8 No Quit | 27. | 7 9 No Quit | … In a person-period dataset: Each person contributes one row of data for each time-period, Data record continues until the time-period in which they either experience the event of interest, or they are censored. In a person-period dataset: Each person contributes one row of data for each time-period, Data record continues until the time-period in which they either experience the event of interest, or they are censored. Teacher #2 is not censored and so s/he experiences the event of interest (i.e. quits teaching for the first time) in the 2 nd year. Teacher #5 is censored – s/he never experiences the event of interest (i.e. doesn’t quit teaching for the first time) in all the 12 years during which teachers are observed. Person-Level vs. Person-Period Datasets

15 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 15 Here’s the Life Table – it’s a Two-Way Contingency Table Analysis of EVENT by PERIOD … Use frequencies to estimate a hazard probability describing “risk of quitting teaching for the 1 st time” in each time-period, given that the teacher survived earlier periods. Hazard probability is the (conditional) probability that a teacher will experience the event of interest (i.e., quit teaching for the first time) in a particular time-period, given that s/he has “survived” up until this period. Use frequencies to estimate a hazard probability describing “risk of quitting teaching for the 1 st time” in each time-period, given that the teacher survived earlier periods. Hazard probability is the (conditional) probability that a teacher will experience the event of interest (i.e., quit teaching for the first time) in a particular time-period, given that s/he has “survived” up until this period. In Discrete Time Period #1, for instance: There are 3941 teachers “at risk of quitting for the first time.” Of this “risk set,” 456 were observed to quit for the first time. Hence, the probability that a teacher will quit for the first time in this period (given that she entered it), is (456/3941), or 0.1157. So, the sample hazard probability in Discrete Time-Period #1 is In Discrete Time Period #1, for instance: There are 3941 teachers “at risk of quitting for the first time.” Of this “risk set,” 456 were observed to quit for the first time. Hence, the probability that a teacher will quit for the first time in this period (given that she entered it), is (456/3941), or 0.1157. So, the sample hazard probability in Discrete Time-Period #1 is Life Tables: At Each Time Point, for People Who Survived

16 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 16 Here’s the sample hazard probability for discrete time-period #2 … Sample hazard probability (or “risk”) in discrete time-period #2 is:  3485 teachers survive from time-period #1 and enter the risk set for time-period #2.  Of these, 384 quit for the first time.  Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that point, is (384/3485), or 0.1102.  So, the sample hazard probability in discrete time-period #2 is 11.02%. How did we get that number?  Note that the survivors at the target time point are the survivors from the previous time point minus the “quitters.” For now… Sample hazard probability (or “risk”) in discrete time-period #2 is:  3485 teachers survive from time-period #1 and enter the risk set for time-period #2.  Of these, 384 quit for the first time.  Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that point, is (384/3485), or 0.1102.  So, the sample hazard probability in discrete time-period #2 is 11.02%. How did we get that number?  Note that the survivors at the target time point are the survivors from the previous time point minus the “quitters.” For now… Hazard Probability: For each Time Point, the Probability of “Fail”

17 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 17 Here’s the sample hazard probability for discrete time-period #3 … Sample hazard probability (or “risk”) in discrete time-period #3 is:  3101teachers survive from time-period #2 and enter the risk set for time-period #3.  Of these, 359 quit for the first time.  Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that point, is (359/3101), or 0.1158.  So, the sample hazard probability in discrete time-period #3 is 11.58%. How did we get that number?  The survivors at the target time point are still the survivors from the previous time point minus the “quitters.” For now… Sample hazard probability (or “risk”) in discrete time-period #3 is:  3101teachers survive from time-period #2 and enter the risk set for time-period #3.  Of these, 359 quit for the first time.  Hence, the risk that a teacher will quit for the first time in time-period #2, given that she survived to that point, is (359/3101), or 0.1158.  So, the sample hazard probability in discrete time-period #3 is 11.58%. How did we get that number?  The survivors at the target time point are still the survivors from the previous time point minus the “quitters.” For now… Hazard Probability: For each Time Point, the Probability of “Fail”

18 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 18 Here’s the sample hazard probability for discrete time-period #11 … Hazard Probability: For each Time Point, the Probability of “Fail”

19 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 19 Collect the sample hazard probabilities together and plot them as a sample hazard function … The Hazard Function

20 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 20 Once you have the sample hazard probabilities, you can cumulate them to get sample survival probabilities … Sample Survival Probability Survival probability in any time period is the probability of “surviving” beyond that period (ie, the probability of not experiencing the event of interest until after the period). Sample Survival Probability Survival probability in any time period is the probability of “surviving” beyond that period (ie, the probability of not experiencing the event of interest until after the period). Here, all teachers survived the 0 th time period, so the estimated sample survival probability in the 0 th period is 1.000. The estimated hazard probability suggests that a proportion of 0.1157 of teachers in the 1 st period risk set will “die” in the 1 st period (i.e., quit teaching).  Because a proportion of 0.1157 of the risk set will “die” in the 1 st period, we know that (1 - 0.1157) or 0.8843 of the 1 st period risk set will survive.  In other words, 0.8843 of the entering “1.0000” will remain “alive” beyond the 1 st time-period (and will therefore be potentially available to quit teaching for the first time at some later period).  The sample survival probability in the 1 st time period is therefore 0.8843  1.000, or:  Because a proportion of 0.1157 of the risk set will “die” in the 1 st period, we know that (1 - 0.1157) or 0.8843 of the 1 st period risk set will survive.  In other words, 0.8843 of the entering “1.0000” will remain “alive” beyond the 1 st time-period (and will therefore be potentially available to quit teaching for the first time at some later period).  The sample survival probability in the 1 st time period is therefore 0.8843  1.000, or: The Survival Probability

21 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 21 And, the estimated survival probability in discrete time period #2… Here, according to the estimated sample survival probability, a proportion of 0.8843 of the teachers survived the 1 th time period. Estimated hazard probability suggests that a proportion of 0.1102 of teachers in the 2 nd period risk set will “die” in the 2 nd period (i.e., quit teaching for the first time).  Because a proportion of 0.1102 of the risk set will “die” in the 2 nd period, we know that (1 - 0.1102) -- or 0.8898 -- of the 2 nd period risk set will survive.  In other words, a proportion of 0.8898 of the entering “0.8843” will remain “alive” beyond the 2 nd time period (and be potentially available to quit teaching for the first time, later).  Sample survival probability in the 2 nd time period is therefore 0.8898  0.8843, or:  Because a proportion of 0.1102 of the risk set will “die” in the 2 nd period, we know that (1 - 0.1102) -- or 0.8898 -- of the 2 nd period risk set will survive.  In other words, a proportion of 0.8898 of the entering “0.8843” will remain “alive” beyond the 2 nd time period (and be potentially available to quit teaching for the first time, later).  Sample survival probability in the 2 nd time period is therefore 0.8898  0.8843, or: The Survival Probability

22 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 22 And, the estimated survival probability in discrete time period #3 … etc Here, according to the estimated sample survival probability, a proportion of 0.7869 of the teachers survived the 2 nd time period. The estimated hazard probability suggests that a proportion of 0.1158 of teachers in the 3 rd period risk set will “die” in the 3 rd period (i.e., quit teaching for the first time).  Because a proportion of 0.1158 of the risk set will “die” in the 3 rd period, we know that (1 - 0.1158), or 0.8842, of the 3 rd period risk set will survive.  In other words, a proportion of 0.8842 of the entering “0.7869” will remain “alive” beyond the 3 rd time period (and be potentially available to quit teaching for the first time, later).  The sample survival probability in the 2 nd time period is therefore 0.8842  0.7869, or:  Because a proportion of 0.1158 of the risk set will “die” in the 3 rd period, we know that (1 - 0.1158), or 0.8842, of the 3 rd period risk set will survive.  In other words, a proportion of 0.8842 of the entering “0.7869” will remain “alive” beyond the 3 rd time period (and be potentially available to quit teaching for the first time, later).  The sample survival probability in the 2 nd time period is therefore 0.8842  0.7869, or: The Survival Probability

23 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 23 Thus, as a general principle, the estimated survivor probability in any time period j can be found by substituting into a simple little rule … So, in general, in any time period j.. The Survival Probability – General Equation

24 © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 24 Plotting the sample survival probabilities against time period provides the sample survivor function.  Typical monotonically decreasing survivor function …  We can also use this to estimate the median time of survival, by projecting over from 0.5 and down to the Time axis.  Typical monotonically decreasing survivor function …  We can also use this to estimate the median time of survival, by projecting over from 0.5 and down to the Time axis. The Survival Function


Download ppt "Unit 5a: Survival Analysis: Questions about Whether and When © Andrew Ho, Harvard Graduate School of EducationUnit 5a– Slide 1"

Similar presentations


Ads by Google