SC968: Panel Data Methods for Sociologists

Slides:



Advertisements
Similar presentations
The analysis of survival data in nephrology. Basic concepts and methods of Cox regression Paul C. van Dijk 1-2, Kitty J. Jager 1, Aeilko H. Zwinderman.
Advertisements

Impact analysis and counterfactuals in practise: the case of Structural Funds support for enterprise Gerhard Untiedt GEFRA-Münster,Germany Conference:
Presenter Name(s) Issue date National Student.
Being Educated or in Education: the Impact of Education on the Timing of Entry into Parenthood Dieter H. Demey Faculty of Social and Political Sciences.
The Relationship between Childbearing and Transitions from Marriage and Cohabitation in Britain Fiona Steele 1, Constantinos Kallis 2, Harvey Goldstein.
Poverty trajectories after risky life events in Germany, Spain, Denmark and the United Kingdom: a latent class approach Leen Vandecasteele Post-doctoral.
Multilevel Event History Analysis of the Formation and Outcomes of Cohabiting and Marital Partnerships Fiona Steele Centre for Multilevel Modelling University.
What is Event History Analysis?
Assumptions underlying regression analysis
SURVIVAL AND LIFE TABLES
Measures of disease frequency (II). Calculation of incidence Strategy #2 ANALYSIS BASED ON PERSON-TIME CALCULATION OF PERSON-TIME AND INCIDENCE RATES.
Multilevel Event History Modelling of Birth Intervals
What is Event History Analysis?
Surviving Survival Analysis
STAR Webinar - December 20th, 2012 Stroke POpulation Risk Tool.
1 Epidemiologic Measures of Association Saeed Akhtar, PhD Associate Professor, Epidemiology Division of Epidemiology and Biostatistics Aga Khan University,
POPULATION ECOLOGY.
Multivariate Data/Statistical Analysis SC504/HS927 Spring Term 2008 Week 18: Relationships between variables: simple ordinary least squares (OLS) regression.
Methodological issues in LS analysis of mortality and fertility by ethnic group Bola Akinwale.
If we use a logistic model, we do not have the problem of suggesting risks greater than 1 or less than 0 for some values of X: E[1{outcome = 1} ] = exp(a+bX)/
Cross-sectional study. Definition in Dictionary of pharmaceutical medicine 2009 by G Nahler Dictionary of pharmaceutical medicine cross-sectional study.
SC968: Panel Data Methods for Sociologists
Intermediate methods in observational epidemiology 2008 Instructor: Moyses Szklo Measures of Disease Frequency.
Models of migration Observations and judgments In: Raymer and Willekens, 2008, International migration in Europe, Wiley.
Main Points to be Covered
Event History Analysis 1 Sociology 8811 Lecture 14 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.
Event History Analysis: Introduction Sociology 229 Class 3 Copyright © 2010 by Evan Schofer Do not copy or distribute without permission.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Introduction to Survival Analysis Seminar in Statistics 1 Presented by: Stefan Bauer, Stephan Hemri
EPUNet Conference Barcelona, 8-9 May 2006 EPUNet Conference Barcelona, 8-9 May 2006.
Measures of disease frequency (I). MEASURES OF DISEASE FREQUENCY Absolute measures of disease frequency: –Incidence –Prevalence –Odds Measures of association:
SC968: Panel Data Methods for Sociologists Introduction to survival/event history models.
Incidence and Prevalence
Single and Multiple Spell Discrete Time Hazards Models with Parametric and Non-Parametric Corrections for Unobserved Heterogeneity David K. Guilkey.
Occurrence and timing of events depend on Exposure to the risk of an event exposure Risk depends on exposure.
1 Borgan and Henderson: Event History Methodology Lancaster, September 2006 Session 1: Event history data and counting processes.
G Lecture 121 Analysis of Time to Event Survival Analysis Language Example of time to high anxiety Discrete survival analysis through logistic regression.
Adding Custom Tags Types by Janis Parkison Rodriguez Arlington RUG Meeting 13 August 2011 Chapter 14 of Terry Reigel’s A Primer for The Master Genealogist.
Introduction to fertility In Demography, the word ‘fertility’ refers to the number live births women have It is a major component of population change.
Probability, contd. Learning Objectives By the end of this lecture, you should be able to: – Describe the difference between discrete random variables.
Sep 2005:LDA - ONS1 Event history data structures and data management Paul Lambert Stirling University Prepared for “Longitudinal Data Analysis for Social.
Tobacco Control Research Conference July 2014 Determinants of smoking initiation in South Africa Determinants of smoking initiation in South Africa.
Logistic Regression STA2101/442 F 2014 See last slide for copyright information.
“Further Modeling Issues in Event History Analysis by Robert E. Wright University of Strathclyde, CEPR-London, IZA-Bonn and Scotecon.
Borgan and Henderson:. Event History Methodology
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Data structure for a discrete-time event history analysis Jane E. Miller, PhD.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Modelling Longitudinal Data Survival Analysis. Event History. Recurrent Events. A Final Point – and link to Multilevel Models (perhaps).
Introduction to Survival Analysis Utah State University January 28, 2008 Bill Welbourn.
Assessing Binary Outcomes: Logistic Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
The dynamics of poverty in Ethiopia : persistence, state dependence and transitory shocks By Abebe Shimeles, PHD.
1 Household Interaction Impact on Married Female Labor Supply Zvi Eckstein and Osnat Lifshitz.
1 Multivariable Modeling. 2 nAdjustment by statistical model for the relationships of predictors to the outcome. nRepresents the frequency or magnitude.
01/20151 EPI 5344: Survival Analysis in Epidemiology Actuarial and Kaplan-Meier methods February 24, 2015 Dr. N. Birkett, School of Epidemiology, Public.
01/20151 EPI 5344: Survival Analysis in Epidemiology Cox regression: Introduction March 17, 2015 Dr. N. Birkett, School of Epidemiology, Public Health.
Satistics 2621 Statistics 262: Intermediate Biostatistics Jonathan Taylor and Kristin Cobb April 20, 2004: Introduction to Survival Analysis.
01/20151 EPI 5344: Survival Analysis in Epidemiology Quick Review from Session #1 March 3, 2015 Dr. N. Birkett, School of Epidemiology, Public Health &
EPI 5344: Survival Analysis in Epidemiology Week 6 Dr. N. Birkett, School of Epidemiology, Public Health & Preventive Medicine, University of Ottawa 03/2016.
Date of download: 5/31/2016 From: Tipping the Balance of Benefits and Harms to Favor Screening Mammography Starting at Age 40 Years: A Comparative Modeling.
Chapter 2. **The frequency distribution is a table which displays how many people fall into each category of a variable such as age, income level, or.
DURATION ANALYSIS Eva Hromádková, Applied Econometrics JEM007, IES Lecture 9.
Which socio-demographic living arrangement helps to reach 100? Michel POULAIN & Anne HERM Orlando 8 January 2014.
Survival time treatment effects
Introduction to fertility
An introduction to Survival analysis and Applications to Predicting Recidivism Rebecca S. Frazier, PhD JBS International.
April 18 Intro to survival analysis Le 11.1 – 11.2
From: Tipping the Balance of Benefits and Harms to Favor Screening Mammography Starting at Age 40 YearsA Comparative Modeling Study of Risk Ann Intern.
The objective of this lecture is to know the role of random error (chance) in factor-outcome relation and the types of systematic errors (Bias)
Silvia Lui and Martin Weale
Presentation transcript:

SC968: Panel Data Methods for Sociologists Introduction to survival/event history models

Types of outcome Continuous OLS Linear regression Binary Binary regression Logistic or probit regression Time to event data Survival or event history analysis

Examples of time to event data Time to death Time to incidence of disease Unemployed - time till find job Time to birth of first child Smokers – time till quit smoking

Time to event data Set of a finite, discrete states Units (individuals, firms, households etc.) –in one state Transitions between states Time until a transition takes place

4 key concepts for survival analysis States Events Risk period Duration/ time

States States are categories of the outcome variable of interest Each person occupies exactly one state at any moment in time Examples alive, dead single, married, divorced, widowed never smoker, smoker, ex-smoker Set of possible states called the state space

Events A transition from one state to another From an origin state to a destination state Possible events depend on the state space Examples From smoker to ex-smoker From married to widowed Not all transitions can be events E.g. from smoker to never smoker

Risk period 2 states: A & B Event: transition from A B To be able to undergo this transition, one must be in state A (if in state B already cannot transition) Not all individuals will be in state A at any given time Example can only experience divorce if married The period of time that someone is at risk of a particular event is called the risk period All subjects at risk of an event at a point in time called the risk set

Time Various meanings... Calendar time ...but onset of risk usually not simultaneous for all units Ex: by age 40, some individuals will have smoked for 20+ years, other for 1 year Duration=time since onset of risk ...intensity may not be the same EX: one smoker may smoke 5 cigarettes a day, another 20 1 unit of time -same for all individuals

Duration Event history analysis is to do with the analysis of the duration of a nonoccurrence of an event or the length of time during the risk period Examples Duration of marriage Length of life In practice we model the probability of a transition conditional on being in the risk set

Example data ID Entry date Died End date 1 01/01/1991 01/01/2008 1 01/01/1991 01/01/2008 01/01/1991 01/01/2000 01/01/2000 3 01/01/1995 01/01/2005 4 01/01/1994 01/07/2004 01/07/2004

Calendar time Study follow-up ended 1991 1994 1997 2000 2003 2006 2009

Censoring Ideally: observe individual since the onset of risk until event has occurred ...very demanding in terms of data collection (ex: risk of death starts when one is born) Usually– incomplete data  censoring An observation is censored if it has incomplete information Types of censoring Right censoring Left censoring

Censoring Right censoring: the person did not experience the event during the time that they were studied Common reasons for right censoring the study ends the person drops-out of the study We do not know when the person experiences the event but we do know that it is later than a given time T Left censoring: the person became at risk before we started observing her We do not know when the person entered the risk set EHA cannot deal with We know when the person entered the risk set condition on the person having survived long enough to enter the study Censoring independent of survival processes!!

Study time in years censored event censored event 0 3 6 9 12 15 18

Why a special set of methods? duration =continuous variable why not OLS? Censoring If excluding  higher probability to throw out longer durations If treating as complete mis-measurement of duration Non normality of residuals Time varying co-variates Interested in the probability of a transition at any given time rather than in the length of complete spells Need to simultaneously take into account: Whether the event has taken place or not The length of the period at risk before the event ocurred

Survival function Length of time (duration) before an event occurs (length of ‘spell’-T)  probability density function (pdf)- f(t) f(t)= lim Pr(t<=T<=t+Δt) = δF(t) δt Δt0 Δt cumulative density function (cdf)- F(t) F(t)= Pr( T<=t) =∫f(t) dt Survival function: S(t)=1-F(t)

Hazard rate duration is continuous duration is discrete h(t)= f(t)/ S(t) The exact definition & interpretation of h(t) differs: duration is continuous duration is discrete Conditional on having survived up to t, what is the probability of leaving between t and t+Δt It is a measure of risk intensity h(t) >=0 In principle h(t)= rate; not a probability There is a 1-1 relationship between h(t), f(t), F(t), S(t) EHA analysis: h(t)= g (t, Xs) g=parametric & semi-parametric specifications

Data Survival or event history data characterised by 2 variables Time or duration of risk period Failure (event) 1 if not survived or event observed 0 if censored or event not yet occurred Data structure different: Duration is discrete Duration is continuous Assume: 2 states; 1 transition; no repeated events

Data structure-Discrete time ID Entry End date Event X at t0 X at t1 .... 1 01/01/1991 01/01/2008 01/01/2002 2 ID Date Duration (t) Event X 1 01/01/1991 01/01/1992 2 ... ..... .... 01/01/2002 11 01/01/2008 17

Data structure-Discrete time The row is a an individual period An individual has as many rows as the number of periods he is observed to be at risk No longer at risk when Experienced event No longer under observation (censored) For each period (row)- explanatory variable X  very easy to incorporate time varying co-variates Stata: reshape long

Data structure-continuous time ID Entry Died End date Duration Event X 1 01/01/1991 01/01/2008 17.0 0 0 2 01/01/1991 01/01/2002 01/01/2002 11.0 1 0 3 01/01/1995 01/01/2000 5.0 0 0 3 01/01/2000 01/01/2005 01/01/2005 5.0 1 1

Data structure-continuous time The row is a person Indicator for observed events/ censored cases Calculate duration= exit date – entry date Exit date= Failure date Censoring date If time-varying covariates- Split the period an individual is under observation by the number of times time-varying Xs change If many Xs-change often- multiple rows

Worked example Random 20% sample from BHPS Waves 1 – 15 One record per person/wave Outcome: Duration of cohabitation Conditions on cohabiting in first wave Survival time: years from entry to the study in 1991 till year living without a partner

The data Duration = 6 years Event = 1 Ignore data after event = 1

The data (continued) Note missing waves before event

Preparing the data Select records for respondents who were cohabiting in 1991 Declare that you want to set the data to survival time Important to check that you have set data as intended

Checking the data setup 1 if observation is to be used and 0 otherwise time of entry time of exit 1 if event, 0 if censoring or event not yet occurred

Checking the data setup How do we know when this person separated?

Trying again!

Checking the new data setup Now censored instead of an event

Summarising time to event data Individuals followed up for different lengths of time So can’t use prevalence rates (% people who have an event) Use rates instead that take account of person years at risk Incidence rate per year Death rate per 1000 person years

Summarising time to event data Number of observations Person-years <25% of sample had event by 15 elapsed years Rate per year stvary-check whether a variable varies within individuals and over time

Descriptive analysis To recap…. pdf= probability that a spell has a length of exactly T f(t)= lim Pr(t<=T<=t+Δt) = δF(t) δt Δt0 Δt cdf=probability that a spell has a length<=T F(t)= Pr( T<=t) =∫f(t) dt Survival function S(t)=1-F(t)

Kaplan-Meier estimates of survival time The Kaplan-Meier  cumulative probability of an individual surviving to any time, t Analysis can be made by subgroup Nonparametric method First period: S1=1-d1/n1 exit rate After t periods: St=(1-d1/n1)*(1-d2/n2)*……*(1-dt/nt) Survival function  estimated only at times where you observe exits!!! Last t that can be estimated highest non-censored time observed

Survival/ failure function Describing the survival/ failure function

Kaplan-Meier graphs Can read off the estimated probability of surviving a relationship at any time point on the graph E.g. at 5 years 88% are still cohabiting The survival probability only changes when an event occurs graph not smooth but (irregular) stepwise sts graph, survival

Testing equality of survival curves among groups The log-rank test A non –parametric test that assesses the null hypothesis that there are no differences in survival times between groups

Log-rank test example Significant difference between men and women

More elaborate models… Modeling the hazard rate not survival time directly h(t)=transitioning at time t, having survived up to t Time: Continuous- parametric Exponential Weibull Log-logistic Continuous-semi-parametric Cox Discrete Logistic Complementary log-log

Some hazard shapes Increasing Decreasing U-shaped Constant Onset of Alzheimer's Decreasing Survival after surgery U-shaped Age specific mortality Constant Time till next email arrives

Proportional-hazards (PH) models h(t) is separable into h0(t) and the effects of Xs h0(t)=‘baseline’ hazard that depends on t but not on individual characteristics h(t)=h0(t)exp(βX) Absolute differences in X proportional differences in h(t) ~scaling of h0(t)

The Cox regression model

Cox regression model Regression model for survival analysis Can model time invariant and time varying explanatory variables Produces estimated hazard ratios (sometimes called rate ratios or risk ratios) Regression coefficients are on a log scale Exponentiate to get hazard ratio Similar to odds ratios from logistic models

Cox regression equation (i) is the hazard function for individual i is the baseline hazard function and can take any form It is estimated from the data (non parametric) are the covariates are the regression coefficients estimated from the data PH assumption needed Estimate βs without estimating h0(t)  semi parametric model

Cox regression equation (ii) If we divide both sides of the equation on the previous slide by h0(t) and take logarithms, we obtain: We call h(t) / h0(t) the hazard ratio The coefficients bi...bn are estimated by Cox regression, and can be interpreted in a similar manner to that of multiple logistic regression exp(bi) is the instantaneous relative risk of an event

Cox regression in Stata Will first model a time invariant covariate (sex) on risk of partnership ending Then will add a time dependent covariate (age) to the model

Cox regression in Stata

Interpreting output from Cox regression Cox model has no intercept It is included in the baseline hazard In our example, the baseline hazard is when sex=1 (male) The hazard ratio is the ratio of the hazard for a unit change in the covariate HR = 1.3 for women vs. men The risk of partnership breakdown is increased by 30% for women compared with men Hazard ratio assumed constant over time At any time point, the hazard of partnership breakdown for a woman is 1.3 times the hazard for a man

Interpreting output from Cox regression (ii) The hazard ratio is equivalent to the odds that a female has a partnership breakdown before a man The probability of having a partnership breakdown first is = (hazard ratio) / (1 + hazard ratio) So in our example, a HR of 1.30 corresponds to a probability of 0.57 that a woman will experience a partnership breakdown first The probability or risk of partnership breakdown can be different each year but the relative risk is constant So if we know that the probability of a man having a partnership breakdown in the following year is 1.5% then the probability of a woman having a partnership breakdown in the following year is 0.015*1.30 = 1.95%

Time dependent covariates Examples Current age group rather than age at baseline GHQ score may change over time and predict break-ups Will use age to predict duration of cohabitation Nonlinear relationship hypothesised Recode age into 8 equally spaced age groups

Cox regression with time dependent covariates

Cox regression assumptions Assumption of proportional hazards No censoring patterns True starting time Plus assumptions for all modelling Sufficient sample size, proper model specification, independent observations, exogenous covariates, no high multicollinearity, random sampling, and so on

Proportional hazards assumption Cox regression with time-invariant covariates assumes that the ratio of hazards for any two observations is the same across time periods This can be a false assumption, for example using age at baseline as a covariate If a covariate fails this assumption for hazard ratios that increase over time for that covariate, relative risk is overestimated for ratios that decrease over time, relative risk is underestimated standard errors are incorrect and significance tests are decreased in power

Testing the proportional hazards assumption Graphical methods Comparison of Kaplan-Meier observed & predicted curves by group. Observed lines should be close to predicted Survival probability plots (cumulative survival against time for each group). Lines should not cross Log minus log plots (minus log cumulative hazard against log survival time). Lines should be parallel

Testing the proportional hazards assumption Formal tests of proportional hazard assumption Include an interaction between the covariate and a function of time. Log time often used but could be any function. If significant then assumption violated Test the proportional hazards assumption on the basis of partial residuals. Type of residual known as Schoenfeld residuals.

When assumptions are not met If categorical covariate, include the variable as a strata variable Allows underlying hazard function to differ between categories and be non proportional Estimates separate underlying baseline hazard for each stratum

When assumptions are not met If a continuous covariate Consider splitting the follow-up time. For example, hazard may be proportional within first 5 years, next 5-10 years and so on Could covariate be included as time dependent covariate? There are different survival regression methods (e.g. parametric models) that do not assume PH

Censoring assumptions Censored cases must be independent of the survival distribution. There should be no pattern to these cases, which instead should be missing at random. If censoring is not independent, then censoring is said to be informative You have to judge this for yourself Usually don’t have any data that can be used to test the assumption Think carefully about start and end dates Always check a sample of records

True starting time The ideal model for survival analysis would be where there is a true zero time If the zero point is arbitrary or ambiguous, the data series will be different depending on starting point. The computed hazard rate coefficients could differ, sometimes markedly Conduct a sensitivity analysis to see how coefficients may change according to different starting points

Other extensions to survival analysis Discrete (interval-censored) survival times Repeated events Multi-state models (more than 1 event type)- competing risks Transition from employment to unemployment or leaving labour market Modelling type of exit from cohabiting relationship- separation/divorce/widowhood Frailty (unobserved heterogeneity)

Could you use logistic regression instead? May produce similar results for short or fixed follow-up periods Examples everyone followed-up for 7 years maximum follow-up 5 years Results may differ if there are varying follow-up times If dates of entry and dates of events are available then better to use Cox regression

Finally…. This is just an introduction to survival/ event history analysis Only reviewed the Cox regression model Also parametric survival methods But Cox regression likely to suit type of analyses of interest to sociologists Consider an intensive course if you want to use survival analysis in your own work

Some Resources Stephen Jenkins’s course on survival analysis: https://www.iser.essex.ac.uk/files/teaching/stephenj/ec968/pdfs/ec968lnotesv6.pdf Allison, Paul D. (1984) Event History Analysis: Regression for Longitudinal Event Data, Sage Cleves, M., W. Gould, and R. Gutierrez. 2004. An Introduction to Survival Analysis Using Stata. Rev. ed. Stata Press: College Station, Texas