Event History Analysis 1 Sociology 8811 Lecture 16 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.

Slides:



Advertisements
Similar presentations
What is Event History Analysis?
Advertisements

Multilevel Event History Modelling of Birth Intervals
What is Event History Analysis?
Event History Models 1 Sociology 229A: Event History Analysis Class 3
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Business Statistics for Managerial Decision
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Multilevel Models 4 Sociology 8811, Class 26 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.
SC968: Panel Data Methods for Sociologists
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
In previous lecture, we highlighted 3 shortcomings of the LPM. The most serious one is the unboundedness problem, i.e., the LPM may make the nonsense predictions.
EHA: Terminology and basic non-parametric graphs
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Duration models Bill Evans 1. timet0t0 t2t2 t 0 initial period t 2 followup period a b c d e f h g i Flow sample.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Event History Analysis 1 Sociology 8811 Lecture 14 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.
Event History Analysis: Introduction Sociology 229 Class 3 Copyright © 2010 by Evan Schofer Do not copy or distribute without permission.
Event History Analysis 7
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Chapter 11 Multiple Regression.
Event History Models Sociology 229: Advanced Regression Class 5
Multiple Regression 2 Sociology 5811 Lecture 23 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
BINARY CHOICE MODELS: LOGIT ANALYSIS
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Analysis of Complex Survey Data
Lecture 16 Duration analysis: Survivor and hazard function estimation
Copyright © 2005 by Evan Schofer
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Single and Multiple Spell Discrete Time Hazards Models with Parametric and Non-Parametric Corrections for Unobserved Heterogeneity David K. Guilkey.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Chapter 13: Inference in Regression
Hypothesis Testing in Linear Regression Analysis
Linear Regression Inference
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
HSRP 734: Advanced Statistical Methods July 10, 2008.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Dr Laura Bonnett Department of Biostatistics. UNDERSTANDING SURVIVAL ANALYSIS.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
EHA: More On Plots and Interpreting Hazards Sociology 229A: Event History Analysis Class 9 Copyright © 2008 by Evan Schofer Do not copy or distribute without.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
3-2 Random Variables In an experiment, a measurement is usually denoted by a variable such as X. In a random experiment, a variable whose measured.
Multiple Regression 3 Sociology 5811 Lecture 24 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
“Further Modeling Issues in Event History Analysis by Robert E. Wright University of Strathclyde, CEPR-London, IZA-Bonn and Scotecon.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Count Models 1 Sociology 8811 Lecture 12
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
HSRP 734: Advanced Statistical Methods July 17, 2008.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
Sociology 5811: Lecture 11: T-Tests for Difference in Means Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
EHA Diagnostics Sociology 229A: Event History Analysis Class 5 Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.
Issues concerning the interpretation of statistical significance tests.
01/20151 EPI 5344: Survival Analysis in Epidemiology Actuarial and Kaplan-Meier methods February 24, 2015 Dr. N. Birkett, School of Epidemiology, Public.
Biostatistics Case Studies 2014 Youngju Pak Biostatistician Session 5: Survival Analysis Fundamentals.
01/20151 EPI 5344: Survival Analysis in Epidemiology Hazard March 3, 2015 Dr. N. Birkett, School of Epidemiology, Public Health & Preventive Medicine,
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
DURATION ANALYSIS Eva Hromádková, Applied Econometrics JEM007, IES Lecture 9.
Logistic Regression APKC – STATS AFAC (2016).
Event History Analysis 3
Multiple logistic regression
Count Models 2 Sociology 8811 Lecture 13
Presentation transcript:

Event History Analysis 1 Sociology 8811 Lecture 16 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission

Announcements Paper #1 due Today! Topic: Event History Analysis I’ll review some basics In following classes we’ll think about data… and then return to the models in greater detail.

Review: EHA In essence, EHA models a dependent variable that reflects both: 1. Whether or not a patient experiences mortality 2. When it occurs (like a OLS regression of duration Dependent variable is best conceptualized as a rate of some occurrence EHA involves both descriptive and parametric analysis of data

EHA Terminology: States & Events “State” = the “state of being” of a case Conceptualized in terms of discrete phenomena e.g., alive vs. dead “State space” = the set of all possible states Can be complex: Single, married, divorced, widowed Event = Occurrence of the outcome of interest Shift from “alive” to “dead”, “single” to “married” Occurs at a specific point in time “Risk Set” = the set of all cases capable of experiencing the event e.g., those “at risk” of experiencing mortality.

Review: Terminology “Spell” = A chunk of time that a case experiences, bounded by: events, and/or the start or end of the study As in “I’m gonna sit here for a spell…” EHA is, in essence, an analysis of a set of spells (experienced by a given sample of cases) “Censored” = indicates the absence of data before or after a certain point in time As in: “data on cases is censored at 60 months” “Right Censored” = no data after a time point “Left Censored” = no data before a time point.

States, Spells, & Events: Visually A complex state space: partnership 0 = single, 1 = married, 2 = divorced, 3 = widowed Individual history: Married at 20, divorced at 27, remarried at Age (Years) State Spell #1 Right Censored at 45 Spell #4 Spell #2Spell #3

Example: Employee Retention Visually – red line indicates length of employment spell for each case: Time (days) Cases Right Censored

Descriptives: Half Life Time when ½ of sample has had event: Time (days) Cases Right Censored Half Life = 23 days

Simple EHA Descriptives Question: What simple things can we do to describe this sample of 12 employees? 3. Tabulate (or plot) quitters in different time- periods: e.g., 1-20 days, days, etc. Absolute numbers of “quitters” or “stayers” –or Numbers of quitters as a proportion of “stayers” Or look at number (or proportion) who have “survived” (i.e., not quit)

Descriptives: Tables For each period, determine number or proportion quitting/staying Time (days) Cases Day

EHA Descriptives: Tables Time Range Quitters: Total #, % # staying 1Day quit, 42% of all, 42% of remaining 7 left, 58 % of all 2Day quit, 16% of all 29% of remaining 5 left, 42% of all 3Day quit, 8% of all 20% of remaining 4 left, 33 % of all 4Day quit, 8% of all 25% of remaining 3 left, 25% of all

EHA Descriptives: Tables Remarks on EHA tables: 1. Results of tables change depending on time-ranges chosen (like a histogram) E.g., comparing 20-day ranges vs. 10-day ranges 2. % quitters vs. % quitters as a proportion of those still employed Absolute % can be misleading since the number of people left in the risk set tends to decrease A low # of quitters can actually correspond to a very high rate of quitting for those remaining in the firm Typically, these ratios are more socially meaningful than raw percentages.

EHA Descriptives: Plots We can also plot tabular information:

The Survivor Function A more sophisticated version of % remaining Calculated based on continuous time (calculus), rather than based on some arbitrary interval (e.g., day 1-20) Survivor Function – S(t): The probability (at time = t) of not having the event prior to time t. Always equal to 1 at time = 0 (when no events can have happened yet Decreases as more cases experience the event When graphed, it is typically a decreasing curve Looks a lot like % remaining.

Survivor Function McDonald’s Example: Steep decreases indicate lots of quitting at around 20 days

The Hazard Function A more sophisticated version of # events divided by # remaining Hazard Function – h(t) = The probability of an event occurring at a given point in time, given that it hasn’t already occurred Formula: Think of it as: the rate of events occurring for those at risk of experiencing the event

The Hazard Function Example: High (and wide) peaks indicate lots of quitting

Cumulative Hazard Function Problem: the Hazard Function is often very spiky and hard to read/interpret Alternative #1: “Smooth” the hazard function (using a smoothing algorithm) Alternative #2: The “cumulative” or “integrated” hazard Use calculus to “integrate” the hazard function Recall – An integral represents the area under the curve of another function between 0 and t. Integrated hazard functions always increase (opposite of the survivor function). Big growth indicates that the hazard is high.

Integrated Hazard Function Example: Steep increases indicate peaks in hazard rate “Flat” areas indicate low hazard rate

Descriptive EHA: Marriage Example: Event = Marriage Time Clock: Person’s Age Data Source: NORC General Social Survey Sample: 29,000 individuals

Survivor: Marriage Compare survivor for women, men: Survivor plot for Men (declines later) Survivor plot for Women (declines earlier)

Integrated Hazard: Marriage Compare Integrated Hazard for women, men: Integrated Hazard for men increases slower (and remains lower) than women

Hazard Plot: Marriage Hazard Rate: Full Sample

Survivor Plot: Pros/Cons Benefits: 1. Clear, simple interpretation 2. Useful for comparing subgroups in data Limitations: 1. Mainly useful for a fixed risk set with a single non- repeating event (e.g., Drug trials/mortality) –If events recur frequently, the survivor drops to zero (and becomes uninterpretable) 2. If the risk set fluctuates a lot, the survivor function becomes harder to interpret.

Hazard Plot Pros/Cons Benefits: Directly shows the rate over time –This is the actual dependent variable modeled Works well for repeating events Limitations: Can be difficult to interpret – requires practice Spikes make it hard to get a clear picture of trend –Pay close attention to width of spikes, not just height! Choice of smoothing algorithms can affect results Hard to compare groups (due to spikeyness).

Integrated Hazard Plot Pros/Cons Benefits: Closely related to the dependent variable that you’ll be modeling Very good for comparing groups Works for repeating events Limitations: Not as intuitive as the actual hazard rate Still takes some practice to interpret.

From Plots to Models We know from the plots that women get married faster than men Questions: –1. how do we quantify the difference in hazard rates? –2. How do we test hypotheses about the difference in rates? Can we be confident that the observed difference between men and women is not merely due to sampling variability

EHA Models Strategy: Model the hazard rate as a function of covariates Much like regression analysis Determine coefficients The extent to which change in independent variables results in a change in the hazard rate Use information from sample to compute t- values (and p-values) Test hypotheses about coefficients

EHA Models Issue: In standard regression, we must choose a proper “functional form” relating X’s to Y’s OLS is a “linear” model – assumes a liner relationship –e.g.: Y = a + b 1 X 1 + b 2 X 2 … + b n X n + e Logistic regression for discrete dependent variables – assumes an ‘S-curve’ relationship between variables When modeling the hazard rate h(t) over time, what relationship should we assume? There are many options: assume a flat hazard, or various S-shaped, U-shaped, or J-shaped curves We’ll discuss details later…

Constant Rate Models The simplest parametric EHA model assumes that the base hazard rate is generally “flat” over time Any observed changes are due to changed covariates Called a “Constant Rate” or “Exponential” model Note: assumption of constant rate isn’t always tenable Formula: Usually rewritten as:

Constant Rate Models Question: Is the constant rate assumption tenable?

Constant Rate Models Question: Is the constant rate assumption tenable? Answer: Probably not The hazard rate goes up and down over time –Not constant at all – even if smoothed 2. The change over time isn’t likely the result of changing covariates (X’s) in our model However, if the change was merely the result of some independent variable, then the underlying (unobserved) rate might, in fact, be constant.

Constant Rate Models Let’s run an analysis anyway… Ignore the violation of assumptions regarding the functional form of the hazard rate Recall - - Constant rate model is: In this case, we’ll only specify one X var: DFEMALE – dummy variable indicating women Coefficient reflects difference in hazard rate for women versus men.

Constant Rate Model: Marriage A simple one-variable model comparing gender Exponential regression No. of subjects = No. of failures = Log likelihood = Prob > chi2 = _t | Coef. Std. Err. z P>|z| Female | _cons | The positive coefficient for Female (a dummy variable) indicates a higher hazard rate for women

Constant Rate Coefficients Interpreting the EHA coefficient: b =.19 Coefficients reflect change in log of the hazard –Recall one of the ways to write the formula: But – we aren’t interested in change in log rates We’re interested in change in the actual rate Solution: Exponentiate the coefficient i.e., use “inverse-log” function on calculator Result reflects the impact on the actual rate.

Constant Rate Coefficients Exponentiate the coefficient to generate the “hazard ratio” Multiplying by the hazard ratio indicates the increase in hazard rate for each unit increase in the independent variable Multiplying by 1.21 results in a 21% increase A hazard ratio of 2.00 = a 200% increase A hazard ratio of.25 = a decreased rate by 75%.

Constant Rate Coefficients The variable FEMALE is a dummy variable Women = 1, Men = 0 Increase from 0 to 1 (men to women) reflects a 21% increase in the hazard rate –Continuous measures, however can change by many points (e.g., Firm size, age, etc.) To determine effects of multiple point increases (e.g., firm size of 10 vs. 7) multiply repeatedly Ex: Hazard Ratio =.95, increase = 3 units:.95 x.95 x.95 =.86 – indicating a 14% decrease.

Hypothesis Tests: Marriage Final issue: Is the 21% higher hazard rate for women significantly different than men? Or is the observed difference likely due to chance? Solution: Hazard rate models calculate standard errors for coefficient estimates Allowing calculation of T-values, P-values _t | Coef. Std. Err. t P>|t| Female | _cons |

Types of EHA Models Two main types of proportional EHA Models 1. Parametric Models specify a functional form of h(t) Constant rate is one example Also: Piecewise Exponential, Gompertz, Weibull,etc. 2. Cox Models Doesn’t specify a particular form for h(t) Each makes assumptions Like OLS assumptions regarding functional form, error variance, normality, etc If assumptions are violated, models can’t be trusted.

Parametric Models These models make assumptions about the overall shape of the hazard rate over time Much like OLS regression assumes a linear relationship between X and Y, logit assumes s-curve Options: constant, Gompertz, Weibull There is a piecewise exponential option, too Note: They also make standard statistical assumptions: Independent random sample Properly specified model, etc, etc…

Cox Models The basic Cox model: Where h(t) is the hazard rate h 0 (t) is some baseline hazard function (to be inferred from the data) This obviates the need for building a specific functional form into the model bX’s are coefficients and covariates

Cox Model Assumptions Cox Models assume that independent variables don’t interact with time At lease, not in ways you haven’t controlled for i.e., that the hazard rate at different values of X are proportional (parallel) to each other over time Example: Marriage rate – women vs. men Women have a higher rate at all points in time Question: Does the hazard rate for women diverge or converge with men over time? If so, the proportion (or ratio) of the rate changes. The assumption is violated. Use a different model

Cox Model Assumptions: Proportionality: Look for parallel h(t)’s for different sub-groups (values of X’s) h(t) time Good Women Men h(t) Bad Women Men

Cox Model Assumptions: Hazard rates are often too spiky to discern trends Options: 1. Smooth the hazard plots OR 2. Check the integrated hazard rate –Look for differences in the overall shape of the curve –Note: divergence is OK on an integrated hazard

Cox Model: Example Marriage example: No. of subjects = Number of obs = No. of failures = Time at risk = LR chi2(1) = Log likelihood = Prob > chi2 = _t | Coef. Std. Err. z P>|z| Female |