Lecture 16 Duration analysis: Survivor and hazard function estimation

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

Ch 7.7: Fundamental Matrices
Lecture 8 (Ch14) Advanced Panel Data Method
Lecture 3 (Ch4) Inferences
Instrumental Variables Estimation and Two Stage Least Square
4.3 Confidence Intervals -Using our CLM assumptions, we can construct CONFIDENCE INTERVALS or CONFIDENCE INTERVAL ESTIMATES of the form: -Given a significance.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Cross section and panel method
In previous lecture, we highlighted 3 shortcomings of the LPM. The most serious one is the unboundedness problem, i.e., the LPM may make the nonsense predictions.
FURTHER APPLICATIONS OF INTEGRATION Probability In this section, we will learn about: The application of calculus to probability.
Chapter 4 Multiple Regression.
INTEGRALS 5. INTEGRALS We saw in Section 5.1 that a limit of the form arises when we compute an area.  We also saw that it arises when we try to find.
16 MULTIPLE INTEGRALS.
Physics and Measurements.
Lecture 15-2 Censored regression
Lecture 14-2 Multinomial logit (Maddala Ch 12.2)
Lecture 13 (Greene Ch 16) Maximum Likelihood Estimation (MLE)
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
Lecture II-2: Probability Review
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Chapter 5: z-scores.
Algebra Problems… Solutions
Lecture 3-2 Summarizing Relationships among variables ©
Lecture 14-1 (Wooldridge Ch 17) Linear probability, Probit, and
Lecture 15 Tobit model for corner solution
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
1 Research Method Lecture 6 (Ch7) Multiple regression with qualitative variables ©
On Model Validation Techniques Alex Karagrigoriou University of Cyprus "Quality - Theory and Practice”, ORT Braude College of Engineering, Karmiel, May.
Inference for a Single Population Proportion (p).
Lecture 3-3 Summarizing r relationships among variables © 1.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 PROBABILITIES FOR CONTINUOUS RANDOM VARIABLES THE NORMAL DISTRIBUTION CHAPTER 8_B.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
Extending the Definition of Exponents © Math As A Second Language All Rights Reserved next #10 Taking the Fear out of Math 2 -8.
Bayesian Analysis and Applications of A Cure Rate Model.
INTEGRALS The Fundamental Theorem of Calculus INTEGRALS In this section, we will learn about: The Fundamental Theorem of Calculus and its significance.
Yaomin Jin Design of Experiments Morris Method.
Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math Dividing 1 3 ÷ 1 3.
01/20151 EPI 5344: Survival Analysis in Epidemiology Epi Methods: why does ID involve person-time? March 10, 2015 Dr. N. Birkett, School of Epidemiology,
01/20141 EPI 5344: Survival Analysis in Epidemiology Epi Methods: why does ID involve person-time? March 13, 2014 Dr. N. Birkett, Department of Epidemiology.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
HSRP 734: Advanced Statistical Methods July 17, 2008.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
Pro gradu –thesis Tuija Hevonkorpi.  Basic of survival analysis  Weibull model  Frailty models  Accelerated failure time model  Case study.
Random Sampling Approximations of E(X), p.m.f, and p.d.f.
Survival Analysis in Stata First, declare your survival-time variables to Stata using stset For example, suppose your duration variable is called timevar.
Warsaw Summer School 2015, OSU Study Abroad Program Advanced Topics: Interaction Logistic Regression.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Math 3680 Lecture #13 Hypothesis Testing: The z Test.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Multiplication of Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math 1 3 ×1 3 Applying.
Copyright © Cengage Learning. All rights reserved.
01/20151 EPI 5344: Survival Analysis in Epidemiology Hazard March 3, 2015 Dr. N. Birkett, School of Epidemiology, Public Health & Preventive Medicine,
Copyright © Cengage Learning. All rights reserved. 16 Vector Calculus.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
02/20161 EPI 5344: Survival Analysis in Epidemiology Hazard March 8, 2016 Dr. N. Birkett, School of Epidemiology, Public Health & Preventive Medicine,
Copyright © Cengage Learning. All rights reserved. 8 9 Correlation and Regression.
Analysis of financial data Anders Lundquist Spring 2010.
DURATION ANALYSIS Eva Hromádková, Applied Econometrics JEM007, IES Lecture 9.
Inference for a Single Population Proportion (p)
Copyright © Cengage Learning. All rights reserved.
Lecture 15 Tobit model for corner solution
Linear Algebra Review.
CHAPTER 18 SURVIVAL ANALYSIS Damodar Gujarati
Parametric Survival Models (ch. 7)
Virtual University of Pakistan
Presentation transcript:

Lecture 16 Duration analysis: Survivor and hazard function estimation Research Method Lecture 16 Duration analysis: Survivor and hazard function estimation

Duration analysis The duration analysis was originally developed to examine the duration that a patient survives the cancer etc. Such models have been applied to Econometrics. The common application is the estimation of the unemployment duration, or the duration of a worker to be promoted to a higher position.

In the duration analysis, our purpose is to estimate either the Survivor function, or the Hazard function. The definitions of the survivor function and the hazard function are simple. For illustrative purpose, I will consider the duration of a university faculty to be promoted to a full professor as an example.

The duration until the promotion is a random variable. Let F(t) be the cumulative distribution function of the duration. Then, we have the following.

1. The cumulative distribution function F(t): F(t)=The probability that the duration until the promotion is less than t years. F(t) Example 1 This graph means that if you work 20 years, the probability that you will be promoted to full professor is 95%. 0.95 t: (years since hired) 20

2. The survivor function S(t) S(t)=1-F(t) = the probability that the person has not been promoted for at least t years. S(t) This means that, if you work 20 years, the probability that you are not promoted to full professor is 5% Example 1 t: years since hired. 0.05 20 years

3. The density function f(t) f(t)=F’(t) 4 3. The density function f(t) f(t)=F’(t) 4. The hazard function λ(t) =f(t)/S(t) The hazard function shows the rate at which you will be promoted at t years, given that you have not promoted up to that time.

It may sound strange for you to call the faculty who is not promoted as the ‘Survivor’, and call the rate at which the faculty is promoted as ‘Hazard’. But just remember that, this model is initially developed to estimate the survival duration of cancer patients etc.

The relationship between Survivor function and Hazard function Let G(t)=logS(t). Then the derivative of G(t) is written as: You can recover G(t) from G’(t) by integration, which is shown below. Since G(t)=logS(t), we have Exp(G(t))=S(t). Thus, Thus, the relationship between the survivor function and the hazard function is given by

If you can estimate the hazard function, you can recover the survivor function. Therefore, most of the duration analyses focus on the estimation of the hazard function.

The hazard function estimation Let x be the row vector of explanatory variables, and β be the corresponding column vector of coefficients. We model the hazard function λ(t, x,β) as λ(t, x,β) =λ0(t)exp(xβ) λ0(t) is called the baseline hazard. There are several choices for the baseline hazard. I will explain 3 common choices.

The exponential hazard model: When you assume that λ0(t)=1, then this model is called the exponential hazard model. The hazard function is given by: λ(t) =exp(xβ) This model assumes that the hazard is constant overtime. This is restrictive, since if you are unemployed for long time, you become less and less likely to find a job. This kind of time dependency cannot be captured by the exponential hazard.

Given the exponential hazard function, the survivor function S(t) is given by and

If the person has been promoted already, you know the exact duration If the person has been promoted already, you know the exact duration. Thus, the likelihood contribution for this person is the density function: f(t) If the person has not been promoted, then only thing you know is that the duration until promotion is longer than the recorded duration. Thus, the likelihood contribution is the probability that the promotion duration is greater than t, which is equal to the Survivor function S(t).

Let Di be the dummy indicating that the person has been promoted Let Di be the dummy indicating that the person has been promoted. Then, the likelihood contribution is written as: The likelihood function L is then given by: The values of β that maximizes the likelihood function is the estimators of the exponential hazard model. As usual, you usually maximize the log(L).

Regarding the dummy variable Di, note the difference between the hazard function estimation and the censored model. In the hazard function estimation, Di=1 if the person is promoted. But in the censored model, we set Di=1 if the person is not promoted (thus the duration is censored). This is purely a difference in convention between two models.

Exponential hazard example Using the promotion.dta, let us estimate the exponential hazard model. Explanatory variables are female, phdabroad and book_rate In the data, “durat” is the duration from the initial hire to the promotion to the full professor. “promoted” is the dummy variable indicating if the person has been promoted. This corresponds to Di. “phdabroad” is a dummy for those who get Ph.D. abroad. “book_rate” is the number of books published per year of their career.

First, tell the STATA that this is a survival data

Then, estimate the hazard function

Interpretation of the coefficients are tricky. Note that you have estimated the following hazard function Thus, the estimated coefficient for female (-0.3131) means that, if you are female, the hazard will decrease by the multiplicative factor of exp(-0.3131)=0.7112.

In other words, female’s hazard function is 71% that of male’s In other words, female’s hazard function is 71% that of male’s. This means that females are less likely than males to be promoted to the full professor position at any give experience (though the coefficient is not significant). Sometimes, researchers report the exponentiated coefficients exp(βj), instead of the actual coefficients. You can do this by dropping the “nohr” option in the streg command. However, economists usually report coefficients. Thus, I recommend you to use “nohr” option.

Another complicating fact is that, even if the female’s hazard function is 71% that of males, this does not mean that female’s probability of being promoted is 71% of the male’s promotion probability. In order to compare the probability of being promoted, you have to compute the survivor function.

Just remember that The survivor functions for the “average” males and females are given by: Male: Female:

Survivor function is a function of t Survivor function is a function of t. Thus, there are two ways to compare the survival probabilities. The first way is to plot the survivor functions for male and female, then visually compare these two. This is done automatically by Stata. The second way is to compute the survival probability at a particular time, say 10 years, for both males and females, then compare them.

female male Note that the survival probability shows the probability of not being promoted. Since survival curve for females is above males, females are less likely to be promoted (i.e, more likely to be not promoted.)

Now, let us compute the survival probability for males and females at duration equal to 20 years. The difference in the survival probability is about 0.114. Thus, at 20 years of experience, female is 11% less likely than males to be promoted to full professor. Next slide shows how I computed this these probabilities using STATA.

Weibull hazard model: When you assume that λ0(t)= , this model is called the Weibull hazard model. If <1, there is a negative duration dependence (i.e., if stay unemployed longer, it becomes less likely to find a job.) If >1, then there is a positive duration dependence. If =1, then there is no duration dependence, and it is the same as the exponential model.

Remember that the exponential hazard model cannot capture the duration dependence. Thus, Weibull hazard model overcomes this weakness. The hazard function for Weibull model is given by:

Given the Weibull hazard function, we have And

Let Di be the dummy indicating that the person has been promoted Let Di be the dummy indicating that the person has been promoted. Then, the likelihood contribution is written as: The likelihood function L is then given by: The values of β and that maximizes the likelihood function is the estimators of the Weibull hazard model. As usual, you usually maximize log(L).

Weibull hazard estimation example Using the promotion.dta, let us estimate the Weibull hazard function. The explanatory variables are female, phdaborad, and book_rate.

This is log( ). Thus, is greater than 1 This is log( ). Thus, is greater than 1. So there is positive duration dependence

Female is less likely to be promoted to full professor at any give experience. The coefficient (-0.459) indicates that female’s hazard is smaller than male by the multiplicative factor of exp(-0.459)= 0.639. Now, let us compare the survival functions for males and females.

The survival functions for males and females “at average” are given by The STATA automatically plots these survival functions.

Females’ survivor function is above the males’ Females’ survivor function is above the males’. Thus, females are less likely to be promoted to full professor at any given experience.

Now, let us compute the survival probability at t=20. Thus, the difference in the survival probability is 13%. Females are 13% less likely to be promoted to full professor at 20 years of experience. Next slide shows how I computed these probabilities.

Computing the survival probability at t=20 for Weibull model

Piecewise-constant hazard model This is perhaps the most flexible model. In this model, you have to segment the duration into several pieces. Then, you assume that (i) within each segment, the hazard is constant, but (ii) between segments, hazard can be different.

The hazard function can be written as: λ(t)= λ1 for 0≤t ≤c1 = λ2 for c1<t ≤c2 . = λM for cM-1<t ≤∞

For example, suppose that you segment the duration into three pieces, then the piecewise-constant hazard function would look like: λ(t) λ3 λ2 λ1 t c1 c2

In piecewise-constant hazard model, you estimate λ1~λM as well as β. The practical estimation is illustrated as follows. Suppose you segment the duration into three pieces, 0 ~10, 11~20, 21~∞. Let B1 be the dummy variable that takes 1 if the recorded duration is in the first segment. B2 is the dummy variable that takes 1 if the recorded duration is in segment 2. B3 is the dummy for those whose recoded duration is in segment 3.

Then, the piecewise hazard function can be written as: where In estimation, you estimate ~ .

As can be seen, this is the same model as the exponential hazard model As can be seen, this is the same model as the exponential hazard model. Only the difference is that you have included 3 dummies, B1, B2 and B3. The Survival function of the piecewise-constant hazard model has somewhat complicated form. where .

It is easier to be understand it with an example It is easier to be understand it with an example. Suppose you segment the duration into three pieces: c1=5 and c2=10. λ(t) λ3 λ2 λ1 t t* 5 10

Now suppose that you want to compute the survival probability at t Now suppose that you want to compute the survival probability at t* years. Then it will be given by:

In piecewise-hazard function estimation, it is a good idea to use the demeaned explanatory variables. This is because, if the explanatory variables are demeaned, then the estimated hazard pieces λ1~λM are the hazard pieces for the ‘average person’.

To see this, note that if you estimate the following hazard, then, at the average, we have Thus, the estimated hazard pieces are the hazard pieces for the “average person”.

When you divide the duration into several segments, you should make sure that, in each segment, there should be at least one person who have been promoted to full professor. Otherwise, you cannot estimate this model. Finally, when you estimate the model, you should estimate it without the constant.

Piecewise-constant hazard example Use promotion.dta to estimate the piecewise-constant hazard model. Let us divide the segment in the following way: Segment 1: 0~5 years Segment 2: 6~10 years : Segment 5: 21~25 years Segment 6: 26 years or greater

Demean all the explanatory variables except female, so that the estimated hazard pieces are for the “average males.”

Create the hazard piece dummies

Then demean the explanatory variables

Do not include constant Piecewise-constant hazard is the same as exponential hazard plus the hazard piece dummies

The female hazard is smaller than male by the multiplicative factor of exp(-0.271)=0.76. However, the coefficient is not significant. The computation of the survivor function cannot be done automatically by STATA for piecewise-constant hazard model. I recommend you to use Excel to do compute this, since this is perhaps the quickest way to do so.

Then, you use the following formula to compute this. To compute the survivor function, note that the estimated hazard pieces are the exponentiated coefficients for tp1~tp6. Then, you use the following formula to compute this. Next slide provides a graphical illustration of how to compute this. where

Suppose that t is in the 3rd segment. λ3 λ2 λ1 t t 5 10

Then the survival probability is given by: This is simply the following: Now, let us compute and plot the survival function for the “average males”. Since we have demeaned the explanatory variables, the exponentiated coefficients for tp1~tp6 are the hazard pieces for the “average males”.

Now, let us compute the survival probability for female. To do so, you should simply notice that the hazard pieces for female are given by: Thus, first, multiply the hazard pieces by exp(βfemale). Then estimate the hazard function in the same way as that for males.

Survival functions for males and females