The Cox model in R Gardar Sveinbjörnsson, Jongkil Kim, Yongsheng Wang.

Slides:



Advertisements
Similar presentations
Residuals Residuals are used to investigate the lack of fit of a model to a given subject. For Cox regression, there’s no easy analog to the usual “observed.
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
HSRP 734: Advanced Statistical Methods July 24, 2008.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Section 4.2 Fitting Curves and Surfaces by Least Squares.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 11 Survival Analysis Part 3. 2 Considering Interactions Adapted from "Anderson" leukemia data as presented in Survival Analysis: A Self-Learning.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Multiple Regression Models
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24: Thurs., April 8th
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Multiple Linear Regression
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Accelerated Failure Time (AFT) Model As An Alternative to Cox Model
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Model Checking in the Proportional Hazard model
8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Assessing Survival: Cox Proportional Hazards Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Inference for regression - Simple linear regression
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Assessing Survival: Cox Proportional Hazards Model
Design and Analysis of Clinical Study 11. Analysis of Cohort Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
Lecture 13: Cox PHM Part II Basic Cox Model Parameter Estimation Hypothesis Testing.
HSRP 734: Advanced Statistical Methods July 17, 2008.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
HSRP 734: Advanced Statistical Methods July 31, 2008.
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
Lecture 15: Time Varying Covariates Time-varying covariates.
01/20141 EPI 5344: Survival Analysis in Epidemiology SAS code and output March 4, 2014 Dr. N. Birkett, Department of Epidemiology & Community Medicine,
Lecture 12: Cox Proportional Hazards Model
Lecture 16: Regression Diagnostics I Proportional Hazards Assumption -graphical methods -regression methods.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Correlation & Regression Analysis
Introduction to Frailty Models
Love does not come by demanding from others, but it is a self initiation. Survival Analysis.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
Proportional Hazards Model Checking the adequacy of the Cox model: The functional form of a covariate The link function The validity of the proportional.
Additional Regression techniques Scott Harris October 2009.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
03/20161 EPI 5344: Survival Analysis in Epidemiology Testing the Proportional Hazard Assumption April 5, 2016 Dr. N. Birkett, School of Epidemiology, Public.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
BINARY LOGISTIC REGRESSION
Regression Forecasting and Model Building
Love does not come by demanding from others, but it is a self initiation. Survival Analysis.
Presentation transcript:

The Cox model in R Gardar Sveinbjörnsson, Jongkil Kim, Yongsheng Wang

OUTLINE  Recidivism data  Cox PH Model for Time-Independent Variables in R  Model Selection  Model Diagnostics  Cox PH Model for Time-Dependent Variables in R  Summary 2 18,April 2011Department of Mathematics, ETHZ

Recidivism data The data is from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison. Half of the prisoners were randomly given financial aid when they were released. 3 18,April 2011Department of Mathematics, ETHZ The data is from an experimental study of recidivism of 432 male prisoners, who were observed for a year after being released from prison. Half of the prisoners were randomly given financial aid when they were released. Recidivism data

Variables in Recidivism Data  week: week of first arrest after release, or censoring time.  arrest: the event indicator, 1 = arrested, 0 = not  fin: 1=received financial aid, 0= not  age: in years at the time of release  race: 1= black, 0= others  wexp: 1= had full-time work experience, 0= not  mar: 1= married, 0= not  paro: 1= released on parole, 0= not  prio: number of prior convictions  educ: codes 2 (grade 6 or less), 3 (grades 6 through 9), 4 (grades 10 and 11), 5 (grade 12), or 6 (some post-secondary).  emp1— emp52: 1= employed in the corresponding week, 0 = not 4 18,April 2011Department of Mathematics, ETHZ

Recidivism Data > Rossi <- read.table(’Rossi.txt’, header=T) > Rossi[1:5, 1:10] ## omitting the variables emp1 — emp ,April 2011Department of Mathematics, ETHZ week arrest fin age race wexp mar paro prio educ

Cox PH Model for Time-Independent Variables in R

7  Surv and coxph function in R  Cox Regression  Adjusted survival curve 18,April 2011Department of Mathematics, ETHZ

8 Surv function in R  Surv(time, event)  time: survival or censoring time  event: the status indicator  0=censored  1=observed  Left-truncated and right- censored data  Surv(time, time2, event)  time: left-truncation time  time2: survival or censoring time  event: the status indicator  0= censored  1= observed 18,April 2011Department of Mathematics, ETHZ > Surv(time, time2, event, type=c('right', 'left', 'interval', 'counting'), origin=0) Right-censored data

9 Coxph function in R > coxph(formula, data=, weights, subset, na.action, init, control, method=c("efron","breslow","exact"), singular.ok=TRUE, robust=FALSE, model=FALSE, x=FALSE, y=TRUE,...)  Most of the arguments are similar to lm 18,April 2011Department of Mathematics, ETHZ

10 Coxph function in R  Formula  The right-hand side: the same as a linear model  The left-hand side: a survival object  Method : The method for tie handling. If there are no tied survival times all the methods are equivalent.  Breslow: the default for most Cox PH models  Efron: used as the default and much more accurate than Breslow when dealing with tied survival times  Exact: computes the exact partial likelihood 18,April 2011Department of Mathematics, ETHZ

11 Cox regression > mod.allison <- coxph(Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio + as.factor(educ), data=Rossi) > mod.allison 18,April 2011Department of Mathematics, ETHZ

12 Cox regression 18,April 2011Department of Mathematics, ETHZ Call:coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp + mar + paro + prio + as.factor(educ), data = Rossi) coef exp(coef) se(coef) z p fin age race wexp mar paro prio as.factor(educ) as.factor(educ) as.factor(educ) as.factor(educ) Likelihood ratio test=38.7 on 11 df, p=6.01e-05 n= 432, number of events= 114

13 Adjusted survival curve  > plot(survfit(mod.allison), ylim=c(.7, 1), xlab=’Weeks’, ylab=’Proportion Not Rearrested’) 18,April 2011Department of Mathematics, ETHZ

14 Adjusted survival curve  We may wish to display how estimated survival depends upon the value of a covariate. Because the principal purpose of the recidivism study was to assess the impact of financial aid on rearrest, let us focus on this covariate.  We construct a new data frame with two rows, one for each value of fin; the other covariates are fixed to their median. 18,April 2011Department of Mathematics, ETHZ

15 Adjusted survival curve 18,April 2011Department of Mathematics, ETHZ > Rossi.fin <- data.frame(fin=c(0,1), age=rep(median(age),2), race=rep(median(race),2),wexp=rep(median(wexp),2), mar=rep(median(mar),2), paro=rep(median(paro),2), prio=rep(median(prio),2), educ=as.factor(rep(median(educ),2)) > plot(survfit(mod.allison, newdata=Rossi.fin), conf.int=T, lty=c(1,2), col=c(‘red’, ‘blue’), ylim=c(.5, 1), xlab='Weeks', ylab='Proportion Not Rearrested')

Model Selection 16

Model Selection  Why variable selection?  Purposeful selection  Stepwise selection  Best Subset Selection of Covariates 18,April Departement of Mathematics, ETHZ

Why variable selection?  We generally want to explain the data in the simplest way.  Unnecessary predictors in a model will effect the estimation of other quantities. That is to say, degrees of freedom will be wasted  If model is to be used for prediction, we will save effort, time and/or money if we do not have to collect data for predictors that are redundant ,April 2011Department of Mathematics, ETHZ

Why variable selection?  We must decide on a method to select a subset of variables.  Purposeful selection  Stepwise selection - using P-values - using AIC  Best subset selection 19 18,April 2011Department of Mathematics, ETHZ

Purposeful selection 1.We fit a multivariable model containing all variables that were significant in a univariable analysis at the 20-25% level. 2.We use the p-values from the Wald statistic to remove variables from our model. We also confirm the non-significance by a likelihood ratio test. 3.We check whether the removal has produced an “important” change in coefficients of other variables. 4.We check again all the variables that we removed. 5.We check for nonlinearity. 6.We look for interactions. 7.We check assumptions ,April 2011Department of Mathematics, ETHZ

Stepwise selection  Stepwise selection is a mix between forward and backward selection.  We can either start with an empty model or a full model and add/remove predictors according some criteria.  At each step we reconsider terms that were added or removed earlier. → Often applied in practice → Done argument in the step() function in R → In practice often based on AIC/BIC 21 18,April 2011Department of Mathematics, ETHZ

Stepwise selection  The AIC is a measure of the relative goodness of fit of a statistical model.  It does not only reward goodness of fit, but also includes a penalty that is an increasing function of the number of parameters.  AIC = 2k – 2max(loglikelihood), where k is the number of parameters in the model.  This means the smaller the better 22 18,April 2011Department of Mathematics, ETHZ

Stepwise selection using our data Step: AIC= Surv(week, arrest) ~ fin + age + mar + prio Df AIC mar fin age prio ,April 2011Department of Mathematics, ETHZ

24 Best Subset Selection  Stepwise only considers a small number of all the possible models  Best subset provides a way to check all the possible models  The same as in linear regression: need a criterion to judge the models Idea: not only based on goodness-of- fit, but also penalizes for the model size. 18,April 2011Department of Mathematics, ETHZ

25 Best Subset Selection  Mallow’s C: C=W+(p-2q) smaller C is better  p: number of variables under consideration  q: number of variables not included in the subset model  W=W(p)-W(p-q), where W(p) is the Wald statistics for the model containing all p variables and W(p-q) denotes the Wald statistics for the subset model 18,April 2011Department of Mathematics, ETHZ

Best Subset Selection of Covariates  Check the model 26 18,April 2011Department of Mathematics, ETHZ VariablesMallow’s C fin, age, mar, prio6.56 fin, age, mar, prio, race7.22 fin, age, mar, prio, wexp7.81 fin, age, mar, prio, paro8.47 fin, age, mar, prio, educ5.39 fin, age, mar, prio, race, paro9.09 fin, age, mar, prio, wexp, paro9.75 fin, age, mar, prio, race, wexp8.53 fin, mar, prio11.77 fin, age, prio6.60 fin, age, mar17.34 age, mar, prio8.28

Model Diagnostics 27

Model Diagnostics  Analyze PH assumption with residuals  Influential observations  Checking nonlinearity 18,April Departement of Mathematics, ETHZ

18,April 2011Department of Mathematics, ETHZ Analyze PH assumption with residuals  We have a strong evidence of non-PH assumption for age  plot with cox.zph shows us plots of scaled Schoenfeld residuals. > cox.zph(mod.allison.4) rho chisq p fin e age e prio e mar e GLOBAL NA 8.88e

18,April 2011Department of Mathematics, ETHZ Analyze PH assumption with residuals > plot(cox.zph(mod.allison.4))

18,April 2011Department of Mathematics, ETHZ Analyze PH assumption with residuals  For the variable age, the plot of residuals changes over time.  There are two possible solutions.  The effect of variable age is different with regards to time intervals: Age is a strata variable  The effect of age is declining over time: Interaction between age and time exists

18,April 2011Department of Mathematics, ETHZ Analyze PH assumption (Strata)  The variable age is “strata” variable  We separate observations into 4 groups by their ages  [ < 19]  [20 ~ 25]  [26 ~ 30]  [31 < ] > library(car) > Rossi$age.cat <- recode(Rossi$age, “lo:19=1;20:25=2;26:30=3;31:hi=4") > table(Rossi$age.cat)

Analyze PH assumption (Strata)  Use this separated age groups as strata variables and check the PH assumption again > mod.allison.6 <- coxph(Surv(week, arrest) ~ fin + prio + strata(age.cat) + mar, data=Rossi) > cox.zph(mod.allison.6) rho chisq p fin prio mar GLOBAL NA ,April Departement of Mathematics, ETHZ

Analyze PH assumption (Interaction)  To analyze the interaction between age and time, we should transform the data.  Ex) 1 st observation, we change (0,20] to (0,1+], (1,2+], (2,3+],..., (19,20] 18,April Departement of Mathematics, ETHZ > Rossi[1,1:10] week arrest fin age race wexp mar paro prio educ > Rossi2[1:20,1:10] start stop arrest.time week arrest fin age... prio educ

18,April 2011Department of Mathematics, ETHZ Analyze PH assumption (Interaction)  The interaction exists between time and age coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + age:stop + prio + mar, data = Rossi.2) coef exp(coef) se(coef) z p fin age prio mar age:stop Likelihood ratio test=38.1 on 5 df, p=3.55e-07 n=  Age has a positive partial effect on the hazard but this effect gets smaller with time, even becoming negative effect about 10 weeks.

36 18,April 2011Department of Mathematics, ETHZ Influential observations  For each covariate we look at how much the regression coefficients change if we remove one observation.  In R the argument type=dfbeta to the residuals() function produces a matrix of estimated changes in the regression coefficients upon deleting each observation in turn.  We then plot these changes. 3

37 18,April 2011Department of Mathematics, ETHZ 4

38 18,April 2011Department of Mathematics, ETHZ Influential observations(Just for fun)  Let see what happens if I change some observations.  I take the first age observation and change it. First to age=60 and then to age=110.  See R 5

39 18,April 2011Department of Mathematics, ETHZ Checking nonlinearity  Nonlinearity is a problem in Cox regression as it is in linear and generalized linear models.  To detect nonlinearity we plot the Martingale residuals against covariates.  We add a smooth produced by local linear regression using the loess function and try to detect deviations from zero. 6

Martingale residuals  The Martingale residual for individual i on time t i is  Where δ i is the event indicator  is the cumulative hazard function for individual i.  t i is the time at the end of follow up for individual i. 18,April Departement of Mathematics, ETHz

41 18,April 2011Department of Mathematics, ETHZ 7

42 18,April 2011Department of Mathematics, ETHZ In case of nonlinearity  We try to transform our covariate which is not linear.  We can try several transformations, e.g. log or sqrt.  We can also include higher order terms in our model and compare with the original model using likelihood ratio test. 8

Cox PH Model for Time-Dependent Variables in R 43

18,April 2011Department of Mathematics, ETHZ Time-Dependent Variables  Cox PH Model for Time-Dependent Variables  Data Transformation  Model with Time-Dependent Variables  Model with Lagged Time-Dependent Variable

18,April 2011Department of Mathematics, ETHZ Cox PH Model for Time-Dependent Variables  Recall the Cox PH model for Time-Dependent Variables  g i (t) which depend on time t can be 0 (time-independent) t, ln(t), etc... One variable at a time g i (t) = 1 (t = t 0, t 1, t 2,..) = 0 (otherwise) Heavyside function g i (t) = 1 (t ≥ t 0 ) = 0 (t < t 0 )

18,April 2011Department of Mathematics, ETHZ Data Transformation  Now, we want to assess the effect of weekly employment on rearrest harzard.  Weekly employment indicators appear as a single row in 52 columns => Weekly employment indicators in rows > Rossi[1,] week arrest fin age race wexp mar paro prio educ emp1 emp emp3 emp4 emp5 emp6 emp7 emp8 emp9 emp10 emp11 emp12 emp emp14 emp15 emp16 emp17 emp18 emp19 emp20 emp21 emp22 emp NA NA NA emp24 emp25 emp26 emp27 emp28 emp29 emp30 emp31 emp32 emp33 1 NA NA NA NA NA NA NA NA NA NA emp34 emp35 emp36 emp37 emp38 emp39 emp40 emp41 emp42 emp43 1 NA NA NA NA NA NA NA NA NA NA emp44 emp45 emp46 emp47 emp48 emp49 emp50 emp51 emp52 1 NA NA NA NA NA NA NA NA NA

18,April 2011Department of Mathematics, ETHZ Data Transformation  Transformed data (Weekly employment indicators in rows) > Rossi.2[1:50,] start stop arrest.time week arrest fin age race wexp mar paro prio educ employed

18,April 2011Department of Mathematics, ETHZ Model with Time-Dependent Variables  We treat weekly employment as a predictor depended on time to rearrest.  Suggested model:  X employed (t) means whether people are employed at week t (0 or 1)  We estimate coefficient β i, δ employed

18,April 2011Department of Mathematics, ETHZ Model with Time-Dependent Variable  The weekly employment variable has an apparently large effect.  The hazard of rearrest is smaller by a factor of e = (declined by 73.5%) when people are on a employed status. coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + mar + prio + employed, data = Rossi.2) n= coef exp(coef) se(coef) z Pr(>|z|) fin age * mar prio ** employed e-07 ***

18,April 2011Department of Mathematics, ETHZ Model with Lagged Time-Dependent Variables  Claim: The direction of causality is not clear, because a person cannot work when he is in jail. Weekly Employment at time t Arrest at time t Weekly Employment at time t-1 Arrest At time t Ambiguous causality

Model with Lagged Time-Dependent Variables  Use a lagged value of employment from the previous week  Model with lagged time  We apply lagged property to the data and then use the same command and arguments in R arrest.time are shifted by a lagged time 18,April Departement of Mathematics, ETHZ

18,April 2011Department of Mathematics, ETHZ Model with Lagged Time-Dependent Variables  The coefficient for the lagged employment variable is still significant, but the estimated effect is much smaller: e = 0.45 coxph(formula = Surv(start, stop, arrest.time) ~ fin + age + mar + prio + employed, data = Rossi.3) n= coef exp(coef) se(coef) z Pr(>|z|) fin age * mar prio *** employed ***

53 Summary

54 Final model  After we introduced the weekly employment into our model the marriage variable has become non-significant. We therefore remove it.  We also choose to have age as a strata variable for ease of interpretation because it does not satisfy the PH assumption. 18,April 2011Department of Mathematics, ETHZ

55 Final model coxph(formula = Surv(start, stop, arrest.time) ~ fin + strata(age.cat) + prio + employed, data = Rossi.3) coef exp(coef) se(coef) z Pr(>|z|) fin prio *** employed *** --- exp(coef) exp(-coef) lower.95 upper.95 fin prio employed Rsquare= (max possible= ) Likelihood ratio test= on 3 df, p=1.325e-06 Wald test = on 3 df, p=1.182e-06 Score (logrank) test = on 3 df, p=6.933e-07 18,April 2011Department of Mathematics, ETHZ

56 Final model  Financial aid coef exp(coef) se(coef) z Pr(>|z|) fin  The estimated hazard ratio for receiving financial aid is  This means, holding the other covariates constant, the rearrested rate of subjects with financial aid reduces 29%. 18,April 2011Department of Mathematics, ETHZ

57 Final model  Number of prior convictions coef exp(coef) se(coef) z Pr(>|z|) prio ***  The estimated hazard is  Holding the other covariates constant, an additional time of prior convictions increases the weekly hazard of rearrest by 9 percent. 18,April 2011Department of Mathematics, ETHZ

58 Final model  Employment coef exp(coef) se(coef) z Pr(>|z|) employed ***  The estimated hazard ratio is  This means that the hazard of rearrest is smaller by a decline of 56 percent during a week in which the former inmate was employed. 18,April 2011Department of Mathematics, ETHZ

59 Summary  Cox PH Model for Time-Independent Variables in R  Surv and coxph function in R  Cox Regression  Adjusted survival curve  Model Selection  Why variable selection?  Purposeful selection  Stepwise selection  Best Subset Selection 18,April 2011Department of Mathematics, ETHZ

60 Summary  Model Diagnostics  Analyze PH assumption with residuals  Influential observations  Checking nonlinearity  Cox PH Model for Time-Dependent Variables in R  Model description  Analysis for the result  Lagged variables  Final model 18,April 2011Department of Mathematics, ETHZ