Lecture 11: Hypothesis Testing III Stratified Tests Renyi and Other Tests
Stratified Tests Adjust for a covariate Allows you to control for a confounder without using a regression approach However Like regression, if interaction is present, it won’t be detected Assumes the ‘treatment’ effect is the same across strata
Sometime Confusing “Stratified” analysis Sometimes Subgroup analysis Stratified “combined” test In this case, combined test Recall Mantel-Haenszel odds ratio
Notation Now three variables J = 1, 2,…., K indexes groups Outcome (time to event) Group variable (i.e. treatment) Strata variable (i.e. gender, cancer grade) J = 1, 2,…., K indexes groups S = 1, 2,…, M indexes strata
Similar to the Standard Test Formal hypothesis Now, Zj.(t) is represented by a sum
From there, inference is the same Chi-square test with K – 1 d.f. where S-1 is the inverse of the estimated variance covariance matrix For the 2 group scenario it can be reduced to a Z-score
Asymptotics Just like unstratified test, requires large N Here requires even larger- think about dividing the sample into M strata In most cases, there probably is not sufficient N
Small Example 20 subjects received 1 of two treatments 9 patients on treatment 1 11 patients received treatment 2 Patients also categorized by disease type 2 strata Question: Does the data show a treatment effect after adjusting for disease type?
Time Death Censor Trt Disease 1 5 6 2 8 37 49 58 79 11 50 51 62 67 73 86 90 96 97
What first Data in standard format Trt 1: 1, 5, 5+,6+, 8, 37, 49, 58, 79+ Trt 2: 11+, 50, 51, 62, 73, 86, 90, 96, 97, 97 We might first conduct a global test What is our hypothesis
Constructing Statistic
Calculate Statistic Z-statistic c2 statistic
Now Let’s Adjust for Disease Type Steps: Divide the data according to strata Calculate Zjs.(t) and Sum Zjs(t) and across strata to get Zj.(t) & Calculate your test statistic according to
Divide data By Strata Disease 1 Disease 2 Time Death Censor Trt 1 5 8 5 8 49 11 2 50 62 67 73 86 Time Death Censor Trt 6 1 37 58 79 51 2 90 96 97
Calculate and sgsms
Calculate and sgsms
Calculate the Statistic Z (or chi-square) What is our conclusion
R Code >times<-c(1,5,5,6,8,11,37,49,50,51,58,62,67,73,79,86,90,96,97,97) >trt<- c(1,1,1,1,1,2,1,1,2,2,1,2,2,2,1,2,2,2,2,2) >strat<-c(1,1,1,2,1,1,2,1,1,2,2,1,1,1,2,1,2,2,2,2) >death<-c(1,1,0,0,1,0,1,1,1,1,1,1,1,1,0,1,1,1,1,1) #Global >survdiff(st~trt) Call: survdiff(formula = st ~ trt) N Observed Expected (O-E)^2/E (O-E)^2/V trt=1 9 6 2.63 4.329 6.1 trt=2 11 10 13.37 0.851 6.1 Chisq= 6.1 on 1 degrees of freedom, p= 0.0136
R Code #Stratified survdiff(st~trt + strata(strat)) Call: survdiff(formula = st ~ trt + strata(strat)) N Observed Expected (O-E)^2/E (O-E)^2/V trt=1 9 6 2.27 6.16 9.46 trt=2 11 10 13.73 1.02 9.46 Chisq= 9.5 on 1 degrees of freedom, p= 0.0021
BMT: Hodgkin’s & Non-Hodgkin’s Lymphoma Study included 43 BMT patients Is there a difference in hazard rates between Allogenic transplant = HLA matched sibling donor (N=16) Autogenic transplant = Own “cleaned” marrow (N=27) But want to adjust for disease state Non-Hodgkin’s lymphoma (N=23) Hodgkin’s disease (N=20)
Global Test 2 1 43 16 0.628 0.234 4 42 15 0.643 0.230 28 41 14 0.659 0.225 30 40 13 -0.325 0.219 32 39 0.667 0.222 … 132 22 7 -0.318 0.217 140 21 -0.333 252 18 -0.389 0.238 357 0.563 0.246 Sum 0.886 5.841
Global Results Global Test Results > dat<-read.csv("C:\\BJW\\AutoAllo.csv") > d<-dat$death; t<-dat$time > dis<-dat$disease; type<-dat$graft > nostrat<-survdiff(Surv(t, d)~type) > nostrat Call: survdiff(formula = Surv(t, d) ~ type) N Observed Expected (O-E)^2/E (O-E)^2/V type=1 16 10 9.11 0.0862 0.134 type=2 27 16 16.89 0.0465 0.134 Chisq= 0.1 on 1 degrees of freedom, p= 0.714
Stratified by Disease Type Non-Hodgkin’s Lymphoma subjects 28 1 23 11 0.522 0.250 32 22 10 0.545 0.248 42 21 9 -0.429 0.245 49 20 0.550 53 19 8 -0.421 0.244 57 18 -0.444 0.247 63 17 -0.471 0.249 81 2 16 -1.000 0.467 84 14 0.429 140 13 7 -0.538 252 -0.636 0.231 357 0.300 0.210 524 6 -0.750 0.188 Sum -2.344 3.319
Stratified by Disease Type Hodgkin’s Disease subjects 2 1 20 5 0.750 0.188 4 19 0.789 0.166 30 18 3 -0.167 0.139 36 17 -0.176 0.145 41 16 -0.188 0.152 52 15 -0.200 0.160 62 14 -0.214 0.168 72 13 0.769 0.178 77 12 0.833 79 11 0.909 0.083 108 10 0.000 132 9 sum 3.106 1.518
Stratified Results Stratified Test Results > strat<-survdiff(Surv(t, d)~type + strata(dis)) > strat Call: survdiff(formula = Surv(t, d) ~ type + strata(dis)) N Observed Expected (O-E)^2/E (O-E)^2/V type=1 16 10 9.24 0.0629 0.12 type=2 27 16 16.76 0.0347 0.12 Chisq= 0.1 on 1 degrees of freedom, p= 0.729
Stratified Results Stratified Test Results Again we fail to reject This seems in error (recall (our survival curves looked VERY different)
Problem? The treatment effect is not the same in the 2 disease states They are in different directions ZAllo = -2.344 ZAuto = 3.106 Stratified approach is NOT appropriate
Alternative to Stratified Analysis Alternatives Define 4 groups and conduct a K-sample log rank test Allogenic and NHL Allogenic and Hodgkin’s Autogenic and NHL Autogenic and Hodgkin’s Subgroup analysis (by disease) should be performed Allo|Hodgkins Allo|Non-Hodgkins
R Code- K sample test > allgrp<-ifelse(dis==1 & type==1, 1, 0) > allgrp<-ifelse(dis==1 & type==2, 2, allgrp) > allgrp<-ifelse(dis==2 & type==1, 3, allgrp) > allgrp<-ifelse(dis==2 & type==2, 4, allgrp) > grp4<-survdiff(Surv(t, d)~allgrp) > grp4 Call: survdiff(formula = Surv(t, d) ~ allgrp) N Observed Expected (O-E)^2/E (O-E)^2/V allgrp=1 11 5 7.67 0.927 1.350 allgrp=2 12 9 7.45 0.324 0.459 allgrp=3 5 5 1.45 8.721 9.567 allgrp=4 15 7 9.44 0.631 0.997 Chisq= 11.1 on 3 degrees of freedom, p= 0.0113
R Code- Subgroup analysis > ### Subgroup (NHL) > subNHL<-survdiff(Surv(t,d)[which(dis==1)]~type[which(dis==1)]) > subNHL Call: survdiff(formula = Surv(t, d)[which(dis == 1)] ~ type[which(dis ==1)]) N Observed Expected (O-E)^2/E (O-E)^2/V type[which(dis == 1)]=1 11 5 7.34 0.748 1.66 type[which(dis == 1)]=2 12 9 6.66 0.825 1.66 Chisq= 1.7 on 1 degrees of freedom, p= 0.198 > ### Subgroup (Hodgkins) > subHD<-survdiff(Surv(t,d)[which(dis==2)]~type[which(dis==2)]) > subHD survdiff(formula = Surv(t, d)[which(dis == 2)] ~ type[which(dis ==2)]) N Observed Expected (O-E)^2/E (O -E)^2/V type[which(dis == 2)]=1 5 5 1.89 5.095 6.36 type[which(dis == 2)]=2 15 7 10.11 0.955 6.36 Chisq= 6.4 on 1 degrees of freedom, p= 0.0117
Summary: Stratified Testing Alternative to a regression approach to control for a 2nd covariate when examining treatment effect. Sample size needs to be larger that in the case of testing K-groups for test results to be valid. One needs to be cautious about misinterpreting null results when interactions exist. We can use a subgroup approach if this fails.
Renyi Tests Previous tests we discussed all use weighted integral of estimated difference in cumulative hazard rates Doesn’t address situation where early differences favor one group, and later differences favor another group Solution: Renyi tests i.e. addresses issue of crossing hazard rates
Renyi Test Censored data analogs of Kolmogrov-Smirnov statistic when comparing to uncensored samples Recall KS is a test of equality of one-dimensional probability distributions used to compare two samples
Komolgrov-Smirnov Test Recall empirical distribution function Hypothesis The KS statistic is
Example of a KS test Two groups observed for a continuous outcome: 1: -0.2, 3.7, 4.3, 5.0, 7.7, 8.6 2: -0.9, 0.4, 0.5, 2.6, 3.0, 12.1 We want to determine if the distribution of the outcomes are different (without assuming any distributional form…)
Constructing KS statistic x P(X1 < x) P(X2 < x) |P(X1 < x)-P(X2 < x)| -0.9 1/6 -0.2 0.4 1/3 0.5 1/2 2.6 2/3 3.0 5/6 3.7 4.3 5.0 7.7 8.6 1 12.1
K-S Test
Renyi Test Approach Find the value of Z(ti) for each failure time Note different from Z(t) which sums over all ti < t Calculate series of Z(ti) : Estimate the standard error of Z(t) (all times)
Renyi Statistic When hazard rates cross, the absolute value of Z(t) will have max value at some value t < t Hypothesis test: Note that multiple tests are made, because we are taking the max over Z(t)
Test Statistic Q Use the same variance estimate for test statistic as in standard two-sample approach Test statistic Q is approximated by distribution of sup{|B(x)|, 0 < x < 1} where B is a standard Brownian motion process Use table C.5 to find associated p-value
Small Example Given the following data Group 1: (7, 8+, 9, 15, 17)
Constructing the statistic dk dk1 Yk Yk1 Var
Calculating Q First we can calculate Q Once we have Q we compare to table C.5
Example 2: Kidney Infection Data on 119 kidney dialysis patients Comparing time to kidney infection between two groups Catheters placed percutaneously (n = 76) Catheters placed surgically (n = 43)
Example: Kidney Infection
R Code: Kidney Infection > kidney<-read.csv("H:\\public_html\\BMRTY722_Summer2015\\Data\\Kidney.csv") > time<-kidney$Time > infect<-kidney$d > percut<-kidney$cath > st<-Surv(time, infect) > LRtest<-survdiff(st~percut) > LRtest Call: survdiff(formula = st ~ percut) N Observed Expected (O-E)^2/E (O-E)^2/V percut=1 43 15 11 1.42 2.53 percut=2 76 11 15 1.05 2.53 Chisq= 2.5 on 1 degrees of freedom, p= 0.112
How to Test This in R? We could write our own R function to conduct the Renyi test… BUT, it turns out there was a package released in April that has the Renyi test (and all weight functions from K & M included )
R Code: Kidney Infection > library(survMisc) > RYtest<-comp(survfit(st~percut)) > RYtest $tne t n e n_percut=1 e_percut=1 n_percut=2 e_percut=2 1: 0.5 119 6 76 6 43 0 2: 1.5 103 1 60 0 43 1 … 16: 26.5 5 1 3 0 2 1 $tests$lrTests ChiSq df p Log-rank 2.529506318 1 0.11174 Gehan-Breslow (mod~ Wilcoxon) 0.002084309 1 0.96359 Tarone-Ware 0.402738202 1 0.52568 Peto-Peto 1.399160019 1 0.23686 Mod~ Peto-Peto (Andersen) 1.275908836 1 0.25866 Flem~-Harr~ with p=1, q=1 9.834062861 1 0.00171 $tests$supTests Q p Log-rank 1.590442 0.22347 Gehan-Breslow (mod~ Wilcoxon) 1.430499 0.30511 Tarone-Ware 1.260498 0.41467 Peto-Peto 1.166979 0.48551 Mod~ Peto-Peto (Andersen) 1.185549 0.47085 Renyi Flem~-Harr~ with p=1, q=1 7.460348 0.00000
R Code: Kidney Infection > library(survMisc) > RYtest<-comp(survfit(st~percut), FHp=0, FHq=0) > RYtest $tne t n e n_percut=1 e_percut=1 n_percut=2 e_percut=2 1: 0.5 119 6 76 6 43 0 2: 1.5 103 1 60 0 43 1 … 16: 26.5 5 1 3 0 2 1 $tests$lrTests ChiSq df p Log-rank 2.529506318 1 0.11174 Gehan-Breslow (mod~ Wilcoxon) 0.002084309 1 0.96359 Tarone-Ware 0.402738202 1 0.52568 Peto-Peto 1.399160019 1 0.23686 Mod~ Peto-Peto (Andersen) 1.275908836 1 0.25866 Flem~-Harr~ with p=0, q=0 2.529506318 1 0.11174 $tests$supTests Q p Log-rank 1.5904422 0.22347 Gehan-Breslow (mod~ Wilcoxon) 1.4304991 0.30511 Tarone-Ware 1.2604976 0.41467 Peto-Peto 1.1669791 0.48551 Mod~ Peto-Peto (Andersen) 1.1855486 0.47085 Renyi Flem~-Harr~ with p=0, q=0 0.9743145 0.65287
Example 3: Gastric Cancer Clinical trial of chemotherapy vs. chemotherapy combined with radiotherapy 45 Patients randomized to each of two arms Followed for up to 8 years
R Code: Gastric Cancer > RYtest<-comp(survfit(Surv(tm, dth)~x, data=dat)) > RYtest $tne t n_x=1 e_x=1 n_x=2 e_x=2 n e 1: 1 45 1 45 0 90 1 … 80: 2363 3 1 6 0 9 1 $tests$lrTests ChiSq df p Log-rank 0.23192760 1 0.63010 Gehan-Breslow (mod~ Wilcoxon) 3.99653918 1 0.04559 Tarone-Ware 1.92661766 1 0.16513 Peto-Peto 4.02844247 1 0.04474 Mod~ Peto-Peto (Andersen) 4.12061234 1 0.04236 Flem~-Harr~ with p=1, q=1 0.01112868 1 0.91598 $tests$supTests Q p Log-rank 2.200066 0.05560 Gehan-Breslow (mod~ Wilcoxon) 2.951879 0.00632 Tarone-Ware 2.677299 0.01484 Peto-Peto 2.965941 0.00604 Mod~ Peto-Peto (Andersen) 2.997885 0.00544 Renyi Flem~-Harr~ with p=1, q=1 9.388643 0.00000
Compare To Log Rank Renyi test 0.05< p <0.06 What would you expect to see from the log rank test? More or less significant?
LR Results > LRtest<-survdiff(Surv(tm, dth)~x) > LRtest Call: survdiff(formula = Surv(tm, dth) ~ x) N Observed Expected (O-E)^2/E (O-E)^2/V x=1 45 43 45.1 0.102 0.232 x=2 45 39 36.9 0.125 0.232 Chisq= 0.2 on 1 degrees of freedom, p= 0.63
Final Comments on the Renyi Test Simulations comparing the Renyi vs. log-rank Hazards cross Renyi test performs better Renyi test has little loss of power if proportional hazard assumption holds (with limited censoring) However, with large amounts of censoring, advantages of the Renyi test decline So this tests provides a good alternative when hazard rates cross. But caution still needs to be taken when there is a large amount of censoring.
Other Tests for Crossing Hazards Cramer-von Mises test(s): Based on the integrated squared difference between two curves T-test analog: Requires estimation of the mean Compared area under S1(t) and S2(t) Brookmeyer-Crowley Censored version of two-sample median test
Cramer-Von Mises Test Based on the Nelson-Aalen estimator for the hazard rate and it’s associated variance Ideally we integrate over time 0 to t but this integral is estimated by summing over distinct death times
2-Sample T-test analog Again this test is based on the difference in the area under the survival curve between two groups Components of the test include: Order all observed times (event and censored) Calculate dij, cij, and Yij or both groups Calculate the KM estimator for survival and censoring Calculate the pooled KM estimate of survival
2-Sample T-test analog Once these estimates are obtained: Construct weight function Construct the test statistic Construct the variance of the test statistic Calculate a Z-score according to
Summary of Other 2-Sample Tests When the hazard rates cross, both the Cramer-Von Mises and the 2-sample t-test analog have greater power than log-rank. When hazard rates are proportional, both show power loss relative to log-rank. Performance is similar to the Renyi test when hazards cross but Renyi has better power for proportional hazards.
Test Based on Fixed Points in Time Complicated description in K&M (chapter 7.8) However, pretty simple idea when you are comparing two groups:
Next Time We will begin our discussion of semi-parametric regression modeling in survival analysis.