Xin Huang, Yan Sun, V. Devanarayan Exploratory Statistics AbbVie Inc. Paths to precision medicine: Subgroup Identification in Clinical Trials Xin Huang, Yan Sun, V. Devanarayan Exploratory Statistics AbbVie Inc. April. 15, 2015
Why Personalized Medicine? Edward Abrahams and Mike Silver. The Case for Personalized Medicine. (2009) Journal of Diabetes Science and Technology V3 Issue 4
Some statistical challenges For ease of implementation in clinical practice, need cut-points on biomarkers for predicting responders/non-responders. i.e., threshold-based biomarker signatures E.g., Patients with Gene X1 > …, Gene X2 < …, are likely responders. This should be “Multivariate”. Derive this from typically a large panel of candidate markers, and often from full genome data (e.g., > 30,000 genes). Need to account for linear and nonlinear trends. After a promising threshold-based signature is identified, need to predict it’s performance in a future dataset. i.e., predict treatment effect in the “responder” subgroup, or predict the signature effect among patients receiving treatment.
Biomarker signatures for subgroup identification Prognostic Signature: identifies a subgroup of patients that are more likely to experience an outcome of interest (efficacy, toxicity, disease progression, etc.), independent of treatment. Predictive Signature: identifies a subgroup of patients that respond better to a specific treatment.
Some existing methods Prognostic Signatures (predict the disease outcome irrespective of the treatment ): CART (Breiman et al, 1984) MARS (Friedman, 1991) RuleFit (Friedman and Popescu 2008) Predictive Signatures (predict the response to a specific treatment compared to other treatments): Interaction Trees (Su et al. 2008, 2009) Virtual Twins ( Foster et al. 2011) SIDES method (Lipkovich et al. 2011, 2014) Bayesian approaches (Berger et al. 2014)
Objective functions Consider a supervised learning problem with data 𝒙 𝒊 , 𝑦 𝑖 , 𝑖= 1, 2, …, 𝑛, where 𝒙 𝒊 is a p-vector of predictor and 𝑦 𝑖 is an outcome variable Consider three major applications: Linear regression for continuous response Logistic regression for binary response, where 𝑦 𝑖 ∈ 0, 1 Cox regression for survival response: 𝑦 𝑖 =( 𝑇 𝑖 , 𝛿 𝑖 ), where 𝑇 𝑖 is a right censored survival time and 𝛿 𝑖 is the censoring indicator Denote the log likelihood or log partial likelihood by 𝓁(𝜂;𝑿, 𝒚), where 𝜂 is the usual linear combination of predictors. continuous response in simple linear regression log odds in logistic regression log hazard ratio in proportional hazards regression.
Objective functions, contd. Consider the following model for prognostic signatures (predict the outcome, irrespective of the treatment), 𝜂=𝛼 + 𝛽∙𝜔(𝑿), (1) where 𝜔 𝑿 ={0, 1} is the signature rule returning grouping indicators for each subject. Consider following model for predictive signatures (predict the response to a specific treatment compared to the other treatment), 𝜂=𝛼 + 𝛽∙ 𝜔 𝑿 ×𝑟 + 𝛾∙𝑟, (2) where r is the treatment indicator. Our algorithms derive signature rules, 𝜔 𝑿 , with the objective of searching for a best grouping to optimize the significance of 𝛽 in (1) and (2)
Bootstrapping & Aggregating of Thresholds from Trees (BATTing) Original Data Bootstrapping (sampling with replacement) Data 1 Data 2 Data B … … ... Tree 1 >= C1 < C1 Tree 2 >= C2 < C2 … … ... Tree B >= CB < CB Aggregate Thresholds (C1, C2, …., CB) Threshold is robust to small perturbations in data, outliers, etc. BATTing Threshold (Median) (Devanarayan, 1999)
BATTing, contd. We use the same simulation model as described in the simulation section for prognostic signatures except that we only include one true predictor as the only candidate predictor. Under the simulated model, the overall response equals to zero, and the signature positive group (cutoff >= 0) has a positive response, while the signature negative group (cutoff < 0) has a negative response. The difference between the two signature groups is determined by a predetermined effect size. Figure 1 shows the distribution of BATTing threshold estimates from 500 simulation runs across different number of bootstrapping for sample size = 100 and effect size = 0.2, with true cutoff being 0 (red dashed vertical line). As demonstrated in Figure 1, BATTing helps reduce the influence of data perturbations in the dataset and thus stabilize the threshold estimate. In our experience, the number of bootstraps >= 50 is recommended.
Sequential BATTing Model Growing within the potential Sig+ group Marker 7 Marker 3 Marker 9 Whole Population (Sig+) (Sig+) (Sig+) Sig+ Sig- Sig- Sig- Sig- Model Growing within the potential Sig+ group Get the BATTing threshold for each unused marker The best marker is selected to split the current sig+ group This procedure continues in the new Sig+ group Stopping Rule: The new added predictor goes through the likelihood ratio test for significance.
Adaptive Index Model AIM (Tian & Tibshirani, 2010) can be used for selecting markers & thresholds. Output: AIM Score An index predictor: # of satisfied rules 𝒔𝒄𝒐𝒓𝒆= 𝒌=𝟏 𝑲 𝑰( 𝑋 𝑘 ≤ 𝑐 𝑘 ) Model to get the AIM score Prognostic: 𝜂 ∗ = 𝜃 0 +𝜽×𝒔𝒄𝒐𝒓𝒆, Predictive: 𝜂 ∗ = 𝜃 0 +𝛾∙𝑇+ 𝜽×𝑻×𝒔𝒄𝒐𝒓𝒆. An information matrix based fast algorithm is used to do score test to select threshold for each marker Markers are selected one at a time (forward selection) Optimal # of markers is determined via cross validation
AIM-BATTing Obtain the AIM Score Use BATTing to derive an optimal AIM Score threshold based on Model (1) & (2). The threshold is then used to stratify the population. Step1 Step2 Patient 1 AIM I(X1≥c1) + I(X2≤c2) ….. I(Xk≥ck) Score 1 Sig+ Grp. Patient 2 Score 2 BATTing I( Score ≥ j ) Sig- Grp. Patient n Score n
Some Refinements to the AIM-BATTing algorithm MC-AIM-BATTing: Monte Carlo procedure to get a more stable estimate of the “optimal # of markers”. i.e., use the median of estimated “optimal # of markers” across multiple cross validation runs with different random seeds MC-AIM-RULE-BATTing: Use BATTing directly on the rules (Xi > c), instead of scores, and get a cutoff on the rule list. Patients meeting all the rules within the cutoff are assigned to the sig+ group
Performance evaluation: Common mistakes in practice Using an entire dataset to build a model Select “important” variables by associating markers with outcomes (e.g., stepwise regression) Test and rely on lack of fit assessment of the resulting model Assuming the resulting model is correct, making inferences using the same dataset over-fitting
Predictive significance via cross-validation Train Train Train Sig. Sig. Sig. Test Test Test Repeat Multiple Times Group Label Group Label Group Label Group Label Group Label CV p-value pi Aggregated cross-validated p values from M iterations (p1, p2, …., pM) predictive significance (median of p value) Note: other performance statistics, e.g., sensitivity, specificity, PPV, NPV, hazard ratio, odds ratio can be calculated similarly
Effect size = E(Y|Trt, sig+) - E(Y|ctrl, sig+) = 0.5 Simulation Design Similar simulation model as Lipkovich et al., 2011, 2014, with each predictor as continuous instead of dichotomized valued Small trials to large trials (n=100, 300, 500) Number of candidate predictors is k=10 and 18 with different correlation structures Effect size is 0.2 (low), 0.5 (medium), 0.8 (high) Effect size = E(Y|Trt, sig+) - E(Y|ctrl, sig+) = 0.5 0.5
Simulation Results For small effect size, none of the methods has many testing p values less than 0.05 for sample size from 100 to 500 Our proposed methods outperform SIDES in terms of the selection accuracy: the accuracy of SIDES is around 50% while that of our proposed algorithms is from 60% to 70% for large sample size. For effect size greater than medium (0.5) and sample size larger than 300, our proposed methods have most of the testing p values less than 0.05 and accuracy around 90%. SIDES method under performs in all scenarios.
Clinical Trial Case Study Data simulated based on a Phase III clinical trial Efficacy of a novel treatment is compared to the standard of care (Control) in patients with severe sepsis Treatment arm (n = 317) vs. Control arm (n = 153) Outcome: Binary (survival) Available markers: demographic and clinical covariates, i.e., age, time from first sepsis-organ fail to start drug, sum of baseline SOFA socres (cardiovascular, hematology, hepaticrenal, and respiration scores), number of baseline organ failures, pre-infusion apache-ii score, baseline GLASGOW coma scale score, baseline activity of daily living score laboratory markers, i.e., baseline local platelets, creatinine, serum IL-6 concentration, local bilirunbin Overall outcome was insignificant (1-tailed p value = 0.08), with survival rates of 40.7% and 34% in the treatment and control arms, respectively Original data was randomly split into two parts (training + testing)
Clinical Trial Case Study, contd. Signature rules for positive subgroup: Sequential-BATTing and AIM-RULE: pre-infusion apache-ii score <= 27 AIM: meet at least two out of the three thresholds: (1) pre-infusion apache-ii score < 27; (2) Age < 54; (3) local bilirunbin > 0.8 SIDES: Creatinine <= 1.1 & baseline GLASGOW coma scale score > 11 Table: 1-tailed p-values for sepsis trial example Seq-BATTing has the most promising CV performance, and its signature is validated in the test dataset
Summary The proposed subgroup identification algorithms perform well in simulations and case-study illustration. These algorithms provide threshold-based multivariate biomarker signatures. Variable selection is automatically built-in to these algorithms. Personalized medicine is a paradigm shift in drug development, which requires Advanced subgroup identification and subgroup analysis methods Enrichment design and simulations Smart diagnostic test development and clinical development strategy to overcome operational challenges Collaboration between functional areas
References Hastie T, Tibshirani R, Friedman J (2011) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2nd ed. 2009. Corr. 7th printing 2013 edition. Springer Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and Regression Trees, 1 edition. Chapman and Hall/CRC Friedman JH (1991) Multivariate Adaptive Regression Splines. Ann Stat 19:1–67. doi: 10.1214/aos/1176347963 Friedman JH, Popescu BE (2008) Predictive learning via rule ensembles. Ann Appl Stat 2:916–954. doi: 10.1214/07-AOAS148 Liu X, Minin V, Huang Y, et al. (2004) Statistical methods for analyzing tissue microarray data. J Biopharm Stat 14:671–685. doi: 10.1081/BIP-200025657 Chen G, Zhong H, Belousov A, Devanarayan V (2015) A PRIM approach to predictive-signature development for patient stratification. Stat Med 34:317–342. doi: 10.1002/sim.6343 Su X, Zhou T, Yan X, et al. (2008) Interaction Trees with Censored Survival Data. Int J Biostat. doi: 10.2202/1557-4679.1071 Su X, Tsai C-L, Wang H, et al. (2009) Subgroup Analysis via Recursive Partitioning. J Mach Learn Res 10:141–158. Lipkovich I, Dmitrienko A, Denne J, Enas G (2011) Subgroup identification based on differential effect search--a recursive partitioning method for establishing response to treatment in patient subpopulations. Stat Med 30:2601–2621. doi: 10.1002/sim.4289 Lipkovich I, Dmitrienko A (2014) Strategies for identifying predictive biomarkers and subgroups with enhanced treatment effect in clinical trials using SIDES. J Biopharm Stat 24:130–153. doi: 10.1080/10543406.2013.856024 Berger JO, Wang X, Shen L (2014) A Bayesian approach to subgroup identification. J Biopharm Stat 24:110–129. doi: 10.1080/10543406.2013.856026 Devanarayan V, Cummins D, Tanzer L (1999) Application of GAM and tree models for assessing the role of drug resistance proteins in leukemia chemotherapy. Tian L, Tibshirani R (2011) Adaptive index models for marker-based risk stratification. Biostatistics 12:68–86. doi: 10.1093/biostatistics/kxq047 Tian L, Alizadeh A, Gentles A, Tibshirani R (2012) A Simple Method for Detecting Interactions between a Treatment and a Large Number of Covariates. arXiv Tibshirani R, Efron B (2002) Pre-validation and inference in microarrays. Stat Appl Genet Mol Biol. doi: 10.2202/1544- 6115.1000 Foster JC, Taylor JM, Ruberg SJ (2011) Subgroup identification from randomized clinical trial data. Stat Med. 30(24) 2867-80