Solving Wide Predictive Modeling Problems With Clinical and Genomic Data Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences SAS Institute, Inc.
Precision Medicine Initiative Introduction Outline Precision Medicine Initiative Predictive Models and the Impact of “Big” Data Tools for Model Assessment Subgroup Analysis Live Demonstration
Biological data to drive research for tailored therapies Introduction Precision Medicine initiative Biological data to drive research for tailored therapies Wide-range of application areas, including oncology and pharmacogenomics Better predict… Treatment outcomes Responders Survival or Time-to-Event Rich data mining environment
Rich set of methodology for prediction problems Popular methods: Predictive Modeling Methods and big data Rich set of methodology for prediction problems Popular methods: Continuous: GLM, PLS, Kernel methods (e.g. Ridge, Radial-Basis), Trees (e.g. Forest, Gradient Boosting), Quantile Regression Discrete: Logistic, Discriminant, KNN Censored: Life Regression, Cox Proportional Hazards, Buckley-James “Big” biological data => Wide prediction problem! Serious risk of overfitting
Simple Complex Filtering Techniques Predictive Modeling Predictor Reduction Simple Complex Filtering Techniques Known biology Statistical testing Clustering Forest models or linear regression model selection Optimization Combination of algorithms + predictor reduction = MILLIONS of potential models Critical to perform filtering within a cross-validated framework to prevent OVERFITTING and generalization bias in your models
Data Hold Out: K-fold, leave L-out, leave P-percent-out, etc… Model Assessment Cross-validation Model comparison Data Hold Out: K-fold, leave L-out, leave P-percent-out, etc… Hold Out Methods: Simple Random, Random Partition, Stratified, etc.. Performance Metrics: RMSE, Harrell’s C, AUC, Correlation, etc…
Specialized Prediction problems Subgroup analysis Identify subjects most-likely to respond to treatment Benefits in study design / safety / ethics Subgroup Guidance (CPMP, 2014) Classification and Regression Trees popular models (Zink et al., 2015) 0.5 1 P(Improve if NOT Treated) P(Improve if Treated) INCURABLE GET WELL ANYWAY DRUG MAKES YOU WORSE DRUG CURES YOU
JMP Genomics and JMP Clinical Predictive Modeling Reviews Example Data JMP Life Sciences Live Demonstration JMP Genomics and JMP Clinical Predictive Modeling Reviews Example Data Sepsis prediction in hospitals with metabolite and protein data Survival prediction in prostate cancer with clinical trials data
Discovery and prediction Hospital Biomarker utility to predict sepsis survival
Subgroup Analysis Interaction Trees All Randomized Subjects Linear, Logistic or Cox Model f(yi) = β0 + β1xi + β2Treatmenti + β3Treatmenti*xi Significant interaction implies differential treatment effect between subgroups defined by binary covariate All Randomized Subjects Biomarker 1 Absent Biomarker 2 Absent Biomarker 2 Present Biomarker 3 Absent Biomarker 3 Present Biomarker 1 Present Split based on p-value of treatment by covariate interaction term Su et al. (2009)
Virtual Twins (Foster et al., 2011) Subgroup analysis Virtual twins Virtual Twins (Foster et al., 2011) Fit forest model and tree model to response and counter-factual data estimated treatment effects
Subgroup identification Optimal treatment regimes Subgroup analysis Optimal treatment Regimes Subgroup identification “the right patients for a given drug” Optimal treatment regimes “the best drug for a given patient” Zhang et al. (2011) methodology to fit a response regression model and propensity score logistic model to create pseudo binary response and weight (augmented inverse probability weighted estimators or AIPWE) Use as input into predictive modeling routines including cross-validated designs (Freidlan et al., 2009)