A machine learning approach to prognostic and predictive covariate identification for subgroup analysis David A. James and David Ohlssen Advanced Exploratory Analytics Novartis Pharmaceuticals Joint Statistical Meetings July 2018
Use of machine learning for: Objectives Use of machine learning for: Discovering and exploring prognostic and predictive subgroups Patient risk stratification Risk prediction Two examples from large cardiovascular trials Note: Non-confirmatory setting
Patient stratification by relative risk Legend Relative Risk #events / #patients %Patients
Interrogating the tree-building process What are the top competing predictors for splitting each node? How much better was the winning splitting predictors vs the 2nd, 3rd,..., 5th contenders? What predictors could be used for imputing missing data? Etc. These investigations in addition to the usual assessments like cross- validation error estimates, ROC analysis, C-index (AUC), etc.
Competing splits at each non-terminal node (Baseline predictors) Here we go into the details of the tree construction. First we note the final tree (displayed in an abbreviated form) and the most important variables identified by the tree over all ~60 candidate predictors. Then we note the top 5 predictors for splitting the root node 1. The left bottom panel showing the change in deviance in the top node (vs the sum of the deviance in the daughter nodes) in the range of 145 units out of 8452 in the parent node. The right bottom panel displays the “surrogate” variables – these are the covariates most correlated to the primary split “heartfn” – it conveys a measure of collinearity b/w predictors involved in the splitting of node 1.
Competing splits at each non-terminal node (Baseline predictors) Here we go into the details of the tree construction. First we note the final tree (displayed in an abbreviated form) and the most important variables identified by the tree over all ~60 candidate predictors. Then we note the top 5 predictors for splitting the root node 1. The left bottom panel showing the change in deviance in the top node (vs the sum of the deviance in the daughter nodes) in the range of 145 units out of 8452 in the parent node. The right bottom panel displays the “surrogate” variables – these are the covariates most correlated to the primary split “heartfn” – it conveys a measure of collinearity b/w predictors involved in the splitting of node 1.
Competing splits at each non-terminal node (Baseline predictors) Here we go into the details of the tree construction. First we note the final tree (displayed in an abbreviated form) and the most important variables identified by the tree over all ~60 candidate predictors. Then we note the top 5 predictors for splitting the root node 1. The left bottom panel showing the change in deviance in the top node (vs the sum of the deviance in the daughter nodes) in the range of 145 units out of 8452 in the parent node. The right bottom panel displays the “surrogate” variables – these are the covariates most correlated to the primary split “heartfn” – it conveys a measure of collinearity b/w predictors involved in the splitting of node 1.
Searching for predictive factors Model-based (mob) partitioning trees Objective Assess whether baseline covariates are “predictive” Methods Quantify how much each baseline covariate changes the estimated treatment effects Use mob trees to split patients into subgroups according to those baseline covariates that impact the magnitude of the overall treatment effects $cantos_loc/local/trees/trees_v2.pptx
Predicting risk Random survival forests vs extended Cox Performance: 2-year predictions (C-index, calibration plots) Nelson-Aalen estimate of survival Out-of-bag” ensemble estimator