Multivariable Logistic Regression Split Cohort into Development & Consensus Strategy for Variable Selection in Clinical Prediction Rule Development Miriam R. Elman, MPH1; Jessina C. McGregor, PhD2; Jodi Lapidus, PhD1 1Oregon Health & Science University-Portland State University School of Public Health; 2Oregon State University/Oregon Health & Science University College of Pharmacy BACKGROUND Clinical prediction rules aim to prognostically identify presence of diagnoses using baseline patient data Electronic health record (EHR) data is a rich resource Massive amount of retrospective patient information Robust and efficient variable reduction likely aids variable selection on multidimensional, EHR data Prevent early removal of key predictors Model Building Approach Consensus strategy to reduce candidate predictors RESULTS (continued) Saturated and best subsets model results in Table 3 of 4 predictors selected by all three methods appeared in final model Multivariable Logistic Regression Consensus Strategy Random Forest Group Lasso Boosted Classification Statistical analysis with R 3.3.3 Table. Results of multivariable logistic regression models Model AUC Sensitivity Specificity Saturated 0.6631 0. 6039 0.6432 Best subsets 0.6382 0.4312 0.7748 OBJECTIVE Apply consensus strategy to inform prediction rule developed to direct appropriate selection of antibiotic agents to treat urinary tract infections CONCLUSIONS Prediction rule did not meet minimum acceptable 90% sensitivity and 85% specificity set a priori by clinicians Challenging prediction problem Mostly categorical predictors Key predictors may be missing in retrospective data METHODS Data Preparation Data management with SAS v9.4 Extract EHR Data & Identify Cohort STEP 1 Split Cohort into Development & Validation Sets STEP 2 Development Set for Prediction Rule STEP 3 RESULTS No interaction terms selected for best subsets model Twenty-two predictors selected by consensus strategy FUTURE DIRECTIONS Reviewed predictors with clinical partners and conducting prospective data collection Further model development with additional data Explore additional modeling strategies Developed framework for consensus strategy Available for other applications Random Forest (0) (3) Lasso (8) (4) (5) (2) Boosting (0) Miriam Elman elmanm@ohsu.edu
Split Cohort into Development & STEP 1 Extract EHR Data & Identify Cohort STEP 2 Split Cohort into Development & Validation Datasets STEP 3 Use Development for Prediction Rule 80% 20% Extract retrospective EHR data from electronic repositories Define cohort, outcome, and predictors Randomly split cohort into development (80%) and validation (20%) datasets Construct prediction rule on development set Retain remaining data set aside for rule validation
Boosted Classification Random Forest Group Lasso Boosted Classification Consensus Strategy party (1.2-2) implementation used Algorithm repeated x 3 with different seeds and 10 most important variables used for each grpreg (3.0-2) used to select categorical variables as a group Tuning parameter identified with minimized cross-validated error then refined mboost (2.7-0) used Variables defined as ordinary least squared base learners to group categorical variables Continuous variables centered
Multivariable Logistic Regression Model selection conducted with best subsets based on minimized BIC Model limited to 4 main effects by design Interactions assessed after main effects selected AUC, sensitivity, and specificity calculated for saturated model and selected model Youden’s index chosen for sensitivity and specificity cutpoint
Lasso Alone (8) Random Forest (0) Boosting and Boosting (5) Random Forest and Boosting (2) All Three (4) and Lasso (3)