Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision Trees Rich Caruana Cornell CS Stefan Niculescu CMU CS Bharat Rao Siemens Medical Cynthia Simms Magee Hospital
C-Section in the U.S. C-section rate in U.S. too high –Western Europe has lower rates, but comparable outcomes –C-section is major surgery => tough on mother –C-section is expensive Why is U.S. rate so high? –Convenience (rate highest Fridays, before sporting events, …) –Litigation –Social and Demographic issues Current controls –Financial: pay-per-patient instead of pay-per-procedure –Physician reviews: monthly/quarterly evaluation of rates
Risk Adjustment Some practices specialize in high risk patients Some practices have low-risk demographics Must correct for patient population seen by each practice
Risk Adjustment Some practices specialize in high risk patients Some practices have low-risk demographics Must correct for patient population seen by each practice Need an accurate, objective, evidence-based method for assessing c-section rate of different physician practices
Modeling "Standard Practice" with Machine Learning Not trying to improve outcomes Maintain quality of outcomes while reducing c-sec rate Compare physicians/practices to other physicians/practices Warn physicians if rate higher than other practitioners
Data 3 years of data: from South-Western PA 22,175 expectant mothers 16.8% c-section 144 attributes per patient (82 used for learning) 17 physician practices C-section rate varies from 13% to 23% in practices
Learning Methods Logistic Regression Artificial Neural Nets MML Decision Trees –Buntine's IND decision tree package –Maximum size trees (1000's of nodes) –Probabilities generated by smoothing counts along path to leaf Bagged MML Decision Trees –Better accuracy –Better ROC Area –Excellent calibration
Bagged Decision Trees Draw 100 bootstrap samples of data Train trees on each sample -> 100 trees Average prediction of trees on out-of-bag samples … Average prediction ( … ) / # Trees = 0.24
Bagging Improves Accuracy and AUC
Calibration Good calibration: If 1000 patients have pred(x) = 0.1, ~100 should get csec
Calibration Model can be accurate but poorly calibrated –good threshold with uncalibrated probabilities Model can have good ROC but be poorly calibrated –ROC insensitive to scaling/stretching –only ordering has to be correct, not probabilities themselves Model can have very high variance, but be well calibrated –if expected value is correct -> good calibration –not worried about prediction for any one patient: aggregate prediction over groups of patients reduces variance high variance of per patient prediction is OK bias/variance tradeoff: favor reducing bias at loss of variance
Measuring Calibration Bucket method In each bucket: –measure observed c-sec rate –predicted c-sec rate (average of probabilities) –if observed csec rate similar to predicted csec rate => good calibration in that bucket # ######### #########
Calibration Plot
Bagging Improves Calibration an Order of Magnitude!
Overall Bagged MML DT Performance Out-of-bag test points (cross validation for bagging) Accuracy: 90.1% (baseline 83.2%) ROC Area: Mean absolute calibration error = Calibration as good in tails as center of distribution
Aggregate Risk Aggregate risk of c-section for a group is just the average probability of c-section for each patient in that group If a group's aggregate risk matches the observed c-sec rate, then the group is performing c-secs in accordance with standard practice (as modeled by learned model)
Sanity Check Could models have good calibration in tails, but still be inaccurate on some groups of patients? Suppose two kinds of patients, A and B, both with elevated risk: p(A) = p(B) = 0.8 Patients A and B differ, model might be more accurate or better calibrated on A than B, so a practice seeing patients A might over/under estimate relative to practice seeing B Are models good on each physician group?
Assumptions Intrinsic factors: variables given to learning as inputs Extrinsic factors: variables not given to learning –Provider (HMO vs. pay-per-procedure vs. no insurance vs. …) –patient/physician preference (e.g. socio-economic class) Learning compensates for intrinsics, not extrinsics –assumes extrinsic not correlated to intrinsic –correlations may weaken sensitivity Difference between observed and predicted rates due to extrinsic factors, model bias, noise If models are good, differences due mainly to extrinsic factors such as provider and physician/patient preferences
Summary & Conclusions Bagged probabilistic decision trees are excellent –Accuracy: 86.9% -> 90.1% –ROC Area: > –Calibration: > 0.013! Learned models of standard practice are good 3 of 17 groups have high c-section rate –one has high-risk population –one has normal risk population –one has low-risk population!!! One group has csec rate 4% below predicted rate –how do they do it? –are their outcomes good?
Future Work 3 more years of unused data: Do discrepancies to standard practice correlate with: –practice characteristics? –patient characteristics? –payor characteristics? –outcomes? Compare with data/models for patients in Europe Confidence intervals Research on machine learning to improve calibration Apply this methodology to other problems in medicine
Sub Population Assessment Accurately predict c-section rate for groups of patients –aggregate prediction –not worried about prediction for individual patients Retrospective analysis –not prediction for future patients (i.e. pregnant women) –prediction for pregnancies that already have gone full term –risk assumed by physician, not machine learning Tell physician/practice if their rate does not match rate predicted by the standard practice model
Calibration Good calibration: Given 1000 patients with identical features, observed rate should equal average predicted rate on those patients Often have: but that's not good enough