Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision Trees Rich Caruana Cornell CS Stefan.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Learning Algorithm Evaluation
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Inference for Regression
Yue Han and Lei Yu Binghamton University.
Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Sparse vs. Ensemble Approaches to Supervised Learning
Machine Learning Risk Adjustment of the C-section Rate: Impact by Provider Cynthia J. Sims MD, Obstetrics, Gynecology & Reproductive Sciences, Magee Womens.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
© sebastian thrun, CMU, The KDD Lab Intro: Outcome Analysis Sebastian Thrun Carnegie Mellon University
Ensemble Learning: An Introduction
Evaluating Hypotheses
Midterm Review Goodness of Fit and Predictive Accuracy
Machine Learning Risk Adjustment of the C-section Rate: Impact by Provider Cynthia J. Sims MD, Obstetrics, Gynecology & Reproductive Sciences, Magee Womens.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Linear Regression Inference
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Economic evaluation of health programmes Department of Epidemiology, Biostatistics and Occupational Health Class no. 17: Economic Evaluation using Decision.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
CpSc 810: Machine Learning Evaluation of Classifier.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Evaluating Results of Learning Blaž Zupan
Section 3.3: The Story of Statistical Inference Section 4.1: Testing Where a Proportion Is.
CpSc 881: Machine Learning Evaluating Hypotheses.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
INTRODUCTION TO Machine Learning 3rd Edition
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Chapter 13: Inferences about Comparing Two Populations Lecture 8b Date: 15 th November 2015 Instructor: Naveen Abedin.
Classification Ensemble Methods 1
Case Selection and Resampling Lucila Ohno-Machado HST951.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning 5. Parametric Methods.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Lecture 8 Estimation and Hypothesis Testing for Two Population Parameters.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bootstrap and Model Validation
Machine Learning: Ensemble Methods
Confidence Intervals Cont.
An Empirical Comparison of Supervised Learning Algorithms
Discussion/Presentation of Park and Basu: “Alternative Evaluation Metrics for Risk Adjustment Models” Stephen P. Ryan, Olin.
Estimation & Hypothesis Testing for Two Population Parameters
Evaluating Results of Learning
Overfitting and Underfitting
EAST GRADE course 2019 Introduction to Meta-Analysis
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Using Machine Learning to Model Standard Practice: Retrospective Analysis of Group C-Section Rate via Bagged Decision Trees Rich Caruana Cornell CS Stefan Niculescu CMU CS Bharat Rao Siemens Medical Cynthia Simms Magee Hospital

C-Section in the U.S. C-section rate in U.S. too high –Western Europe has lower rates, but comparable outcomes –C-section is major surgery => tough on mother –C-section is expensive Why is U.S. rate so high? –Convenience (rate highest Fridays, before sporting events, …) –Litigation –Social and Demographic issues Current controls –Financial: pay-per-patient instead of pay-per-procedure –Physician reviews: monthly/quarterly evaluation of rates

Risk Adjustment Some practices specialize in high risk patients Some practices have low-risk demographics Must correct for patient population seen by each practice

Risk Adjustment Some practices specialize in high risk patients Some practices have low-risk demographics Must correct for patient population seen by each practice Need an accurate, objective, evidence-based method for assessing c-section rate of different physician practices

Modeling "Standard Practice" with Machine Learning Not trying to improve outcomes Maintain quality of outcomes while reducing c-sec rate Compare physicians/practices to other physicians/practices Warn physicians if rate higher than other practitioners

Data 3 years of data: from South-Western PA 22,175 expectant mothers 16.8% c-section 144 attributes per patient (82 used for learning) 17 physician practices C-section rate varies from 13% to 23% in practices

Learning Methods Logistic Regression Artificial Neural Nets MML Decision Trees –Buntine's IND decision tree package –Maximum size trees (1000's of nodes) –Probabilities generated by smoothing counts along path to leaf Bagged MML Decision Trees –Better accuracy –Better ROC Area –Excellent calibration

Bagged Decision Trees Draw 100 bootstrap samples of data Train trees on each sample -> 100 trees Average prediction of trees on out-of-bag samples … Average prediction ( … ) / # Trees = 0.24

Bagging Improves Accuracy and AUC

Calibration Good calibration: If 1000 patients have pred(x) = 0.1, ~100 should get csec

Calibration Model can be accurate but poorly calibrated –good threshold with uncalibrated probabilities Model can have good ROC but be poorly calibrated –ROC insensitive to scaling/stretching –only ordering has to be correct, not probabilities themselves Model can have very high variance, but be well calibrated –if expected value is correct -> good calibration –not worried about prediction for any one patient:  aggregate prediction over groups of patients reduces variance  high variance of per patient prediction is OK  bias/variance tradeoff: favor reducing bias at loss of variance

Measuring Calibration Bucket method In each bucket: –measure observed c-sec rate –predicted c-sec rate (average of probabilities) –if observed csec rate similar to predicted csec rate => good calibration in that bucket # ######### #########

Calibration Plot

Bagging Improves Calibration an Order of Magnitude!

Overall Bagged MML DT Performance Out-of-bag test points (cross validation for bagging) Accuracy: 90.1% (baseline 83.2%) ROC Area: Mean absolute calibration error = Calibration as good in tails as center of distribution

Aggregate Risk Aggregate risk of c-section for a group is just the average probability of c-section for each patient in that group If a group's aggregate risk matches the observed c-sec rate, then the group is performing c-secs in accordance with standard practice (as modeled by learned model)

Sanity Check Could models have good calibration in tails, but still be inaccurate on some groups of patients? Suppose two kinds of patients, A and B, both with elevated risk: p(A) = p(B) = 0.8 Patients A and B differ, model might be more accurate or better calibrated on A than B, so a practice seeing patients A might over/under estimate relative to practice seeing B Are models good on each physician group?

Assumptions Intrinsic factors: variables given to learning as inputs Extrinsic factors: variables not given to learning –Provider (HMO vs. pay-per-procedure vs. no insurance vs. …) –patient/physician preference (e.g. socio-economic class) Learning compensates for intrinsics, not extrinsics –assumes extrinsic not correlated to intrinsic –correlations may weaken sensitivity Difference between observed and predicted rates due to extrinsic factors, model bias, noise If models are good, differences due mainly to extrinsic factors such as provider and physician/patient preferences

Summary & Conclusions Bagged probabilistic decision trees are excellent –Accuracy: 86.9% -> 90.1% –ROC Area: > –Calibration: > 0.013! Learned models of standard practice are good 3 of 17 groups have high c-section rate –one has high-risk population –one has normal risk population –one has low-risk population!!! One group has csec rate 4% below predicted rate –how do they do it? –are their outcomes good?

Future Work 3 more years of unused data: Do discrepancies to standard practice correlate with: –practice characteristics? –patient characteristics? –payor characteristics? –outcomes? Compare with data/models for patients in Europe Confidence intervals Research on machine learning to improve calibration Apply this methodology to other problems in medicine

Sub Population Assessment Accurately predict c-section rate for groups of patients –aggregate prediction –not worried about prediction for individual patients Retrospective analysis –not prediction for future patients (i.e. pregnant women) –prediction for pregnancies that already have gone full term –risk assumed by physician, not machine learning Tell physician/practice if their rate does not match rate predicted by the standard practice model

Calibration Good calibration: Given 1000 patients with identical features, observed rate should equal average predicted rate on those patients Often have: but that's not good enough