1 The Receiver Operating Characteristic (ROC) Curve EPP 245 Statistical Analysis of Laboratory Data
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 2 Binary Classification Suppose we have two groups for which each case is a member of one or the other, and that we know the correct classification (“truth”). Suppose we have a prediction method that produces a single numerical value, and that small values of that number suggest membership in group 1 and large values suggest membership in group 2
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 3 If we pick a cutpoint t, we can assign any case with a predicted value ≤ t to group 1 and the others to group 2. For that value of t, we can compute the number correctly assigned to group 2 and the number incorrectly assigned to group 2 (true positives and false positives). For t small enough, all will be assigned to group 2 and for t large enough all will be assigned to group 1. The ROC curve is a plot of true positives vs. false positives
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 4 Juul's IGF data Description: The 'juul' data frame has 1339 rows and 6 columns. It contains a reference sample of the distribution of insulin-like growth factor (IGF-I), one observation per subject in various ages with the bulk of the data collected in connection with school physical examinations. Variables: age a numeric vector (years). menarche a numeric vector. Has menarche occurred (code 1: no, 2: yes)? sex a numeric vector (1: boy, 2: girl). igf1 a numeric vector. Insulin-like growth factor ($mu$g/l). tanner a numeric vector. Codes 1-5: Stages of puberty a.m. Tanner. testvol a numeric vector. Testicular volume (ml).
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 5 Predicting Menarche Subset Juul data to only females between 8 and 20 years old Predict menarch from age as a quantitative variable and Tanner score as a qualitative variable using dummy variables Menarch re-coded to be 0/1
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 6. logistic men1 age tan2 tan3 tan4 tan5 Logistic regression Number of obs = 519 LR chi2(5) = Prob > chi2 = Log likelihood = Pseudo R2 = men1 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] age | tan2 | tan3 | tan4 | tan5 | predict pmen (option p assumed; Pr(men1)). predict pmen1, xb
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 7. histogram pmen. graph export pmenhist.wmf. histogram pmen if men1==0, title("Pre-Menarch"). graph export pmenhist0.wmf. histogram pmen if men1==1, title("Post-Menarch"). graph export pmenhist1.wmf. histogram pmen1. graph export pmen1hist.wmf. hist pmen1 if men1==0, title("Pre-Menarche"). graph export pmen1hist0.wmf. hist pmen1 if men1==1, title("Post-Menarche"). graph export pmen1hist1.wmf. lroc Logistic model for men1 number of observations = 519 area under ROC curve = graph export pmenroc.wmf
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 8
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 9
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 10
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 11
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 12
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 13
November 30, 2006EPP 245 Statistical Analysis of Laboratory Data 14