Evidence Based Diagnosis Mark J. Pletcher, MD MPH 6/28/2012 Combining Tests
Acknowledgements For this lecture I’ve adapted a slide set from Mike Kohn
Combining Tests Overview A case with 2 simple tests Test non-independence Approaches to combining tests Looking at all possible combinations of results Recursive partitioning, logistic regression, other Overfitting and validation in multitest panels
Combining Tests – A Case Case Pregnant woman getting prenatal care, worried about Down’s Syndrome (Trisomy 21) Chorionic Villus Samping (CVS) is a definitive test, but there is a risk of miscarriage Should she get this procedure?
Combining Tests – A Case Age helps… Risk goes up with age Our patient is 41, so pretest risk is ~2%...
Ultrasound can help even more It’s harmless Several features predict Trisomy 21 (Down’s) at weeks* Nuchal translucency Nasal bone absence Combining Tests – A Case *Cicero, S., G. Rembouskos, et al. (2004). "Likelihood ratio for trisomy 21 in fetuses with absent nasal bone at the week scan." Ultrasound Obstet Gynecol 23(3): How do we use these two features together?
Combining Tests – A Case First, nuchal translucency (NT)
Wider translucent “gap” here is predictive of Down’s
Nuchal Translucency Data Cross-sectional study 5556 Pregnant Women undergoing CVS 333 (6%) with Trisomy 21 fetus All had ultrasound at weeks
Dichotomize here for purposes of illustration
Nuchal Translucency Data Trisomy 21 Nuchal D+ D- Translucency ≥ 3.5 mm(+) < 3.5 mm(-) Total Sensitivity and Specificity? PPV and NPV?
Nuchal Translucency Sensitivity = 212/333 = 64% Specificity = 4745/5223 = 91% and IF we assume that this cross-sectional sample represents our population of interest, then: Prevalence = 333/( ) = 6% PPV = 212/( ) = 31% NPV = 4745/( ) = 97.5%
Nuchal Translucency Data Trisomy 21 Nuchal D+ D- Translucency ≥ 3.5 mm < 3.5 mm Total LR+ and LR-?
Nuchal Translucency Data Trisomy 21 Nuchal D+ D- LR Translucency ≥ 3.5 mm < 3.5 mm Total LR+ = P(T+|D+)/P(T+|D-) LR- = P(T-|D+)/P(T-|D-)
Nuchal Translucency Data Trisomy 21 Nuchal D+ D- LR Translucency ≥ 3.5 mm < 3.5 mm Total LR+ = (212/333)/(478/5223) = 7.0 LR- = (121/333)/(4745/5223) = 0.4
Back to the case… Let’s apply this data to our case, with pre-test probability of 2%
Post-test risk using NT only Pre-test prob: 0.02 at age 41 Pre-test odds: 0.02/0.98 = IF TEST IS POSITIVE - LR = 7.0 Post-Test Odds = Pre-Test Odds x LR(+) = x 7.0 = Post-Test prob = 0.143/( ) = 12.5%
Post-test risk using NT only Pre-test prob: 0.02 at age 41 Pre-test odds: 0.02/0.98 = IF TEST IS NEGATIVE - LR = 0.4 Post-Test Odds = Pre-Test Odds x LR(+) = x 0.4 = Post-Test prob = /( ) =.8%
Back to the case… Is.8% risk low enough to not get CVS? Is 12.5% risk high enough to risk CVS? OTHER Ultrasound features are also predictive Nasal bone absence
Nasal Bone Seen NBA=“No” Neg for Trisomy 21 Nasal Bone Absent NBA=“Yes” Pos for Trisomy 21
Nasal Bone Absence Test Data Nasal Bone Tri21+ Tri21-LR Absent Yes No Total
Post-test risk using NBA only Pre-test prob: 0.02 at age 41 Pre-test odds: 0.02/0.98 = IF TEST IS POSITIVE - LR = 27.8 Post-Test Odds = Pre-Test Odds x LR(+) = x 27.8 =.567 Post-Test prob =.567/( ) = 36%
Post-test risk using NBA only Pre-test prob: 0.02 at age 41 Pre-test odds: 0.02/0.98 = IF TEST IS NEGATIVE - LR = 0.32 Post-Test Odds = Pre-Test Odds x LR(+) = x 0.32 = Post-Test prob = /( ) =.6%
Back to the case… NBA is a bit better than NT, but still important uncertainty… Can we combine our NT results with NBA results and do even better? How do we combine test results?
Combining tests Approach #1 – Assume independence Knowing results of one test doesn’t influence how you interpret the next test We usually assume LR is independent of pre-test probability This is what we did when we used a pre-test risk of 2% instead of 6% in our calculations If so, we can just do the calculations sequentially
Assuming test independence First do NT, assume it’s positive (LR = 7) Pre-test riskPost-test risk 2% 12.5% Then do NBA, assume it’s also positive (LR = 23.7) Pre-test riskPost-test risk 12.5% 77%
Assuming test independence What’s the mathematical shortcut? LR(1) * LR(2) = LR(1&2) 7* 27.8 = 195
Assuming test independence What’s the mathematical shortcut? LR(1) * LR(2) = LR(1&2) NTNBALR
Assuming test independence Slide rule approach (pre-test prob = 6%) Line arrows up without shrinkage
Combining tests Is it reasonable to assume independence? Does nasal bone absence tell you as much if you already know that the nuchal translucency is >3.5 mm? What can we do to figure this out?
Combining tests Approach #2 – evaluate all possible test result combinations
Joint eval of 4 test result combinations NTNBA Trisomy 21 + % Trisomy 21 - %LR Pos 15847%360.7% 69 PosNeg5416%4428.5% 1.9 NegPos7121%931.8% 12 Neg 5015%465289% 0.2 Totals333100% % Vs If tests were independent…
Combining tests The Answer – the tests are NOT completely independent So we CANNOT just multiply LR’s What should we do in this case? Use LR’s from the combination table
Joint eval of 4 test result combinations NTNBE Trisomy 21 + % Trisomy 21 - %LR Pos 15847%360.7% 69 PosNeg5416%4428.5% 1.9 NegPos7121%931.8% 12 Neg 5015%465289% 0.2 Totals333100% % Use these!
Create ROC Table NTNBETri21+SensTri SpecLR 0% Pos 15847%360.70%69 NegPos7168%933%12 PosNeg5484%44211%1.9 Neg 50100% %0.2
AUROC = 0.896
Optimal Cutoff Analysis NTNBELR Post-Test Prob Pos 6981% NegPos1243% PosNeg1.911% Neg 0.21% If we assume: Pre-test probability = 6% Threshold for CVS = 2% Optimal algorithm is “any positive test CVS”
Non-independence What does non-independence mean?
Non-independence Slide rule approach (pre-test prob = 6%) The total arrow length is NOT equal to the sum of its parts!
Non-independence Technical definition of independence - must condition on disease status: If this stringent definition is not met, the tests are non-independent In patients with disease, a false negative on Test 1 does not affect the probability of a false negative on Test 2. In patients without disease, a false positive on Test 1 does not affect the probability of a false positive on Test 2.
Non-independence Reasons for non-independence? Tests measure the same aspect of disease. Simple example: predicting pneumonia Cyanosis: LR = 5 O2 sat 85%-90%: LR = 6 Can’t just multiply these LR’s because they really just reflect the same physiologic state!
Non-independence Reasons for non-independence? Tests measure the same aspect of disease. In our example: One aspect of Down’s syndrome is slower fetal development; the NT decreases more slowly AND the nasal bone ossifies later. Chromosomally NORMAL fetuses that develop slowly will tend to have false positives on BOTH the NT Exam and the Nasal Bone Exam.
Non-independence Other reasons for non-independence? Disease is heterogeneous In severe pneumonia, all tests tend to be abnormal, so each individual test tells you less O2 sat and respiratory rate Non-disease is heterogeneous In patients with cough but no pneumonia, abnormal tests may still track together 02 sat and respiratory rate also both abnormal with PE; and both are normal with viral URI See EBD page 158
Back to the case… Remember that we actually simplified the case: Nuchal translucency is really a continuous test. How do we take into account actual continuous NT measurement and NBA (and age, race, fetal crown-rump length, etc)?
Back to the case… Can’t do combination table for all possible combinations! 2 dichotomous tests = 4 combinations 4 dichotomous tests = 16 combinations 3 3-level tests = 27 combinations How do we deal with continuous tests?
Combining tests Approach #3: Recursive partitioning Repeatedly split the data to find optimal testing/decision algorithm “prune” the tree
Combining tests Approach #3: Recursive partitioning
Combining tests Approach #3: Recursive partitioning Non-optimal test ordering
Combining tests Approach #3: Recursive partitioning You might do nasal bone test first, then “prune”
Combining tests Approach #3: Recursive partitioning Final algorithm: do Nasal Bone exam first, stop if absence and do CVS…
Combining tests Approach #3: Recursive partitioning Sophisticated statistical algorithms optimize cutpoints
Combining tests Approach #3: Recursive partitioning For classic example, see Figure 8.7: Chest pain workup algorithm (Goldman et al)
Combining tests Approach #3: Recursive partitioning BUT: Still requires dichotomizing at cutpoints
Combining tests Approach #4: Logistic regression Uses a statistical model to combine test results and predict disease Designed to account for non-independence Handles continuous test results Can produce a “score” A single integrated continuous test result Score subject to ROC curve, C-statistic, other standard continuous test analyses
Combining tests Approach #4: Logistic regression For classic example, see Table 8.5: Predicting death in patients with pneumonia – The PORT score
Combining tests Approach #5: Other fancy algorithms Neural networks Random forests Boosting Etc.
Combining tests The Major Pitfall - Overfitting What happens when you throw more variables into a model? Will the model perform better?
Combining tests The Major Pitfall What happens when you throw more variables into a model? Will the model perform better? YES, in the “derivation” set (even random noise will look good!) NO, when you try to apply in the real world!
Combining tests The more complex your test algorithm, the more important it is to VALIDATE Split your sample into a “derivation set” and a “test set” 10-fold cross-validation, etc Validate in an EXTERNAL sample
Example 1 - predicting CAC with multiple risk factors Should we do a heart scan for atherosclerosis? Can we predict with clinical characteristics who has atherosclerosis without doing a heart scan?
Example 1 - predicting CAC with multiple risk factors AUC-ROC Model Naïve*Cross-validated Age + sex + race “” + standard CHD RF’s “” + all possible race-sex interactions Last model is most complex, highest “naïve” AUC-ROC, but NOT the highest cross-validated AUC-ROC because it is “over-fit”. * - “Naïve” AUC-ROC refers to the AUC-ROC that you get when you estimate it within the same dataset from which the test algorithm was derived
Example 2 - predicting CAC with a proteomics “signal” Proteomic analysis is an extreme example of combining test results: hundreds to thousands of signal peak heights, many just noise
Example 2: proteomics-CAC Proteomics algorithm looks great in the derivation set!
Example 2: proteomics-CAC But cross-validation shows that it was all just useless noise (AUC-ROC ~0.5)
VALIDATION No matter what technique (CART or logistic regression) is used, the tests included in a model and the way in which their results are combined must be tested on a data set different from the one used to derive the rule.
Combining Tests Take home points Test non-independence is the rule, not the exception, so usually CAN’T just multiple LR’s together In simple cases, look at LR’s for all possible test result combinations Fancier methods often used, but look for validation analyses, especially when there are LOTS of tests being combined.