Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy Michael A. Kohn, MD, MPP 10/14/2004 Coursebook Chapter 5 – Multiple.

Slides:



Advertisements
Similar presentations
Welcome Back From Lunch
Advertisements

Likelihood ratios Why these are the most thrilling statistics in Emergency Medicine.
Studying a Study and Testing a Test: Sensitivity Training, “Don’t Make a Good Test Bad”, and “Analyze This” Borrowed Liberally from Riegelman and Hirsch,
TESTING A TEST Ian McDowell Department of Epidemiology & Community Medicine November, 2004.
Student’s Research Group at the Department of Internal Medicine, Hypertension and Angiology The Medical University in Warsaw PULMONARY EMBOLISM – TOUGH.
Is it True? Evaluating Research about Diagnostic Tests
Assessing Information from Multilevel (Ordinal) and Continuous Tests ROC curves and Likelihood Ratios for results other than “+” or “-” Michael A. Kohn,
12 June 2004Clinical algorithms in public health1 Seminar on “Intelligent data analysis and data mining – Application in medicine” Research on poisonings.
Evaluation of Diagnostic Test Studies
Acute Chest Pain “Can I go back to sleep?” Dr. Hussam Al-Faleh Residents Course.
Interpreting Diagnostic Tests
Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy Michael A. Kohn, MD, MPP 10/27/2005 Coursebook Chapter 8 – Multiple.
Exercise Echocardiography Cardiac Issues 2011 Douglass A Morrison, MD, PhD.
Diagnosis Concepts and Glossary. Cross-sectional study The observation of a defined population at a single point in time or time interval. Exposure and.
Studies of Diagnostic Tests
Statistics in Screening/Diagnosis
Michael A. Kohn, MD, MPP 10/30/2008 Chapter 7 – Prognostic Tests Chapter 8 – Combining Tests and Multivariable Decision Rules.
Diagnosis Articles Much Thanks to: Rob Hayward & Tanya Voth, CCHE.
Evaluation of Chest Pain William Norcross, M.D.. Evaluation of Chest Pain Dictum: With any chief complaint or symptom complex, first rule- out (R/O) life.
Evidence Based Diagnosis Mark J. Pletcher, MD MPH 6/28/2012 Combining Tests.
EBM --- Journal Reading Presenter :李政鴻 Date : 2005/10/26.
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.
Division of Population Health Sciences Royal College of Surgeons in Ireland Coláiste Ríoga na Máinleá in Éirinn Indices of Performances of CPRs Nicola.
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 16, 2008.
Vanderbilt Sports Medicine How to practice and teach EBM Chapter 3 May 3, 2006.
1 Interpreting Diagnostic Tests Ian McDowell Department of Epidemiology & Community Medicine January 2012 Note to readers: you may find the additional.
Evidence Based Medicine Workshop Diagnosis March 18, 2010.
Evaluation of Diagnostic Tests
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 11, 2012.
+ Clinical Decision on a Diagnostic Test Inna Mangalindan. Block N. Class September 15, 2008.
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 11, 2007.
Diagnosis: EBM Approach Michael Brown MD Grand Rapids MERC/ Michigan State University.
Evaluating Diagnostic Tests Payam Kabiri, MD. PhD. Clinical Epidemiologist Tehran University of Medical Sciences.
Appraising A Diagnostic Test
Assessing Information from Multilevel (Ordinal) and Continuous Tests ROC curves and Likelihood Ratios for results other than “+” or “-” Michael A. Kohn,
Likelihood 2005/5/22. Likelihood  probability I am likelihood I am probability.
Michael A. Kohn, MD, MPP 10/28/2010 Chapter 7 – Prognostic Tests Chapter 8 – Combining Tests and Multivariable Decision Rules.
Diagnosis of Myocardial Infarction/Ischemia with Bundle Branch Blocks
1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.
1. Statistics Objectives: 1.Try to differentiate between the P value and alpha value 2.When to perform a test 3.Limitations of different tests and how.
Assessing Information from Multilevel and Continuous Tests Likelihood Ratios for results other than “+” or “-” Michael A. Kohn, MD, MPP 10/2/2008.
Assessing Information from Multilevel and Continuous Tests Likelihood Ratios for results other than “+” or “-” Michael A. Kohn, MD, MPP 10/13/2011.
TESTING A TEST Ian McDowell Department of Epidemiology & Community Medicine January 2008.
Evaluating Risk Adjustment Models Andy Bindman MD Department of Medicine, Epidemiology and Biostatistics.
Multiple Tests, Multivariable Decision Rules, and Prognostic Tests Michael A. Kohn, MD, MPP 10/25/2007 Chapter 8 – Multiple Tests and Multivariable Decision.
Welcome Back From Lunch. Thursday Afternoon 2:00-3:00 Studies of Diagnostic Test Accuracy (Tom) 3:00-3:45 Combining Tests (Mark) 3:45-4:00 Break 4:00-5:30.
HSS4303B – Intro to Epidemiology Feb 8, Agreement.
Introduction Left bundle branch block (LBBB) is notorious for obscuring the ECG diagnosis of acute myocardial infarction (AMI) and, therefore, the decision.
Diagnostic Test Characteristics: What does this result mean
At the Bedside Evidence Based Medicine Stephen R. Hayden, MD Department of Emergency Medicine UCSD Medical Center, San Diego Teaching.
Journal Club Optimizing Early Rule-Out Strategies for Acute Myocardial Infarction: Utility of 1-Hour Copeptin P. Hillinger, R. Twerenbold, C. Jaeger, K.
1 Medical Epidemiology Interpreting Medical Tests and Other Evidence.
10 May Understanding diagnostic tests Evan Sergeant AusVet Animal Health Services.
Common Errors by Teachers and Proponents of EBM
Shoulder Objective Examination How to Interpret Special Tests.
PTP 560 Research Methods Week 12 Thomas Ruediger, PT.
Diagnosis:Testing the Test Verma Walker Kathy Davies.
Chest Pain in the Emergency Department Junior Teaching C. Brown August 2015.
Critical Appraisal Course for Emergency Medicine Trainees Module 5 Evaluation of a Diagnostic Test.
Diagnosis Recitation. The Dilemma At the conclusion of my “diagnosis” presentation during the recent IAPA meeting, a gentleman from the audience asked.
Is suicide predictable? Paul St John-Smith Short Courses in Psychiatry 15/10/2008.
Diagnostic studies Adrian Boyle.
Diagnostic Test Studies
Sensitivity and Specificity
Risk Stratification of Chest Pain: Best Practices
Cost Effective Use of Troponin to Rule Out Acute Coronary Syndrome
Refining Probability Test Informations Vahid Ashoorion MD. ,MSc,
Undetectable High Sensitivity Cardiac Troponin T Level in the Emergency Department and Risk of Myocardial Infarction Nadia Bandstein, MD; Rickard Ljung,
Undetectable High Sensitivity Cardiac Troponin T Level in the Emergency Department and Risk of Myocardial Infarction Nadia Bandstein, MD; Rickard Ljung,
Evidence Based Diagnosis
Presentation transcript:

Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy Michael A. Kohn, MD, MPP 10/14/2004 Coursebook Chapter 5 – Multiple Tests and Multivaraible Decision Rules Coursebook Chapter 6 – Studies of Diagnostic Test Accuracy

Outline of Topics Combining results of multiple tests: importance of test non-independence Recursive Partitioning Logistic Regression Published “rules” for combining test results: importance of validation separate from derivation Biases in studies of diagnostic tests: –Overfitting bias –Incorporation bias –Referral bias –Double gold standard bias –Spectrum bias

Warning: Different Example Example of combining two tests in this talk: Exercise ECG and Nuclide Scan as dichotomous tests for CAD (assumed to be a dichotomous D+/D- disease)* Example of combining two tests in Coursebook: Premature birth (GA < 36 weeks) and low birth weight (BW < 2500 grams) as dichotomous tests for neonatal morbidity *Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology : a basic science for clinical medicine. 2nd ed. Boston: Little Brown; 1991.

One Dichotomous Test Exercise ECG CAD+CAD-LR Positive Negative Total Do you see that this is (299/500)/(44/500)? Review of Chapter 3: What are the sensitivity, specificity, PPV, and NPV of this test? (Be careful.)

Clinical Scenario – One Test Pre-Test Probability of CAD = 33% EECG Positive Pre-test prob: 0.33 Pre-test odds: 0.33/0.67 = 0.5 LR(+) = 6.80 Post-Test Odds = Pre-Test Odds x LR(+) = 0.5 x 6.80 = 3.40 Post-Test prob = 3.40/( ) = 0.77

Pre-Test Probability of CAD = 33% EECG Positive Post-Test Probability of CAD = 77% Clinical Scenario – One Test Pre-Test Odds of CAD = 0.50 EECG Positive (LR = 6.80) Post-Test Odds of CAD = 3.40 Using Probabilities Using Odds

Clinical Scenario – One Test Pre-Test Probability of CAD = 33% EECG Positive EECG + (LR = 6.80) | > X X | | | | | | | Log(Odds) Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1 Prob Odds = 0.50 Prob = 0.33 Odds = 3.40 Prob = 0.77

Second Dichotomous Test Nuclide Scan CAD+CAD-LR Positive Negative Total Do you see that this is (416/500)/(190/500)?

Pre-Test Probability of CAD = 33% EECG Positive Post-EECG Probability of CAD = 77% Nuclide Scan Positive Post-Nuclide Probability of CAD = ? Clinical Scenario –Two Tests Using Probabilities

Clinical Scenario – Two Tests Pre-Test Odds of CAD = 0.50 EECG Positive (LR = 6.80) Post-Test Odds of CAD = 3.40 Nuclide Scan Positive (LR = 2.19?) Post-Test Odds of CAD = 3.40 x 2.19? = 7.44? (P = 7.44/(1+7.44) = 88%?) Using Odds

Clinical Scenario – Two Tests Pre-Test Probability of CAD = 33% EECG Positive Odds = 0.50 Prob = 0.33 Odds = 3.40 Prob = 0.77 E-ECG + (LR = 6.80) | > Nuclide + (LR = 2.19) |------> E-ECG + Nuclide + Can we do this? | >|-----> E-ECG + and Nuclide X X------X---+ | | | | | | | Log(Odds) Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1 Prob Odds = 7.44 Prob = 0.88

Question Can we use the post-test odds after a positive Exercise ECG as the pre-test odds for the positive nuclide scan? i.e., can we combine the positive results by multiplying their LRs? LR(E-ECG +, Nuclide +) = LR(E-ECG +) x LR(Nuclide +) ? = 6.80 x 2.19 ? = ?

Answer = No E-ECGNuclideCAD+%CAD-%LR Pos 27655%265% PosNeg235%184% 1.28 NegPos14028%16433% 0.85 Neg 6112%29258% 0.21 Total %500100% Not 14.88

Non-Independence A positive nuclide scan does not tell you as much if the patient has already had a positive exercise ECG.

Clinical Scenario Pre-Test Odds of CAD = 0.50 EECG +/Nuclide Scan + (LR = 10.62) Post-Test Odds of CAD = 0.50 x = 5.31 (P = 5.31/(1+5.31) = 84%, not 88%) Using Odds

Non-Independence E-ECG + | > Nuclide + |------> E-ECG + Nuclide + if tests were independent | >|-----> E-ECG + and Nuclide + since tests are dependent | > X X | | | | | | | Log(Odds) Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1 Prob Prob = 0.84

Non-Independence Instead of the nuclide scan, what if the second test were just a repeat exercise ECG? A second positive E-ECG would do little to increase your certainty of CAD. If it was false positive the first time around, it is likely to be false positive the second time.

Counterexamples: Possibly Independent Tests For Venous Thromboembolism: CT Angiogram of Lungs and Doppler Ultrasound of Leg Veins Alveolar Dead Space and D-Dimer MRA of Lungs and MRV of leg veins

Unless tests are independent, we can’t combine results by multiplying LRs

Ways to Combine Multiple Tests On a group of patients (derivation set), perform the multiple tests and determine true disease status (apply the gold standard) Measure LR for each possible combination of results Recursive Partitioning Logistic Regression

Determine LR for Each Result Combination E-ECGNuclideCAD+%CAD-%LR Post Test Prob* Pos 27655%265% % PosNeg235%184%1.2839% NegPos14028%16433%0.8530% Neg 6112%29258% 0.219% Total %500100% *Assumes pre-test prob = 33%

Determine LR for Each Result Combination 2 dichotomous tests: 4 combinations 3 dichotomous tests: 8 combinations 4 dichotomous tests: 16 combinations Etc. 2 3-level tests: 9 combinations 3 3-level tests: 27 combinations Etc.

Determine LR for Each Result Combination How do you handle continuous tests? Not practical for most groups of tests.

Recursive Partitioning

Recursive Partioning Same as Classification and Regression Trees (CART) Don’t have to work out probabilities (or LRs) for all possible combinations of tests, because of “tree pruning”

Tree Pruning: Goldman Rule* 8 “Tests” for Acute MI in ER Chest Pain Patient : 1.ST Elevation on ECG; 2.CP < 48 hours; 3.ST-T changes on ECG; 4.Hx of ACI; 5.Radiation of Pain to Neck/LUE; 6.Longest pain > 1 hour; 7.Age > 40 years; 8.CP not reproduced by palpation. *Goldman L, Cook EF, Brand DA, et al. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N Engl J Med. 1988;318(13):

8 tests  2 8 = 256 Combinations

Recursive Partitioning Does not deal well with continuous test results

Logistic Regression Ln(Odds(D+)) = a + b E-ECG E-ECG+ b Nuclide Nuclide + b interact (E-ECG)(Nuclide) “+” = 1 “-” = 0 More on this later in ATCR!

Logistic Regression Approach to the “R/O ACI patient” *Selker HP, Griffith JL, D'Agostino RB. A tool for judging coronary care unit admission appropriateness, valid for both real- time and retrospective use. A time-insensitive predictive instrument (TIPI) for acute cardiac ischemia: a multicenter study. Med Care. Jul 1991;29(7): For corrected coefficients, see CoefficientMV Odds Ratio Constant-3.93 Presence of chest pain Pain major symptom Male Sex Age 40 or less Age > Male over 50 years** ST elevation New Q waves ST depression T waves elevated T waves inverted T wave + ST changes**

Clinical Scenario* 71 y/o man with 2.5 hours of CP, substernal, non-radiating, described as “bloating.” Cannot say if same as prior MI or worse than prior angina. Hx of CAD, s/p CABG 10 yrs prior, stenting 3 years and 1 year ago. DM on Avandia. ECG: RBBB, Qs inferiorly. No ischemic ST- T changes. *Real patient seen by MAK 1 am 10/12/04

CoefficientClinical Scenario Constant-3.93Result-3.93 Presence of chest pain1.231 Pain major symptom0.881 Sex0.711 Age 40 or less Age > Male over 50 years ST elevation New Q waves ST depression T waves elevated T waves inverted T wave + ST changes Odds of ACI Probability of ACI30%

What Happened to Pre-test Probability? Typically clinical decision rules report probabilities rather than likelihood ratios for combinations of results. Can “back out” LRs if we know prevalence, p[D+], in the study dataset. With logistic regression models, this “backing out” is known as a “prevalence offset.” (See Chapter 5A.)

Need for Validation Develop prediction rule by choosing a few tests and findings from a large number of possibilities. Takes advantage of chance variations in the data. Predictive ability of rule will probably disappear when you try to validate on a new dataset. Can be referred to as “overfitting.”

Need for Validation: Example* Study of clinical predictors of bacterial diarrhea. Evaluated 34 historical items and 16 physical examination questions. 3 questions (abrupt onset, > 4 stools/day, and absence of vomiting) best predicted a positive stool culture (sensitivity 86%; specificity 60% for all 3). Would these 3 be the best predictors in a new dataset? Would they have the same sensitivity and specificity? *DeWitt TG, Humphrey KF, McCarthy P. Clinical predictors of acute bacterial diarrhea in young children. Pediatrics. Oct 1985;76(4):

VALIDATION No matter what technique (CART or logistic regression) is used, the “rule” for combining multiple test results must be tested on a data set different from the one used to derive it. Beware of “validation sets” that are just re- hashes of the “derivation set”. (This begins our discussion of potential problems with studies of diagnostic tests.)

Studies of Diagnostic Test Accuracy Sackett, EBM, pg 68 1.Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? 2.Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? 3.Was the reference standard applied regardless of the diagnostic test result? 4.Was the test (or cluster of tests) validated in a second, independent group of patients?

Studies of Diagnostic Tests Overfitting Bias (“Data Snooping”) Usually a problem for multi-test rules which use a few predictors chosen from a wide array of candidates. But, in studies of single tests, beware of “data-snooped” cutoffs: “A procalcitonin concentration of ng/ml is the best cutoff for predicting ventilator-associated pneumonia.” “A CSF WBC:RBC ratio < 1:117 is a sensitive and specific predictor of ‘real’ meningitis vs. a traumatic puncture” “A birth weight cutoff of 1625 grams accurately identifies newborns at high risk for neonatal morbidity and mortality.”

Studies of Diagnostic Tests Overfitting Bias Problems with “Data-Snooped” Cutoffs -- Dependent on the derivation set, require independent validation -- Fixed cutoffs assume a common prevalence or pre-test probability of disease (Recall our discussion in Chapter 4 about the undesirability of a fixed cutoff for a multi-level or continuous test)

Studies of Diagnostic Tests Sackett, EBM, pg 68 1.Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? 2.Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? 3.Was the reference standard applied regardless of the diagnostic test result? 4.Was the test (or cluster of tests) validated in a second, independent group of patients?

Studies of Diagnostic Tests Incorporation Bias Consider a study of the usefulness of various findings for diagnosing pancreatitis. If the "Gold Standard" is a discharge diagnosis of pancreatitis, which in many cases will be based upon the serum amylase, then the study can't quantify the accuracy of the amylase for this diagnosis.

Studies of Diagnostic Tests Incorporation Bias A study* of BNP in dyspnea patients as a diagnostic test for CHF also showed that the CXR performed extremely well in predicting CHF. *Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3): The two cardiologists who determined the final diagnosis of CHF were blinded to the BNP level but not to the CXR report, so the assessment of BNP should be unbiased, but not the assessment CXR.

Studies of Diagnostic Tests Sackett, EBM, pg 68 1.Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? 2.Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? 3.Was the reference standard applied regardless of the diagnostic test result? 4.Was the test (or cluster of tests) validated in a second, independent group of patients?

Studies of Diagnostic Tests Referral Bias The study population only includes those to whom the gold standard was applied, but patients with positive index tests are more likely to be referred for the gold standard. Example: Swelling as a test for ankle fracture. Gold standard is a positive X-ray. Patients with swelling are more likely to be referred for x-ray. Only patients who had x-rays are included in the study.

Studies of Diagnostic Tests Referral Bias FractureNo Fracture Swellingab No Swelling c  d  Sensitivity (a/(a+c)) is biased UP. Specificity (d/(b+d)) is biased DOWN.

Studies of Diagnostic Tests Referral Bias Example* Test: A-a O2 gradient Disease: PE Gold Standard: VQ scan or pulmonary angiogram Study Population: Patients who had VQ scan or PA-gram Results: A-a O2 gradient > 20 mm Hg had very high sensitivity (almost every patient with PE by VQ scan or PA gram had a gradient > 20 mm Hg), but a very low specificity (lots of patients with negative PA grams had gradients > 20 mm Hg). *McFarlane MJ, Imperiale TF. Use of the alveolar-arterial oxygen gradient in the diagnosis of pulmonary embolism. Am J Med. 1994;96(1):57-62.

Studies of Diagnostic Tests Referral Bias VQ Scan +VQ Scan - A-aO2 > 20 mmHg ab A-aO2 < 20 mmHg c  d  Sensitivity (a/(a+c)) is biased UP.* Specificity (d/(b+d)) is biased DOWN. *Still concluded test not sensitive enough, so it probably isn’t.

Studies of Diagnostic Tests Double Gold Standard One gold standard (e.g. biopsy) is applied in patients with positive index test, another gold standard (e.g., clinical follow-up) is applied in patients with a negative index test.

Studies of Diagnostic Tests Double Gold Standard Test: A-a O2 gradient Disease: PE Gold Standard: VQ scan or pulmonary angiogram in patients who had one, clinical follow-up in patients who didn’t Study Population: All patients presenting to the ED with dyspnea. Some patients did not get VQ scan or PA-gram because of normal A-a O2 gradients but would have had positive studies. Instead they had negative clinical follow-up and were counted as true negatives.

Studies of Diagnostic Tests Double Gold Standard PENo PE A-a O2 > 20ab A-a O2 < 20cd Sensitivity (a/(a+c)) biased UP Specificity (d/(b+d)) biased UP

Studies of Diagnostic Tests Sackett, EBM, pg 68 1.Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? 2.Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? 3.Was the reference standard applied regardless of the diagnostic test result? 4.Was the test (or cluster of tests) validated in a second, independent group of patients?

Studies of Diagnostic Tests Spectrum Bias So far, we have said that PPV and NPV of a test depend on the population being tested, specifically on the prevalence of D+ in the population. We said that sensitivity and specificity are properties of the test and independent of the prevalence and, by implication at least, the population being tested. In fact, …

Studies of Diagnostic Tests Spectrum Bias Sensitivity depends on the spectrum of disease in the population being tested. Specificity depends on the spectrum of non-disease in the population being tested.

Studies of Diagnostic Tests Spectrum Bias D+ and D- groups are not homogeneous. D-/D+ really is D-,D+, D++, or D+++ D-/D+ really is (D1-, D2-, or D3-)/D+

Studies of Diagnostic Tests Spectrum Bias Example: Pale Conjunctiva as Test for Iron Deficiency Anemia Assume that conjunctival paleness always occurs at HCT < 25

Pale Conjunctiva as a Test for Iron Deficiency

Sensitivity is HIGHER in the population with more severe disease

Pale Conjunctiva as a Test for Iron Deficiency

Specificity is LOWER in the population with more severe non-disease. (Patients without the disease in question are more likely to have other diseases that can be confused with the disease in question.)

Biases in Studies of Tests Overfitting Bias – “Data snooped” cutoffs take advantage of chance variations in derivations set making test look falsely good. Incorporation Bias – index test part of gold standard (Sensitivity Up, Specificity Up) Referral Bias – positive index test increases referral to gold standard (Sensitivity Up, Specificity Down) Double Gold Standard – positive index test causes application of definitive gold standard, negative index test results in clinical follow-up (Sensitivity Up, Specificity Up) Spectrum Bias –D+ sickest of the sick (Sensitivity Up) –D- wellest of the well (Specificity Up)

Biases in Studies of Tests Don’t just identify potential biases, figure out how the biases could affect the conclusions. Studies concluding a test is worthless are not invalid if biases in the design would have led to the test looking BETTER than it really is.