Download presentation
Presentation is loading. Please wait.
Published byAnis Reynolds Modified over 9 years ago
1
Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008
2
Overview n General Issues –Similarities and differences –Types of questions –Gold standard –Spectrum of disease and of results –Sampling and generalizability n Examples: –Reproducibility and Accuracy of S 3 –Visual assessment of jaundice
3
What do we mean by “tests”? n Studies, procedures, maneuvers intended to provide information about the probability of different health states, e.g., –Items of the history and physical examination –Blood tests –X-rays –Endoscopies
4
“Tests” include history questions
5
How are studies of tests similar to other studies? n Same basic pieces –Research question –Study design –Subjects –Predictor variables –Outcome variables –Analysis n Same need to generalize from study subjects and measurements to populations and phenomena of interest
6
How are studies of tests different? n Address different types of questions –Primarily descriptive –Causal inference may or may not be relevant –Confidence intervals rather than P-values n Different biases –Spectrum, verification, etc. n Different statistics used to summarize results –Kappa, sensitivity, specificity, ROC curves, likelihood ratios
7
Diagnostic Test Questions n How reproducible is it? n How accurate is it? n How much new information does it provide? n How often do results affect clinical decisions? n What are the costs, risks, and acceptability of the test? n What is the effect of testing on outcomes? n How do the answers to these questions vary by patient characteristics?
8
Gold Standard -1 n Needed for studies that measure accuracy n Can’t include test being measured (Incorporation bias) –Example: WBC as a predictor of sepsis in newborns –Gold standard (+BC) imperfect –Why not include probable sepsis, based on judgment of treating clinicians? –Judgment affected by WBC!
9
Gold Standard -2 n Best if applied blindly –Prevent incorporation bias n Best if applied uniformly –Prevent verification bias, double-gold standard bias n If imperfect, test accuracy can be under- estimated or over-estimated –Example: culture vs PCR for pertussis n If nonexistent, think about WHY you want to make the diagnosis –Examples: ADHD, autism
10
Spectrum of Disease, Nondisease and Test Results n Disease is often easier to diagnose if severe n “Nondisease” is easier to diagnose if patient is well than if the patient has other diseases n Test results will be more reproducible if ambiguous results excluded
11
Sources of variation, generalizability and sampling n Test characteristics may depend on: –How the specimen is obtained and processed –How and by whom the test is done and interpreted n Consider whether you need to sample or stratify results at these levels (depends on the RQ)
12
Studies of Reproducibility n For tests with no gold standard n Often done as part of quality control –For a larger study –For patient care
13
Example: The Third Heart Sound Marcus et al., Arch Intern Med. 2006;166:617-622 n RQs: –What is interobserver variability for hearing S 3 ? –How does this vary with level of experience? n Design: cross- sectional study
14
Study Subjects n Adults scheduled for non-emergency left-sided heart catheterization at UCSF 8/03 to 6/04 n N=100 Marcus et al., Arch Intern Med. 2006;166:617-622
15
Examining Physicians n Cardiology attendings (N=26) n Cardiology fellows (N= 18) n Internal medicine residents (N=54) n Internal medicine interns (N=48) n All from UCSF? Marcus et al., Arch Intern Med. 2006;166:617-622
16
Measurements n Auscultation –Standard procedure in quiet room –Examiners blinded to other information n Phonocardiogram with computerized analysis to determine S 3
17
Analysis: Kappa n Measures agreement beyond that expected by chance n For ordinal variables use weighted kappa, which gives credit for coming close
18
Copyright restrictions may apply. Marcus, G. et al. Arch Intern Med 2006;166:617-622. Results: Comparison of Auscultation with Phonocardiogram
19
Do S 3 and S 4 matter? JAMA. 2005;293:2238-2244 n RQ: How well do S 3 and S 4 predict abnormal (≥15 mm Hg) LVEDP? n Design: cross- sectional study
20
Study Subjects n Adults scheduled for non-emergency left-sided heart catheterization at UCSF 8/03 to 6/04 –Excluded if poor phonocardiographic quality (N=8) or paced rhythm (N=2)
21
Measurements n Test: S 3 (Y/N) and S 3 “confidence score” from computer analysis of phonocardiogram n “Gold Standard”: Left ventricular end- diastolic pressure ≥ 15 mm/Hg at cath
22
Results: S 3 present/absent Specificity = 45/49 = 92% 95% CI (80%, 98%) Sensitivity = 17/41 = 41% 95% CI: (26%, 58%) Positive PV = 17/21= 81% Negative PV = 45/69 = 65%
23
Results: “Confidence Scores” n Many “dichotomous” tests not really dichotomous, e.g.: –Definite –Probable –Possible –Absent n Phonocardiogram software generates “confidence scores” for S3 and S4
24
Analysis: ROC Curve n ROC = “Receiver Operating Characteristics” n Illustrate tradeoff between sensitivity and specificity as the cutoff is changed n Discrimination of test measured by area under the curve (AUROC = c) –Perfect test 1.0 –Worthless test 0.5
25
Results: S 3 & S 4 Confidence Scores
26
Issues: 1. Generalizability n Were subjects representative of those in whom S3 relevant? n Study participants (MDs) representative of those who listen for S3? –UCSF representative? –How many of the attending examinations were done by Kanu Chatterjee?
27
Issues: 2. Does test provide new information? n Blinding observers to rest of H & P not sufficient n Options –Compare accuracy of prediction of LVEDP with and without examination for S3 –Record all clinical information and use multivariate techniques
28
Issues 3: Value of Information n What decision is the test supposed to help with? n How often does the test change the decision? n What is the effect of the change in decision on outcome? n What is the value of that effect?
29
Should every newborn have a bilirubin test before discharge? n About 60% of newborns develop some jaundice n Usually it is harmless n Current practice: Check bilirubin level if jaundice appears significant n Proposal: check it on all newborns
30
Kernicterus Public Information Campaign Draft Posters
31
Advancement of Dermal Icterus in the Jaundiced Newborn Kramer LI, AJDC 1969;118:454
32
Accuracy of Clinical Judgment in Neonatal Jaundice* n RQ: How well can clinicians estimate bilirubin levels in jaundiced newborns? n Study Design: cross-sectional study n Subjects: 122 healthy term newborns (mean age 2 days) whose total serum bilirubin (TSB) was measured in the course of standard newborn care *Moyer et al., Archives Peds Adol Med 2000; 154:391
33
Accuracy of Clinical Judgment in Neonatal Jaundice* n Measurements: –Jaundice assessed by attendings, nurse practitioners and pediatric residents (absent/slight/obvious) at each body part and Total Serum Bilirubin (TSB) estimated –TSB levels measured in clinical laboratory n Analysis –Agreement for jaundice at each body part by Weighted Kappa –Sensitivity and specificity for TSB ≥ 12 mg/dL *Moyer et al., Archives Peds Adol Med 2000; 154:391
34
Results: 1. Moyer et al., APAM 2000; 154:391
35
Results: 2 Moyer et al., APAM 2000; 154:391 n Sensitivity of jaundice below the nipple line for TSB ≥ 12 mg/dL = 97% n Specificity = 19% Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication. --Catherine D. DeAngelis, MD
36
Issues: 1 n No information on the numbers of different types of examiners or their years of experience –Generalizability uncertain n No CI around sensitivity and specificity –Sensitivity based upon 67/69 –95% CI: 90% to 99.6%
37
Issues: 2 n Verification bias (Type 1) –Infants NOT jaundiced below the nipples not likely to have a TSB measured –Sensitivity too high, specificity too low
38
Issues: 3 n How often would the bilirubin test alter management? n How often would this affect outcomes? –None of the bilirubin levels in the study was dangerously high
39
CDC Posters
40
TIP n If you are doing a study of test accuracy, Google STARD Checklist n STARD= Standards for Reporting of Diagnostic Accuracy n (Like CONSORT for clinical trials)
41
Summary: Think about n The question you are trying to answer and why. n Sampling of subjects, and maybe of people doing or interpreting the test n Measurements – optimal or “real life”? n Analysis – Kappa, Weighted Kappa, Sensitivity, Specificity, Likelihood Ratios, ROC curves, with confidence intervals n Acknowledge limitations, think about the effect they would have on results
42
Extra/back-up slides
43
Issues: 1. Spectrum n Spectrum of disease: what is distribution of LVEDP in study subjects and in population of interest? LVEDP Frequency
44
Results: 2. Moyer, 2000
45
Reproducibility of Continuous Variables: Bland Altman Plots
46
The Effect of Instituting a Prehospital- Discharge Newborn Bilirubin Screening Program in an 18-Hospital Health System* n Comparison of two time periods, before and after near-universal bilirubin screening n Results n But: no info on phototherapy during birth admission! Eggert LD et al. Pediatrics 2006;117:e855-62
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.