Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008.

Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008

Overview n General Issues –Similarities and differences –Types of questions –Gold standard –Spectrum of disease and of results –Sampling and generalizability n Examples: –Reproducibility and Accuracy of S 3 –Visual assessment of jaundice

What do we mean by “tests”? n Studies, procedures, maneuvers intended to provide information about the probability of different health states, e.g., –Items of the history and physical examination –Blood tests –X-rays –Endoscopies

“Tests” include history questions

How are studies of tests similar to other studies? n Same basic pieces –Research question –Study design –Subjects –Predictor variables –Outcome variables –Analysis n Same need to generalize from study subjects and measurements to populations and phenomena of interest

How are studies of tests different? n Address different types of questions –Primarily descriptive –Causal inference may or may not be relevant –Confidence intervals rather than P-values n Different biases –Spectrum, verification, etc. n Different statistics used to summarize results –Kappa, sensitivity, specificity, ROC curves, likelihood ratios

Diagnostic Test Questions n How reproducible is it? n How accurate is it? n How much new information does it provide? n How often do results affect clinical decisions? n What are the costs, risks, and acceptability of the test? n What is the effect of testing on outcomes? n How do the answers to these questions vary by patient characteristics?

Gold Standard -1 n Needed for studies that measure accuracy n Can’t include test being measured (Incorporation bias) –Example: WBC as a predictor of sepsis in newborns –Gold standard (+BC) imperfect –Why not include probable sepsis, based on judgment of treating clinicians? –Judgment affected by WBC!

Gold Standard -2 n Best if applied blindly –Prevent incorporation bias n Best if applied uniformly –Prevent verification bias, double-gold standard bias n If imperfect, test accuracy can be under- estimated or over-estimated –Example: culture vs PCR for pertussis n If nonexistent, think about WHY you want to make the diagnosis –Examples: ADHD, autism

Spectrum of Disease, Nondisease and Test Results n Disease is often easier to diagnose if severe n “Nondisease” is easier to diagnose if patient is well than if the patient has other diseases n Test results will be more reproducible if ambiguous results excluded

Sources of variation, generalizability and sampling n Test characteristics may depend on: –How the specimen is obtained and processed –How and by whom the test is done and interpreted n Consider whether you need to sample or stratify results at these levels (depends on the RQ)

Studies of Reproducibility n For tests with no gold standard n Often done as part of quality control –For a larger study –For patient care

Example: The Third Heart Sound Marcus et al., Arch Intern Med. 2006;166:617-622 n RQs: –What is interobserver variability for hearing S 3 ? –How does this vary with level of experience? n Design: cross- sectional study

Study Subjects n Adults scheduled for non-emergency left-sided heart catheterization at UCSF 8/03 to 6/04 n N=100 Marcus et al., Arch Intern Med. 2006;166:617-622

Examining Physicians n Cardiology attendings (N=26) n Cardiology fellows (N= 18) n Internal medicine residents (N=54) n Internal medicine interns (N=48) n All from UCSF? Marcus et al., Arch Intern Med. 2006;166:617-622

Measurements n Auscultation –Standard procedure in quiet room –Examiners blinded to other information n Phonocardiogram with computerized analysis to determine S 3

Analysis: Kappa n Measures agreement beyond that expected by chance n For ordinal variables use weighted kappa, which gives credit for coming close

Copyright restrictions may apply. Marcus, G. et al. Arch Intern Med 2006;166:617-622. Results: Comparison of Auscultation with Phonocardiogram

Do S 3 and S 4 matter? JAMA. 2005;293:2238-2244 n RQ: How well do S 3 and S 4 predict abnormal (≥15 mm Hg) LVEDP? n Design: cross- sectional study

Study Subjects n Adults scheduled for non-emergency left-sided heart catheterization at UCSF 8/03 to 6/04 –Excluded if poor phonocardiographic quality (N=8) or paced rhythm (N=2)

Measurements n Test: S 3 (Y/N) and S 3 “confidence score” from computer analysis of phonocardiogram n “Gold Standard”: Left ventricular end- diastolic pressure ≥ 15 mm/Hg at cath

Results: S 3 present/absent Specificity = 45/49 = 92% 95% CI (80%, 98%) Sensitivity = 17/41 = 41% 95% CI: (26%, 58%) Positive PV = 17/21= 81% Negative PV = 45/69 = 65%

Results: “Confidence Scores” n Many “dichotomous” tests not really dichotomous, e.g.: –Definite –Probable –Possible –Absent n Phonocardiogram software generates “confidence scores” for S3 and S4

Analysis: ROC Curve n ROC = “Receiver Operating Characteristics” n Illustrate tradeoff between sensitivity and specificity as the cutoff is changed n Discrimination of test measured by area under the curve (AUROC = c) –Perfect test 1.0 –Worthless test 0.5

Results: S 3 & S 4 Confidence Scores

Issues: 1. Generalizability n Were subjects representative of those in whom S3 relevant? n Study participants (MDs) representative of those who listen for S3? –UCSF representative? –How many of the attending examinations were done by Kanu Chatterjee?

Issues: 2. Does test provide new information? n Blinding observers to rest of H & P not sufficient n Options –Compare accuracy of prediction of LVEDP with and without examination for S3 –Record all clinical information and use multivariate techniques

Issues 3: Value of Information n What decision is the test supposed to help with? n How often does the test change the decision? n What is the effect of the change in decision on outcome? n What is the value of that effect?

Should every newborn have a bilirubin test before discharge? n About 60% of newborns develop some jaundice n Usually it is harmless n Current practice: Check bilirubin level if jaundice appears significant n Proposal: check it on all newborns

Kernicterus Public Information Campaign Draft Posters

Advancement of Dermal Icterus in the Jaundiced Newborn Kramer LI, AJDC 1969;118:454

Accuracy of Clinical Judgment in Neonatal Jaundice* n RQ: How well can clinicians estimate bilirubin levels in jaundiced newborns? n Study Design: cross-sectional study n Subjects: 122 healthy term newborns (mean age 2 days) whose total serum bilirubin (TSB) was measured in the course of standard newborn care *Moyer et al., Archives Peds Adol Med 2000; 154:391

Accuracy of Clinical Judgment in Neonatal Jaundice* n Measurements: –Jaundice assessed by attendings, nurse practitioners and pediatric residents (absent/slight/obvious) at each body part and Total Serum Bilirubin (TSB) estimated –TSB levels measured in clinical laboratory n Analysis –Agreement for jaundice at each body part by Weighted Kappa –Sensitivity and specificity for TSB ≥ 12 mg/dL *Moyer et al., Archives Peds Adol Med 2000; 154:391

Results: 1. Moyer et al., APAM 2000; 154:391

Results: 2 Moyer et al., APAM 2000; 154:391 n Sensitivity of jaundice below the nipple line for TSB ≥ 12 mg/dL = 97% n Specificity = 19% Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication. --Catherine D. DeAngelis, MD

Issues: 1 n No information on the numbers of different types of examiners or their years of experience –Generalizability uncertain n No CI around sensitivity and specificity –Sensitivity based upon 67/69 –95% CI: 90% to 99.6%

Issues: 2 n Verification bias (Type 1) –Infants NOT jaundiced below the nipples not likely to have a TSB measured –Sensitivity too high, specificity too low

Issues: 3 n How often would the bilirubin test alter management? n How often would this affect outcomes? –None of the bilirubin levels in the study was dangerously high

CDC Posters

TIP n If you are doing a study of test accuracy, Google STARD Checklist n STARD= Standards for Reporting of Diagnostic Accuracy n (Like CONSORT for clinical trials)

Summary: Think about n The question you are trying to answer and why. n Sampling of subjects, and maybe of people doing or interpreting the test n Measurements – optimal or “real life”? n Analysis – Kappa, Weighted Kappa, Sensitivity, Specificity, Likelihood Ratios, ROC curves, with confidence intervals n Acknowledge limitations, think about the effect they would have on results

Extra/back-up slides

Issues: 1. Spectrum n Spectrum of disease: what is distribution of LVEDP in study subjects and in population of interest? LVEDP Frequency

Results: 2. Moyer, 2000

Reproducibility of Continuous Variables: Bland Altman Plots

The Effect of Instituting a Prehospital- Discharge Newborn Bilirubin Screening Program in an 18-Hospital Health System* n Comparison of two time periods, before and after near-universal bilirubin screening n Results n But: no info on phototherapy during birth admission! Eggert LD et al. Pediatrics 2006;117:e855-62

Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008.

Similar presentations

Presentation on theme: "Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008.

Similar presentations

Presentation on theme: "Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008."— Presentation transcript:

Similar presentations

About project

Feedback