Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008.

Slides:



Advertisements
Similar presentations
Research Curriculum Session II –Study Subjects, Variables and Outcome Measures Jim Quinn MD MS Research Director, Division of Emergency Medicine Stanford.
Advertisements

Lecture 3 Validity of screening and diagnostic tests
Sample Size And Power I Jean B. Nachega, MD, PhD Department of Medicine & Centre for Infectious Diseases Stellenbosch University
ADVANCED STATISTICS FOR MEDICAL STUDIES Mwarumba Mwavita, Ph.D. School of Educational Studies Research Evaluation Measurement and Statistics (REMS) Oklahoma.
Understanding Statistics in Research Articles Elizabeth Crabtree, MPH, PhD (c) Director of Evidence-Based Practice, Quality Management Assistant Professor,
 Residents report using different modes of communication with LEP patients depending on the clinical encounter.  Variation in professional interpreter.
Critically Evaluating the Evidence: diagnosis, prognosis, and screening Elizabeth Crabtree, MPH, PhD (c) Director of Evidence-Based Practice, Quality Management.
STUDY DESIGN CASE SERIES AND CROSS-SECTIONAL
Introduction to Biostatistics, Harvard Extension School © Scott Evans, Ph.D.1 Evaluation of Screening and Diagnostic Tests.
Copyright restrictions may apply JAMA Pediatrics Journal Club Slides: Total Serum Bilirubin Levels at or Above the ETT Wu YW, Kuzniewicz MW, Wickremasinghe.
Concept of Measurement
Intermediate methods in observational epidemiology 2008 Quality Assurance and Quality Control.
Clustered or Multilevel Data
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute.
By Dr. Ahmed Mostafa Assist. Prof. of anesthesia & I.C.U. Evidence-based medicine.
Cohort Studies Hanna E. Bloomfield, MD, MPH Professor of Medicine Associate Chief of Staff, Research Minneapolis VA Medical Center.
AM Recitation 2/10/11.
Studies of Diagnostic Tests
Statistics in Screening/Diagnosis
Multiple Choice Questions for discussion
Evidence-Based Medicine 4 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
1 Lecture 2: Types of measurement Purposes of measurement Types and sources of data Reliability and validity Levels of measurement Types of scale.
Neonates (children less than one month of age) have immature immune systems and are at higher risk for serious complications of bacterial and viral infections,
Lecture 4: Assessing Diagnostic and Screening Tests
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010.
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Assessing Information from Multilevel and Continuous Tests Likelihood Ratios for results other than “+” or “-” Tom Newman (based on previous lectures by.
Division of Population Health Sciences Royal College of Surgeons in Ireland Coláiste Ríoga na Máinleá in Éirinn Indices of Performances of CPRs Nicola.
Biostat 215 Clarifying the Causal Question Thomas B. Newman, MD, MPH.
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 16, 2008.
Management of Neonatal Hyperbilirubinemia Methods of the AHRQ Evidence Report FDA Advisory Committee Meeting June 11, 2003 Joseph Lau, MD Tufts-New England.
Example Papers Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency Department Maala Bhatt, MD, MSc, Lawrence Joseph,
Evaluation of Diagnostic Tests
+ Clinical Decision on a Diagnostic Test Inna Mangalindan. Block N. Class September 15, 2008.
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 11, 2007.
Course overview, the diagnostic process, and measures of interobserver agreement Thomas B. Newman, MD, MPH September 20, 2007.
Introduction to the Statistical Analysis of the Clinical Trials
Evaluating Risk Adjustment Models Andy Bindman MD Department of Medicine, Epidemiology and Biostatistics.
Welcome Back From Lunch. Thursday Afternoon 2:00-3:00 Studies of Diagnostic Test Accuracy (Tom) 3:00-3:45 Combining Tests (Mark) 3:45-4:00 Break 4:00-5:30.
Copyright restrictions may apply JAMA Pediatrics Journal Club Slides: Procalcitonin Use to Predict Bacterial Infection in Febrile Infants Milcent K, Faesch.
A Claims Database Approach to Evaluating Cardiovascular Safety of ADHD Medications A. J. Allen, M.D., Ph.D. Child Psychiatrist, Pharmacologist Global Medical.
Quality control & Statistics. Definition: it is the science of gathering, analyzing, interpreting and representing data. Example: introduction a new test.
Diagnostic Tests Studies 87/3/2 “How to read a paper” workshop Kamran Yazdani, MD MPH.
SCH Journal Club Use of time from fever onset improves the diagnostic accuracy of C-reactive protein in identifying bacterial infections Wednesday 13 th.
EBM --- Journal Reading Presenter :呂宥達 Date : 2005/10/27.
Common Errors by Teachers and Proponents of EBM
Validation and Refinement of a Prediction Rule to Identify Children at Low Risk for Acute Appendicitis Kharbanda AB, Dudley NC, Bajaj L, et al; Pediatric.
Heart Disease Example Male residents age Two models examined A) independence 1)logit(╥) = α B) linear logit 1)logit(╥) = α + βx¡
Compliance Original Study Design Randomised Surgical care Medical care.
Statistical inference Statistical inference Its application for health science research Bandit Thinkhamrop, Ph.D.(Statistics) Department of Biostatistics.
EVALUATING u After retrieving the literature, you have to evaluate or critically appraise the evidence for its validity and applicability to your patient.
Clinical Decision on A Diagnostic Test. Clinical Question In a middle aged man with primary gout and azotemia, can a urine uric acid to creatinine ratio.
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
Laboratory Medicine: Basic QC Concepts M. Desmond Burke, MD.
Course: Research in Biomedicine and Health III Seminar 5: Critical assessment of evidence.
Odds Ratio& Bias in case-control studies
Retrospective Chart Reviews: How to Review a Review Adam J. Singer, MD Professor and Vice Chairman for Research Department of Emergency Medicine Stony.
Copyright restrictions may apply JAMA Pediatrics Journal Club Slides: Effect of Laboratory Calibration of Neonatal Bilirubin Kuzniewicz MW, Greene DN,
EBM --- Journal Reading Presenter :黃美琴 Date : 2005/10/27.
Critical Appraisal Course for Emergency Medicine Trainees Module 5 Evaluation of a Diagnostic Test.
“Reading and commenting papers” (Scientific English) Alexis Descatha INSERM, UMS UVSQ- Unité de pathologie professionnelle, Garches.
Copyright restrictions may apply JAMA Pediatrics Journal Club Slides: Preoperative Anemia and Postoperative Mortality in Neonates Goobie SM, Faraoni D,
Diagnostic Test Studies
Materials & Methods what to include and where
Chapter 7 The Hierarchy of Evidence
Natalie Robinson Centre for Evidence-based Veterinary Medicine
ERRORS, CONFOUNDING, and INTERACTION
Lecture 4 Study design and bias in screening and diagnostic tests
Evidence Based Diagnosis
Presentation transcript:

Studies of Medical Tests Thomas B. Newman, MD, MPH September 9, 2008

Overview n General Issues –Similarities and differences –Types of questions –Gold standard –Spectrum of disease and of results –Sampling and generalizability n Examples: –Reproducibility and Accuracy of S 3 –Visual assessment of jaundice

What do we mean by “tests”? n Studies, procedures, maneuvers intended to provide information about the probability of different health states, e.g., –Items of the history and physical examination –Blood tests –X-rays –Endoscopies

“Tests” include history questions

How are studies of tests similar to other studies? n Same basic pieces –Research question –Study design –Subjects –Predictor variables –Outcome variables –Analysis n Same need to generalize from study subjects and measurements to populations and phenomena of interest

How are studies of tests different? n Address different types of questions –Primarily descriptive –Causal inference may or may not be relevant –Confidence intervals rather than P-values n Different biases –Spectrum, verification, etc. n Different statistics used to summarize results –Kappa, sensitivity, specificity, ROC curves, likelihood ratios

Diagnostic Test Questions n How reproducible is it? n How accurate is it? n How much new information does it provide? n How often do results affect clinical decisions? n What are the costs, risks, and acceptability of the test? n What is the effect of testing on outcomes? n How do the answers to these questions vary by patient characteristics?

Gold Standard -1 n Needed for studies that measure accuracy n Can’t include test being measured (Incorporation bias) –Example: WBC as a predictor of sepsis in newborns –Gold standard (+BC) imperfect –Why not include probable sepsis, based on judgment of treating clinicians? –Judgment affected by WBC!

Gold Standard -2 n Best if applied blindly –Prevent incorporation bias n Best if applied uniformly –Prevent verification bias, double-gold standard bias n If imperfect, test accuracy can be under- estimated or over-estimated –Example: culture vs PCR for pertussis n If nonexistent, think about WHY you want to make the diagnosis –Examples: ADHD, autism

Spectrum of Disease, Nondisease and Test Results n Disease is often easier to diagnose if severe n “Nondisease” is easier to diagnose if patient is well than if the patient has other diseases n Test results will be more reproducible if ambiguous results excluded

Sources of variation, generalizability and sampling n Test characteristics may depend on: –How the specimen is obtained and processed –How and by whom the test is done and interpreted n Consider whether you need to sample or stratify results at these levels (depends on the RQ)

Studies of Reproducibility n For tests with no gold standard n Often done as part of quality control –For a larger study –For patient care

Example: The Third Heart Sound Marcus et al., Arch Intern Med. 2006;166: n RQs: –What is interobserver variability for hearing S 3 ? –How does this vary with level of experience? n Design: cross- sectional study

Study Subjects n Adults scheduled for non-emergency left-sided heart catheterization at UCSF 8/03 to 6/04 n N=100 Marcus et al., Arch Intern Med. 2006;166:

Examining Physicians n Cardiology attendings (N=26) n Cardiology fellows (N= 18) n Internal medicine residents (N=54) n Internal medicine interns (N=48) n All from UCSF? Marcus et al., Arch Intern Med. 2006;166:

Measurements n Auscultation –Standard procedure in quiet room –Examiners blinded to other information n Phonocardiogram with computerized analysis to determine S 3

Analysis: Kappa n Measures agreement beyond that expected by chance n For ordinal variables use weighted kappa, which gives credit for coming close

Copyright restrictions may apply. Marcus, G. et al. Arch Intern Med 2006;166: Results: Comparison of Auscultation with Phonocardiogram

Do S 3 and S 4 matter? JAMA. 2005;293: n RQ: How well do S 3 and S 4 predict abnormal (≥15 mm Hg) LVEDP? n Design: cross- sectional study

Study Subjects n Adults scheduled for non-emergency left-sided heart catheterization at UCSF 8/03 to 6/04 –Excluded if poor phonocardiographic quality (N=8) or paced rhythm (N=2)

Measurements n Test: S 3 (Y/N) and S 3 “confidence score” from computer analysis of phonocardiogram n “Gold Standard”: Left ventricular end- diastolic pressure ≥ 15 mm/Hg at cath

Results: S 3 present/absent Specificity = 45/49 = 92% 95% CI (80%, 98%) Sensitivity = 17/41 = 41% 95% CI: (26%, 58%) Positive PV = 17/21= 81% Negative PV = 45/69 = 65%

Results: “Confidence Scores” n Many “dichotomous” tests not really dichotomous, e.g.: –Definite –Probable –Possible –Absent n Phonocardiogram software generates “confidence scores” for S3 and S4

Analysis: ROC Curve n ROC = “Receiver Operating Characteristics” n Illustrate tradeoff between sensitivity and specificity as the cutoff is changed n Discrimination of test measured by area under the curve (AUROC = c) –Perfect test 1.0 –Worthless test 0.5

Results: S 3 & S 4 Confidence Scores

Issues: 1. Generalizability n Were subjects representative of those in whom S3 relevant? n Study participants (MDs) representative of those who listen for S3? –UCSF representative? –How many of the attending examinations were done by Kanu Chatterjee?

Issues: 2. Does test provide new information? n Blinding observers to rest of H & P not sufficient n Options –Compare accuracy of prediction of LVEDP with and without examination for S3 –Record all clinical information and use multivariate techniques

Issues 3: Value of Information n What decision is the test supposed to help with? n How often does the test change the decision? n What is the effect of the change in decision on outcome? n What is the value of that effect?

Should every newborn have a bilirubin test before discharge? n About 60% of newborns develop some jaundice n Usually it is harmless n Current practice: Check bilirubin level if jaundice appears significant n Proposal: check it on all newborns

Kernicterus Public Information Campaign Draft Posters

Advancement of Dermal Icterus in the Jaundiced Newborn Kramer LI, AJDC 1969;118:454

Accuracy of Clinical Judgment in Neonatal Jaundice* n RQ: How well can clinicians estimate bilirubin levels in jaundiced newborns? n Study Design: cross-sectional study n Subjects: 122 healthy term newborns (mean age 2 days) whose total serum bilirubin (TSB) was measured in the course of standard newborn care *Moyer et al., Archives Peds Adol Med 2000; 154:391

Accuracy of Clinical Judgment in Neonatal Jaundice* n Measurements: –Jaundice assessed by attendings, nurse practitioners and pediatric residents (absent/slight/obvious) at each body part and Total Serum Bilirubin (TSB) estimated –TSB levels measured in clinical laboratory n Analysis –Agreement for jaundice at each body part by Weighted Kappa –Sensitivity and specificity for TSB ≥ 12 mg/dL *Moyer et al., Archives Peds Adol Med 2000; 154:391

Results: 1. Moyer et al., APAM 2000; 154:391

Results: 2 Moyer et al., APAM 2000; 154:391 n Sensitivity of jaundice below the nipple line for TSB ≥ 12 mg/dL = 97% n Specificity = 19% Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication. --Catherine D. DeAngelis, MD

Issues: 1 n No information on the numbers of different types of examiners or their years of experience –Generalizability uncertain n No CI around sensitivity and specificity –Sensitivity based upon 67/69 –95% CI: 90% to 99.6%

Issues: 2 n Verification bias (Type 1) –Infants NOT jaundiced below the nipples not likely to have a TSB measured –Sensitivity too high, specificity too low

Issues: 3 n How often would the bilirubin test alter management? n How often would this affect outcomes? –None of the bilirubin levels in the study was dangerously high

CDC Posters

TIP n If you are doing a study of test accuracy, Google STARD Checklist n STARD= Standards for Reporting of Diagnostic Accuracy n (Like CONSORT for clinical trials)

Summary: Think about n The question you are trying to answer and why. n Sampling of subjects, and maybe of people doing or interpreting the test n Measurements – optimal or “real life”? n Analysis – Kappa, Weighted Kappa, Sensitivity, Specificity, Likelihood Ratios, ROC curves, with confidence intervals n Acknowledge limitations, think about the effect they would have on results

Extra/back-up slides

Issues: 1. Spectrum n Spectrum of disease: what is distribution of LVEDP in study subjects and in population of interest? LVEDP Frequency

Results: 2. Moyer, 2000

Reproducibility of Continuous Variables: Bland Altman Plots

The Effect of Instituting a Prehospital- Discharge Newborn Bilirubin Screening Program in an 18-Hospital Health System* n Comparison of two time periods, before and after near-universal bilirubin screening n Results n But: no info on phototherapy during birth admission! Eggert LD et al. Pediatrics 2006;117:e855-62