Evaluation of Diagnostic Tests Presenter: Akash Ranjan Moderator: Dr Chetna Maliye
Framework Introduction Determining useful Diagnostic Test Evaluation of Diagnostic Test Gold Standard Measure of Diagnostic Accuracy ROC Curve Multiple Testing Reliability of Test Relationship between Reliability and Validity References
Correctly classifying individuals by Disease Status Tests are used in medical diagnosis, screening and research to classified subjects in to disease or non-diseased group Ideally, all subjects who have the disease should be classified as “having the disease” and vice-versa
Diagnostic Test and Screening Test A diagnostic test is used to determine the presence or absence of a disease when a subject shows signs or symptoms of a disease A screening test identifies asymptomatic individuals who may have the disease The diagnostic test is performed after a positive screening test to establish a definitive diagnosis Few Exp of screeng test- Pap smear, F & PP Bld sugar for DM, BP for HT, Mammography for Br Ca, Fasting bld Chl for Ht Dis, Occular press for Galucoma
Useful Diagnostic Test Reproducibility Accuracy Feasibility Effects on clinical decisions Outcomes
Evaluation of Diagnostic Test Ability to classify individuals in to correct disease status in reliable manner Help to make decisions about their use and interpretation By determining validity and reliability. Validity Internal Validity External Validity Reliability
Simplify Data Many test results have a continuous, ordinal or continuous variables Complex data are reduce to simple dichotomy Present/ Absent Abnormal/ Normal Disease/ Well.
Distribution of Systolic Blood Pressures: Males, Ages 40–64
+ Gold Standard Disease All people with disease Accuracy of a test established by independent comparison with “Gold Standard” Ideally, Gold Standard is 100% accurate test Practically, sensitivity and specificity tend to be 100% Histopathology Cytopathology Radiologic contrast procedures Prolong follow up Autopsy Disease + All people with disease All people without disease
Measure of Diagnostic Accuracy Comparison of Disease status: Gold Standard test and Index test Disease + a (True positives) b (False Positives) c (False Negative) d (True Negative) + Index Test
Sensitivity Sensitivity = a a + c Proportion of people with the disease, who have positive test result for the disease A sensitive test will rarely miss people with the disease Used when there is an important penalty for missing the disease eg. Ca Cervix, Breast Cancer, HIV Sensitivity = a a + c
Specificity Specificity = d b + d The proportion of people without the disease, who have negative test result useful to confirm ( “rule in” ) a diagnosis For screening a prevalent dis like DM when false positive results can harm the patients, physically and financially eg. Cancer Chemotherapy Specificity = d b + d 1. For a prevalent dis like DM, for which t/t doesn’t markedly alter outcome, false positive should be limited, otherwise health system will be overburdened with diagnostic demands of the positives
Factors establishing Sensitivity and Specificity Spectrum of Patients Test may not distinguish when differences are subtle between patients Bias Sn & Sp of test should be assessed separately, not be part of information in making diagnosis eg x ray Chance Small sample Size Confidence Interval Bias- Not be part of information in making diagnosis. Chane – precision of estimate
Trade-off between Sensitivity and Specificity Sensitivity can be increased only at the expense of Specificity Trade-off between Sensitivity and Specificity when diagnosing Diabetes Blood Sugar after fasting 8 hour Sensitivity (%) Specificity(%) It is desirable to have a test that is both highly Sn and Sp, but unfortunately this is usually not possible. Instead there is trade off between Sn & Sp of a diagnostic test. When test results expressed on a continuous scale, Sn can be increased only at the expense of Sp and vice- versa
ROC Curve
ROC Curve By Plotting Sensitivity against false positive rate (1-Sp) over a range of cut off values Test that discriminate well, crowd towards the upper right corner of the curve Tests that performs less well have curves that fall closer to diagonal running from lower left to upper right. shows how severe trade off between Sn & Sp To decide where best cut off point should be Generally it is near the shoulder of ROC curve, unless there are clinical reasons for minimizing either false negative or false positives 3 point- For them as the Sn increases (lower cut off point) there is little or no loss in specificity, until very high levels of Sn are achieved.
ROC Curve In comparing alternative tests for same diagnosis Area under the ROC curve-larger the area, better the test 2 point-Accuracy of test described as MAST- Michighan alcoholism screening test The CAGe is both more Sn & more Sp than MAST and includes much larger atea under its curve.
Predictive Accuracy (“Clinician’s dilemma”) Positive predictive value - Probability of disease in a patient with positive test result. Reflects the diagnostic power of a test Depends on Sn & Sp Directly proportional to disease prevalence in population PPV= a a + b Intro:For a clinician the dilemma is to determine whether or not the patient has the disease, given the result of a test. They are more concern to know- What is the probability of having the disease, when test is positive? And what is the probability of not having the disease, when test is negative? 1 point-Positive predictive value /Post Test Probability/ Posterior probability
Predictive Accuracy PPV = Sensitivity x Prevalence (Sensitivity x Prevalence) + (1- Sp) x (1-Pr)
Predictive Accuracy
Predictive Accuracy NPV= d c + d Negative predictive value- Probability that the patient with Negative test result do not have the disease. Reflect the diagnostic power of test Depends on Sn & Sp Inversely proportional to disease prevalence in population NPV= d c + d
Likelihood Ratios LR+ = Sn 1- Sp LR- = 1 – Sn Sp Positive Likelihood ratio(LR+): Ratio of proportion of diseased people with a positive test result (Sn) to the proportion of non diseased people with a positive test result (1-Sp) Negative Likelihood ratio(LR-):proportion of diseased people with a negative test result (1-Sn) devided by proportion of non diseased people with a negative test result (Sp) LR+ = Sn 1- Sp LR- = 1 – Sn Sp
Likelihood Ratios Example: A positive test is about 2.6 times more likely to be found in presence of DVT (Deep vein thrombosis) than in absence of it. Advantages of LR’s Not change with changes in the prevalence Can be used at multiple levels of test results describing the overall odds of disease when a series of diagnostic test is used.
Likelihood Ratios Disease + - Test 34 168 1 282 Techniques of using LR’s Mathematical approach Using a likelihood ratio nomogram Disease + - Test 34 168 1 282 Sn=97%, Sp= 63%, Pv=7%, PPV= 17%, NPV= 100%, LR+ = 2.6, LR- =0.05 Step1: Convert pretest probability to pretest odds Odds= 0.075 Step2: Post test odds= Pretest odds x LR+ = 0.075 X 2.6 = 0.195 Step3: Convert Post test odds to post test probability P= 0.195/ (1+0.195) = 16%
Likelihood Ratios Using a likelihood ratio nomogram
Multiple Tests Single test frequently results in a probability of disease that is neither very high nor very low Physician raise or lower the probability of disease in such situations Multiple tests helps the clinicians in this regard Applied in in two basic ways Parallel testing: (All at once) Serial Testing: (Consecutive) 2 point- Usually it is not acceptable to stop the diagnostic process at that point.
Multiple Tests Parallel testing: (All at once) A positive result of any test is considered evidence for disease Rapid assessment is needed eg. hospitalized or emergency patients useful when need for a very sensitive strategy Net effect is a more sensitive diagnostic strategy Serial Testing: (Consecutive) Decision to order next test in series based on results of previous test All tests must give a positive result in order for diagnosis to be made Maximizes Sp and PPV, but lowers Sn and NPV Serial- after second point -because the diagnostic process stopped with a negative result Last - Serial testing strategy used when rapid assessment of patients not required, tests are expensive, risky. These tests used only after simpler or safer tests suggest the presence of disease eg. Maternal age and blood tests (AFP, Chorionic gonadotropin & estradiol) are used to identify pregnancies at higher risks of delivering a baby with Down syndrome. Mother found to be at higher risk by these tests are then offered Amnioncentasis
Multiple Tests
Reliability of a test Reliability/ Repeatability- Test is able to give same result again and again. Regardless of Sn and Sp of a test, if the test result can not be reproduced, the value and usefulness of the test are minimal Factors contribute to the variation between test results Intra subject variation (with in individual subjects) Intra observer variation Inter observer variation (variation between those reading test result).
Reliability of a test Intra subject variation Therefore, in evaluating any test result, it is important to consider conditions under which the test was performed, including the time of day Table: Examples showing variation in Blood Pressure reading during a 24-Hour Period Blood Pressure (mmHg) Female Aged 27 Yr Female Aged 62 Yr Male Aged 33 Yr Basal 110/70 132/82 152/109 Lowest Hour 86/47 102/61 123/78 Highest Hour 126/79 172/94 153/107 Casual 108/64 155/93 157/109
Reliability of test Intra observer variation Variation occurs between two observations made by the same observer Eg. A radiologist who reads the same group of x rays at two different times, may read one or more x ray differently at second time. Tests and examinations differ in the degree to which subjective factors enter in to observer’s conclusion, greater the subjective element in the reading, greater the intra observer variation in reading is likely to be.
Reliability of test Inter observer variation Variation between observers Measures extent to which observers agree or disagree in quantitative terms. Kappa Statistics (Kappa measure of agreement) Difference between observed and expected agreement expressed as a fraction of the maximum difference. Since the maximum value of I0 is 1, this gives K = I0 – Ie / 1- Ie
Relationship between Validity and Reliability Reliability/ Repeatability- Test is able to give same result again and again. Validity- Test is able to measure what it is intended to
Comparison of reliability and validity using graphical presentation The distribution of test result is a broad base centered on the true value, describe the result as valid. However the results are valid only for a group (Tend to cluster around true value). It is important to remember that what may be valid for a group or a population may not be so for an individual in a clinical setting When the reliability of a test is poor, the validity of the test for a given individual also be poor.
References Beaglehole R, Bonita R, Kjellstrom T. Basic Epidemiology. Geveva: World Health Organization; 1993. Fletcher RH, Fletcher SW. Clinical Epidemiology- The essentials. Third ed. Baltimore: Lippincott Williams and Williams; 1996. 35-56 p. Gordis L. Epidemiology. Pennsylvania: Elsever Saunders; 2004. 71-94p. Armitage P, Berry G. Statistical Methods in Medical Research. Third ed. London: Blackwell Scientific Publications; 1994.445p