Download presentation
Published byRegina Leaming Modified over 10 years ago
1
Is it True? Evaluating Research about Diagnostic Tests
2
The Case of Baby Jeff
3
The Case of Baby Jeff Prevalence: 1 in 5,000 (0.02%)
CPK testing for Muscular Dystrophy Sensitivity: 100% Specificity: 99.98% Prevalence: 1 in 5,000 (0.02%)
4
100,000 newborn boys Negative results Positive results
Prevalence = 1 in 5,000 = .02% = 20 newborn boys 20 will have M.D. 99,980 – no M.D. 20 correctly positive 20 false positive 0 false negative 99,960 correctly negative Specificity 99.98% Sensitivity 100% 40 positive tests 50% truly positive 50% falsely positive 50% PPV 99,960 negative tests 100% truly negative 0 falsely negative 100% NPV
5
Why this is important
6
Other examples Lyme disease
Sensitivity= 95%; specificity= 95% High prevalence (20%): PPV =83% Low prevalence (2%): PPV = 28% Echocardiogram as part of executive physical Prevalence = 10%; PPV = 50% Here are some more examples of how prevalence affects the behavior of a test. When the prevalence of Lyme disease is high, the antibody test has a high positive predictive value. In patients who live in places where the disease prevalence is low, the same test has a low positive predictive value. See: Brown SL. Role of serology in the diagnosis of Lyme disease. JAMA 1999;282:62-6. 6
7
Technical vs. Clinical Precision
Technical precision Clinical precision Sensitivity The percentage of patients with the disease who have a positive test Positive predictive value The percentage of patients with a positive test who have the disease Specificity The percentage of patients without disease who test negative Negative predictive value The percentage of patients with a negative test who are without disease. Unaffected by prevalence Changed by prevalence In contrast to sensitivity (the patients with a specific disease who have a positive test), the positive predictive value, on the other hand is the percentage of patients with a positive test who have the disease. If you are a clinician holding a lab report for the patient in front of you that is positive, you and your patient will want to know their likelihood of actually having the disease – the positive predictive value. In our example, a total of 40 male infants tested positive on the CPK test, but only half of them actually have muscular dystrophy. Thus, the positive predictive value of the test was only 50%. 7
8
Predictive Values Positive Predictive Value Negative Predictive Value
The percentage of patients with a positive test who have the disease Negative Predictive Value The percentage of patients with a negative test who don’t have the disease In contrast to sensitivity (the patients with a specific disease who have a positive test), the positive predictive value, on the other hand, is the percentage of patients with a positive test who have the disease. If you are a clinician holding a lab report for the patient in front of you that is positive, you and your patient will want to know their likelihood of actually having the disease – the positive predictive value. In our example, a total of 40 male infants tested positive on the CPK test, but only half of them actually have muscular dystrophy. Thus, the positive predictive value of the test was only 50%.
9
Let’s practice Task 1. A serum test screens pregnant women for babies with Down’s syndrome. The test is a very good one, but not perfect. Roughly, 1% of babies have Down’s syndrome. If the baby has Down’s syndrome, there is a 90% chance that the result will be positive. If the baby is unaffected, there is still a 1% chance that the result will be positive. A pregnant woman has been tested and the result is positive.
10
Negative: 99% correctly identified Positive: 90% correctly identified
1,000 similar Negative results Positive results Prevalence = 1% = ___ patients/1,000? Negative: 99% correctly identified Positive: 90% correctly identified
11
Negative: 99% correctly identified Positive: 90% correctly identified
1,000 similar Negative results Positive results Prevalence = 1% = ___ patients/1,000? Negative: 99% correctly identified Positive: 90% correctly identified
12
1,000 similar patients Negative results Positive results
Down’s Syndrome 1,000 similar patients Negative results Positive results Prevalence = 1% = 10 with Downs 10 – Downs 990 No Downs 9 correctly positive 10 false positive 1 false negative 980 correctly negative Positive: 90% correctly identified Negative: 99% correctly identified 19 positive tests 47.5% truly positive 52.5 falsely positive 981 negative tests 99.99% truly negative 0.001% falsely negative
13
Task 2 A 45-year-old woman presents with a sore throat and cough but without fever, tonsillar exudate, or cervical nodes. Using a clinical decision rule, you determine her likelihood of having strep throat is 1%. However, according to your office protocol, your medical assistant already has performed a rapid strep (antigen) test, which is positive. What is the likelihood the patient has strep throat now? Antigen test -- Sensitivity: 88% Specific: 96%
14
100,000 similar patients Negative results Positive results
Strep throat 100,000 similar patients Negative results Positive results 1,000 – Strep Prevalence = 1% = 1,000 with strep 99,000 – viral 880 correctly positive 3,960 false positive 120 false negative 95,040 correctly negative Specificity 96% Sensitivity 88% 4840 positive tests 18% truly positive 82% falsely positive 18% PPV 95,160 negative tests 99.87% truly negative 0.126% falsely negative 99.87% NPV
15
Adopting new screening/diagnostic tests
Sensitivity/specificity not enough Testing as an intervention Did the authors study an outcome patients care about?
16
Levels of “POEMness” for Diagnostic Tests
Sensitivity & specificity Does it change diagnoses? Does it change treatment? Does it change outcomes? Is it worthwhile (to patients and/or society)? (examples: HbA1C for DM, CPK vs T4/PKU in newborns, electron beam tomography for CAD) Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991; 11:88-94 Okay, now that we understand test characteristics, we need to go the next step and determine how useful the test really is. We call this the “level of POEMness.” (Derived from: Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991; 11:88-94). The second step is to determine whether the test changes our ability to make a diagnosis – again, let’s look to the proteonomic pattern test. It has good sensitivity and specificity but its ability to make the diagnosis is very small (although it would be a good test to confirm the diagnosis). The third level is to determine whether the test will change our approach to therapy. For example, sputum samples are often suggested in guidelines of the treatment of pneumonia. However, research has shown physicians frequently do not change treatment when the culture results are known a day or two later. The next step is to determine whether using the test changes outcomes. For example, routine monitoring of HgA1c did not affect outcomes in the United Kingdom Prospective Diabetes Study (see: Turner RC, Holman RR, Cull CA, et al. Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes (UKPDS 33). Lancet 1998;352: McCormack J, Greenhalgh T. Seeing what you want to see in randomised controlled trials: versions and perversions of UKPDS data. BMJ 2000;320: ). On the other hand, using the BNP test in the emergency department to determine whether the patient does or doesn’t have heart failure has been shown to decrease admissions, decrease length of stay, and speed the initiation of appropriate therapy (see: Mueller C, Scholer A, Laule-Kilian K, et al. Use of B-type natriuretic peptide in the evaluation and management of acute dyspnea. N Engl J Med 2004; 350: ). The ultimate test also will be shown to be worthwhile, which is a judgment call. Screening for disease is often done without evaluating its value to patients and society. The same money used to screen for disease also could be used to treat other diseases or even screen for other diseases. Screening newborns for muscular dystrophy may not be worthwhile because it’s a disease that cannot be treated (thus early diagnosis has no value). The only reliable method for determining whether a diagnostic test has reached POEM levels 2, 3, 4 and 5 is by comparing one population of patients in which the new test is used to another population where it is not used. And, the probability that a particular patient ends up in either group must be randomly, with concealed allocation assignment, determined. Rarely do we find this level of evidence for a new diagnostic test. We must ask the question “If I start using this test today, as opposed to not having used it yesterday, will my patients really be better off? Will they end up living longer or living better? As we’ve talked about before, checking CPK in male newborns does not get past level 2. On the other hand, early identification and treatment of newborns with hypothyrodism or phenylketonuria can have a dramatic effect on their lives and on society. We don’t know at all yet how well electron beam tomorgraphy will work in improving CAD outcomes.
17
Screening pulse oximetry for CHD
Diagnostic performance of abnormal pulse oximetry for congenital heart defects for all major congenital defects * sensitivity 49.06% * specificity % * positive predictive value 13.33% * negative predictive value 99.86% for critical congenital defects * sensitivity 75% * specificity 99.12% * positive predictive value 9.23% * negative predictive value 99.97% Lancet 2011 Aug 27;378(9793):785
18
Screening pulse oximetry for CHD
Jaundice, terminating breast-feeding, and the vulnerable child Breast-feeding was more common in the jaundiced group (61% vs 79%). By 1 month, more mothers of jaundiced infants had completely stopped breast-feeding (19% vs 42%). They were more likely to have never left the baby with anyone else (including the father) or left the baby at most one time for less than 1 hour (15% vs 31%), more well-visits, more ED visits (2% v 11%, not including bili measurements). Thus, may increase the risk for premature termination of breast-feeding and for development of the VULNERABLE CHILD SYNDROME. Pediatrics 1989 Nov;84(5):773-8
19
Naming is not curing In the 1600s, astrology dominated medicine as a healing profession. Neither worked but astrology was much more popular because it focused on fixing people's problems. Medicine, on the other hand, focused mainly on categorizing illnesses (i.e., diagnosing) and not so much on treatment. 400 years later there is still a priority on categorizing, regardless of whether it's helpful. A correct diagnosis is only useful when it results in the selection of a treatment that benefits the patient; otherwise, it's only a label. James Burke. The day the Universe Changed. Boston: Little, Brown and Company, 1985, p. 333.
20
TEST + By convention, the 2 x 2 table is set up this way, with the disease on the top and the test on the side. It’s not absolutely necessary to do it this way, but doing so in a consistent way makes it easier to organize data. Box A includes patients with the disease who test positive (true positives). Box B includes patients who test positive that don’t have the disease (false positives). Box C includes patients with the disease who test negative (false negatives). Box D includes patients without the disease who test negative (true negatives). See: Sackett DL, et al. Clinical Epidemiology: A Basic Science for Clinical Medicine. 2nd ed. Boston: Little, Brown and Company, Jaeschke R, Guyatt GH, Sackett DL. Users' guides to the medical literature. III. How to use an article about a diagnostic test. A. Are the results of the study valid? JAMA 1994;271: Jaeschke R, Guyatt GH, Sackett DL. Users' guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? JAMA 1994;271:703-7. TEST - 20
21
Sensitivity TEST + TEST -
The sensitivity of the test is defined as the percentage of people with the disease who test positive (true positives/(true positives + false negatives); a/(a+c). A 100% sensitive test is positive for everyone who has the disease. A test that is positive for only half of those with the disease has a 50% sensitivity. TEST - 21
22
Specificity TEST + TEST -
Specificity is the percentage of people not having the disease who test negative (true negatives/(true negatives + false positives); d/(b+d). A specificity of 99% means that 99% of patients without a particular disease will test negative. There is only one person out of 100 testing positive, or a 1% false positive rate. TEST - 22
23
Positive Predictive Value
TEST + These slides summarize how the Table can be used to calculate sensitivity, specificity, positive and negative predictive values. For visual learners, the key to these illustrations is that sensitivity and specificity are calculated “up and down,” and the predictive values are calculated horizontally. TEST - 23
24
Negative Predictive Value
TEST + These slides summarize how the Table can be used to calculate sensitivity, specificity, positive and negative predictive values. For visual learners, the key to these illustrations is that sensitivity and specificity are calculated “up and down,” and the predictive values are calculated horizontally. TEST - 24
25
Likelihood Ratios Similar to the concepts of “ruling in” and “ruling out” disease Pre Test Odds x LR = Post Test Odds The problem – we don’t think in terms of odds Clinical decision rules: Do the hard math for us, be we need to enter the appropriate data and interpret results The likelihood ratio is a measure, like sensitivity and specificity, that tells us the accuracy with which the test identifies the target disorder. The likelihood ratio tells us, after the test results are known, how much the pretest probability is increased or decreased. A positive likelihood ratio for a specific test of 19, for example, simply means that the likelihood of the disease being present is 19 times higher than the pretest probability (or likelihood) would have us think. Likewise, a negative likelihood ratio of 0.5 simply means that individuals testing negative are half as likely to have the disease as that predicted by their pretest probability. The best part about likelihood ratios is that, by converting percentages into odds, we can multiply the odds by the likelihood ratio and derive the new odds, and then convert again back to percentages (the last step is unnecessary if you spend a lot of time at the track and can think in terms of odds). This is mathematically difficult for most clinicians and patients to do in their heads, but clinical decision rules found on computers do the math for us. All we have to do is enter the probability (percentage) of patients who have the disease before the test, and the rule calculates the post test probability (percentage) depending on whether the test is negative or positive. We’ll show examples of this in a later section. See: Jaeschke R, Guyatt GH, Sackett DL. Users’ guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? JAMA 1994;271:703-7. 25
26
II. Are The Results Valid?
Diagnostic test compared with the “Gold standard” on all patients Blinded comparison Independent testing Consecutive patient enrollment (adequate spectrum of disease) (Must have all for LOE = 1b) Was the new test compared with a gold standard: another method that is already proven or accepted to determine, with high certainty, whether or not a disease is present or absent? We want to make sure that the reference, or gold standard, is performed on everyone and not just on patients for whom the test in question was positive (or negative). If only patients with a positive test undergo testing with the gold standard, under the assumption that all those testing negative are truly negative, the test will look more accurate than it might truly be in the real world. This is called “verification bias” and can distort our understanding of the ability of the test being study. At this point you might also ask whether the new diagnostic test offers anything over the gold standard. Is the test something reasonable? What is its expense, in terms of money, time, or the effect on the patient? For example, determining whether someone has sepsis by doing a self-injected heart biopsy may not be a very reasonable test. If the new test is not reasonable it might be best to stick with the gold standard. Was the comparison done in a blind fashion? We want to make sure the results of one test are not known to the person interpreting the second test. This is to prevent bias in the determination of whether or not either the test was normal/abnormal or the disease was present/absent. Everyone can think of examples of this bias – seeing the previously unnoticed lesion on chest x-ray after seeing the computerized tomographic scan or suddenly hearing the cardiac murmur after its presence is pointed out by another clinician. For example, let’s say someone is comparing echocardiogram with beta-naturetic protein (BNP) to identify patients with heart failure. Since the echocardiogram requires interpretation, knowing the results of the BNP might affect this interpretation. Were the tests performed independently? Was the test applied to patients who presented consecutively, or did the investigators “cherry pick” patients with mild or more severe disease? If the test is only done on people with severe disease and is shown to be useful, it doesn’t necessarily mean it will work as well in patients with milder forms of the same disease. We want to make sure the patients are drawn from a source where a full spectrum of disease presentation occurs, or at the very least, from a patient group similar to your own. For example, the BNP test could be used in an emergency department setting where patients could have symptoms ranging from a cough to pitting edema. Evaluation studies should be performed in an emergency department setting. These results will not apply, however, to hospitalized patients with known heart failure, since the spectrum of disease is different. All of these qualities must be there for a diagnostic test to have a 1b recommendation. 26 13
27
II. Are The Results Valid?
What are the results? Sensitivity, specificity and predictive values Likelihood ratio calculation Prevalence of disease in the study population Typical? Similar to your practice? The next step is to determine, or find in the report, the test characteristics. Is the prevalence of the disease in the study’s population similar to one’s own practice? If not, you will need to recalculate predictive values by changing the prevalence in the 2x2 table while keeping the sensitivity and specificity the same (which is a little tricky). Another way to understand the value of the test is to remember that, if the prevalence in your situation is lower than that in the study group, the likelihood of false positives will be higher. For example, in the proteonomic pattern test example, the prevalence of disease in their study group was 33% (they picked patients already known to have ovarian cancer and balanced the group out with patients known not to have cancer). Knowing that the prevalence of cancer in an unselected group is about 0.04%, you can intuit that the false positive rate will be higher (although you probably wouldn’t intuit that the false positive rate is over 99%). 27 15
28
Levels of “POEMness” for Diagnostic Tests
Sensitivity & specificity Does it change diagnoses? Does it change treatment? Does it change outcomes? Is it worthwhile (to patients and/or society)? (examples: HbA1C for DM, CPK vs T4/PKU in newborns, electron beam tomography for CAD) Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991; 11:88-94 Okay, now that we understand test characteristics, we need to go the next step and determine how useful the test really is. We call this the “level of POEMness.” (Derived from: Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991; 11:88-94). The second step is to determine whether the test changes our ability to make a diagnosis – again, let’s look to the proteonomic pattern test. It has good sensitivity and specificity but its ability to make the diagnosis is very small (although it would be a good test to confirm the diagnosis). The third level is to determine whether the test will change our approach to therapy. For example, sputum samples are often suggested in guidelines of the treatment of pneumonia. However, research has shown physicians frequently do not change treatment when the culture results are known a day or two later. The next step is to determine whether using the test changes outcomes. For example, routine monitoring of HgA1c did not affect outcomes in the United Kingdom Prospective Diabetes Study (see: Turner RC, Holman RR, Cull CA, et al. Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes (UKPDS 33). Lancet 1998;352: McCormack J, Greenhalgh T. Seeing what you want to see in randomised controlled trials: versions and perversions of UKPDS data. BMJ 2000;320: ). On the other hand, using the BNP test in the emergency department to determine whether the patient does or doesn’t have heart failure has been shown to decrease admissions, decrease length of stay, and speed the initiation of appropriate therapy (see: Mueller C, Scholer A, Laule-Kilian K, et al. Use of B-type natriuretic peptide in the evaluation and management of acute dyspnea. N Engl J Med 2004; 350: ). The ultimate test also will be shown to be worthwhile, which is a judgment call. Screening for disease is often done without evaluating its value to patients and society. The same money used to screen for disease also could be used to treat other diseases or even screen for other diseases. Screening newborns for muscular dystrophy may not be worthwhile because it’s a disease that cannot be treated (thus early diagnosis has no value). The only reliable method for determining whether a diagnostic test has reached POEM levels 2, 3, 4 and 5 is by comparing one population of patients in which the new test is used to another population where it is not used. And, the probability that a particular patient ends up in either group must be randomly, with concealed allocation assignment, determined. Rarely do we find this level of evidence for a new diagnostic test. We must ask the question “If I start using this test today, as opposed to not having used it yesterday, will my patients really be better off? Will they end up living longer or living better? As we’ve talked about before, checking CPK in male newborns does not get past level 2. On the other hand, early identification and treatment of newborns with hypothyrodism or phenylketonuria can have a dramatic effect on their lives and on society. We don’t know at all yet how well electron beam tomorgraphy will work in improving CAD outcomes. 28
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.