Download presentation
Presentation is loading. Please wait.
Published byAnabel Berry Modified over 9 years ago
1
November 18, 2009 1 CAD Panel Meeting Statistical Issues in CADe Evaluations Thomas E. Gwise, Ph.D. Mathematical Statistician / Acting Team Leader Division of Biostatistics Office of Surveillance and Biometrics November 18, 2009
2
2 Outline Statistical concepts Reader studies for CADe evaluation Prospective and retrospective Retrospective study design examples Complications in retrospective studies Choice of endpoints Choice of controls Standalone studies Compared to reader studies Re-use of data
3
November 18, 2009 3 Statistical Evaluation of Diagnostic Tests Two dimensions are considered when evaluating diagnostic test performance. How well can the test detect diseased cases? Sensitivity: Fraction of diseased patients who are test positive How well can the test correctly identify the non- diseased cases? Specificity : Fraction of non-diseased patients who are test negative Sensitivity and Specificity are not comparable if estimated in separate studies
4
November 18, 2009 4 ROC Curves ROC curves are plots of Se/Sp considering all possible cutoffs.
5
November 18, 2009 5 Statistical Evaluation of Diagnostic Tests Does the test add value? Example: Is a diagnostic test for bone mineral density better than just using a person’s age in diagnosing osteoporosis? Example: Does use of a CADe device improve diagnostic performance of readers? Examples of improvement: Sensitivity, Specificity both better ROC plot (or area) better Improved reading time, same performance
6
November 18, 2009 6 Intended Use The vast majority of submissions for CADe devices to date have been for those labeled as second reader, aids to physicians. User is directed to completely evaluate images as practice dictates before initiating CADe As such, it is expected that using the device in accordance with the label will improve performance of the physician.
7
November 18, 2009 7 Prospective Study If study conduct matches intended use, it is generally believed that a good way to test for a change in performance is to do a multi-center, prospective, randomized clinical trial, e.g. Randomize patients to the respective experimental conditions: unassisted image reading; CADe assisted image reading. Manage patients according to the evaluations as in routine clinical practice. Follow-up patients to determine true disease state. Analyze results and compare performance under the two experimental conditions.
8
November 18, 2009 8 Prospective Studies: Pros Study conduct matches indications for use (routine clinical practice, where reader decisions affect patient management). Estimate of performance under intended use conditions
9
November 18, 2009 9 Drawbacks to Prospective Randomized Trials For intended use populations where disease prevalence is low, a prospective study as described would require large amounts of time and result in large enrollments to obtain enough disease cases to compare the performance of the two modalities. Risk to participants, if patient management will depend on readings in the study (IDE may be required)
10
November 18, 2009 10 Possible Proxies for Dx Performance in Population Retrospective Reader Studies Standalone Studies (bench testing without reader)
11
November 18, 2009 11 Retrospective Reader Studies Reader evaluations are made off-line on a retrospective data set of images on which disease status of patients has been established according to ground-truthing rules. Multi-reader Multi-case (MRMC) designs: multiple readers read some or all images Sample is enriched with disease cases.
12
November 18, 2009 12 Retrospective Reader Studies: Pros Not significant risk because reader results are not used manage patients (IDE not required) Very efficient. Relatively small sample size can result in precise estimates of sensitivity, specificity, ROC curve, and CADe effect on these endpoints.
13
November 18, 2009 13 Retrospective Reader Studies: Cons Reading behavior may not be the same as in routine clinical practice because: Readers know their readings do not matter to the patient. Readers may detect enrichment, which could affect their reading behavior. Enrichment causes spectrum bias Example: enriching with challenging cases results in downward bias in reader performance upward bias in CADe effect on reader A small number of readers may not generalize
14
November 18, 2009 14 Complications In Retrospective Reader Studies Reader variability issues Enrichment related biases Choice of controls Assumptions
15
November 18, 2009 15 Reader Variability 108 US mammographers reading a common set of 79 mammograms provided a rating of suspicion of disease using the breast imaging recording and data system (BIRADS) rating scale of 1–5, where 5 is the highest level of suspicion of cancer Data from Beam et al., Variability in the interpretation of screening mammograms by US radiologists Arch Intern Med 1996;156:209-213, as in Wagner et al., Assessment of Medical Imaging and Computer-assist Systems: Lessons from Recent Experience, 9 Acad Radiol 1264–1277, 2002
16
November 18, 2009 16 (Sensitivity) ( Specificity )
17
November 18, 2009 17 Number of Readers Companies have submitted studies with from 5 to 20 readers. Reader sample should represent intended use population of readers. A small number of readers may not be representative of the reader population.
18
November 18, 2009 18 Enrichment The process of supplementing the image sample with disease positive images. Performance estimates obtained with enriched study samples will likely be different than performance in the intended use population Infer that differences in performance between modalities may be qualitatively applicable to the intended use population if the spectrum of disease is properly represented.
19
November 18, 2009 19 Enrichment (Spectrum Effect) Different case mixes of lesion types will likely result in different performance estimates (spectrum effect) For example: in mammography, a CADe may have more difficulty detecting some masses than microcalcifications. A sample in which the proportion of microcalcifications to masses is large will give higher performance estimates than a sample in which that proportion is smaller.
20
November 18, 200920 Disease (-) Disease (+)
21
November 18, 200921 Disease (-) Disease (+)
22
November 18, 2009 22 Enrichment (Easy Cases) Consider a sample of images enriched with a large proportion of disease positive cases easily detected by readers and CADes. Performance estimates for both modalities will likely be high. Possibly difficult to detect a difference in performance between the two modalities.
23
November 18, 200923 Reader Alone (red) Reader W/CADe Simulated data
24
November 18, 2009 24 Enrichment (Challenging Cases) Stress Test : A study in which a sample of images is enriched with a large proportion of positive cases considered to be difficult to detect by readers and CADes. Goal: to show that the device can add value in cases that are difficult for readers. Performance results obtained from studies on enriched samples cannot be easily generalized across studies.
25
November 18, 200925 Reader Reader W/CADe Simulated data
26
November 18, 2009 26 Enrichment (Context Bias) Readers in a study environment will become aware of the enrichment and could change their reading behavior in response. Investigators attempt to mitigate this context bias by estimating relative performance. Egglin et al., Context Bias: A Problem in Diagnostic Radiology, JAMA 1996;276:1752-1755
27
November 18, 2009 27 Background for Questions on Endpoints Contrast endpoints with specific thresholds (Se/Sp) to aggregating endpoints (ROC)
28
November 18, 2009 28 ROC Curves and Decision Variable Models ROC curves show how well a test separates disease test scores from non-disease test scores. Assume that a decision variable can model a reader’s decision process Example: Probability of Malignancy (POM) Readers are instructed to rate an image with respect to the probability that it is malignant Ratings simulated for 25 healthy and 25 diseased images
29
November 18, 200929 Disease (-) Disease (+) Gaussian
30
November 18, 2009 30 ROC Curves Depend on Relative Ranking ROC curves are invariant to monotone transformations Relative ranking is the key
31
November 18, 200931 Gaussian
32
November 18, 2009 32 Complication Very large fraction of responses for certain detection tasks are in the extreme ranges of the scale.- Gur, et al. Similar pattern is not uncommon in reader study results submitted to FDA Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876
33
November 18, 200933 Disease (-) Disease (+)
34
November 18, 2009 34 Certain tasks that are binary in nature are better represented by a binary endpoint-both conceptually and statistically. In simulations Gur, et al showed that a binary task is evaluated with less bias and variability if a binary scale rather than continuous scale is used. For a task that is essentially binary, such as detecting microcalcifications, how rigorous can we expect relative rankings to be? Gur, et al, “Binary” and “Non-Binary” Detection Tasks: Are Current Performance Measures Optimal? 2007 Acad Radiol;14:871-876
35
November 18, 2009 35 ROC Based Endpoints Good for comparing tests over all possible cutoffs Use information efficiently Following slides discuss details associated with ROC analyses
36
November 18, 200936 Control Modality CADe Modality Difference between AUCs is the average difference in Se over all Sp
37
November 18, 200937 Comparable AUCs? Depends on clinical context
38
November 18, 200938 Is all of the difference in AUC clinically relevant? Control Modality CADe Modality Possible to weight regions according to clinical relevance? Partial AUC? Use other device specific criteria? Context dependent bound
39
November 18, 2009 39 Thresholds (Se/Sp) Intuitive Binary, similar to practice => work up or no? Obviate adapting readers to unfamiliar rating scales Mimic reality Same framework as Post-Market Information (spectrum bias still an issue)
40
November 18, 2009 40 Example “Keep All Positives from Unaided Read” rule Several 2 nd reader CADe device labels require or imply that positive findings on the initial unaided read should not be negated by the CADe-aided read.
41
November 18, 2009 41 Endpoints Specific to Intended Use (“Keep All Positives from Unaided Read” rule) “Therefore, the radiologist’s work-up decision should not be altered if the system fails to mark an area that the radiologist has detected on the initial film review and has already decided requires further work-up. Nor should the decision be affected if the system marks an area that the radiologist decides is not suspicious enough to warrant further work-up, whether the area is detected by the radiologist on initial film review or only after being marked by the system.” From SecondLook label The radiologist should base interpretation only upon the original images and not depend on the CAD markers for interpretation. The device is a detection aid, not an interpretative aid. The CAD markers should be activated only after the first reading. The device does not identify all areas that are suspicious for cancer. - Some lesions are not marked by the device and a user should not be dissuaded from working up a finding if the device fails to mark that site. From R2 label
42
November 18, 200942 Applying “Keep All Positives from Unaided Read” Rule Se Un-Aided to CADe-Aided => NON NEGATIVE Sp Un-Aided to CADe-Aided => NON POSITIVE Bound increase of FPF Unaided reader Se, 1-Sp) Success* Region *Biggerstaff, “Comparing diagnostic tests: a simple graphic using likelihood ratios,” Stat Med, 2000.
43
November 18, 2009 43 Image Sample Required for Comparing the Same Two ROC Curves Using Different Accuracy Measures Compare sample size needs for various measures Context: Two specified ROC curves Detectable change in AUC Corresponding detectable change at given false positive rates (FPRs) or over given FPR intervals Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY
44
November 18, 2009 44 Se FPR 0.10.2 Se at FPR=0.2 PAUC (0.1<FPR<0.2) (0.2-0.1) (FPR interval) Detectable Changes
45
November 18, 2009 45 Measure of AccuracyDetectable Change N total (n+=n-) ROC AUC0.100278 Se (FPR=0.01) 0.108930 Se (FPR=0.10) 0.201482 Se (FPR=0.20) 0.276382 PAUC (FPR<0.1) /(FPR2-FPR1)0.167722 PAUC (FPR<0.2) /(FPR2-FPR1)0.182522 PAUC (0.1<FPR<0.2) /(FPR2-FPR1)0.198384 Sample Size Efficiency Adapted from table 6.8, Zhou, Obuchowski and McClish 2002, Statistical Methods in Diagnostic Medicine, Wiley & Sons, Inc. NY
46
November 18, 200946 Not Uncommon Problem AUC difficult to interpret Post hoc PAUC as rescue? Choosing Bound has Type I error implications N for AUC may be too small to get useful PAUC or Se/Sp estimates Inadequate information => Failed Study
47
November 18, 2009 47 Endpoint Summary Sensitivity and specificity are more relevant than ROC AUC to the dichotomous decisions made in image reading. Drawbacks to using ROC analysis Not always easy to interpret AUCs Crossing curves Comparable FPF regions Reader scoring representative of practice?
48
November 18, 2009 48 Endpoint Summary “So my comment is about CADe. I want to point out that the ROC, which is, of course, a wonderful device for assessing the process is not perfectly relevant from the clinical setting. The clinical setting, there is a particular algorithm cut-point and decision are dichotomous. And so one had ought to focus on specific points on the ROC curve. And it seems to me that it is essential that your– that companies show that they have improved sensitivity, which to me means statistical significance or Bayesian probability that the sensitivity is improved. This is a very low hurdle.”—D Berry, March 2008 Radiological Devices Advisory Panel meeting. Pepe, M.S, Urban, N., Rutter, C. and Longton, G. (1997) Design of a study to improve accuracy in reading mammograms, J Clin. Epidemiol 50: 1327-1338. Van Belle, G. 2002, Statistical Rules of Thumb, Wiley & Sons, Inc. NY (p 100)
49
November 18, 2009 49 Control Arm Discussion It is assumed that effectiveness or clinical utility can be shown by comparing unaided image reading to CADe-aided image reading We formulated several questions for the panel concerning control arms for 510K (substantial equivalence). The next slides provide some background.
50
November 18, 2009 50 Example Non-Inferiority Test 0 -- Reader Performance CADe new Reader Performance CADe predicate Success: CI of difference in improvements is greater than some preset limit
51
November 18, 2009 51 Study Design #1 Readers read common set of images under three modalities Unaided reading CADe aided reading with study device CADe aided reading with predicate Note: CADe aided reading according to label Randomize image order Washout periods between modalities Compare performance results Unaided reading comparisons ensure clinical utility Non inferiority delta can be defined
52
November 18, 2009 52 Study Design #2 Un-aided reading vs CADe-aided reading Unaided reading CADe aided reading with study device Randomize image order Washout periods between modalities Compare performance results to recorded predicate performance (label, prior study)
53
November 18, 2009 53 CADe SE Study Example Assume study design #2 from previous slide Case mix Predicate study: Difficult to detect New device study:Easy to detect Readers Predicate study: Experienced specialist New device study: Minimally experienced performance (W/CADe- W/O CADe) are similar in the two studies.
54
November 18, 2009 54 Changes In Performance Are Not Comparable Across Studies In design two, the comparison across studies is confounded by spectrum bias and reader differences. Using such a study design comparing changes across enriched studies effectively reduces the question to one of whether or not the CADe device offers any increase in performance over unaided reading. With respect to performance, comparing across enriched studies invites imprecise or erroneous SE and NSE conclusions due to confounding (case mix, reader differences, others).
55
November 18, 2009 55 Example Non-Inferiority Test 0 -- Reader Performance CADe new Reader Performance CADe predicate Given there is an improvement with CADe-aided reading over reader alone Compared in same study Success: CI of difference in improvements is greater than some preset limit
56
November 18, 2009 56 Standalone Studies Cannot show clinical utility because no reader is involved Standalone studies may be useful in comparing a CADe device to a previous version or investigating the performance of the device without the reader. Example: Studying a sample large enough to characterize all important strata (diseased & non-diseased cases) can be useful label information
57
November 18, 2009 57 Enriched Standalone Studies Suffer the same complications as reader studies with respect to sample enrichment. Results are not generalizable across studies. Performance estimators apply only to the sample Not simple random samples of population Do not represent stand alone performance in population
58
November 18, 2009 58 Reuse of Test Data (Standalone Studies) Some companies have proposed re-using test data in evaluating updated versions of CADes.
59
November 18, 2009 59 Multiplicity Multiple tests on the same data set will inflate type I error Sponsors must account for multiplicity Example: Bonferroni correction Practical problem: Choosing in a “reuse” test if = 0.05 for first test of several
60
November 18, 2009 60 “Teaching to the Test” Each upgrade iteration on the same data could be considered training. Test on training data => unreliable results “teaching to the test” “fitting to the noise” This is in addition to multiplicity problems Difficult to quantify this bias.
61
November 18, 2009 61 Example of Overfitting Randomly generate data set of 20 profiles having 6000 features each Arbitrarily assign each member to one of two classes Develop and evaluate classifiers using 3 processes. Nearly unbiased cross validation Resubstitution Method 1) Build predictor on full data set 2) Reapply predictor to each specimen (Teaching to test) Partial cross validation 1)Leave one out 2) Build classifier on remaining data 3) Classify last point Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003 4) Repeat (total 20) This example from Simon et al, illustrates the problems of overfitting in the context of developing algorithms for class prediction with gene expression data. The large number of features within relatively small samples make this a good parallel to the situation faced by CADe developers.
62
November 18, 200962 Simon, Richard, Radmacher, D., Dobbin,K. McShane, L.M. Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification Journal of the National Cancer Institute, Vol. 95, No. 1, January 1, 2003 Random data: Expect ~ ½ to be misclassified Resubstitution Method (Teaching to Test) 98.2% of data sets had zero misclassifications
63
November 18, 2009 63 Added Review Questions Any variation of reusing data would raise many difficult review issues: Data integrity/access controls Who has access to test data? When? Theoretical basis for procedures Published method? Assumptions verifiable? Selection bias How were images chosen? Type I error control
64
November 18, 2009 64 Using Only Standalone Data A change in marker style can affect reader behavior--- Krupinski, et al 1992, Gilbert, et al 2008 Changes in prevalence affect reader behavior--- Egglin et al 1996 Deduce that changes in CADe mark placement or frequency could impact reader behavior. A change to the algorithm is a change to the device, the device is acting on reader Dx. It is difficult to know a priori what change to an algorithm will produce a change in Dx performance.
65
November 18, 2009 65 Reader Studies Compared to Standalone Studies Reader studies investigate reader-device interaction. Standalone studies investigate only device performance.
66
November 18, 2009 66 Summary Endpoints for reader studies Binary endpoint more relevant to study question Sample size for appropriate endpoint Control arms for 510K reader studies Is any improvement over un-aided reading adequate? Reuse of data Teaching to the test Evaluating CADes without readers Does not show clinical utility Does not investigate device under its intended use
67
November 18, 2009 67 Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.