CAD Panel Meeting General CAD Methods Nicholas Petrick, Ph.D. Deputy Director, Division of Imaging and Applied Math, OSEL Radiological Devices Panel Meeting March 4, 2008
Outline What is CAD Basic components of CAD algorithms Clinical implementations of CAD Evaluating CAD Algorithms Non-clinical testing Clinical testing Basic statistical tools
What is CAD?
What is CADe? CADe: Computer-aided detection devices Also termed CAD Designed to identify findings (or regions) on an image that may be abnormal Prompting devices only
What is CADx? CADx: Computer-aided diagnosis Also termed CAD Designed to process a specific finding (or region) to characterize the finding Likelihood of malignancy Recommended clinical action Describe the finding Helps physician determine what he/she is looking at 0.26 0.77 0.27
What is CADx? CADx: Computer-aided diagnosis Termed CAD also Designed to process a specific finding (or region) to characterize the finding Likelihood of malignancy Recommended clinical action Describe the finding Helps physician determine what he/she is looking at B1 B4 B2
Artificial Intelligence What is a CAD? Statistics Pattern Recognition Artificial Intelligence CAD Medicine Physics Biology Image Processing CAD encompasses many disciplines
Basic Blocks in CADe Algorithms Image processing Segmentation Features/feature selection Classification Sequencing and block details differ between CADe algorithms Acquire Digital Data Image Processing Segmentation Features & Feature Selections Classification Annotation
Acquire Data Digital data can come from Digitized film Acquire Digital Data Digital data can come from Digitized film Direct digital devices FFDM CT Many others Image Processing Segmentation Features & Feature Selections Annotation Classification Mass
Image Processing Image is enhanced or processed to facilitate analysis Acquire Digital Data Image is enhanced or processed to facilitate analysis Image Processing Segmentation Features & Feature Selections Annotation Classification
Segmentation Identify boundaries or regions within the image Acquire Digital Data Identify boundaries or regions within the image Lesion candidates Organs Image Processing Segmentation Features & Feature Selections Annotation Classification
Features Features Feature selection F1: Area F2: Perim Acquire Digital Data Image Processing Segmentation Features & Feature Selections Classification Annotation Features Characterize regions or pixels within a dataset Shape Texture Curvature … Feature selection Process for selecting informative features F1: Area F2: Perim
Classification Classification Classifier types Acquire Digital Data Classification Features input to learning algorithm Combine into an output score Classifier types Multiple thresholds, LDA, Neural Network Training/Test paradigm critical Image Processing Segmentation Features & Feature Selections Annotation Classification F1 FN Trained Learning Machine Object Score
Threshold applied to object scores Classification Acquire Digital Data Classification Features input to learning algorithm Combine into an output score Classifier types Multiple thresholds, LDA, Neural Network Training/Test paradigm critical Image Processing Segmentation Features & Feature Selections Annotation Classification Threshold applied to object scores
Annotation CADe Annotations Prompts of potential abnormalities Acquire Digital Data CADe Annotations Prompts of potential abnormalities Image Processing Segmentation Features & Feature Selections Annotation Classification
Basic Blocks in a CADx Algorithm Characterization of a finding Basic blocks are similar Image processing Features/feature selection Classification Sequencing and block details differ between CADx algorithms Identified Region Image Processing Features & Feature Selections Classification Annotation
Basic Blocks in a CADx Algorithm Characterization of a finding Basic blocks are similar Image processing Features/feature selection Classification Sequencing and block details differ between CADx algorithms 0.79 0.91 0.66
Training CADs Process for systematically improving performance for a set of data known as the training set Maximize sensitivity Maximize area under ROC curve Training can be performed By computer Regression or optimization techniques By humans Tweak parameters or combination of parameters Algorithm fixed after training
No. of Patients per Class Training CADs Learning Curve Training (learning) is a dynamic process Increasing training data Increases performance Decrease variability ROC Area No. of Patients per Class Learning curve 3 feature linear classifier
Clinical Use of CAD
Discussion questions: M6, C7, L6 CAD Reading Paradigms First reader Physician reviews only regions or findings marked by the CAD device Unmarked regions not necessarily evaluated by physician No radiological CAD device approved/cleared for this mode Discussion questions: M6, C7, L6
CAD Reading Paradigms Second reader Physician first conducts a complete interpretation without CAD (unaided read) Then re-conducts an interpretation with the CAD device (aided read). Also termed “second detector” or “sequential reader” Example Mammography CADs Some lung CADs
CAD Reading Paradigms Concurrent read Physician performs a complete interpretation in the presence of CAD marks CAD marks are available at any time Examples Some colon CAD devices are potentially used in this way
CAD Factors Influencing Clinical Use Physical characteristics of mark Physicians may respond differently to different types of marks* CAD standalone performance Number of CAD marks Knowledge of Se & FP rate may affect user confidence in or attention to CAD marks Change in interpretation Change in reading time Increase review time Maintain/decrease review time *EA Krupinski et al., A Perceptually Based Method for Enhancing Pulmonary Nodule Recognition. 28(4) Investigative Radiology 289 (1993).
Evaluating CAD Algorithms Non-clinical Evaluation
Non-Clinical Evaluation Device & algorithm descriptions Stability analysis
Non-Clinical Evaluation Algorithm description Different CAD devices contain different processing Easier to assess/compare if devices are not “blackboxes” To understand a CAD the following info is needed Patients targeted by device Device usage (e.g., reading mode, etc) Image processing, segmentation, etc Features, classifiers, etc Training & training data, etc Discussion question: G1
Discussion question: G1 Algorithm Stability Stable algorithm Similar performance with changes in algorithm, features, training, or training databases Stability increases as No. of training cases increases No. of initial features decreases Complexity of the CAD decreases Discussion question: G1
Why Stability Analysis? Indicates if performance due to fortuitous training/test set Algorithm updates produce evolving performance *Example only: Not an actual device More Stable Training CI
Why Stability Analysis? Indicates if performance due to fortuitous training/test set Algorithm updates produce evolving performance *Example only: Not an actual device Less Stable Training CI
Evaluating CAD Algorithms Clinical Testing
Hierarchical Model of Efficacy* Level 1 Technical efficacy Physical & bench tests Level 2 Diagnostic accuracy Se/Sp, ROC curve, etc Level 3 Diagnostic thinking Effect on clinicians’ estimates of diagnostic probabilities, pretest to posttest Level 4 Therapeutic efficacy Effect on therapeutic management Level 5 Patient outcome Value in terms of quality-adjusted life years (QALYs), etc. Level 6 Societal efficacy Overall societal benefit *Fryback, Thornbury, “The efficacy of diagnostic imaging,” Med Decis Making 11:88–94, 1991.
Hierarchical Model of Efficacy* Levels imaging technology sponsors generally focus when going through FDA Sponsors & FDA are not constrained to these levels Level 1 Technical efficacy Physical & bench tests Level 2 Diagnostic accuracy Se/Sp, ROC curve, etc Level 3 Diagnostic thinking Effect on clinicians’ estimates of diagnostic probabilities, pretest to posttest Level 4 Therapeutic efficacy Effect on therapeutic management Level 5 Patient outcome Value in terms of quality-adjusted life years (QALYs), etc. Level 6 Societal efficacy Overall societal benefit *Fryback, Thornbury, “The efficacy of diagnostic imaging,” Med Decis Making 11:88–94, 1991.
Classes of Tests Standalone performance testing Performance of the device by itself Intrinsic functionality of the device Reader performance testing Performance of physicians using the device Impact on physician performance
Discussion questions: M1, C3, L2 Standalone Testing Performance of the device by itself Establish Scoring Rule & Method Discussion questions: M1, C3, L2 Establish Truthing Rule & Method Establish Ground Truth Apply CADe Device Acquire Test Dataset Apply Scoring Statistical Analysis
Discussion questions: M1, M4, C3, C6, L2, L5 Test Dataset Clinical images used to determine safety and effectiveness of a CAD Different from set used to train/develop or validate CAD Represents target population & target disease condition Usually includes clinically relevant spectrum of patients, imaging hardware & protocols Discussion questions: M1, M4, C3, C6, L2, L5
Acquiring Test Dataset Field test accrual Collection during real-time clinical interpretation Enrichment accrual Enrichment for low prevalence of disease Enrich with disease cases at a higher proportion than in population Enrichment for stress testing Enrich with cases containing challenging findings Stress testing usually includes a comparison modality
Reuse of Test Data Ideal testing paradigm Develop CAD algorithm Collect testing cases Apply CAD Report standalone and/or reader performance results G2
Discussion question: G2 Reuse of Test Data Sponsor may want to compare performance of revised algorithm with same or expanded version of test cases Developer may have gained knowledge (learned) by knowing performance of original CAD on test data For larger datasets and minimal feedback, knowledge gain may be quite small May be possible to reuse test data under appropriate constraints to streamline assessment What may be appropriate constraints to balance data integrity & data collection? Discussion question: G2
Standalone Testing Performance of the device by itself Establish Scoring Rule & Method Establish Truthing Rule & Method Establish Ground Truth Acquire Test Dataset Apply Scoring Statistical Analysis Apply CADe Device
Discussion questions: C2, L1 Ground Truth Ground truthing includes: Whether or not disease is present (patient level) Location and/or extent of the disease (lesion level) Types of ground truthing Cancerous lesions Biopsy/pathology (Follow-up imaging for normals) Non-cancerous lesions Expert panel reviews all available clinical information May be others Discussion questions: C2, L1
Ground Truth by Expert Panel Experts almost always required to determine lesion locations May also determine if abnormality is present Experts are susceptible to reader variability Multiple readers allow measure of truth variability
Ground Truthing: Mammography Patient-level Pathology verified cancer in left breast
Ground Truthing: Mammography Lesion-level Radiologist identifies region of lesion Clinician identified ROI
Ground Truthing: Mammography Lesion-level Radiologist segments region
Standalone Testing Performance of the device by itself Establish Scoring Rule & Method Establish Truthing Rule & Method Establish Ground Truth Acquire Test Dataset Apply Scoring Statistical Analysis Apply CADe Device
Scoring Rules and Methods Truth Segmentation Used to determine whether CAD marks a true lesion Overlap between CAD/truth Discussion questions: M1, C3, L2 CAD Segmentation
Scoring Rules and Methods Truth Centroid Used to determine whether CAD marks a true lesion Distance between CAD/truth centroids Distance= 2.1 mm Scoring by a physician CAD Centroid
Standalone Performance Measures Lesion-based sensitivity and number of FPs per image (or per scan) [Se, FPs/Image] Free Response Receiver Operating Characteristic (FROC) curve No. of FPs (per image) 1.0 2.0 3.0 4.0 0.0 0.2 0.4 0.6 0.8 TPF, sensitivity 5.0
Evaluating CAD Algorithms Clinical Testing Reader Performance Testing
Reader Performance Testing Performance of physicians using the device Establish Scoring Rule & Method Establish Truthing Rule & Method Establish Ground Truth Discussion questions: M2, C4, L3 Apply Scoring Apply CADe Read w CADe Read w/o CADe Acquire Test Dataset Statistical Analysis Apply Scoring Identify Study Readers
Reader Selection Readers generally selected to be representative of intended users Representative of clinicians who will use device Representative of proper clinician experience level Reader performance testing depends on Proper understanding & using of the CAD device Proper understanding & implementation of study protocol Training of readers is a key to achieving both
Designing Reader Studies Common endpoints Common CAD study designs
Evaluating CAD Algorithms Clinical Testing Study Endpoints
Discussion questions: M2, C4, L3 Study Endpoints Patient analysis [Sensitivity, Specificity] ROC analysis Location–specific analysis Location-specific ROC Free-response ROC (FROC) Discussion questions: M2, C4, L3
Patient Endpoints Assessing CADx Assessing CADe Not accounting for location Identified Region CADx Aid CADx POM: 0.91 Clinician POM: 0.95
Patient-Based Endpoints Patient analysis does not account for localizing the lesion Endpoints Binary decision (single threshold) [Sensitivity, Specificity] ([Se, Sp]) operating point Rating/ranking (range of thresholds) Receiver operating characteristic (ROC) curve
True Positive Fraction False Positive Fraction Se/Sp Operating Point [Se, Sp] operating point Comparing without/with CAD Often Higher Se, Lower Sp Many other possible endpoints 1.0 Reader+CAD True Positive Fraction = Sensitivity Reader alone 0.0 0.0 1.0 False Positive Fraction = 1.0 Specificity
ROC Assessment Non-diseased cases Diseased cases Computer score
Single Operating Point Single Threshold Non-diseased cases Single Operating Point TPF, sensitivity Diseased cases FPF, 1-specificity
Entire ROC Curve Non-diseased cases TPF, sensitivity Threshold Range FPF, 1-specificity
True Positive Fraction False Positive Fraction ROC Analysis Comparing without/with CAD Often Higher Se, Lower Sp ROC can facilitate comparison Requires ordering cases from least to most suspicious Ratings often used to facilitate ordering 1.0 Reader+CAD True Positive Fraction = Sensitivity Reader alone 0.0 0.0 1.0 False Positive Fraction = 1.0 Specificity
True Positive Fraction False Positive Fraction ROC Analysis Performance metrics ROC area (AUC) Average TPF across all possible FPFs Partial area under the curve (PAUC) Challenge to link AUC measures to clinical relevance 1.0 AUC1 True Positive Fraction = Sensitivity AUC2 0.0 0.0 1.0 False Positive Fraction = 1.0 Specificity
Location-Specific Endpoints CADe Device Assessing CADe Location is important Multiple prompt on the same image Truthing rule now is critical component
Location-Specific ROC ROC analysis that requires correct location of the lesion One scored location per patient Location must be on the lesion
Location-Based Operating Points [Se, No. FPs] operating point Comparing without/with CAD Often higher Se along with more FPs Many other possible endpoints 1.0 0.8 0.6 TPF, sensitivity 0.4 0.2 0.0 0.0 1.0 2.0 3.0 4.0 5.0 No. of FPs (per image)
Free-Response ROC All [Se, No. FPs] combinations All thresholds 1.0 0.8 0.6 TPF, sensitivity 0.4 0.2 0.0 0.0 1.0 2.0 3.0 4.0 5.0 No. of FPs (per image)
FROC Performance Metrics Area under FROC curve Need to choose FP range Area under alternative FROC (AFROC) Challenges Link measures to clinical relevance Statistical methodology 1.0 0.8 0.6 TPF, sensitivity 0.4 0.2 0.0 0.0 1.0 2.0 3.0 4.0 5.0 No. of FPs (per image)
Evaluating CAD Algorithms Clinical Testing Reader Study Designs
Reader Performance Study Designs Prospective studies Retrospective studies Some CAD study designs Warren-Burhenne MRMC Discussion questions: M4, C6, L5
Prospective Reader Studies CAD performance measured as part of actual clinical practice Field testing of CAD devices
Retrospective Reader Studies Cases are collected prior to image interpretation Typically enriched or stress test dataset used Read offline by one or more readers under specific reading conditions CAD Examples Mammography CAD devices Lung nodule CAD devices
Warren-Burhenne Study Design* Two separate studies Retrospective study of CAD Se to detect abnormalities “missed” in clinical practice Estimated relative reduction in false negative (FN) rate with CAD Commonly a prospective study of the work-up rate of readers with & without CAD in clinical practice Difference in work-up rate is attributed to use of CAD Study design in early mammography CAD approvals *Warren Burhenne et al, Radiology 215:554-562, 2000.
Warren-Burhenne Study Design Fundamental limitation is that reduction in FN rate & increase in work-up rate are not being evaluated in same study Study design can be difficult to interpret statistically Study design goal is to estimate “potential” effect on the FN rate
Multiple Reader Multiple Case (MRMC) Study Design Study where a set of readers interpret a set of patient images, in each of two competing reading conditions With and without CAD Could be either prospective or retrospective Fully-crossed design All readers read all of the cases in both modalities Most statistical power for given number of cases Hybrid designs are also evaluable
MRMC Study Design Advantages Generalizes to new readers & cases Cases are random effects Readers are random effects Advantages Greater statistical power for given number of cases MRMC studies can accommodate [Se, Sp] endpoints ROC endpoints FROC endpoint MRMC studies are generally statistically interpretable
Patient-Based MRMC Analysis Includes [Se, Sp] or ROC endpoints Well established methodologies & tools Jackknife/ANOVA Dorfman, Berbaum and Metz ANOVA and correlation model Obuchowski Ordinal regression Toledano and Gatsonis Bootstrap Beiden, Wagner, and Campbell One-shot estimate Gallas
Location-Based MRMC Analysis Accounts for correct localization of lesions Statistical methodologies & tools are available Region-of-Interest ROC analysis Obuchowski et al., Rutter Divide patient data into ROIs (e.g., quadrant or lobe) Jackknife FROC (JAFROC) Chakraborty & Berbaum Bootstrap FROC analysis Samuelson and Petrick, Bornefalk and Hermansson
Evaluating CAD Algorithms Further Statistical Issues Next Talk Evaluating CAD Algorithms Further Statistical Issues
Extra Slides
ROC and Operating Point It is possible to obtain both a rating/ranking as well as action item within the same reader study Not necessarily just one or the other Examples Determine if patient should have workup Rate patient level of suspicion Rate level of suspicion for individual lesions Determine if individual lesions require workup
Example from literature Jiang et. al, “Improving breast cancer diagnosis with computer-aided diagnosis,” Academic Radiology. 6(1):22-33, 1999. Authors studied ROC curves, ROC areas and [Se, Sp] operating point Characterization of microcalcifications Quasi-continuous ratings & action item